Nintendo Challenge

This!
#pragma GCC optimize "-O3"
With this option my solutions runs 256b testcase in 850 microseconds.
Disappointingly, std::arrays are still twice as slow as C++ arrays…

Indeed, with __attribute__((always_inline)) on non-recursive calls, I’m around ~2 ms, and sometimes fall ~ 1.8-9

And with #pragma GCC optimize “-O3”, it’s around 650 µs, gj for finding this ^^

From my experience, “-g” doesn’t incur any significant overhead, nor does the execution with gdb (except of course startup time).

I don’t know how you run gdb, but i confirm that gdb slow a much your code. Because it has to keep the stacks states everytime you do something.

I’m not going to debate whether or not programs run under gdb if CG staff told you so [1], but I’m with @Plopx on this one, there’s no reason for code running under gdb to be slower than standalone after it actually started.

There is no need for gdb to “keep stack states” at runtime any more than the program already does on its own. For compiled languages, the program’s status (registers and memory including stack) is available externally when the program isn’t running (whether SIGSTOPped or crashed) and mapping to symbols only needs be done once.

I don’t know how you run gdb, but I confirm that when actually measured, slowness wrt standalone running is lost in the noise.

C prime sieve: 0.05834s ⋘ σ

  • Standalone: 2.68186s (σ=0.12621s)
  • Under gdb: 2.80568s (σ=0.11785s)
  • gdb startup: 0.11960s (σ=0.00774s)

In C++ (STL, streams): 0.08128s ⋘ σ

  • Standalone: 3.89629s (σ=0.27718s)
  • Under gdb:4.10208s (σ=0.35259s)
  • gdb startup: 0.12451s (σ=0.00836s)

Fact-supported counterarguments welcome! (like, something with severe recursion)


[1] It’s debatable because it’s not needed at all: if all you want is a symbolic stack trace from a crashed program, you can extract it from the core dump. If the overtimes reported in this thread boil down to gdb startup (and that it’s counted against program time), I’d consider that a platform bug.

1 Like

I made a few tests with this code: http://pastebin.com/7k0Gj8qf (i use the standard output to output the time because on CodinGame i don’t really have the choice).

I tested this code in Onboarding puzzle. The output is around 200ms (you can test it). If you add #pragma GCC optimize "-O3" at the start of the code, the output is around 5ms.

On my computer, if i try the both codes (with and without the pragma) with this compilation:
g++ code.cpp -o code.exe -std=c++11

I got the same times. And i also got the same times if i had a -g flag, i got the same times too. But now if i run it with gdb (simple gdb command like gdb code.exe), i got 520ms for the first code (without the pragma) and 325ms for the second code (with the pragma directive). Adding the -g flag doesn’t change anything. I got also the same results if i compile with -O3 flag.

I don’t know how you can run a C++ code with gdb without losing almost any time. But i’ll be glad to learn it.

For codingame, i don’t think the code is really running with gdb. The code performance is really too different. It would be impossible to do a puzzle like Nintendo with gdb. The code would be far too slow.

But for codingame you can test it easily: Just make 2 codes. One with an inline function and a second one without it (just copy/paste the function content) and see how the both codes are doing. An inline function is supposed to have no impact on a code speed (if you are using it properly, of course), but on CodinGame it does.

The -g flag can impact performances. But it depends of the compiler. gcc have a -g3 option to avoid that. It is not used by CodinGame but it’s not a big problem.

My code is pretty simple, so the pragma directive act the same as a the -O3 flag. With a bigger code you’ll see some differences. Now see this 2 codes:

http://pastebin.com/VV626vH9
http://pastebin.com/1mMak1iM

If you try this 2 codes on Onboarding puzzle, the first one take around 28ms. The second one takes less than 0.001ms. If you remove the -O3 pragma directive, the first one take around 45ms and the second one take around 5ms.

But according to the C++ documentation and specifications, an inlined function (with reference parameters and const values) does not impact the performances. And a vector has the less overhead possible. But try to run this code on your computer with the -O3 flag at the compilation (not with the pragma directive). On my computer i got 5ms for the first code and less than 0.001ms for the second code.

The problem on CodinGame is the lack of -O3 at the compilation. Without it, you can’t use classes and function without impacting the speed of your code. It always leads to bad pratices to keep your code fast (like i explained in my previous post)

2 Likes

All pragma & attributes optimization removed ^^
(I heard the code was running into valgrind somewhere, but with this test, I really don’t think it’s the case).

Within lldb, it’s only 10% slower with -O0, while in valgrind, it’s ~50 times slower

00:20:23 [~/Desktop] schlum$ g++ -O0 -g -W -Wall -Wextra -o nintendo nintendo.cpp
00:20:27 [~/Desktop] schlum$ valgrind nintendo
==40920== Memcheck, a memory error detector
==40920== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==40920== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==40920== Command: nintendo
==40920== 
256
4af6fc33 39029380 465c5267 c72f6a8b 0906e6d0 ca60550f 14a5e47c 42ad10fb 4a3bb446 bb74360a 5ea02b9c 23c68553 3fade253 e270ba24 39e141ad 6c38c43d
320a18d5 b61b13f6 1aaaa61c 0afe2a41 1a4ff107 84cc2efc 956ff31d fa595299 33749a7f 6cc9659d dc503569 ef4d0ef5 73b746c5 b8fb36d3 7616e9d6 b21251c4
33749a7f 6cc9659d dc503569 ef4d0ef5 73b746c5 b8fb36d3 7616e9d6 b21251c4 320a18d5 b61b13f6 1aaaa61c 0afe2a41 1a4ff107 84cc2efc 956ff31d fa595299
time: 184544 µs
==40920== 
==40920== HEAP SUMMARY:
==40920==     in use at exit: 30,275 bytes in 188 blocks
==40920==   total heap usage: 279 allocs, 91 frees, 38,193 bytes allocated
==40920== 
==40920== LEAK SUMMARY:
==40920==    definitely lost: 0 bytes in 0 blocks
==40920==    indirectly lost: 0 bytes in 0 blocks
==40920==      possibly lost: 0 bytes in 0 blocks
==40920==    still reachable: 8,192 bytes in 2 blocks
==40920==         suppressed: 22,083 bytes in 186 blocks
==40920== Rerun with --leak-check=full to see details of leaked memory
==40920== 
==40920== For counts of detected and suppressed errors, rerun with: -v
==40920== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
00:20:36 [~/Desktop] schlum$ lldb nintendo
(lldb) target create "nintendo"
Current executable set to 'nintendo' (x86_64).
(lldb) run
Process 40926 launched: '/Data/HOME/Desktop/nintendo' (x86_64)
256
4af6fc33 39029380 465c5267 c72f6a8b 0906e6d0 ca60550f 14a5e47c 42ad10fb 4a3bb446 bb74360a 5ea02b9c 23c68553 3fade253 e270ba24 39e141ad 6c38c43d
320a18d5 b61b13f6 1aaaa61c 0afe2a41 1a4ff107 84cc2efc 956ff31d fa595299 33749a7f 6cc9659d dc503569 ef4d0ef5 73b746c5 b8fb36d3 7616e9d6 b21251c4
33749a7f 6cc9659d dc503569 ef4d0ef5 73b746c5 b8fb36d3 7616e9d6 b21251c4 320a18d5 b61b13f6 1aaaa61c 0afe2a41 1a4ff107 84cc2efc 956ff31d fa595299
time: 3815 µs
Process 40926 exited with status = 0 (0x00000000) 
(lldb)  quit
00:23:48 [~/Desktop] schlum$ ./nintendo
256
4af6fc33 39029380 465c5267 c72f6a8b 0906e6d0 ca60550f 14a5e47c 42ad10fb 4a3bb446 bb74360a 5ea02b9c 23c68553 3fade253 e270ba24 39e141ad 6c38c43d
320a18d5 b61b13f6 1aaaa61c 0afe2a41 1a4ff107 84cc2efc 956ff31d fa595299 33749a7f 6cc9659d dc503569 ef4d0ef5 73b746c5 b8fb36d3 7616e9d6 b21251c4
33749a7f 6cc9659d dc503569 ef4d0ef5 73b746c5 b8fb36d3 7616e9d6 b21251c4 320a18d5 b61b13f6 1aaaa61c 0afe2a41 1a4ff107 84cc2efc 956ff31d fa595299
time: 3457 µs
00:23:59 [~/Desktop] schlum$ 

You appear to be compiling on a DOS derivative. CG runs Linux. This is IMHO the most likely source of discrepancies around the foreign process debugging interface.

I tested the same code on a linux machine (debian) and i got the same results. Can i know how do you compile and run a code with gdb ? (and what code are you testing ?)

Here are the timings I got with your codes, numbered from top to bottom code0 to code2. Cases “pragma” and “nopragma”, are respectively with and without the “pragma” line; “O3” is by adding “-O3” to the compilation command line. Gdb is run with “gdb -batch -ex run ./command”. Gcc is version 4.9.2, gdb version 7.7.1 (Debian Jessie). The timings are the mean and standard deviation for 10 runs. For each run, the command is executed twice in a row to mitigate caching effects.

  code0     nopragma   nopragma+O3    pragma   pragma+O3
  direct  217.22±2.06   3.57±0.11   6.11±0.13  3.56±0.12
  gdb     217.28±2.75   3.41±0.08   6.02±0.26  3.39±0.10

  code1    nopragma   nopragma+O3    pragma    pragma+O3
  direct  52.17±1.77   5.34±0.19   29.96±0.33  5.35±0.10
  gdb     52.29±1.21   5.31±0.13   29.72±0.40  5.19±0.11

  code2    nopragma  nopragma+O3(e-05)  pragma(e-05)  pragma+O3(e-05)
  direct  5.58±0.09      10.17±2.35      11.15±2.21      9.81± 2.14
  gdb     5.54±0.09      10.06±2.52       9.15±1.05     12.85±11.71

It appears clearly that there is no overhead with gdb. The difference with your timings is maybe elsewhere (http://stackoverflow.com/questions/27380443/why-is-gdb-so-slow-in-windows), but I don’t have any Windows system to verify.

We can also see that #pragma optimize "-O3" is not as good as giving “-O3” on the command line. Is it a bug in gcc? I don’t know. Anyway, that supports your argument that -O3 should be turned on by the CG team.

Finally, code2 is very fast with optimizations turned on. The explanation is that the test is a bit flawed. The result of the computation is not used at all, and gcc is able to remove it completely. You can check that by looking at the assembler code (option -S).

Interesting. Thanks for the post Plopx.

As far as i remember, this is not a bug. But i’m not completely sure. There’s some optimizations that the pragma directive can’t do. You can try for example to use some C library in C++, they are most of the time not supported by the -O3 flag, but they works with the pragma directive.

Didn’t knew that, i was trying to see why the code with vector is slower. Didn’t think that -O3 completly remove the line.

This puzzle was quite challenging! Would deserve to be in a very very hard category :slight_smile:

Solving the challenge in Python first turned out to be very useful for me. Python has support for big integers by default and numpy has many built-in functions to speed-up the implementation. Once the problem was solved in Python, it was easier to translate it to C++ since I could always compare intermediate results with the Python implementation.

Definitely don’t look at this if you don’t want any spoilers:

I wrote up (or rather, generated from the solution) a post with details on each step of my particular solution. I’d be interested in hearing about other’s approaches and differences. ***)

(Note puzzle does pass with 100%, but there’s one specific minor but annoying gotcha that’s not covered in the particular examples in that link.)

edit: removed link

I’m not sure this is a good idea for a sponsored puzzle. CodinGame even disabled the ability to view other’s code after solving the problem…

Even without giving the source, in this case, understanding what to do is at least half of the work.

It might hurt Nintendo (by providing them with candidates who wouldn’t have done all the work), players (because Nintendo would not be sure that they did solve it themselves) and eventually CodinGame (because the logical exit in this case is that Nintendo pulls out of the system).

Your work doesn’t necessarily have to go to waste though, you could remove all references to the nintendo problem (including the encoding/decoding details) and make it a post on the underlying problem, which is interesting enough by itself! :slight_smile:

1 Like

Is the problem correct as stated? I get a suspicion that the pseudocode should read

b[(i+j)/32] ^= ((a[i/32] >> (i%32)) & (a[j/32 + size/32] >> ((j+size)%32)) & 1) << ((i+j)%32)

or something to that respect.

I’m pretty sure that it is correct as stated.

You can also verify yourself that the pseudocode is consistent with the provided C++ implementation.

1 Like

I confirm. The problem is correct.

well, i think we can all agree that the c++ code is not correctly implemented after just a few test cases starting with size=17 …

To be honest I have no idea. :slight_smile:

But the obfuscated part is correct.

what about the case where size = 33?

the problem stipulates that one should read (size/16) 2 bytes, yet at the first pass through the inner loop at i=0,j=32 the program attempts to read from a[2] ( a[j/32 + size/32] => a[32/32 + 33/32] )

or am I misunderstanding something fundamental about this problem?