Nintendo Challenge

Magus · April 9, 2016, 8:10am

CodinGame compile the C++ code with the -g flag. So everything is slower than in C++ real life. Unfortunaly,if you want a fast code on CodinGame, you have to abuse of bad pratices just to avoid the overheads of the -g flag. Never do functions (even inlined), avoir all values copy, avoid using classes (even STL classes), abuse of global variables, …

MatthewDaiprai · April 10, 2016, 4:57am

I took a quick peek at the Nintendo Challenge. It’s been a long time since I played with C++. Am I overthinking the syntax b[ (i+j)/32 ]? In the case if i=0 and j=0 is that interpreted as index 0 or index 1 of 32 or something totally off the wall? Some syntax isn’t exactly searchable on the web.

Magus · April 10, 2016, 7:37am

In C++ integer division stay in integers. 0/32 gives 0. 1/32 gives 0. 31/32 gives 0. 32/32 gives 1. 33/32 gives 1. 65/32 gives 2.

b[0/32] is the same as b[0].

MatthewDaiprai · April 10, 2016, 5:02pm

I was overthinking it. thank you.

FredericBautista · April 16, 2016, 2:30am

This is actually indeed a drawback to be forced into bad practices by not being able to turn on optimizations.

#pragma GCC optimize “-O3” on top of the code helps though.
I also use inline __attribute__((always_inline)) for that matter.

Regarding the puzzle, it was a nice journey in a new field for me.
My solution run in 20ms and can likely be improved a lot.
i’m using CZ as EDF, but I guess BTA is superior on this.

Edit: Just cutting out the unneeded parts I get 12ms.

Magus · April 16, 2016, 7:07am

Actually the problem is far worse than that. Missing -O3 and have a -g in the compilation command line is bad, but no that much. (EDIT: I was wrong, see my post here)

After a discussion with someone of the staff, C++ is slow on CodinGame for another reason : C++ always run with gdb. This is the reason we got the line and the stack of any error when our code crash. It is very useful in the IDE, but at the submit it is very very very very (very) slow. And having gdb handling your code encourage bad pratices such as:

Abuse global variables
Never ever do an utility function. It will slow down your code even with an inline. Copy/paste your code or use compilator instructions
Never use classes (even a simple vector<…>) if you don’t really need it. Use pure array.
Recode everything. Don’t use the STL, it is slow.
NEVER use pow(x, 2)
Keep your code as short as possible. Don’t let dead code. Because it will slow down gdb. If you have some debug functions, remove it for the submit (i use compilator instructions for that now).
Keep the minimum of variable in the scope to speed up gdb. You need an int and you already have an int ? Re-use it. Be dirty.

EDIT: But the discussion was unclear or i misunderstood. Because after a few tests, i don’t think the code is running in gdb or any other debugger. All the problem i talk above comes from the lack of -O3 in the compilation flags.

FredericBautista · April 16, 2016, 7:20am

What I meant is that it was indeed a drawback regarding the value of the work performed here.
Namely from a technical recruiter standpoint.

On the pure performance end, there are indeed ways out even though,
the code won’t ever be as fast as if there was no debugging at all.

I was thinking of a possibility to have a debug mode and a release mode at some point later on.
I believe that It would make sense for competition involving any kind of performance.

magaiti · April 16, 2016, 8:13am

This!
#pragma GCC optimize "-O3"
With this option my solutions runs 256b testcase in 850 microseconds.
Disappointingly, std::arrays are still twice as slow as C++ arrays…

Wildcat-SC · April 16, 2016, 12:54pm

Indeed, with __attribute__((always_inline)) on non-recursive calls, I’m around ~2 ms, and sometimes fall ~ 1.8-9

And with #pragma GCC optimize “-O3”, it’s around 650 µs, gj for finding this ^^

Plopx · April 16, 2016, 1:35pm

From my experience, “-g” doesn’t incur any significant overhead, nor does the execution with gdb (except of course startup time).

Magus · April 18, 2016, 11:07am

I don’t know how you run gdb, but i confirm that gdb slow a much your code. Because it has to keep the stacks states everytime you do something.

JBM · April 18, 2016, 3:33pm

I’m not going to debate whether or not programs run under gdb if CG staff told you so [1], but I’m with @Plopx on this one, there’s no reason for code running under gdb to be slower than standalone after it actually started.

There is no need for gdb to “keep stack states” at runtime any more than the program already does on its own. For compiled languages, the program’s status (registers and memory including stack) is available externally when the program isn’t running (whether SIGSTOPped or crashed) and mapping to symbols only needs be done once.

I don’t know how you run gdb, but I confirm that when actually measured, slowness wrt standalone running is lost in the noise.

C prime sieve: 0.05834s ⋘ σ

Standalone: 2.68186s (σ=0.12621s)
Under gdb: 2.80568s (σ=0.11785s)
gdb startup: 0.11960s (σ=0.00774s)

In C++ (STL, streams): 0.08128s ⋘ σ

Standalone: 3.89629s (σ=0.27718s)
Under gdb:4.10208s (σ=0.35259s)
gdb startup: 0.12451s (σ=0.00836s)

Fact-supported counterarguments welcome! (like, something with severe recursion)

[1] It’s debatable because it’s not needed at all: if all you want is a symbolic stack trace from a crashed program, you can extract it from the core dump. If the overtimes reported in this thread boil down to gdb startup (and that it’s counted against program time), I’d consider that a platform bug.

Magus · April 18, 2016, 7:59pm

I made a few tests with this code: http://pastebin.com/7k0Gj8qf (i use the standard output to output the time because on CodinGame i don’t really have the choice).

I tested this code in Onboarding puzzle. The output is around 200ms (you can test it). If you add #pragma GCC optimize "-O3" at the start of the code, the output is around 5ms.

On my computer, if i try the both codes (with and without the pragma) with this compilation:
g++ code.cpp -o code.exe -std=c++11

I got the same times. And i also got the same times if i had a -g flag, i got the same times too. But now if i run it with gdb (simple gdb command like gdb code.exe), i got 520ms for the first code (without the pragma) and 325ms for the second code (with the pragma directive). Adding the -g flag doesn’t change anything. I got also the same results if i compile with -O3 flag.

I don’t know how you can run a C++ code with gdb without losing almost any time. But i’ll be glad to learn it.

For codingame, i don’t think the code is really running with gdb. The code performance is really too different. It would be impossible to do a puzzle like Nintendo with gdb. The code would be far too slow.

But for codingame you can test it easily: Just make 2 codes. One with an inline function and a second one without it (just copy/paste the function content) and see how the both codes are doing. An inline function is supposed to have no impact on a code speed (if you are using it properly, of course), but on CodinGame it does.

The -g flag can impact performances. But it depends of the compiler. gcc have a -g3 option to avoid that. It is not used by CodinGame but it’s not a big problem.

My code is pretty simple, so the pragma directive act the same as a the -O3 flag. With a bigger code you’ll see some differences. Now see this 2 codes:

http://pastebin.com/VV626vH9
http://pastebin.com/1mMak1iM

If you try this 2 codes on Onboarding puzzle, the first one take around 28ms. The second one takes less than 0.001ms. If you remove the -O3 pragma directive, the first one take around 45ms and the second one take around 5ms.

But according to the C++ documentation and specifications, an inlined function (with reference parameters and const values) does not impact the performances. And a vector has the less overhead possible. But try to run this code on your computer with the -O3 flag at the compilation (not with the pragma directive). On my computer i got 5ms for the first code and less than 0.001ms for the second code.

The problem on CodinGame is the lack of -O3 at the compilation. Without it, you can’t use classes and function without impacting the speed of your code. It always leads to bad pratices to keep your code fast (like i explained in my previous post)

Wildcat-SC · April 18, 2016, 10:30pm

All pragma & attributes optimization removed ^^
(I heard the code was running into valgrind somewhere, but with this test, I really don’t think it’s the case).

Within lldb, it’s only 10% slower with -O0, while in valgrind, it’s ~50 times slower

00:20:23 [~/Desktop] schlum$ g++ -O0 -g -W -Wall -Wextra -o nintendo nintendo.cpp
00:20:27 [~/Desktop] schlum$ valgrind nintendo
==40920== Memcheck, a memory error detector
==40920== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==40920== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==40920== Command: nintendo
==40920== 
256
4af6fc33 39029380 465c5267 c72f6a8b 0906e6d0 ca60550f 14a5e47c 42ad10fb 4a3bb446 bb74360a 5ea02b9c 23c68553 3fade253 e270ba24 39e141ad 6c38c43d
320a18d5 b61b13f6 1aaaa61c 0afe2a41 1a4ff107 84cc2efc 956ff31d fa595299 33749a7f 6cc9659d dc503569 ef4d0ef5 73b746c5 b8fb36d3 7616e9d6 b21251c4
33749a7f 6cc9659d dc503569 ef4d0ef5 73b746c5 b8fb36d3 7616e9d6 b21251c4 320a18d5 b61b13f6 1aaaa61c 0afe2a41 1a4ff107 84cc2efc 956ff31d fa595299
time: 184544 µs
==40920== 
==40920== HEAP SUMMARY:
==40920==     in use at exit: 30,275 bytes in 188 blocks
==40920==   total heap usage: 279 allocs, 91 frees, 38,193 bytes allocated
==40920== 
==40920== LEAK SUMMARY:
==40920==    definitely lost: 0 bytes in 0 blocks
==40920==    indirectly lost: 0 bytes in 0 blocks
==40920==      possibly lost: 0 bytes in 0 blocks
==40920==    still reachable: 8,192 bytes in 2 blocks
==40920==         suppressed: 22,083 bytes in 186 blocks
==40920== Rerun with --leak-check=full to see details of leaked memory
==40920== 
==40920== For counts of detected and suppressed errors, rerun with: -v
==40920== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
00:20:36 [~/Desktop] schlum$ lldb nintendo
(lldb) target create "nintendo"
Current executable set to 'nintendo' (x86_64).
(lldb) run
Process 40926 launched: '/Data/HOME/Desktop/nintendo' (x86_64)
256
4af6fc33 39029380 465c5267 c72f6a8b 0906e6d0 ca60550f 14a5e47c 42ad10fb 4a3bb446 bb74360a 5ea02b9c 23c68553 3fade253 e270ba24 39e141ad 6c38c43d
320a18d5 b61b13f6 1aaaa61c 0afe2a41 1a4ff107 84cc2efc 956ff31d fa595299 33749a7f 6cc9659d dc503569 ef4d0ef5 73b746c5 b8fb36d3 7616e9d6 b21251c4
33749a7f 6cc9659d dc503569 ef4d0ef5 73b746c5 b8fb36d3 7616e9d6 b21251c4 320a18d5 b61b13f6 1aaaa61c 0afe2a41 1a4ff107 84cc2efc 956ff31d fa595299
time: 3815 µs
Process 40926 exited with status = 0 (0x00000000) 
(lldb)  quit
00:23:48 [~/Desktop] schlum$ ./nintendo
256
4af6fc33 39029380 465c5267 c72f6a8b 0906e6d0 ca60550f 14a5e47c 42ad10fb 4a3bb446 bb74360a 5ea02b9c 23c68553 3fade253 e270ba24 39e141ad 6c38c43d
320a18d5 b61b13f6 1aaaa61c 0afe2a41 1a4ff107 84cc2efc 956ff31d fa595299 33749a7f 6cc9659d dc503569 ef4d0ef5 73b746c5 b8fb36d3 7616e9d6 b21251c4
33749a7f 6cc9659d dc503569 ef4d0ef5 73b746c5 b8fb36d3 7616e9d6 b21251c4 320a18d5 b61b13f6 1aaaa61c 0afe2a41 1a4ff107 84cc2efc 956ff31d fa595299
time: 3457 µs
00:23:59 [~/Desktop] schlum$

JBM · April 19, 2016, 7:54am

You appear to be compiling on a DOS derivative. CG runs Linux. This is IMHO the most likely source of discrepancies around the foreign process debugging interface.

Magus · April 19, 2016, 10:27am

I tested the same code on a linux machine (debian) and i got the same results. Can i know how do you compile and run a code with gdb ? (and what code are you testing ?)

Plopx · April 19, 2016, 11:53am

Here are the timings I got with your codes, numbered from top to bottom code0 to code2. Cases “pragma” and “nopragma”, are respectively with and without the “pragma” line; “O3” is by adding “-O3” to the compilation command line. Gdb is run with “gdb -batch -ex run ./command”. Gcc is version 4.9.2, gdb version 7.7.1 (Debian Jessie). The timings are the mean and standard deviation for 10 runs. For each run, the command is executed twice in a row to mitigate caching effects.

  code0     nopragma   nopragma+O3    pragma   pragma+O3
  direct  217.22±2.06   3.57±0.11   6.11±0.13  3.56±0.12
  gdb     217.28±2.75   3.41±0.08   6.02±0.26  3.39±0.10

  code1    nopragma   nopragma+O3    pragma    pragma+O3
  direct  52.17±1.77   5.34±0.19   29.96±0.33  5.35±0.10
  gdb     52.29±1.21   5.31±0.13   29.72±0.40  5.19±0.11

  code2    nopragma  nopragma+O3(e-05)  pragma(e-05)  pragma+O3(e-05)
  direct  5.58±0.09      10.17±2.35      11.15±2.21      9.81± 2.14
  gdb     5.54±0.09      10.06±2.52       9.15±1.05     12.85±11.71

It appears clearly that there is no overhead with gdb. The difference with your timings is maybe elsewhere (http://stackoverflow.com/questions/27380443/why-is-gdb-so-slow-in-windows), but I don’t have any Windows system to verify.

We can also see that #pragma optimize "-O3" is not as good as giving “-O3” on the command line. Is it a bug in gcc? I don’t know. Anyway, that supports your argument that -O3 should be turned on by the CG team.

Finally, code2 is very fast with optimizations turned on. The explanation is that the test is a bit flawed. The result of the computation is not used at all, and gcc is able to remove it completely. You can check that by looking at the assembler code (option -S).

Magus · April 19, 2016, 5:29pm

Interesting. Thanks for the post Plopx.

As far as i remember, this is not a bug. But i’m not completely sure. There’s some optimizations that the pragma directive can’t do. You can try for example to use some C library in C++, they are most of the time not supported by the -O3 flag, but they works with the pragma directive.

Didn’t knew that, i was trying to see why the code with vector is slower. Didn’t think that -O3 completly remove the line.

oook · June 12, 2016, 1:33pm

This puzzle was quite challenging! Would deserve to be in a very very hard category

Solving the challenge in Python first turned out to be very useful for me. Python has support for big integers by default and numpy has many built-in functions to speed-up the implementation. Once the problem was solved in Python, it was easier to translate it to C++ since I could always compare intermediate results with the Python implementation.

macton · June 17, 2016, 1:12am

Definitely don’t look at this if you don’t want any spoilers:

I wrote up (or rather, generated from the solution) a post with details on each step of my particular solution. I’d be interested in hearing about other’s approaches and differences. ***)

(Note puzzle does pass with 100%, but there’s one specific minor but annoying gotcha that’s not covered in the particular examples in that link.)

edit: removed link