Nintendo Challenge

magaiti · April 3, 2016, 8:54pm

I have found the source of my problem. It was the very 5 lines of code I added “for optimization purposes” at some point. So actually it was my solution that was overoptimized, except the part that my “optimization” was doing the oposite of what i supposed it should do.

Now I have stable 8 msec on 256b testcase.

I had to rewrite several lines of code due to the removal of these lines, but in the end the code looks just as clean.

Wildcat-SC · April 3, 2016, 11:35pm

See, I told you, I did not work on optimization
I’m pretty sure there is at least 50% or more to gain on my implementation of B (for example, I did not sort my matrix lines)
But as it took 5ms on my computer, I did not see the point in spending more time on it ^^

[Edit] Well, all I could gain optimizing my B. was 0.5 ms And then, I optimized my degree recalculation function and gets 7.5 ms (3.5 ms on my iMac) ; I’m a bit stunned, but I guess it is logical : lots of XOR, so lots of recalculations to do, and seems an expensive operation

[Edit2] Gained one more ms using __builtin_clzll for the degree recalculation -> 6.4 ms ^^

Agade · April 4, 2016, 12:47pm

The recent activity on this thread got me curious about my 1000ms performance. So here is what I found: it becomes 100ms locally with O3. I believe we are not using the same algorithm though. Furthermore I think my implementation must be partly to blame, for one thing my code is leaking memory like crazy, just not enough to run out of it.

Wildcat-SC · April 4, 2016, 1:03pm

Mhh, we may not use the same algorithm indeed ^^ The only thing I allocate on the heap is an array of ~530 kB

Wildcat-SC · April 7, 2016, 10:51pm

Found out a new optimization for the matrix initialization (maths is a beautiful thing :D)
-> 2.8 ms (on CG server) now for the 256b case

Now I guess it IS overoptimized ^^

magaiti · April 8, 2016, 1:08am

Wow. I’m at 7-7.5 msec now, after optimizing get_bit() method. AFAIK my main algorithm takes ~5 msec, and there 's probably nothing I can do about it, other than rewriting the whole thing. As for initialization, maybe i’ll try to see what i can do.

Update: got to 5msec total time after unrolling the innermost xor cycle.

Wildcat-SC · April 8, 2016, 7:37am

If you have tools for this, try to instrument an infinite loop of 256b resolution. That’s what I did, and I saw that the matrix init was taking 63% of the CPU time ^^;

Finally, I found out a mathematical way to init it with at most one <<2, one <<1 and three XOR per line, so I gained a good part of those 63%

magaiti · April 8, 2016, 11:40am

Well… I am disappointed with std::array.
From the way it looks, I assumed it was like a wrapper for c++ arrays, only with some STL methods supplied.
That’s why I had my data defined as array<uint64_t,Size> a; all the time.
After I’ve changed it to uint64_t a[Size], I got like two times increase in speed.

Allocating the array on the stack also seems faster than from heap. So now my routine casually bites off ~40Kb of call stack.

My matrix initialization uses about as much operations as yours.

A-and… now I’m at 2.5ms on CG server.

Update: 2.15-2.20 ms with get_bit() optimized in the innermost loop (offset and mask precalculated outside of loop)

Wildcat-SC · April 8, 2016, 6:14pm

Indeed, I get the same times working on the stack instead of on the heap. I work already with an old school array.

Btw, sometimes I have ~2100 µs, some others ~2400 µs ; seems it depends of the server state ; hard to have something stable with such low times

magaiti · April 8, 2016, 6:49pm

I’ve read from several sources that std::array is a thin wrapper over c++ arrays, and both std::vector and std::array are as fast as c++ arrays, when used correctly. Well, it just seems that CG server doesn’t do all optimizations needed for that.

Yes, I get values between 2150 and 2200 mcsec most of the time, but sometimes it goes down to about 2100, or up to like 2300-2600. One time i got over 4000.

Agade · April 9, 2016, 8:01am

Indeed on the CG servers most of the STL is slower than C style code. Such as int array [N] being faster than vector<int> array(N).

Magus · April 9, 2016, 8:10am

CodinGame compile the C++ code with the -g flag. So everything is slower than in C++ real life. Unfortunaly,if you want a fast code on CodinGame, you have to abuse of bad pratices just to avoid the overheads of the -g flag. Never do functions (even inlined), avoir all values copy, avoid using classes (even STL classes), abuse of global variables, …

MatthewDaiprai · April 10, 2016, 4:57am

I took a quick peek at the Nintendo Challenge. It’s been a long time since I played with C++. Am I overthinking the syntax b[ (i+j)/32 ]? In the case if i=0 and j=0 is that interpreted as index 0 or index 1 of 32 or something totally off the wall? Some syntax isn’t exactly searchable on the web.

Magus · April 10, 2016, 7:37am

In C++ integer division stay in integers. 0/32 gives 0. 1/32 gives 0. 31/32 gives 0. 32/32 gives 1. 33/32 gives 1. 65/32 gives 2.

b[0/32] is the same as b[0].

MatthewDaiprai · April 10, 2016, 5:02pm

I was overthinking it. thank you.

FredericBautista · April 16, 2016, 2:30am

This is actually indeed a drawback to be forced into bad practices by not being able to turn on optimizations.

#pragma GCC optimize “-O3” on top of the code helps though.
I also use inline __attribute__((always_inline)) for that matter.

Regarding the puzzle, it was a nice journey in a new field for me.
My solution run in 20ms and can likely be improved a lot.
i’m using CZ as EDF, but I guess BTA is superior on this.

Edit: Just cutting out the unneeded parts I get 12ms.

Magus · April 16, 2016, 7:07am

Actually the problem is far worse than that. Missing -O3 and have a -g in the compilation command line is bad, but no that much. (EDIT: I was wrong, see my post here)

After a discussion with someone of the staff, C++ is slow on CodinGame for another reason : C++ always run with gdb. This is the reason we got the line and the stack of any error when our code crash. It is very useful in the IDE, but at the submit it is very very very very (very) slow. And having gdb handling your code encourage bad pratices such as:

Abuse global variables
Never ever do an utility function. It will slow down your code even with an inline. Copy/paste your code or use compilator instructions
Never use classes (even a simple vector<…>) if you don’t really need it. Use pure array.
Recode everything. Don’t use the STL, it is slow.
NEVER use pow(x, 2)
Keep your code as short as possible. Don’t let dead code. Because it will slow down gdb. If you have some debug functions, remove it for the submit (i use compilator instructions for that now).
Keep the minimum of variable in the scope to speed up gdb. You need an int and you already have an int ? Re-use it. Be dirty.

EDIT: But the discussion was unclear or i misunderstood. Because after a few tests, i don’t think the code is running in gdb or any other debugger. All the problem i talk above comes from the lack of -O3 in the compilation flags.

FredericBautista · April 16, 2016, 7:20am

What I meant is that it was indeed a drawback regarding the value of the work performed here.
Namely from a technical recruiter standpoint.

On the pure performance end, there are indeed ways out even though,
the code won’t ever be as fast as if there was no debugging at all.

I was thinking of a possibility to have a debug mode and a release mode at some point later on.
I believe that It would make sense for competition involving any kind of performance.

magaiti · April 16, 2016, 8:13am

This!
#pragma GCC optimize "-O3"
With this option my solutions runs 256b testcase in 850 microseconds.
Disappointingly, std::arrays are still twice as slow as C++ arrays…

Wildcat-SC · April 16, 2016, 12:54pm

Indeed, with __attribute__((always_inline)) on non-recursive calls, I’m around ~2 ms, and sometimes fall ~ 1.8-9

And with #pragma GCC optimize “-O3”, it’s around 650 µs, gj for finding this ^^