"Killed" on C++ memory reservations

_CG_XorMode · November 15, 2019, 12:45pm

I have tried something else: removing servers from our pool which have less memory than others. @Marchete, could you try again please?

Note: That means CG is running on half its capacity, so you may have delays…

EDIT: Users are complaining, so I have to put servers back online, BUT I am going to restart them in a special, potentially slower, mode which consumes way less memory. So testing again would still be useful.

_CG_XorMode · November 15, 2019, 1:14pm

I tried Marchete’s Hypersonic test code multiple times without failing after removing our low memory servers from the pool

EDIT: ARG, reached the play limit So I hit the play button many times and I did not see the red player failing any longer.

darkhorse64 · November 15, 2019, 1:55pm

I did a test on Oware. Still got one failure at startup but also 10% time outs during play

_CG_XorMode · November 15, 2019, 1:57pm

We may have multiple issues: one linked to memory (Marchete’s issue) and one linked to performance of the compiled code. Let’s serialize issues. Would like to hear from Marchete about his startup issue first.

darkhorse64 · November 15, 2019, 2:00pm

No worries. I just wanted to point out that startup problems have decreased a lot but that there are now timeouts during play that were not seen before.

Marchete · November 15, 2019, 2:35pm

Still one crash, sorry
https://www.codingame.com/share-replay/422095497

INIT CACHE 142187KB CHUNK3906 SIZE:749952 524288

ABCDEF0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,

At 38. Also a cc1plus, but that’s another bug.
But it’s much much less often, I see it only once until I reached the "Oups An error occurred (#407): “You reached the limit of plays for a period of time.”. Please contact codersHS@codingame.com"

_CG_XorMode · November 15, 2019, 4:18pm

Well, will have to do for the week-end. The thing I understand the less is why always the red player???

FYI, we only changed the language runtimes and lowered the available memory on the machine by 100MB, I reverted the 100MB thing so now only the runtimes have been updated. We changed nothing with the computing engine. So this makes debugging difficult moving forward. I am a bit out of options there (do not wish to revert as we did for Java in the past)

@darkhorse64, @amurushkin, do you see the issues when your are the second player only or does it seems random as well?

RoboStac · November 15, 2019, 4:32pm

My startup problems appear to be getting worse again quite quickly (4 in 20 games on this submit). I’m seeing roughly equal occurrences as either player.

darkhorse64 · November 15, 2019, 4:40pm

No more timeouts during play but more at startup being first or second. I decreased my search by 5 ms but it does not change anything

_CG_XorMode · November 15, 2019, 6:12pm

So even the unstability is unstable…

Marchete · November 15, 2019, 7:13pm

Well, unstability is that, unstable

I don’t see the innards of the process, so I just can assume that somehow the jail kills the player that goes beyond the limits of dynamic memory. The thing is that my precompiled HS bot had zero “Killed” timeouts after reducing hashtables to 99MB, and only 1 normal timeout in about 500+ matches, even packed into a C++ code.

C# test bot starts with a high memory usage. Command ps -aux shows:

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
u65601 1 0.0 5.3 128796 13896 ? Sl 18:44 0:00 /usr/bin/mono --debug -O=-inline /tmp/Tester.exe

And after vector initializations:

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
u65601 1 4.0 11.1 142376 29332 ? Sl 18:44 0:00 /usr/bin/mono --debug -O=-inline /tmp/Tester.exe

Almost not new memory reserved (I don’t understand it, but magics of the managed code).

In C++ is different:

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
u65501 1 0.0 1.2 7652 3404 ? S 18:54 0:00 /tmp/Answer
INIT CACHE 142187KB CHUNK3906 SIZE:749952 524288

After initialization:

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
u65501 1 14.0 76.7 205488 201240 ? S 18:54 0:00 /tmp/Answer

cat /proc/meminfo is 256MB

MemTotal: 262144 kB
MemFree: 262144 kB
SwapTotal: 0 kB
SwapFree: 0 kB

ulimit seems normal (8MB stack is meh, but that’s vanilla linux):

time(seconds) unlimited
file(blocks) unlimited
data(kbytes) unlimited
stack(kbytes) 8192
coredump(blocks) 0
memory(kbytes) unlimited
locked memory(kbytes) 16384
process 100
nofiles 200
vmemory(kbytes) unlimited
locks unlimited
rtprio 0

76% mem usage seems a bit too high, maybe something is killing the process as a protection. Maybe the first player doesn’t pass the limit, but the second one does, and as someone is killed the rest of players are good to go, unstability as its finest

amurushkin · November 16, 2019, 7:51am

_CG_XorMode it happens to me for both players. sometimes it happens for my opponent
Also on UTTT is timeouted me many times between 4 and 20 moves but i did not see it before with same code

_CG_XorMode · November 18, 2019, 9:56am

By “unstable unstability”, I meant that at some point in time, one tells me everything is fixed (or much better) and then 2 hours later that the unstability came back (i.e. 10% of matches fail).

A bit of data about our process in case it helps:

We do not have the word “Killed” in our code. So it must come from a linux or C++ mechanism
We rely on cgroups for limiting memory. The oom killer comes into play when your program exceeds 768MB usage (which I do not think you reached)
The /proc/meminfo is a completely fake hard coded file we mount in our jail. It was done for languages that are not cgroup aware (java 8 for example). I could try unmounting this file for the purpose of the experiment and let linux provide the host meminfo (15 or 30GB depending on server). I wonder if the C++ runtime uses any info from that file.

Marchete · November 18, 2019, 10:24am

Maybe the oom_score calculated for the process uses that /proc/meminfo
http://man7.org/linux/man-pages/man5/proc.5.html

I keep having Killed on other bots, and reducing the memory usage it no longer crashes.

Edit: at least red hat uses that file for memory calculations https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/performance_tuning_guide/s-memory-captun

RoboStac · November 18, 2019, 10:32am

It seems to have continued to get worse over the weekend (and languages other than c++ are starting to see killed as well).

Killed comes from sh/bash when the running process received a SIGKILL. It’s almost certainly the kernel oom-killer trying to keep the system alive. Running ‘dmesg 1>&2’ as a bash submission shows a lot of oom killer messages in the logs. The output from this does show the contents of the cgroup and to me it looks like there is an old process hanging around in the group taking up memory (though I’m not sure how your systems work exactly).

Wild Speculation: Whenever the oom killer activates the group appears to have 2 copies of ‘starter’ running (I assume this is the process that starts our code), but the first one has a much lower pid (and shows up in multiple messages, whereas the pid of the second starter continually changes). The first starter appears to have run something called ‘Answer’ which has a similar pid and is also consistent across multiple oom kills. Across 5 different machines it’s always been starter and Answer for the low two pids (and the memory usage of Answer is very similar), whereas the higher variable pids have been a combination of starter and something else (I’ve seen python3, g++ , Answer). The newest process is then killed as thats what the oom-killer prioritizes to keep the system stable so it never recovers.

Marchete · November 18, 2019, 12:03pm

That’s a very good point.
dmesg | grep -i killed 1>&2

[250312.343835] Task in /CODEMACHINE_6_MEMORY_LIMIT_1 killed as a result of limit of /CODEMACHINE_6_MEMORY_LIMIT_1
[250312.346153] Killed process 29972 (cc1plus) total-vm:180752kB, anon-rss:125296kB, file-rss:15200kB, shmem-rss:0kB
[250356.976554] Task in /CODEMACHINE_3_MEMORY_LIMIT killed as a result of limit of /CODEMACHINE_3_MEMORY_LIMIT
[250356.985061] Killed process 29463 (node) total-vm:1271764kB, anon-rss:770276kB, file-rss:24228kB, shmem-rss:0kB
[250364.070200] Task in /CODEMACHINE_2_MEMORY_LIMIT killed as a result of limit of /CODEMACHINE_2_MEMORY_LIMIT
[250364.080306] Killed process 30066 (node) total-vm:1272308kB, anon-rss:772268kB, file-rss:1648kB, shmem-rss:0kB
[250423.752841] Task in /CODEMACHINE_0_MEMORY_LIMIT_2 killed as a result of limit of /CODEMACHINE_0_MEMORY_LIMIT_2
[250423.754967] Killed process 26059 (Answer) total-vm:598564kB, anon-rss:126244kB, file-rss:2904kB, shmem-rss:80kB
[250455.291707] Task in /CODEMACHINE_4_MEMORY_LIMIT_2 killed as a result of limit of /CODEMACHINE_4_MEMORY_LIMIT_2
[250455.294479] Killed process 4792 (Answer) total-vm:400228kB, anon-rss:126224kB, file-rss:3264kB, shmem-rss:108kB
[250467.600583] Task in /CODEMACHINE_1_MEMORY_LIMIT_2 killed as a result of limit of /CODEMACHINE_1_MEMORY_LIMIT_2

Explain a lot of things. You can see both cc1plus and Answer being killed. So the cc1plus error is related to the memory killer, they are being killed with 180MB usage.

Maybe as Robostac suggests there are some zombie processes lying around. In brutaltester I saw that Java Process.destroy() doesn’t kill subprocesses. So when I used a precompiled bot that used system() to run the precompiled binary (or just a simply script.sh that called the binary) it stayed alive forever, hanging my PC after some couple of matches. Maybe those subprocesses aren’t killed properly and takes up memory and CPU time.

amurushkin · November 18, 2019, 12:21pm

for me g++ killing at compilation error starts to appear more often. sometimes it is 3-4 times in a row when i run code in IDE. even python that uses numpy was being killed

ThomasNicoullaud · November 18, 2019, 3:56pm

Hi,

80% of my last UTTT games in arena timout on first turn.
2 errors randomly:

"Code failed: your program was terminated before reaching the main entry point for your language
(possible reasons: segfault on static initializer, too much memory used, etc.) "
“g++: fatal error: Killed signal terminated program cc1pluscompilation terminated.”

I’m faaaaaaaaaaaaaalling

_CG_XorMode · November 18, 2019, 5:17pm

What you describe is consistent with something slowly degrading on the machine since just after a reboot, from your various feedbacks, the problem is not there and seems to become more and more important as the time passes.

Unfortunately there are no zombies on these machines, nor long running programs and there is plenty of free memory available.

These dmesg messages are quite interesting for trying to understand what happens. We’ll investigate more tomorrow on our preprod environments.

Thanks for helping. I’ll keep you posted.

_CG_XorMode · November 18, 2019, 6:20pm

OK there are no zombies but there are indeed “Answer” processes in a weird state that seems to be considered as consuming >600MB, hence when your program is executed in a cgroup where that process was, it kills your program based on the election mechanism described above by RoboStac.

When I say weird, I mean really weird: do not appear on ps but appear on htop and are not displayed when doing ls -l /proc | grep XXXX BUT you can do cd /proc/XXXX and you are in the directory of that process. More investigation tomorrow.