After a day, we have just one of these processes on one of our servers. It is a normal C++ code for UTTT.
The only thing I found weird (and probably the reason of the issue) is that this code creates one computation thread which is “detached” using “detach()”. From the thread.detach() doc, I do not see why someone would need to do that in the context of CodinGame. Have the 9.2 C++ libs changed something to the implementation of detach?
The game produced is valid.
Anyway doing some more tests, like submitting the code on our test server to try and reproduce, but so far no luck.
I think the detach is there because the code is also used outside of CG (as I believe a lot of top CodinGamers do when implementing their solutions) and the detach() may make sense in that case (boilerplate code to make it compatbile with an execution inside or outside CG).
Anyway this detach() should not cause issues with the linux cgroup freezer system. I wonder is this is a Linux cgroup bug that was put to light with the new C++ 9.2 libs.
In my own code, the .detach() is there because I want to have two threads that runs independently (and this makes sense even if the code is run on a single-code machine):
one thread updates a tree (MCTS) continuously,
one thread interacts with the referee and reads the tree in order to answer in due time.
I can actually remove the call to .detach() and still have a bot with the same behavior. In the end, the issue might be somewhere else.
This timeout is not an issue (at least not a CG issue).
In fact, I have already found at turn 49 that teccles will win no matter what.
The issue (in my code) is that teccles plays a sequence of winning moves that is not the one my bot was expecting. So I end up with a failure and thus a timeout.
Yes I mentionned this replay, because your code program stayed in memory after this specific game and we just could not kill it (had to reboot the server).
Anyway we made some progress as we can reproduce on our test servers, we believe this is a linux/cgroup defect after looking at the linux kernel code based on the stacks of the stuck threads, so we are going to change slightly our kill procedure to make sure we do not fall into this potential defect.
NB: I don’t understand your use of detach() but that’s OK, I ruled it out, I could reproduce the problem with or without detach().
In my history, I can find four games similar to the replay above, where teccles is winning and my bot is crashing (a legend says that cats are stronger than angry turtles at UTTT).
Did you have the issue only for one game?
NB: th.detach() is here to ensure that my working thread th will be stopped properly, should my code ever reach the end of the main function.
Well there have been several games provoking the issue, but that’s the only one since we put the debug that allowed us to tie a semi-dead-process to a game. I believe there is a race-condition of some kind as we only reproduce after waiting a few minutes after we submit your code on our test server.
N.B.:If we are talking about the detach() thing, I would first say that on CG your code is most of the time killed, and your main does not reach the end. Then I have a different reading of the detach() documentation (meaning, I believe that detach() makes your thread stop improperly in case your reach the end of the main function) and it seems to be bad practice in general based on these resources: https://stackoverflow.com/questions/22803600/when-should-i-use-stdthreaddetach, https://stackoverflow.com/questions/3756882/detached-vs-joinable-posix-threads. Anyway whether or not a thread completed properly or not once your process is itself completed does not really matter: all ressources are collected properly by the OS in the end.
Then, you should check if the issue only happens when (the main thread of) my bot is crashing.
edit: I did submit a new version of my code by mistake (this “submit” button is too close to “replay in same conditions” for me ).
If things are getting worse (as I fear), tell me and I will submit my old Haskell code instead.
As for detach, both threads says that I should either use join or detach.
Some people seem convinced that a thread must always return a value, and that join is therefore the only way to go. But using join in my context implies that I would need one new thread each turn, which is not desirable (the idea is precisely to run the computation of the MCTS tree once and for all).
So what we saw: we upgraded all runtimes of our jail then problems started to show and CG players started to complain. So investigation started…
First thing that comes to mind is that something is wrong with the new runtimes. After a lot of trying to understand and reproduce, we believed that something was wrong with linux itself.
But why when updating the runtimes? In fact as a procedure, when we update our jails, prior to installing the jails, we install security updates on the host system (ubuntu 18.04). This is automated so I did not realize it occured (it is usually a seamless operation).
But this time, linux changed its kernel from 4.15.0-65 to 4.15.0-70 and guess what, when reverting to the base 4.15.0-58 image, the problem does not occur any longer.
The patch 4.15.0-70 was delivered on Nov 13 just a day or two before we updated our runtimes in production. Bad luck really…
My guess is that it is too new for others to have reported the problem. Each time there was this type of issue (Unkillable “D” state processes), it was reported by the Docker community, so I would not be surprised to see a report/fix in the coming weeks.
Anyway doing some more tests and we’ll reimage all our servers to the base Ubuntu 18.04 image which is at 4.15.0-58.
@amurushkin, go ahead. What I fixed was mostly related to timeouts due to the lack of memory. So not sure we fixed all timeout types. So at least we’ll know.
I was able to pass the Hypersonic resize test with 650MB+ reservations (64blocks of 11MB plus some other vector resizes). No crashes until I reach the submit limit.
With more memory usage oom-killer is kicking me out, and dmesg | grep -i killed 1>&2 is consistent with a bot going over the 768MB limit.