Search for Ghosts in the Cell
I still think there must be some kind of zombie process/thread. Well, “orphaned” in Linux terms
Search for Ghosts in the Cell
Well no, no orphans or zombies, just cgroup frozen processes (As you may know frozen processes cannot be killed until they are unfreezed).
What we see is a CodinGamer AI program (don’t know which code at this point), which escapes its memory cgroup when it is in a frozen state.
I believe it is always the same program because it is always consuming exactly the same amount of memory across all our servers. It is slowly poisoning all our servers (some started on nov 15, other on nov 16 and so on)
What should happen is that our “starter” program which is placed in a memory cgroup starts a CodinGamer program as a child, the cgroup mechanism then automatically places this child in the same memory cgroup as its parent, the “starter” program. We also place the child program in a “freezer” cgroup. We can then freeze/unfreeze the program during a game by changing the “freeze” status of the cgroup.
Now what we do see is a “starter” program with a child in a frozen state (“D” status) BUT no longer in the memory cgroup of its parent (i.e. not listed in the “tasks” or “cgroups.proc” files of the cgroup). It is also nowhere to be seen in any “freezer” cgroup (ans so cannot be unfreezed). So somehow this process escaped its cgroups jail but is still accounted for at the memory level by its original memory cgroup.
Once we identify the CodinGamer source code causing the issue, things may become clearer but that’ll take time.
I’m sorry for all the troubles, but from a geek/technical point of view the bug is very interesting
Oh yes, I agree:) Just wish to have the answer.
For information, I rebooted the servers yesterday evening except one to be able to study it.
Curiously, this morning, the faulty behavior did not occur yet on the rebooted servers…
- CodinGame for Work, same technology (lxc like, chroot+cgroups), different pool
- tech.io (or CodinGame) playgrounds, different technology, docker based
So there’s someone who successfully wrote a code which can escape from the codingame jail. And since the issue comes back quickly after after a servers reboot, this code is currently used/submitted. I’m very curious of what kind of code can do that.
and who’s that guy?
It half escapes. The process is no longer listed in any cgroups but its original memory cgroup still counts it in. Any attempt from our part to unfreeze and kill the process failed. Only rebooting does it.
Anyway, deployment in progress with additional debug which will allow us to quickly identify the origin code and CodinGamer.
BTW, new processes started to show at 8:16 and 8:39 UTC this morning.
Debug deployed. All servers rebooted. Just have to wait.
I’m intrigued but I must say I don’t understand almost anything about cgroups or jails.
I managed a lot of Linux servers but all of them were simple ones, just a bunch of services running in each one, LAMP and not much more.
I found some paper that talks about processes going out of cgroups:
To see if it helps.
Chances are it’s not even intentional. Proly some fancy fork work that CG can’t handle. It’d be hilarious
so do we know who is the witch?
For now the immortal processes have not reappeared. Do you have issues at the moment?
No issues seen on a full submit of two different games.
Great, let’s just wait for these processes to come back, eventually.
Well it seems happens not often but still i see timeouts during games. this time even on CSB where it was not happen before
After a day, we have just one of these processes on one of our servers. It is a normal C++ code for UTTT.
The only thing I found weird (and probably the reason of the issue) is that this code creates one computation thread which is “detached” using “detach()”. From the thread.detach() doc, I do not see why someone would need to do that in the context of CodinGame. Have the 9.2 C++ libs changed something to the implementation of detach?
The game produced is valid.
Anyway doing some more tests, like submitting the code on our test server to try and reproduce, but so far no luck.
I’ll keep you posted.
For the purpose of forcing a timeout for the opponent ?