Well no, no orphans or zombies, just cgroup frozen processes (As you may know frozen processes cannot be killed until they are unfreezed).
What we see is a CodinGamer AI program (don’t know which code at this point), which escapes its memory cgroup when it is in a frozen state.
I believe it is always the same program because it is always consuming exactly the same amount of memory across all our servers. It is slowly poisoning all our servers (some started on nov 15, other on nov 16 and so on)
What should happen is that our “starter” program which is placed in a memory cgroup starts a CodinGamer program as a child, the cgroup mechanism then automatically places this child in the same memory cgroup as its parent, the “starter” program. We also place the child program in a “freezer” cgroup. We can then freeze/unfreeze the program during a game by changing the “freeze” status of the cgroup.
Now what we do see is a “starter” program with a child in a frozen state (“D” status) BUT no longer in the memory cgroup of its parent (i.e. not listed in the “tasks” or “cgroups.proc” files of the cgroup). It is also nowhere to be seen in any “freezer” cgroup (ans so cannot be unfreezed). So somehow this process escaped its cgroups jail but is still accounted for at the memory level by its original memory cgroup.
Once we identify the CodinGamer source code causing the issue, things may become clearer but that’ll take time.