Timeouts with C++ in CSB (might be more general)


#1

I have been trying to find an elusive timeout bug in my CSB bot these last few days. I tried a lot of things but no luck. Then I decided that maybe it is the timing mechanism. I wrote a small bot: https://pastebin.com/P0P40wcs

The bot only measures time. Each loop iteration should take less than a microsecond, but it measures occasional jumps of over 1 ms, sometimes even 11 ms. I do this with steady_clock in C++, but I also tried high res clock (steady is better anyway because it “is_steady”). I don’t know why these jumps happen, but it does explain my bot timeouts, as i get the exact same size time jumps that time me out.

My questions:

  1. What causes this? It might be possible I am doing something wrong, but i dont see it.

  2. Is it language-specific or arena-specific?

  3. What can we do about it? Is there anything we can do besides set a wide margin?


#2

we’ll look into it


#3

On CG i always keep a margin of 10 or 15ms (it depends of the puzzle). Because if you try to use 100ms exactly (or 50ms), you’ll timeout many times.

I know codingame devs always said that the referee only count the time in our code, but i know it’s false and many of us already experienced it.

From the SDK code, i think that sometime, the GC of the SDK process just start its collection when the referee is awaiting for our code response. So the referee count our time + the SDK GC time.


#4

I agree that you need a margin, but 10-15 ms seems too much. I dont remember needing this much before either. CSB only has 75 ms max, so it hurts quite a bit. If we understood how the problem happened, we could maybe get a better estimate for a healthy margin.

For example. It might be that when we get this jump, it is partially compensated for in the timeout check. So maybe when I get a 11 ms jump, the process also gives my bot more leeway if it goes over 75 ms (maybe it would allow up to 78 in that case or something). I have no idea if this is the case, but it’s important to know something like that.


#5

Just play Wondev Woman multi… I don’t know what may be, but I noticed many timeouts there. Maybe it’s suggestion, but WW seems the worse of all.


#6

@Magus, not sure why you say we say false things and I am not sure how we could have denied the simple fact that timeouts do occur if you try to use the full 100ms. The reasons are just hard to exactly pinpoint.

So we do try to give you the 100ms you deserve, but it is also true that our game engine operates on the same CPU as your process and may steal some CPU cycles from your code when reading from your input (including if/when GC occurs).

Using two CPUs per game (the easy way) would double the cost of our infrastructure. Developing an alternate solution (measuring cpu time rather than user time or segregatting all game engine processes on a given CPU) is a huge task and we are not ready to do it just yet.

We do understand the frustration though.


#7

Maybe you can set a timeout of T+10ms, then just make some validations at end of turn to filter out those spikes. If a bot has 96ms in 100turns, and 1 with 106ms, it should be considered an spike.
This is just a change in the TimeoutException referee:

static final int TIMEOUT_GUARD = 10;
 gameManager.setTurnMaxTime(gameTurn == 0 ? TIMEOUT0 : TIMEOUTN + TIMEOUT_GUARD );
 for (int player = 0; player < 2; player++) {
          Player sdkplayer = gameManager.getPlayer(player);
          try{
             /********doRefereeStuff***********/
            //Send inputs
            //Receive output

              //Validate Timeouts. Maybe it's better to move that inside gameManager.
             int msPlayer = getUsedMS();
             msPlayerTimeHistory[player].add(msPlayer );
             //
             bool overTime = msPlayer > (gameTurn == 0 ? TIMEOUT0 : TIMEOUTN);
             if (overTime ){
               //check when it was the last overTime
               //do more stuff to ensure this is not being abused
               if (overTime being abused) 
                     throw new TimeoutException();
             }

            } catch (InvalidActionHard e) {
                HandleError(gameManager, sdkplayer, sdkplayer.getNicknameToken() + ": " + e.getMessage());
                return;
            } catch (TimeoutException e) { 
                HandleError(gameManager, sdkplayer, sdkplayer.getNicknameToken() + " timeout!");
                return;
            }
        }

It’s not a token bucket (because that can be harder to fine tune).


#8

I never said CG devs denied the fact that timeout occurs. I just said that CG devs says that the referee only count the time in our code, and from the SDK code it’s “true”.

I’m sorry if there’s is a misunderstanding.

I’m pretty sure that the SDK is also counting the SDK garbage collector (if it is triggered at this moment). It’s also possible that the Referee GC is also counting in its time because it’s even worse when the referee is “slow” (as pointed out by Marchete with Wondev Woman).


#9

Hi guys, we just made sure that the Referee cannot have any impact on the active player.
Please let us know if it changes anything on your side.


#11

Edit: Not solved all timeouts. I think these are not referee related. They usually happen at first turn, before even receiving any data:
“Code failed: your program was terminated before reaching the main entry point for your language
(possible reasons: segfault on static initializer, too much memory used, etc.)
mount proc: Device or resource busy”

And crashed in turn 0 (of a bot with like 50-60ms on 1st turn). The bot doesn’t seem to even start.

it seems more frequent on first 10 matches of a submit, but it also happens at random. That was from the 25th match or so of a resubmit. On a bad resubmit there were 3 out of 10 failed.


#12

My bot which is in Java can also get this error. Not so often as you. I would say 1 out of 100 games.
Replay: https://www.codingame.com/replay/366897481


#13

Aaah I also have these timeouts, I was wondering if I had something wrong on my side (it might be the case).


#14

I still get the same unexplained time-spikes. You can easily check it yourself by running the timer-bot on CSB that I shared above in the first post. It may have gotten slightly better, not sure. Probably is still the same. A few tests gave me spikes of 5-7 ms per game. Let me know if you think I am measuring it wrong. The timer bot is really simple.


#15

I got this error message several times over the last weekend while doing practice puzzels in Go (e.g. Mars Lander Level 2, Flood Fill Example…).


#16

Hello,

Yes we used your timer bot all along. It is very simple and very effective in detecting spikes, so thank you for that. Based on what we saw, we concluded that for some games (not CSB), the Referee had an impact on these spikes which were very predictable (always occuring at turn X). This is why we decided to block the Referee during a player’s turn (just like we block other players during a given player’s turn). This fixed the predictable spikes.

There are still some spikes on CSB and other games that we need to investigate but we believe that even for CSB the number of their occurences was lowered.

Also we were not aware of the “Mount proc: Resource busy” issue. This is the issue we are focusing on right now.

So, as usual with Timeouts, it is very hard to separate one issue from the other, as each CodinGamer in this thread is talking about a different issue.

But we are getting there…


#17

I think referee timeouts are reduced, at least in Wondev Woman where I saw a lot of timeouts before.
If the “mount error” also happens in single it may be a problem. You can submit a puzzle and some hidden validator can fail due to that error. It would confuse the player.

It may be a problem in the jail creation. It’s one step where mount is used to create the /tmp for each user (I don’t know the details but I guess it works that way). It also happens in any game, and is language independent.
I thought at first it may be related with C++ uglyness we use for performance reasons (pragmas, big arrays created at initialization that takes up a lot of memory, stack settings), but Java is less dirty in most cases, and also has the problem.

I keep seeing timeouts in both tests and submits.
I rather prefer that if it’s a server side issue it retries the match.


#18

This “mount proc: Device or resource busy” is very annoying. I lose games due to it in any CSB submit. Sometimes it’s more frequent than others , like losing 4 games in the first batch of 20 games (10*2 because of swap) of a submit.


#19

Number of “mount proc…” errors occuring on the platform yesterday

Now that we can measure, let’s fix it!


#20

If not already done you should check n_errors normalised by the number of games in the same amount of time to see if the probability, rather than the raw number, is changing?


#21

@Marchete yes, based on our code, I think this is language independant too, but I believe this is limited to compiled languages. Between compilation and execution we do a umount of /proc followed by a remount of /proc. That’s this remount which fails. Unless you tell me that you see the “mount proc” error also during the compilation phase, or for an uncompiled language, in that case my theory is not the good one.

We’ll add more logs to validate/unvalidate this theory.