Java - Random timeouts due to


#1

Good day,

During last contest I experienced random timeouts although my code is used to reply within 4 ms out of 50 ms. So I decided to investigate.

The first lead to explore was to check my code. The timeout seemed to happen at different locations depending on the game but for a given game, it seemed it always happened at the same location. So I printed timestamps to have an idea where it comes from. And for the first game I studied, the timeout came from a basic stream().collect() in order to append strings. This statement took 40 ms to execute which was very strange. I decided to replace it by a loop but it didn’t solve the timeout. I tried to study other games and each time the timeouts came from different basic statements.

Then I started to explore a second lead. The garbage collector. Maybe a garbage collection was triggered from time to time and suspended the player run. In Java it is possible to get the number of GC with this code: “ManagementFactory.getGarbageCollectorMXBeans().get(i).getCollectionCount()”. The result was
very fast to analyse since there were no GCs during the whole game (at least nothing reported by this method). So it seemed that the garbage collection was not the issue.
Edit: actually there were two garbage collections; see my additional comment below.

At that moment it remained only one last lead in my mind. The JIT compilation. In Java the time spent in JIT compilation is available with “ManagementFactory.getCompilationMXBean().getTotalCompilationTime()” (if supported by the JVM). I decided to print timestamps and JIT compilation time in my simulation loop (each simulation is supposed to take almost the same time).

Here is the result of the first round.

iteration / timestamp in µs / JIT compilation time in ms (from the beginning)
1 / 14952 / 228
2 / 16747 / 229
3 / 19466 / 231
4 / 19739 / 231
5 / 19960 / 231
6 / 20201 / 231

On iteration 4, 5, 6, there are no JIT compilations and each simulation lasts around 250 µs. On the other hand there is a JIT compilation of 1 ms and another one of 2 ms on iteration 2 and 3 and these simulations last 1.8 ms and 2.7 ms. So it seems that the JIT compilation really impacts the main thread accordingly. Since
Codingame’s system is single-threaded, the game should be suspended to execute the JIT compilation.

In addition here is the result of round 2 (for a different simulation loop).
0 / 569 / 193
1 / 1722 / 193
2 / 2263 / 193
3 / 2642 / 193
4 / 3064 / 193
5 / 56230 / 194
6 / 56555 / 194
7 / 56784 / 194
8 / 57017 / 194
9 / 57262 / 194
10/ 57484 / 194

On iteration 5, we can see a JIT compilation of 1 ms which impacts the game of 53 ms and causes the timeout (we can also see that the simulation lasts 400 µs before JIT and 250 µs after). This behaviour is observable on all games. Around the round 2 or 3, there is a JIT compilation which impacts the game of several milliseconds (I don’t really know why the reported JIT compilation time does not match the delay on the simulation in this case).
Edit: as mentioned in my additional comment below, this specific JIT compilation actually also matches a GC execution.

Along the full game, there are usually only four or five JIT compilations which happen during the player run. They mainly appear during the first round, then there is the “very impacted” one on round 2 or 3, then there can be one or two other JIT compilation during later rounds which could delay the main thread of 1 ms. Overall the JIT compilation takes between 1 and 2 seconds. Fortunately all other JIT compilations occur when the player is waiting for the referee input (I guess the scheduler takes this opportunity to switch the threads). By the way, funny thing: I saw a game which timed out before reading any input; the “impacted” JIT compilation occured while the player was trying to read the first input…

Now I hope we can find a solution to that problem. Personnaly I never worked on that specific topic and I’m not sure what we could do. My suggestions would be:

  • increase the maximum time of the first rounds to avoid timeouts;
  • use the JVM option -Xcomp (to compile everything at the first call).

I would be happy to read your comments.

Xavier.


#2

Maybe the issue is not java related only.

With a C++ code you can have a timeout even with a timer at 40ms (so you keep a margin of 10ms, but you can have a random timeout).

According to the SDK code, i’m starting to be sure that the referee can trigger it’s own garbage collector when it is waiting for a player output. So the referee garbage collecting is counted in our computation time.

My second guess is system dependent because there’s no guarantee that the OS will stream an input to a process directly. It is asynchronous and it can be delayed because the OS decided so.

But this delay is almost never greater than 5ms. That’s why we keep a 10ms margin.

For java related issues, the garbage collection will likely never take more than 20ms. For the JIT issue, there was an issue with streams. The first time you used a stream, you could timeout because of JIT. But codingame already fixed that issue (by forcing the stream classes compilation before our code). But maybe you can trigger that issue with something else.

Can you find what new class you are using at round 5 ?

A solution is to force the load of the classes you use during the first round.

I remember [CG]Thibaud said that someday (one eternity later :smiley: ), we could customize the used command line for compilation. But maybe we’ll aslo have access to the running command line. I don’t know. It could solve the issue. But CodinGame has a timeout at the compilation (i already reach it with a C++ code abusing of recursive templates :smiley: ).


#3

No new classes. Here are the logs.

iteration / timestamp / JIT compilation / number of loaded classes (from startup)
0 / 499 / 244 / 959
1 / 10028 / 245 / 959
2 / 10819 / 245 / 959
3 / 11139 / 245 / 959


#4

So JIT use 10ms to compile … nothing ? What sorcery is this ?


#5

I would like to correct my first report. There was a mistake in the way I printed out the number of garbage collections. I didn’t see it because System.gc() that I used in my test was correctly taken into account but not other types of GCs.

Now logs look like that (cumulative data from startup).

iteration / timestamp(µs) / JIT compilation(ms) / number of loaded classes / GC time(ms)
0 / 601 / 186 / 959 / 0
1 / 1166 / 186 / 959 / 0
2 / 2008 / 186 / 959 / 0
3 / 2699 / 186 / 959 / 0
4 / 13393 / 188 / 959 / 0
5 / 13657 / 188 / 959 / 0
6 / 73964 / 188 / 959 / 59
7 / 74212 / 188 / 959 / 59

On iteration 4 there is a JIT compilation of 2 ms. It results to a ~10 ms latency and next iterations seem to be 50% faster.

On iteration 6 the latency is much bigger (60 ms) and is due to a GC of 59 ms. In the case of my player, there are only two GCs during a full game. The above one around round 2-3 and another one in the middle of the game. The second one is less impacting but still 10 ms.

Since my player is able to provide an action with more than 10 ms margin, my target was to deal with the first GC of +/- 50 ms which was responsible of all the timeouts (sometimes the GC is faster and does not cause any timeouts or is not triggered during a round). That said I’m not sure why a GC is triggered since the player uses ~17mB (memory) out of 61mB (initial heap allocation). Anyway I found a workaround which has eliminated this first GC. Since the first round can last 1 second, I trigger the GC myself before printing my first action. And magically the system does not trigger this first impacting GC itself (the second one is still executed).


#6

thank you for the investigation @Xa2017 it’s really helpful. We’re doing some tests internally, and we already have solved an issue. Will tell more when we’re finished.


#7

New servers brought a high amount of random timeouts and device resource busy errors.
Not only Java, but at least C++ too. I just assume now that like 3 out of 100 matches I’ll timeout (using 41ms out of 50ms). Also there are many first turn timeouts when I use much less than 1s(using just 60ms, so Ican’t explain these).
And at least in XR (last multi I played a lot) I needed to resubmit a lot because sometimes I timeouted like 30% of first 10 matches.
Maybe it’s related to what Magus suggests, referee computation time or referee garbage collection time counted as user computation time, but definitely there’s an issue with timeouts.


#8

For first turn timeouts, do you have stats how often you timeout as player 1 vs player 2?


#9

Hi guys, we just made sure that the Referee cannot have any impact on the active player.
Please let us know if it changes anything on your side.

(In order to mix problems, would be great to check with a C++ player. Java players have issues of their own)


#10

These changes didn’t change anything. I reviewed my players of previous contests and the worst result is with Code Royale. When the multi mode opened, I was able to reach the legend league. Now all games time out.

I did some statistics of the memory used by some of my players (cumulative memory allocated in heap - observed when games reach the maximum number of rounds):

  • Coders of the Caribbean = 103 mB
  • Wondev Woman = 60 mB
  • Code Royale = 524 mB
  • Legends of Code & Magic = 5 mB
  • Code of Kutulu = 369 mB
  • Xmas Rush = 40 mB

Considering that JVM is started with -Xmx450M, I would suggest to try to disable garbage collections (if your servers allow us to allocate 450 mB). In Java 11, there is an experimental GC -XX:+UseEpsilonGC which does not run any GC.


#11

Hi,

same for me on Code Royal. RIP my Java Solution with random timout.
Few days ago it was ok (1st gold on CR) but since a recent plateform update it’s a big disaster (+120).

It looks like the garbage collector wich take more time than the previous JVM version.
I added lot of time markers for analysis and sometimes i noticed +70ms separate two basic lines of code (like simple arithmetics opérations).

Moreover the crash appears generally between frame 8-15. I don’t think my code is too expensive memory to fall as fast.

Perhaps you should try other GC configuration to minimize latency (ConcMarkSweep GC could be a solution).


I found this parameter for G1GC (default GC on Java 11): -XX:MaxGCPauseMillis (default is 200ms)
Try 10ms ?


#12

My Java uTTT bot is now consistently timeouting on turn 1. It’s doing some precalculations, very little memory usage. I don’t have the numbers at hand, but I think it was not using anything close to the 1000ms of the first turn before. This problem seems fairly recent, as my bot is still #33 in legend. Maybe something is seriously wrong with the Java 11 update, or with another recent change?


#13

For information, the current GC is the serial Garbage Collector.

You can see this by executing the following code:

import java.lang.management.GarbageCollectorMXBean;
...


        List<GarbageCollectorMXBean> gcBeans = java.lang.management.ManagementFactory.getGarbageCollectorMXBeans();

        for (GarbageCollectorMXBean gcBean : gcBeans) {
            String name = gcBean.getName();
            System.err.println(name);
        }

Java 11 is cgroup-aware (contrary to java 8) so I guess that as Java sees that the program operates on a single cpu it selects the serial GC (as we do not force any from the command line). This is a good choice I guess for maximum throughput on a single cpu but the program is completely blocked during GC. Maybe with Java 8, it was using the parallel GC which was competing with the main program but at least you could escape the while() loop.

We’ll try to force some other Garbage collectors. We’ll use some of your bots for testing. Sorry for the issue!


#14

I would also be interested in a solution here. My bot (Java/Code a la mode) keeps timing out on random turns as well and I can’t see a reason. Happens mostly on turn 2 or early in the game. It just suddenly stops with “timeout” message with the last timestamp having plenty of time left. I cannot definitely say that it is not my fault (all very experimental) but as the output just stops it is hard to tell.

Maybe as an intermediate solution you could give the bots 200 ms extra time if they don’t respond? They could still time out after 50 ms and won’t get another turn but this way we could check if it is a code-based or a system-based timeout and get some debug output?


#15

I was looking into that myself after i talked to Thibaud about this yesterday.
It does not seem to be the GC. I added ‘finalize’ methods to all my own classes and there was no garbage collection during those peeks. Even when i gave the GC a hint (System.gc()) and i saw him cleaning there was no improvement and still got the same timeouts.


#16

Hmm.

If you want to check whether it comes from GC or not, the best would be to build on the code above:

for (GarbageCollectorMXBean gcBean : gcBeans) {
            String name = gcBean.getName();
            System.err.println(name);
            ((NotificationEmitter)gcBean).addNotificationListener(new GCListener(), null, null);
  }
...
public static class GCListener implements NotificationListener {

        @Override
        public void handleNotification(Notification notification, Object handback) {
            if (!notification.getType().equals(GarbageCollectionNotificationInfo.GARBAGE_COLLECTION_NOTIFICATION)) {
                return;
            }

            GarbageCollectionNotificationInfo info = GarbageCollectionNotificationInfo.from((CompositeData) notification.getUserData());
            GcInfo gcInfo = info.getGcInfo();
            String message = String.format("GC #%d: %dms [Cause:%s]", gcInfo.getId(), gcInfo.getDuration(), info.getGcCause());
            System.err.println(message);
}

But if you timeout, you won’t get the message…

If does not come from GC, then we are in trouble. Could come from the JIT as well.

Anyway investigating…

Could you suggest a code that fails always with Java 11 and used to work with Java 8 (just let me know which bot/game to take the code from)?


#17

Similar timeout problems for Clojure in Code a la mode (previously Java 8, Clojure 1.9 and now Clojure 1.10):



#18

Currently my bots in ‘Code a la Mode’ multiplayer is a pretty save crash and it was working before the update.
Just take any of the last battles i end as rank 3 and you should get a good crash.

Thats also the bot that i located the use 10ms for a loop init problem in

e.g. in https://www.codingame.com/replay/381570637 with my current ide code i can see in frame 54 a waste of
9-10ms between line 480 and 483

all i do at those lines is

		System.err.println("blub " + (System.nanoTime() - begin));
		for (int i = 0; i < ingredientsNeeded.size(); ++i) {
			System.err.println(i + " " + ingredientsNeeded.get(i) + " " + ingredientsNeeded.size());
			System.err.println("blub1 " + (System.nanoTime() - begin));

Output is

blub 22025533
0 TART 2
blub1 31818889

Sometimes i’ve also seen that time is beeing wasted between line 272 and 277 and there is also only some comparison in between, shouldn’t take that long


#19

Thanks @Tiramon. I do not see the “blub” and “blub1” thing on frame 54 of your replay.

Anyway, when I replay your code I can see you time out with the last output being “blub xxx” and then when a new “Code a la Mode” round starts I see the continuation of the ouput like 100 frames further (so the difference between “blub” and “blub 1” timings can be explained because you are eliminated at round 1 and do not get more time until round 2, the processes which timeouts are frozen in between the 2 rounds).

Anyway that does not explain why you timeout in the first place, but it gave me a hint as you have a lot of traces all over your code: I believe all your problems will be solved if you comment out all your System.err.println statements.

Could you try it?

If it solves the problem, we’ll fix this Java 8->Java 11 regression easily.


#20

I don’t think it is just the System.err.println. I did a quick try to printout dots and linebreaks for 40ms and got no timeout while this simple code timed out twice (out of 4 tries):

public static void main(String args[]) {
    Scanner in = new Scanner(System.in);
    int numAllCustomers = in.nextInt();
    for (int i = 0; i < numAllCustomers; i++) {
        String customerItem = in.next(); // the food the customer is waiting for
        int customerAward = in.nextInt(); // the number of points awarded for delivering the food
    }
    in.nextLine();
    for (int i = 0; i < 7; i++) {
        String kitchenLine = in.nextLine();
    }

    // game loop
    int numTablesWithItems = 0;
    int numCustomers = 0;
    while (true) {
        in.nextInt(); in.nextInt(); in.nextInt(); in.next(); in.nextInt();in.nextInt(); in.next();
        numTablesWithItems = in.nextInt();
        for (int i = 0; i < numTablesWithItems; i++) { in.nextInt(); in.nextInt(); in.next(); }
        in.next(); in.nextInt(); 
        numCustomers = in.nextInt(); // the number of customers currently waiting for food
        for (int i = 0; i < numCustomers; i++) { in.next(); in.nextInt(); }

        try { Thread.sleep(40); } catch(Exception ex) {}
        // =====================================================================
        
        System.out.println("WAIT");
    }
}

(Basically the starter code + a thread.sleep of 40ms).

Best Regards
Martin