Measuring CPU time instead of Wall clock time to reduce timeouts

Would it make sense, and be possible, for CG to measure our AI’s computation time in CPU time instead of wall clock time? The Fall2020 contest had quite a lot of timeouts, even in AIs limiting themselves to 30ms of wall clock time.
Sometimes, these timeouts are the programmer’s own fault. For exemple, the AI may spend a long time in destructors when exiting a function. But it seems that CG had some blame for the timeouts. I, for example, saw turns where my AI performed 200 simulations in 50ms and I am pretty sure I was avoiding all mallocs and map re-hashes.
I have to suspect that something is sometimes not giving our AIs CPU time server-side, and timing us by CPU time would solve this issue.

If CG were to make this change, it would not break existing bots because existing bots are timing themselves by wall clock time<= CPU time.
If CG were to make this change, it would probably not significantly affect their server costs because for most turns, wall clock should be very close to CPU time.

Open questions, others may feel free to chime in on:

  1. Do you think this would indeed reduce timeouts in the arena?
  2. Is it possible to implement this on CG side?

On question 2, I vaguely remember CG doing this back in Tron days because people were threading to get more CPU time in the same wall clock time.

4 Likes

Like we already discussed, I’m not sure this the only issue.

Some players tested with a simple C++ code doing only the same thing at each turn (and outputing WAIT at each turn). It revealed that sometime, just doing this takes 20ms (according to wall clock time).

So, from my point of view, switching everyone to CPU time before fixing this 20ms “hiccup” will only make games longer to compute. And CG has already a hard time to have fast submits during a contest …

But, i’m not sure you can have a difference of 20ms between wall clock and CPU time in such a short time (50ms). So I think there’s other issues.

1 Like

I’m mostly in favor. I do see the caveat that CPU time is less available across the language spectrum, though.

It’s possible it could help reducing the timeouts in the arena. But it’s far from certain since even CG themselves do not know the root cause of the timeouts (or hiccups as I prefer calling that specific issue, random spikes of 10-20ms lag for no reason even in C++). If they happen while the bot’s process is scheduled, it could very well still count towards the bot’s CPU time and not change anything. But we could validate that by running @blasterpoard’s test again with measuring CPU time.

But overall I would be in favor of this change, as it only brings upsides (more stability!) with the only downside of being more complicated to explain to most people. And as JBM rightly pointed out, maybe more difficult to get a hold of in some languages.

I’m not sure to understand exactly what the people are complaining about here:

  • unreliability in measuring the elapsed time in a turn (and avoid a timeout)?
  • a significant variance in the number of “instruction cycles” available per turn?
  • both?

Besides, for the record, the topic has already been discussed here some time ago.

This is addressing a new issue on CG where a random 10-20ms can be added randomly, pretty much anywhere in your code, even native C++. If this happens when you’re towards the end of a turn, then you are guaranteed to timeout with no alternative other than using much less time than allowed.

See blasterpoard’s example here which could experience such massive hiccups doing nothing at all.

You would have experienced it for yourself if you participated in the last contest for sure. It was very noticeable.

But I did! You just need to scroll down the ranking for some times… I did experienced timeout too, but there were clearly other things to consider in my code before blaming a narcoleptic clock.

1 Like

Then I am not sure what’s not clear to you.

Thank you for the suggestion. The devs told me that we already thought about this but it seems it’s not that simple:

  • It’s not possible to get CPU time in many languages (on the other hand, it’s possible to get wall clock for most languages)
  • you could do a sleep X seconds and not consume any CPU time. We would actually need both: more complex
  • in real time systems, what matters is often the wall clock
  • it’s not sure it would solve the observed timeouts (/spikes/hicchups)

We have an incoming discussion this week about a list of issues to address before the next big challenge, and the timeouts are one of them

4 Likes

Another option I’ve suggested in the past was the “time budget” concept. The general idea being to let the AI be more flexible in how it allocates its time. In broad strokes:

  • the AI is provided with a “remaining time” input
  • it starts at 1
  • at each turn, it’s increased by 0.050s
  • the AI’s compute time then eats into it
  • the AI is responsible for ending the game with >0

Pros:

  • can erase load spikes
  • a (lax) per-turn timeout can still be applied
  • match time is still bounded
  • both per-turn timeout and endgame >0 check can be made external to the game and more of a “platform health-related” variable. And, for example, increased by 20ms when there are unexplained spikes happening randomly.

Cons:

  • an AI can pile up time to use in the end. Well, I’m not sure that’s really a con. Match time is still bounded. It’s juste a possibly different feel.
  • needs per-game protocol adjustment

Many variations are possible, taking inspiration from accounting.

  • overdraft allowed, must be repaid within 5 daysturns
  • overdraft carries interest (less added time per turn while in debt)

That’s a post-hoc rationalization, not really an explanation… In real-time systems, application developers often have much more global control over the system.

It wouldn’t. And it’s not the goal. The goal is to make rankings resilient in face of the spikes.

1 Like