Let me introduce by saying that I believe the result of “Fantastic bits” is fair, with the top4 being Magus > reCurse > (pb4 and Neumann).
This post is not made to discuss the exact result of the contest : it is intended to discuss the procedure.
- Approx. 100 games are played by each AI when they are submitted.
- After those 100 games, the AI’s score has generally come to a stable point, and oscillates around that point.
- A re-run procedure was used where each AI plays a big number of BO5 games.
- During that re-run, every top AI’s score has continually increased
A bit of history :
- It is the nature of TrueSkill that the score fluctuates with each win and each loss.
- A re-run procedure was designed after STC, where the score would gradually stabilize by computing its average value over a number of games.
- The assumption when proposing that re-run procedure was that the scores would have already stabilized close to their equilibrium point, and the average computation would simply compensate for fluctuations around this equilibrium.
- A document was provided by _anst simulating the positive effect of the re-run procedure.
Of win/loss ratios :
- This is not exactly the case, but let’s simplify by saying that a 55% win/loss ratio between two players is roughly 1 point score difference.
- If there is a 55% w/l ratio between two players in BO1, then there is a 60% w/l ratio between those two players in BO5. (the figures are not exact, I haven’t taken the time to make the exact calculation)
- When switching to a BO5 system, the w/l ratio difference between players is amplified, and the score difference reflects this.
- We have indeed observed that the top AI’s scores have continually increased during the re-run.
My question today is : does it make sense to switch to BO5 for the re-run ?
To me, the only reason to use a BOx format is when there is no averaging of the results, such as direct elimination tournaments. If Codingame can launch 2005 games for a re-run, I do believe it would be better to use 1000BO1 results instead of 200*BO5 results.
- 150 matches were made in legend each submit.
I’ll would prefer 333*BO3.
A lot of games are quite random and you can easily lose a game with an early move; even if you have an overall better gameplay.
Point taken, it’s 150 not 100.
Regarding the proposal to use BO3, I disagree with the “random” argument. TrueSkill + averaging is already fully capable of dealing with the randomness of results.
Switching from BO1 to BO3 or BO5 adds a dynamic element which I am not sure anybody fully understands.
Rating based systems and BOx plays are redundant and possibly damaging.
Calculating ratings for BO1 is different than calculating ratings for BO3 or BO5. Those yield different final ratings and possibly rankings.
I didn’t know it was being used in re-runs at all. Have you done any measurement which shows it helps in any way to switch to BO5?
I think the best argument is: Is there any advantage to BO5?
Personally I don’t see it.
Prepare to see ugly graphs, because I don’t know how to use OpenOffice
Here is some historical data taken during the CodeBusters contest. Each line corresponds to one player in the top 20. CodeBusters is the first contest when the BO5 procedure was used.
The abscissae is the nth time at which the screenshot was taken. For example, abscissae #20 means it is the 20th time I took a screenshot of the leaderboard this evening. I took screenshots regularly from 7:40PM to 11:05PM.
The ordinate is the score given for that person on the leaderboard.
One can compare two things :
- The final score obtained by the AIs at the end of the graph.
- The average score obtained by these AIs during the transient conditions.
(that’s the graph below)
Simple observations : players from ranks 3 and 4 would be exchanged. Khao (rank 8) would move to rank 5. Medici (rank 10) would move to rank 15.
(Full data available there for those interested. Beware, I didn’t have time to put into presentation)
Time to conclude, or else I’ll have another “tl;dr.” by SaiksyApo.
The averages I’m computing don’t really make any sense, I know. But hey, they’re averages right ? They’re supposed to smooth things up ?
As we can see : not really. Why ? I have no idea. I suspect transient conditions due to the BO5 procedure, but who knows ?