Propositions to replace Bo5 (final contest rerun)

Magus · May 29, 2017, 5:12pm

After some discussions between @reCurse, @pb4, @Agade, me and some others, here is the proposition we all agree with. Sorted by preference.

A final rerun with 1000 Bo1 games for everyone in the legend league (or at least top 50). We know it won’t happen of course. But it’s the best solution
A final rerun with the following rules:

If possible, a reset of the legend league. Force a resubmit for everyone. It’s not a rerun so we don’t use average ELO here. Just a normal submit.
Then we can do the actual rerun:
- Players in the top 15 have 1000 Bo1 games to play.
- Other players in the legend league (or at least top 50) have 200 Bo1 games to play
- During the rerun, if a player reach the top 15, add 800 games to play for this player

Just strip the Bo5 away and dispatch more games for everyone during the rerun. Rerun only the top 50 with 640 Bo1 games for each player (it is the same amount of games than now)
Just strip the Bo5 away and dispatch more games for everyone during the rerun. Rerun only the top 100 with 320 Bo1 games for each player (it is the same amount of games than now)

TwoSteps · May 30, 2017, 7:18am

Thank you for that Magus and all.

Before, I go to check with R&D the assessment of these proposals, I wanted to come back on the reasons needed for a change. Let me know if I got it right:

Bo5 (in the rerun only) can produce changes in the ranking of top 15 players that do not feel right.
A minimum of played matches should be ensured for top players during rerun.

Now a question: how sure are you that the proposals will indeed be better for the ranking and won’t have any unwanted side effects?

Magus · May 30, 2017, 8:05am

Yes for both

For propositions 1, 3 and 4, i’m pretty sure there is no side effect, but we can’t be sure it’s “better” without any test. But we can’t really test it

For proposition 2, i’ll let @pb4 answer that.

inoryy · May 30, 2017, 8:13am

Why not? Explicitly define the scenarios you feel are broken and simulate them.

Magus · May 30, 2017, 8:23am

Well, if you have the time to implements a TrueSkill arena and test it, i won’t stop you

inoryy · May 30, 2017, 8:24am

Take your pick: https://github.com/search?q=trueskill

TwoSteps · May 30, 2017, 4:01pm

Alright, we discussed it. Even if we believe that 200 Bo1 are ok, we’ll implement the 4th proposal. So rerun only top 100 with 320 Bo1. 3rd proposal could have been ok, but I’m afraid it could create issues with the T-Shirts for top 50.

pb4 · May 30, 2017, 8:45pm

Alright, I took the bait. I made a series of simulations, replicating the propositions made above.

Here was the method:

Have a pool of 10 players.

Define the win ratios of the players in the following manner :

double getWinRatio(int i, int j) {
    // Returns win ratio of player i against player j
    double skew = 25.0;  // 25.0 : player 0 wins 51% against player 1
    return (skew + j) / (2.0 * skew + i + j);
}

Start from a situation where all players have mu = 25.0, sigma = 25.0/3.0. (This means that all players have zero score)
Repeat N1 times : for each player, find a random opponent and play a game.
Repeat N2 times : for each player, find a random opponent and play a game. At the end of each game, push the player’s current score into a score history data structure
Average each player’s score history, and rank the players accordingly
Compare the first three ranks against the expected value : rank[0] == player0 && rank[1] == player1 && rank[2] == player2
Repeat the full procedure above 500 times for each value of N1 and N2 : measure the average number of times when the top 3 was correct

Here are the results : N2

           | 50    | 120   | 400   | 1000  | 2000  |
           |_______|_______|_______|_______|_______|
     0     | 1.8%  | 2.6%  | 9.8%  | 26.4% | 46%   |
           |       |       |       |       |       |
 N1  120   | 3.2%  | 5.8%  | 14.2% | 31.6% | 50.8% |
           |       |       |       |       |       |
     240   | 3.6%  | 6.4%  | 13.2% | 30.2% | 48.8  |
           |_______________________________________|
 
       Table: Fraction of the time when the top 3 is correct

My key take-aways :

Starting from a non-stabilized situation (N1 = 0) is significantly worse than a stabilized situation (N1 = 120 or 240).
Even in much better conditions than what we already have (N2 = 2000), the ranking system fails to rank the top 3 properly 50% of the time ! (under the assumption that there is a 1% win rate difference between players).

My opinion :

I won’t fight for pushing the number of games too high. I can only be thankful that Codingame lets us play these games for free.
What I WILL push for, is that given a “budget” of games, these should be used in the best way possible to discriminate between players.

pb4 · May 30, 2017, 8:58pm

Out of curiosity, I have pushed the calculation to N1 = 120 and N2 = 10 000.

The result is a very good 92% correct.

inoryy · May 31, 2017, 4:36am

Good work!
Can you try increasing full procedure iteration count (ideally to something like 10000)?
It most likely won’t affect the conclusion, but it’s curious that N1 = 120 produced better results than N1 = 240 at N2 = 2000. Is this expected? If not, then one simple answer would be that at m = 500 your error margin is still too high (in fact, p > 0.05 for 0/120 equality test => can’t reject H_0: p_0 = p_120).

Magus · May 31, 2017, 6:13am

As i always said, if one day we have 2 nearly same AI in the top 3, the ranking will be random. But i think it’s a case near to impossible. It means to have 2 AI with nearly the same win ratio (<1% difference) against all AI’s in the top 10. I never saw that case in any contest on codingame.

pb4 · May 31, 2017, 6:28am

N1 = 120, N2 = 120, 10000 repetitions : 5.38 %
N1 = 240, N2 = 120, 10000 repetitions : 5.58%

I see no reason why N1 = 120 would be better than N1 = 240.

pb4 · May 31, 2017, 6:33am

It happens all the time that AI at the top are very close :

STC, nearly all the top 4 ?
CSB : Jeff06 and Owl (top 1-2, < 1% difference)
COTC : Agade and me (top 2-3)
GITC : Vadasz and DarthPED (top 2-3)
FB : reCurse, me and Neumann
and so on…

Magus · May 31, 2017, 6:35am

All this AI never had the same win ratio against all top 10. You always have differences. Like a triangular battle.