Smash The Code - Question about the final ranking


#42

Maxime,
The questions you ask are valid, but I’ll make a difference between the ranking system and how it is used. There are two important questions :

  • Based on a database of win/loss information how can we rank people ? (that’s the ranking system)
  • How can we fill-in a database so that special situations are fairly treated ? (that’s the ranking information given to the ranking system)

Here is how I woud approach the second question, assuming there is a good ranking system to which results are provided :

  • On 1v1 games with fair (symmetric) seeds :
    ** Play one game
    ** If there is a winner, feed the result to the ranking system. Both players’ game counter is incremented by one game.
    ** If there is a draw, the game should not be counted. The result is not fed to the ranking system, and the game counter is not incremented for both players.

  • On 1v1 games with unfair (non symmetric) seeds :
    ** Play two games between the players, with the same seed and reversed positions.
    ** If one player wins both, feed the ranking system with the one win the that player, one loss for the other. Both players’ game counter is incremented by one game.
    ** If no player wins both, the games should not be counted. The result is not fed to the ranking system and the game counter is not incremented for both players.

  • On 1v1v1 games : they are unfair (non-symmetric) seeds:
    ** Pick one seed and play the six combinations for player positions.
    ** Sum the ranks for each player.
    ** For each pair (playerA, playerB), if the sum of ranks indicates that one is better than the other, feed the result to the ranking system. The game counter is incremented by one game for both players.
    ** For one series of 6 1v1v1 games, this means that the ranking system will receive between 0 and 3 1v1 game results. Draws are not provided to the ranking system.


A few additional comments :

Note that one set of 1v1v1 games will provide the ranking system with up to 3 results. A 1v1 game will provide a maximum of 1 result. I believe the number of 1v1 and 1v1v1 games should be adjusted so that there is a good balance of “result importance” between 1v1 and 1v1v1 games. Personal opinion : for previous contests such as TGE and BTTC, I would say the weight of 1v1v1 games was too strong.

I understand that for Coders Strike Back there was a system where players would have 10 1v1 games against the same opponent. The average result was fed to the ranking system. While I understand the merit of such a system (prevent unlucky losses against a much weaker opponent), I see more reasons why the system should not be used. Consider the (real) case scenario in CSB where pb4608 has 30% chances to win against Jeff06, and Magus has only 10% chances to win against Jeff06. In order to get a win according to the ranking system, pb4608 has to win 6 games out of 10 against Jeff06. This is very rare when the individual chance to win each game is only 30%. It is so rare that the ranking system can’t make any difference between pb4608 and Magus based on the games against Jeff06. In order to accurately differentiate players, it is better to feed the ranking system with as many small results as possible than to provide it with a smaller number of stronger results.

On the complexity of rank centrality as described by _anst : the complexity will explode if you want to do an exhaustive search for more than 10 players. The problem is somehow similar to the traveling salesman and could advantageously be approached via genetic algorithms or simulated annealing.

More on the complexity of rank centrality : this complexity comes from the discrete nature of ranking. This is the reason why I proposed an approach based on evaluating a player’s strength instead of its rank. It seems easier to find an optimum strength within a continuous search space : methods such as gradient ascent are available to converge towards a proper player strength.

Under a system where no matches are forgotten, Magus commented that early losses against a much worse player would be detrimental. Indeed, we would be tempted to submit our AI repeatedly until we get a “lucky start” with no losses. This was not an issue with the current use of TrueSkill, but may become one in a system where old games are not forgotten. My proposal is to keep the current version of TrueSkill during the contest. At the end, N games are relaunched for the top100 players.

The previous point goes against my comment a few days ago stating that opponents should be uniformely sampled (top5 players have as many chances to play against another top5 player as against a top1000 player). I repeal this comment as it seems to bring more difficulties than anything else…


#43

I agree on many points, especially like the remarks about (non-)averaging a batch of 1v1 battles and giving more weight to 1v1 battles vs permutations of 3 or 4-player games. Agree also with the proposal of keeping TrueSkill (at least for now), relaunching at the end for top 100 (or 150 or something). Sorry for repeating, but I’d love to see averaging of TrueSkill points as per @Manwe’s experiment (not to be confused with averaging of batches of 1v1 battles, which is bad). At least it’s an easy solution that’ll converge to something.

Just two remarks.

  • A clarification concerning Rank Centrality: what @pb4608 said about complexity, applies to Kemeny-Young brute-force search (impracticable indeed beyond 10 players).
    Rank Centrality however doesn’t explode in complexity. Actually, the inverse is true: it behaves well with as many as O(n*log(n)) random samples. In terms of relative error, it was shown to perform as good as the ML estimator in pairwise comparison models, see here p. 22-23. In my MATLAB tests, it takes about 30-40sec for 3000 players (once you have the data, of course).

  • TrueSkill (+ ELO and some ranking systems in sports etc.) use a gaussian model of “skill”. It’s relevant for human performance varying from one day to another and evolving with time. But I believe this is a wrong choice for judging fixed AI bots.
    First, the outcome of a contest is just a definitive set of pairwise comparisons (no more will be added and nothing will change). The skill is once for all defined by a family of of win/loose ratios (no point to estimate gaussian ‘uncertainty’ parameter).
    Given such data, more relevant is the model used by @Manwe, well-known in stats as BTL. Assuming this model, we want to find the player strengths (w_1,...w_n) that best explain the observed data, given that the win/loose ratio is w_i/(w_i + w_j).
    You do this with the Maximal Likelihood Estimator; in this model it’s given by the Logistic Regression (see e.g. here page 14 for the formula; I think this is different from what @pb4608 computed in his earlier post, since he indicated a gaussian model).
    I tend to believe that this is the best solution to our fair ranking problem. BTW, the Rank Centrality agrees pretty well with MLE results and is said to use less computational power; might be a practical solution.


#44

If I wrote Gaussian, this is a mistake. I used pretty much exactly the Bradley Terry model without knowing the name at that time.

As for Manwe, I understand he did not use Bradley Terry to evaluate the rank. What he did was generate a set of “model data” based on a rule similar to the Bradley Terry model.


#45

OK. Then, basically, we 100% agree.

That’s right, he didn’t do the estimation (I said he used thi model ;). Well, what he showed is somewhat optimistic: given this model, the averaged TrueSkill rank converges to the correct thing).


#46

@anst @pb4608 @Magus : can you ping me on the chat when you are all available to talk about what we can easily do? (Preferably next week)


#47

Si c’est dans la semaine la journée je suis au boulot mais entre 12h et 13h30 c’est ma pause manger et comme je mange devant mon PC je suis dispo. Sinon après c’est le week end ou le soir après 19h.


#48

Have you had your discussion ? Should we expect changes in the ranking system for CodeBusters ?


#49

We are currently running some experiments to have a leaderboard more stable. Nothing official yet but the idea would be to use a mean of the TrueSkill scores during the last phase (when we add new games, after the end of the challenge).

And yes, we want it for CodeBusters, that’s why we’ve picked that solution. We’ll use this system if everything is ready in time.


#50

Thank you for the information… I really appreciate this forum for helping me out