Smash The Code - Question about the final ranking

_CG_Maxime · May 11, 2016, 9:44pm

First of all, thanks a lot for your messages, as a former CodinGamer (yes, I don’t compete anymore since I’m part of the team) I understand your frustrations and I think it’s very important that the ranking reflects the skills of the participants.

Regarding Smash The Code we faced a difficult situation: we’ve used the same algorithm as before and we’ve added the same amount of battles as the last contest. We are aware that our current system has some drawbacks but we couldn’t easy say: “hmm, I think that XXX should have won, let’s add more battles to let him win”. The conclusion was simply: let’s see what we could do to improve the system, but for this contest it’s unfortunately too late :/.

I’ll will try to participate to this discussion and hopefully we’ll be able to find a solution not too long to implement (i.e. not changing the whole ranking system by another one, all the developers are already really busy). If you need details about how things work I’ll do my best to bring you the information you need.

For each game we can decide whether we need to do permutations to be fair and that double the number of games. But I need to dig into the source code to see if the league system has broken something regarding an increase of the number of games for the top players. That might explain a lot of things…

Yes we can change that parameter and currently we are using the default values. If I understand, your suggestion is to change the Tau parameter when we add more games after the submissions are closed?
This doesn’t look to hard to implement, I’d like to hear the point of view of other people on that.

Another suggestion received by email was to use as score the average of all the computed trueskill scores. This would naturally reduces the impact of the new games. What do you think?

FredericBautista · May 11, 2016, 11:57pm

It’s indeed what I am suggesting, I might be wrong but reading the paper, it sound very much like reducing the tau parameter would tend toward a stabilization of the player skill overall evaluation.

There might be better way of ranking players like pb4806 suggested, however the good point with this option if it works is that you won’t need to develop and entirely new system and it could save you quite some work…

The idea is that the more game are played the closer you get to the evaluation of the player skill, and at some point people only move up and down in a range that is likely defined by the tau parameter itself.

I suspect that if you never reduce the tau parameter people with skill within a range below the tau parameter generated dynamic will never ever stabilize and will keep on exchanging place forever.

Evaluating when this exactly happen might be complicated but you can maybe give it a try with empirical tweaking like say half of the games played after AI are definitive (the extra match for the legend league actually) are played with default tau and then the rest with tau=0. Or maybe progressively decreasing it.

For testing purpose you could eventually feed the algorithm with generated data and check the results.
Generating data by simulating games between predetermined skills opponents with results drawn as random with a bias based on the skills, sample size need to be large enough and validated by checking deviation.

I reckon it’s more of a trial and error attempt to find a working solution but it could work still.

On a fun note: maybe you could make an automated test frame work and plug in a GA with the TrueSkills parameters as genes and the correctness of the generated tested rankings as fitness function

Edit:
You should definitely have a look at this: https://github.com/Manwe56/trueskill-ai-contest-test-tool
This tools does very much what you need I guess. In the readme the author state exactly the problem you are trying to solve. Here is a part of the readme inside the project:

For sure, the best players are far away from the noobies, but if you take two players with a skill difference very low, you will notice that the rankings will change between them continuously. Their rank will only depend on the final cut that will be done.

After analysing the trueskill mechanism, I noted that the past matchs will tend to get less and less weight in the final score of the AI. As AI skills don’t change, I found it sad that the number of matchs wouldn’t give more precisions on the skill of the AI.

To enhance this behavior, the best results I have so far is to use the following steps:

Run matchs with trueskill until all the sigma (trueskill confidence in its score) of each player are under a fixed limit.

Then consider the trueskill score of a player is the mean of all the trueskill scores where the sigma is under the previous fixed limit.

When the rankings did not change during several match played, consider we get the correct leaderboard and stop to run matches.

On top of the hints there is also source code for a ranking simulation with players using TrueSkill.

pb4 · May 12, 2016, 6:18am

FredericBautista : I have not taken the time yet to study all the papers you provide on TrueSkill, but I have doubts the “Tau” parameter is really what we need.

My understanding of TrueSkill is that each time you play a game, your rank is updated accordingly. In essence, it behaves similarly to an exponential moving average where old values are gradually forgotten and recent values have more weight.** If this is indeed the principle, no parameter will reduce the unwanted behavior of “forgetting old games”.**

I believe the use of “tau” is the following. Think of a situation where you have a perfect ordering of players where player A always wins against player B and player C, and player B always wins against player C. There is no uncertainty regarding the relative ranks of the player, the “sigma” value should tend towards zero. In that situation, the “Tau” parameter guarantees that sigma will never do so. Hence, “Tau” has an effect when the data already provides an “easy” perfect ranking between players. On CG, this is not the case and I do not believe “Tau” is the solution.

I really like the last suggestion you made based on Manwe’s github.I would be satisfied with a solution where Trueskill is used in two steps : score estimation, then launch new matches and take the average true skill scores obtained during this series of new matches.

I can’t seem to find it anymore, but I recall Manwe posting this suggestion of the forum some time ago ?

FredericBautista · May 12, 2016, 10:16am

I am not entirely sure about the tau parameter either as i said.

The main point is that it was a very easy implementation to achieve and to eventually quickly test on existing data. Regarding what tau exactly does the chapter 14 in the paper still is pretty much what I copied here.

You could be right though, again my point was more about trying simple stuff first. I already had doubt with it anyway but that is typically the kind of “easy patch” I would have tried before anything else.

After finding that tool by Manwe I guessed that he probably tried simple stuff like that before attempting more complex solutions and I was about to remove my tau based suggestion. I still left it here just in case…

anst · May 13, 2016, 6:51am

Here it goes, and the replay from FredTreg is worth reading too.
Actually, Manwe did some interesting work on trueSkill!

Just to let you know I’m digging through all this stuff to see how we could reliably reduce the volatility of the TrueSkill (at the moment i’m not entirely sold on this “tau” parameter, but save you the details for a more clear conclusion).

pb4 · May 13, 2016, 3:42pm

I have implemented locally the “Maximum Likelihood Estimation” method, and applied it to two model cases.

**Case #1:**This case was given by Manwe in his posts.

[quote]Si on a deux joueurs avec des niveaux j1 et j2, J’ai choisi (et c’est très certainement perfectible) que:
la probabilité que j1 gagne seras:j1/(j1+j2)
la probabilité que j2 gagne seras:j2/(j1+j2)

Si on prends des exemples numériques avec j1=5 et j2=9, ça donne:
j1 gagne dans 5/14, et j2 gagne dans 9/14
Si on a j1=1, j2=9, alors j1 gagne 1 fois sur 10 etc

En faisant jouer 1000 matchs par IA, on obtient alors les courbes suivantes pour les scores de chaque joueur:[/quote]

My results with Maximum Likelihood Estimation are the following :

Case #2:

Case #2 is taken from the “Smash the code” TOP10 match history. It is a
sample of 800 matchs that I used to rank people relatively to each
other.
I don’t know how ties appear in the database, probably as a loss for one of the players…
My results with Maximum Likelihood Estimation are the following :

(I have to thank Magus for preparing the database)

My opinion at the moment is :

Maximum Likelihood Estimation seems to be the theoretical best solution to fairly ranking AIs.
Reading Manwe’s position, averaging the ranks upon stabilization doesn’t seem to be a bad solution, and might be easier to implement.
In any case, the current opponent selection system should be modified so that the top2 players don’t play each other 50% of the time. I can provide simulations showing how much this skews the results…
Also, the current system gives more matches the higher ranked you are. Once again, this can strongly skew the results and should be avoided in a fair ranking system. If adding new games for 2500 AIs isn’t doable, then it should be done for a subset (only 100 ? only 50 ?) but all the AIs in that subset should receive the same number of additional games.

Magus · May 13, 2016, 5:24pm

If you want the data, it’s just a simple CSV : http://pastebin.com/F4uMsDk4
First column is winner, second is loser.
I extracted this from the last battles of the top 10. I removed all games against players outside of the top 10 and i carefully count every game only once (you have to avoid to count a game twice because it appears once in the last battles of one player, and one in the last battles of the other player).

anst · May 13, 2016, 9:45pm

Cool stuff!

Yeah, MLE (aka Kemeny method IIRC) seems to be indeed the theoretical best solution.

One problem is, in theory it needs something like K*n^2 samples for n players. That is, roughly all matches everyone vs everyone, maybe a couple of times. Otherwise it might be somewhat random and volatile. So while top 10 is doable, it wouldn’t scale nicely to ~2.5K participants, would it?

Choosing the top 10 (or 50) by trueSkill and then applying MLE could be full of surprises (and generating frustration). It’s better to have one global system. Fortunately, a solution might be in the links you provided in your post.[1] [[2]] (https://papers.nips.cc/paper/4701-iterative-ranking-from-pair-wise-comparisons.pdf) (Here is a more recent version [3].)

It’s called Rank Centrality. Quite simple, agrees with MLE (when a model is known), needs roughly n*log(n) samples and has a performance guarantee. Looks not too difficult to implement (compute the first eigenvalue of a matrix, seems like one command in MATLAB or Python/NumPy).

anst · May 14, 2016, 8:08pm

I’ve run an exhaustive check on the @Magus’ data (thanks again!).
I mean, consider all possible rankings and chose one that has the best agreement with the data.
And to compute the “agreement”, you take all pairs of players (A,B) and add the matches where A won against B. No probability here, just brute-force and clearly the best possible solution.
This is actually a well established method of constructing a ranking from individual comparisons known as Kemeny-Young rule (see also here).

Here is a simple Python3 implementation (you need to put the Magus’ data in the file STC_data.csv in the same directory).
Actually, there are two equally fitting rankings for these data, both supported by 489 matches:

Psyho, pb4608, eldidou, Medici, Rafbill, Romka, imazato, FredericBautista, ATS, Recar
Medici, Psyho, pb4608, eldidou, Rafbill, Romka, imazato, FredericBautista, ATS, Recar.

In this sense, any other ranking is strictly worse. For example, the final TrueScore ranking is supported only by 471 matches and the one provided by @pb4608 has the score of 473.

So far so good. But the complexity of this check is exponential (and ridiculous n!n^2 in my implementation). 20 players would be impossible to rank even with Magus-optimized-C++
Besides, to be really objective, this exhaustive check also needs a full nn matrix with every entry filled with at least a couple of matches. Good for 10 players, but impractical for 100 players.

So what I think is this.

TrueSkill was designed to propose interesting matches end ever-evolving rankings, not definitive one. Just think about it: interesting match means closely skilled players. “Ever-evolving” implies volatility. This is good for humans playing on Xbox and plainly bad for AI bots in time-limited contest.
Given the data, TrueSkill ranking wasn’t that bad. But we agree that we would like more data in the matrix of matches (there are almost empty areas). The problem is a solution based on a full n*n matrix is impractical for 2.5K participants.
Best known (to me) practical solution is Rank Centrality. It has many advantages:
It gives a stable final ranking using an optimal amount of Cnlog(n) random samples.
It’s frustration free in the sense that it samples randomly and agrees with win/lose tables that were brought to forum in two recent contests. Compared to this, TrueSkill is more obscure in its workings; this lack of transparence certainly adds to the frustration.
It’s relatively easy to imagine in terms of random walk on graph of players with edges weighted by win/lose ratios. The strong players are the vertices where the processus spends more time.
It’s model independent. To use a Maximal Likelihood Estimator, one has to arbitrarily choose a model (make an a priori hypothesis as to the real distribution the data is issued from). If your model is wrong, the results can be anything. RankCentrality is proved to work anyway.
While RankCentrality itself seems not too difficult to implement, I understand it’d be relatively costly to replace TrueSkill by another system. So I’d rather vote for averaging of results in TrueSkill. Stabilisation is everything.
Maybe for the top 10 we could launch the ultimate batch of uniformly distributed 2-3K matches and find the best ranking just by brute force. Yes, I do realise it contradicts my “one-system” proposition, but this might be viewed as the elite “super-finals”. The advantage is the results in top 10 are then clearly the best possible.

_CG_Maxime · May 14, 2016, 11:30pm

There’s another aspect that should be considered: we have games with more than 2 players. The system should be able to handle that too.

Also, we have games that we call unfair: even if we do the permutations the initial position can give a great advantage. => For some games a tie (when summing all the games generated by the permutations) means that the players have a similar strength, and sometimes it just means that the initial position gives a strong advantage to a player.

anst · May 16, 2016, 10:20am

Right, obviously.
Kemeny rule takes it “natively” into account (can handle any lists of preferences).
Not sure about RankCentrality.
As for the TrueSkill, it certainly handles it somehow. But I can’t remember seeing it explained in the docs (maybe it’s just a series of pairwise comparisons? then any pairwise system could handle it). Will be digging into

Such a tie outcome contains zero ranking information in any case. So a “good” system should be calcelling the inverse outcomes from two permuted initial positions.
Kemeny rule does it “natively” (one game is +1, the other -1 to any ranking).
Rank Centrality cancels it too (it’s win/loose ratio that counts).
Not sure, but TrueSkill might be unfair here (e.g. the second game could count more than the first one by general principles of TrueSkill; not sure however how it really works).

_CG_Maxime · May 16, 2016, 10:43am

When we use permutations,we sum the rank of all the participants after each match and at the end we consider it as a single result. And that’s where we check for a tie.

pb4 · May 17, 2016, 6:10pm

Maxime,
The questions you ask are valid, but I’ll make a difference between the ranking system and how it is used. There are two important questions :

Based on a database of win/loss information how can we rank people ? (that’s the ranking system)
How can we fill-in a database so that special situations are fairly treated ? (that’s the ranking information given to the ranking system)

Here is how I woud approach the second question, assuming there is a good ranking system to which results are provided :

On 1v1 games with fair (symmetric) seeds :
** Play one game
** If there is a winner, feed the result to the ranking system. Both players’ game counter is incremented by one game.
** If there is a draw, the game should not be counted. The result is not fed to the ranking system, and the game counter is not incremented for both players.
On 1v1 games with unfair (non symmetric) seeds :
** Play two games between the players, with the same seed and reversed positions.
** If one player wins both, feed the ranking system with the one win the that player, one loss for the other. Both players’ game counter is incremented by one game.
** If no player wins both, the games should not be counted. The result is not fed to the ranking system and the game counter is not incremented for both players.
On 1v1v1 games : they are unfair (non-symmetric) seeds:
** Pick one seed and play the six combinations for player positions.
** Sum the ranks for each player.
** For each pair (playerA, playerB), if the sum of ranks indicates that one is better than the other, feed the result to the ranking system. The game counter is incremented by one game for both players.
** For one series of 6 1v1v1 games, this means that the ranking system will receive between 0 and 3 1v1 game results. Draws are not provided to the ranking system.

A few additional comments :

Note that one set of 1v1v1 games will provide the ranking system with up to 3 results. A 1v1 game will provide a maximum of 1 result. I believe the number of 1v1 and 1v1v1 games should be adjusted so that there is a good balance of “result importance” between 1v1 and 1v1v1 games. Personal opinion : for previous contests such as TGE and BTTC, I would say the weight of 1v1v1 games was too strong.

I understand that for Coders Strike Back there was a system where players would have 10 1v1 games against the same opponent. The average result was fed to the ranking system. While I understand the merit of such a system (prevent unlucky losses against a much weaker opponent), I see more reasons why the system should not be used. Consider the (real) case scenario in CSB where pb4608 has 30% chances to win against Jeff06, and Magus has only 10% chances to win against Jeff06. In order to get a win according to the ranking system, pb4608 has to win 6 games out of 10 against Jeff06. This is very rare when the individual chance to win each game is only 30%. It is so rare that the ranking system can’t make any difference between pb4608 and Magus based on the games against Jeff06. In order to accurately differentiate players, it is better to feed the ranking system with as many small results as possible than to provide it with a smaller number of stronger results.

On the complexity of rank centrality as described by _anst : the complexity will explode if you want to do an exhaustive search for more than 10 players. The problem is somehow similar to the traveling salesman and could advantageously be approached via genetic algorithms or simulated annealing.

More on the complexity of rank centrality : this complexity comes from the discrete nature of ranking. This is the reason why I proposed an approach based on evaluating a player’s strength instead of its rank. It seems easier to find an optimum strength within a continuous search space : methods such as gradient ascent are available to converge towards a proper player strength.

Under a system where no matches are forgotten, Magus commented that early losses against a much worse player would be detrimental. Indeed, we would be tempted to submit our AI repeatedly until we get a “lucky start” with no losses. This was not an issue with the current use of TrueSkill, but may become one in a system where old games are not forgotten. My proposal is to keep the current version of TrueSkill during the contest. At the end, N games are relaunched for the top100 players.

The previous point goes against my comment a few days ago stating that opponents should be uniformely sampled (top5 players have as many chances to play against another top5 player as against a top1000 player). I repeal this comment as it seems to bring more difficulties than anything else…

anst · May 18, 2016, 8:40pm

I agree on many points, especially like the remarks about (non-)averaging a batch of 1v1 battles and giving more weight to 1v1 battles vs permutations of 3 or 4-player games. Agree also with the proposal of keeping TrueSkill (at least for now), relaunching at the end for top 100 (or 150 or something). Sorry for repeating, but I’d love to see averaging of TrueSkill points as per @Manwe’s experiment (not to be confused with averaging of batches of 1v1 battles, which is bad). At least it’s an easy solution that’ll converge to something.

Just two remarks.

A clarification concerning Rank Centrality: what @pb4608 said about complexity, applies to Kemeny-Young brute-force search (impracticable indeed beyond 10 players).
Rank Centrality however doesn’t explode in complexity. Actually, the inverse is true: it behaves well with as many as O(n*log(n)) random samples. In terms of relative error, it was shown to perform as good as the ML estimator in pairwise comparison models, see here p. 22-23. In my MATLAB tests, it takes about 30-40sec for 3000 players (once you have the data, of course).
TrueSkill (+ ELO and some ranking systems in sports etc.) use a gaussian model of “skill”. It’s relevant for human performance varying from one day to another and evolving with time. But I believe this is a wrong choice for judging fixed AI bots.
First, the outcome of a contest is just a definitive set of pairwise comparisons (no more will be added and nothing will change). The skill is once for all defined by a family of of win/loose ratios (no point to estimate gaussian ‘uncertainty’ parameter).
Given such data, more relevant is the model used by @Manwe, well-known in stats as BTL. Assuming this model, we want to find the player strengths (w_1,...w_n) that best explain the observed data, given that the win/loose ratio is w_i/(w_i + w_j).
You do this with the Maximal Likelihood Estimator; in this model it’s given by the Logistic Regression (see e.g. here page 14 for the formula; I think this is different from what @pb4608 computed in his earlier post, since he indicated a gaussian model).
I tend to believe that this is the best solution to our fair ranking problem. BTW, the Rank Centrality agrees pretty well with MLE results and is said to use less computational power; might be a practical solution.

pb4 · May 18, 2016, 8:51pm

If I wrote Gaussian, this is a mistake. I used pretty much exactly the Bradley Terry model without knowing the name at that time.

As for Manwe, I understand he did not use Bradley Terry to evaluate the rank. What he did was generate a set of “model data” based on a rule similar to the Bradley Terry model.

anst · May 18, 2016, 8:56pm

OK. Then, basically, we 100% agree.

That’s right, he didn’t do the estimation (I said he used thi model ;). Well, what he showed is somewhat optimistic: given this model, the averaged TrueSkill rank converges to the correct thing).

_CG_Maxime · June 3, 2016, 8:37am

@anst @pb4608 @Magus : can you ping me on the chat when you are all available to talk about what we can easily do? (Preferably next week)

Magus · June 3, 2016, 10:43am

Si c’est dans la semaine la journée je suis au boulot mais entre 12h et 13h30 c’est ma pause manger et comme je mange devant mon PC je suis dispo. Sinon après c’est le week end ou le soir après 19h.

Neumann · June 15, 2016, 11:27am

Have you had your discussion ? Should we expect changes in the ranking system for CodeBusters ?

_CG_Maxime · June 15, 2016, 6:10pm

We are currently running some experiments to have a leaderboard more stable. Nothing official yet but the idea would be to use a mean of the TrueSkill scores during the last phase (when we add new games, after the end of the challenge).

And yes, we want it for CodeBusters, that’s why we’ve picked that solution. We’ll use this system if everything is ready in time.