Smash The Code - Question about the final ranking

Psyho · May 8, 2016, 10:06pm

I apologize in case this is/was already explained somewhere, but I wasn’t able to find it on the forums (I’m still feeling a bit lost here).

I looked a bit at the scoreboard while last tests were finishing and it looked like you just added more tests without reducing variance in the formula (or reducing the weight of each game). If I understand correctly, for the people at the top (including me), the current (final?) ranking really depends only on the last few games since everyone at the top of the ranking has a rather similar win ratio. Locally, I needed around 2-3K games in order to decisively tell that an AI that has 52-53% win ratio is indeed better. While I know you don’t have that many resources available to do it very precisely, currently it looks like throwing a dice in order to decide top3 (or maybe top4 since it looks like eldidou was unfairly affected by a bug/feature). And well, spending huge part of the night on improving AI, so that all of my efforts goes to waste feels a little bit disheartening.

And I still hope, that someday you’ll get a proper ranking system that takes into account that AIs have fixed skill levels, not fluctuating ones. Thus, each game should have equal weight (except for confidence stuff).

Medici · May 8, 2016, 10:15pm

What happens to eldidou ?
His AI was very strong until the last several hours.

magaiti · May 8, 2016, 10:19pm

forum/t/timeout-in-smash-the-code/1513

Medici · May 8, 2016, 10:29pm

pity. By the way, there is a mature ranking system used by chess or other 1 vs 1 game.

Psyho · May 8, 2016, 10:42pm

They’re using TrueSkill (at least this is what they say), which should be slightly better than Elo. Still this ranking is not appropriate for ranking AIs, since it’s rather easy to develop something that will be an order of magnitude more stable than anything elo-based.

Btw, I looked through my match history, and it seems that I indeed lost because the tests finished on my losing streak.

Psyho · May 8, 2016, 11:48pm

Ok, so I took the pain and parsed the games history for the top3. In order to make it fair for everyone, I only count last 530 games and only include opponents where at least 4 games were played. Also take note that since pb4608 didn’t resubmit near the end of the contest, his results are going to look a little bit different (although they should be still quite representative).

W, T, L: Win, Tie, Loss
Total includes omitted opponents as well.

Psyho:
ATS W:15 T:1 L:4
Florenze W:86 T:4 L:87
FredericBautista W:21 T:0 L:6
Jeff06 W:10 T:0 L:2
Rafbill W:24 T:0 L:12
Recar W:7 T:0 L:5
Romka W:17 T:2 L:8
eldidou W:31 T:0 L:21
pb4608 W:78 T:4 L:67
siman W:8 T:0 L:0
Total:  W:307 T:11 L:213

Florenze:
ATS W:15 T:1 L:9
FredericBautista W:22 T:2 L:5
Ixanezis W:6 T:0 L:2
Jeff06 W:8 T:1 L:2
Psyho W:88 T:4 L:87
Rafbill W:22 T:0 L:7
Recar W:12 T:1 L:7
Romka W:19 T:1 L:11
eldidou W:17 T:6 L:19
imazato W:8 T:0 L:2
pb4608 W:39 T:6 L:44
siman W:14 T:1 L:7
Total:  W:299 T:26 L:206

pb4608:
ATS W:22 T:0 L:10
Agade W:3 T:0 L:1
Florenze W:45 T:7 L:40
FredericBautista W:25 T:0 L:7
Jeff06 W:9 T:0 L:1
Psyho W:71 T:4 L:92
Rafbill W:33 T:0 L:26
Recar W:12 T:1 L:5
Romka W:14 T:1 L:13
eldidou W:20 T:0 L:20
imazato W:6 T:1 L:1
siman W:15 T:0 L:4
Total:  W:296 T:14 L:221

Based on our top3, it looks like I have a small edge, but not very significant. From those results, it also looks that eldidou should be very close to us considering some of those losses were probably time outs.

So to sum things up. The results at the top3/4 are really broken. It would be great if they weren’t.

pb4 · May 9, 2016, 12:01am

Based on similar statistics I parsed around the end of the contest, I also did not expect to end up in the first place.

My bet was on the following top 4 : Psyho -> Eldidou -> pb4608 -> Florenze.

I read about this numerous times on the forum (I remember reading posts dating as far as 2014 about this) : once submitted, our AIs have a fixed strength. It would be better if this fact is reflected in the ranking system by giving old games a similar weight to new games in the match history.

Medici · May 9, 2016, 1:07am

Anyway, congratulations to @pb4608

_CG_Maxime · May 9, 2016, 6:56am

We are indeed using TrueSkill to both select the opponents and compute the scores. It is similar to Elo and has been used in other programming challenges such as the Google AI Challenge. It shares some similarities with Elo - winning against a strong opponent gives more points.

It allows us to rank a lot of people with a reasonable amount of games. The main drawback of this system is that is gives more importance to the last games because it is designed for human players that can improve their skills, and not for bots.

If you can find or develop an alternative solution that is proven to be better, we’d be willing to discuss it. If you have any suggestion regarding Smash The Code that could satisfy all of you, we’re all ears too.

FredericBautista · May 9, 2016, 8:15am

That is quite what I envisaged as well. However I did not take eventual last minute AI changes into account.
I wasn’t looking at the ranking when it finished, I have been told there was a lot of changes in the last games and this seem odd after hundred of matches that things can move so much in a few games.

My guess is that Trueskill and other Elo ratings are dynamic ranking systems intended to adapt to changing skill values. While in this case the skill of the AIs are fixed when there cannot be any more changes and the dynamic adaptativity of the algorithm should be progressively decreased in order to minimize the variations.

maybe the concept of playing more games at the end for the top 100 should be applied several times with a reducing range, maybe top 100 then top 50, 25, and so on. This with a reducing impact of each game in order to tend toward a stabilisation of the ranking by progressively reducing the adaptability coefficients to better fit this particular case of static skills ranking.

Edit: after having a quick look at http://trueskill.org/ and the related http://www.moserware.com/assets/computing-your-skill/The%20Math%20Behind%20TrueSkill.pdf
I suspect this adaptability reduction could be simulated through using the “partial play” function of the Trueskill framework by progressively reducing the proportional participation coefficient.

For instance this could be implemented into the current system by starting the last 100 games with a participation coefficient of 100 and decreasing it by 1 at every step.

Idea being that you start the last 100 games with an already relatively sorted ranking based on a variable sample size for every last version of the AI of each player. The confidence growing with the size of the sample you should be able to slowly decrease the adaptability of the algorithm.

Now if the sample size is just not large enough then there is indeed is no solution for accurate ranking.

Edit2: After further investigation it seems that it’s about tweaking the “Tau (τ): The Additive Dynamics Factor” of the algorithm, the paper indicate that the higher τ is, the more the volatility and that it could be tweaked based on contextual needs. I don’t know if CG use the default value and if and how does the framework allows them to parameter it. But if it’s not to difficult to change, maybe using a lower τ (or eventually a dynamically decreasing one) for the last 100 games could prove to be a good solution.

Magus · May 9, 2016, 8:37am

The problem is, if you give the same weight to an old game, if you submit and you have a lose against let say a 200th AI, your AI will never reach the top 10 before 500 games. And we have only 120 games for a submit.

If the ELO system (and so does TrueSkill) gives more impact for the first games, this is not because your skill can change, but because the system want to rank you the fastest way.

But for this contest, if you take an AI submitted at 20h, it will have 120 games for the submit. And then 100 more games for the final ranking. 220 games is clearly too low to have a correct result when AI are so close in skills. I would totally agree if CodinGame add many many more games (like +1000 games by players). And if we must wait en entire day before the final results, i’m ok with this. Make us wait an entire week if you have to. We’ll wait.

FredericBautista · May 9, 2016, 8:58am

If it is to my post you replied, I did not mean to change the way the rating is performed for the most part. Just to do more games for the higher ranks with eventually progressively reducing the variance dynamically by slowly reducing the adaptability of the algorithm.

Idea being that for instance the shape of the top 4 was known long before the end but the order in this top 4 couldn’t be precisely revealed because of high changing rates in the latest played games.

Edit: I guess it was to psycho you replied in fact but the “reply to” thingy doesn’t show for some reason.

anst · May 9, 2016, 9:40am

Considering the ranking system, what about PageRank?
After all, who is a strong player? The one who wins against other strong players. Rings any bells?

If I understand correctly, the tables by Psyho here (and pb4608 or Jeff last time) are exactly kind of data PageRank can use: instead of weblinks one can use transition probabilities from these tables.
In plain terms, if you walk the graph with the given probabilities, the nodes you visit more often are dubbed more important. This is far from complete, some other ideas could be necessary (e.g. for pairing, using more matches for some players etc…). Just thinking.

Maybe I’m a dreamer, but what about a collaborative project to develop a ranking system for CG?
(not necessarily pageRank based, maybe just a modified ELO to reflect a particular character of AI contest).
Anyone?

BTW, great contest, well weighted leagues system, congrats to CG! And to the winners!!

Psyho · May 9, 2016, 12:58pm

If you can find or develop an alternative solution that is proven to be better, we’d be willing to discuss it.

The quickest simple fix I can think of and that doesn’t have any drawbacks was already mentioned by @FredericBautista Simply reduce the weight of games over time (especially after contest have ended) and run more games for people at the top. If you want to do it automatically, you can run enough games that the difference between two players is above 2-3 standard deviations (2 is probably enough). I’m blindly guessing that TrueSkill computes volatility (i.e. confidence interval) for each player. If not, it’s still possible to compute this by hand.

Another option is to disregard elo-based ranking entirely and perform series of round-robin knockout tournaments for the top (I’d guess top 20-32 is a nice range). The drawback is that you probably would need to redesign UI in order to incorporate this so I’m guessing that’s least favorable option.

And lastly, “just” invent a new system. I admit that after giving it more thought it doesn’t look as simple as it seems (and @Magus post helped me with seeing that), because while TrueSkill has many flaws, some of them actually behave as cheap work-arounds for potential problems. For example, because only people with similar skills are matched against each it actually acts like a dynamic leagues system, where results from previous league are rather quickly discarded.

The nice thing about developing your own ranking system is that CG have probably the best dataset in order to do that. Simply run TrueSkill with reduced weight and then try to recreate those results (for example by choosing RMSLE as an evaluation function) with significantly reduced amount of data (games). That way you can safely test that the new algorithm is indeed better than TrueSkill, while it still have most (all?) of the desirable features of it.

Psyho · May 9, 2016, 1:39pm

If you have any suggestion regarding Smash The Code that could satisfy all of you, we’re all ears too.

Obviously I’m not very happy about what happened. I made the decision that I want to try my best at reaching #1 and worked overnight from Saturday to Sunday on it, since I didn’t have enough free time to do it earlier. I knew that the current rankings were very random, but since during the last contest, tests were running for the several hours after the contest and you also managed to drastically reduce the variance of the game (games were played in a series of ten, and AFAIK each series was treated as a one game, so 5:5 didn’t affect the rating change). I assumed that you’ll do your best in order to create fair final standings, which unfortunately didn’t happen (and for some strange reason you also ran very few games after the contest ended).

From what I posted it’s pretty clear that we have a very tight top3 (or top4, because I really don’t know what happened to @eldidou, it’s only clear to me that his solution suddenly started timing out, even though it wasn’t doing it before). It’s also clear for me that the results are based on pretty much only the last 50-100 games (and in reality I’m guessing that the final results are decided by 2-3 games that went one way or another), which considering how close we are, seems like it’s completely random. The fact that there’s only 2 points difference between top3 and 5th+ despite us having 65%+ winrate (which in this game is absolutely enormous) against everyone else is even more bizarre.

So, well, I’d wish the final results were fair and that I didn’t had to post the stats in order to show how ridiculous they are. Personally, that’s probably the most broken thing I encountered in a programming contest organized by a professional company in the past several years (and I’ve been doing a lot of them, since for quite a lot of time, I made a living out of them). I don’t want to really propose anything since I’ll put other people from top3 in an uncomfortable position - so if anything, the proposal should come out of CG. Also, since I’m rather new to CG I don’t know what are your standard procedures, historical decisions, etc. Anyway, I guess it’s pretty obvious that the only thing I find fair is to rerun top3. Maybe not the most pleasant, but fair.

And for the record, I’d still make this post even if “RNGesus” was favorable for me. Just probably with a weaker emotional aspect

Jeff06 · May 9, 2016, 1:58pm

I agree, I was surprised with how quickly this contest ended. For CSB, the final matches between Owl and I lasted for a few hours, and we did a couple thousand matches total. At the end, the stats of all the matches were clear, and there was no arguing about the final result.

However, I know this is an extremely complicated thing to do, especially when the podium is so close. Maybe next contest should be about designing the best ranking system? Or more seriously, maybe you can open-source your ranking algorithm, like that everyone can understand exactly how it works, and offer suggestions/pull requests as to how to improve it?

anst · May 9, 2016, 2:12pm

Unfortunately, TrueSkill seems to be patented and not that well documented…

FredericBautista · May 9, 2016, 2:20pm

Here you have the TueSkill functionalities overview and the maths behind it.

Magus · May 9, 2016, 2:27pm

For CSB there was many many more games. Any time we have to face a player in the ranking we had to make 10 games, not just one. This contest ranking was not built on the same rules and i don’t know why.

FredericBautista · May 9, 2016, 2:29pm

For some reason my previous reply was to last anst post. (It’s showing anst as the replied to person in the edition draft) But it’s showing as a reply to myself instead… Maybe because I quoted part of my own previous message? Strange…