Where to begin and how to explain? Maybe with some history. Because in my case, this is not the result of mere 10 days of competition, but rather years in the making. It all started in 2019 when @pb4 and @Agade broke all CSB records by submitting, to my knowledge, the very first overwhelmingly dominant bots based on neural networks (NN) and reinforcement learning (RL) at CG. I remember being completely floored by the achievement, and I am so thankful for all the help and pointers @pb4 generously provided to get me started on my own journey of machine learning (ML). For the next two years I have read dozens of papers on the subject, failed hundreds of attempts and made a few things work along the way (Coders Strike Back, Bit Runner 2048), as well as applying ML for a project at work.
Fast forward to around last Christmas, when I started building a new RL pipeline inspired from Deepmind’s AlphaGo/Zero. I like naming things but had no inspiration until this contest, so I am now calling it “Totozero”. It began, as usual, with having way too much curiosity and free time on my hands, a general desire to learn more ML, as well as accumulated years of frustration / hatred at being unable to gain a significant edge with my bot at Ultimate-Tic-Tac-Toe. At first I only wanted to finally see what near-perfect play was like, even if it could not run on CG. But one thing led to another, and after months of development and optimization, finally got something to work and briefly took the #1 spot.
It is a lot of hard work, but the payoff is so high when you get something to finally learn on its own, I could not stop there. Soon after the pipeline became more flexible and able to easily plug other games into it, such as Breakthrough, Othello and Checkers. Results were generally very good and achieved in a short time, though not as dominant as initially hoped. However the motivation faded soon after the first results, as I was much more interested in the process than the games themselves. (BUT if I get to chess one day…)
Given this context, when the contest started, the game rules created the perfect storm: board game, 1v1, perfect information, state with a simple representation, large state space, complex evaluation, 100ms per turn… I could not have asked for a better fit. Having all the ingredients and tools ready for RL, all that was needed is: code the simulation, plug it in the pipeline, figure out the right way to do training, pick the right neural architecture, let it run and do more tweaks along the way. Oh, and make it fit under contest constraints, which are more strict than the multiplayer games (100K source code, no “obfuscation”).
It still took a few days before finally starting to get results. First tests in IDE were disappointing, maybe around top 100 at the time? For some reasons, I initially thought having a “single-player” Monte Carlo Tree Search (MCTS) where the next state is obtained by simply predicting the opponent’s next move with my own chosen move would be enough. Maybe because I assumed opponent interaction to be minimal and therefore mostly ignorable by focusing on maximizing my own score. It turned out to be an awful direction. Self-play training hit difficult problems where it would often collapse into both sides WAITing most of the rounds. One side thought it was winning for sure so ended WAITing to accelerate the end of the game, which caused the move prediction to disproportionately output WAITs, which then made the other bot think it was winning as well because it assumed the opponent would WAIT and did minimal adjustments, so ended up mostly WAITing as well, which turned into a feedback loop with a side of self-fulfilling prophecy and matches averaging 30 turns. Even after adding hacks left and right, the results were still unconvincing.
I was afraid of biting the bullet of exponential action space by considering the opponent’s moves, but it had to be done. Enter Decoupled UCT (DUCT) into the mix, and at the end of the first training it… worked so unreasonably well.
No seriously, it was really hard to find losses against the top bots at the time. I have participated in competitions for years and it just never happens, it felt a bit surreal. To the point where nothing afterwards felt nearly as groundbreaking. Sure, a lot more time was still spent on dozens of fine-tuning here and there, while letting my GPU run hot like it’s in a cryptomining farm, and it did end up gaining around +600 self-play Elo rating throughout the process, after it was already hardly losing. This was measured by keeping a local arena with dozens of checkpoints playing against each other, acting as a sort of weak validation the training is not stagnating or regressing. I used Ordo to compute ratings, which is a great little tool for that. Sure, the arena does not confirm the bots are not stuck in a pointless little meta of their own, but that’s the gamble of self-play. The RL setting seems to help counteract a lot of the homogeneity problems that occur in local arenas like that, where the performance is often meaningless because of local exploitation / overfitting.
My confidence was high when finally submitting my first version from the Bronze league on Saturday night, before it got hit by multiple losses in the Silver league.
After a bit of panic, the cause turned out to be something hard to believe. It was losing because the opponent was playing… too badly?!
By the midgame, the bot was getting such impossibly big leads, never seen during training (and not as visible during my testing in IDE against legend bots), that it caused the neural network to become complacent and incredibly unhelpful. It saw every single move, even after going through multiple plies, as winning with over 99.9% certainty. But convergence to a good move cannot be achieved without some contrast in the scoring signal to tell what’s good and what’s bad. Nope, “there is literally no way to lose from this position”. Except it lost by more or less playing random moves until the situation got sometimes too dire to recover from.
There comes a point in every programmer’s career where the need to “ship it” takes over everything else, and thus a disgusting hack was born. If the NN returns a too high win (or loss) percentage, the bot takes the state, computes the final score delta between self and opponent as if the game ended right there, and… multiplies that by a win percentage, adding it directly to the result. Yes, that means nonsense like 110% or even 200% winrate could be backpropagated. At least it’s a signal that converges into “winning more” and finally convert the game. If it works, is simple, gets the job done and you need to ship now… Well, ship it!
It turns out this was also sometimes helpful even in Legend league. It took a bit of tuning to get right, but it was definitely beneficial. I did sense a bit of irony plugging in magic hardcoded constants, old fashioned style, into something meant to not be needing it in the first place, but hey that’s life. Looking back now, I have several ideas that could more properly fix this issue, but that will be something to explore with much less time pressure.
The Devil in the Details
For good or (probably, mostly) bad reasons, I am not comfortable publically exposing too many details, so I apologize for that. Still, hopefully this will satiate most of your curiosity.
- Bitboards were extensively used to simulate the game. Mainly motivated for training performance, not bot performance.
- I only used minimal pruning, prefering to give more freedom for the AI to learn on its own. Only one source of seeding was allowed per cell, and it favors the biggest tree to do it. No completes on last turn if it lowers the final score. No grows on last turn. No seeding at all on last round since I did not think it should matter, something I was very wrong about.
- As a result of minimal pruning, I have no idea if my bot violates common knowledge like “do not seed next to your tree” on purpose or by mistake. To be continued.
- Only a few 1000s of states are evaluated on average each turn. This is often the source of bad plays in endgames where more bruteforce approaches would be strictly better.
- The neural network is made of “many” hex-convolutional layers stacked like pancakes.
- Its input is the hex grid stored in two dimensions with “many” channels per cell.
- Its outputs are comprised of one value [-1,1] for win/lose confidence and two policies, one per player, to accomodate for asymmetry and simultaneous play.
- Inference was done through extensive use of C++ templates, AVX2 intrinsics, and verification through disassembly.
- Evaluations are cached in a hashtable and reused throughout the game.
- One of the most common misconceptions on CG is that NNs are very hard or impossible to use because of the size limits. However only 71% of the maximum allowed source size was utilized to compile my bot. “100K ought to be enough for anybody!” (see encoding techniques already described in this thread, though personally I prefer bit streams :P)
- As a sidenote, IMO the real constraints on CG has always been the low runtime on CPU before the code size. But I have often compared Codingame as being the demoscene of AI. I find having to make the best out of very limited (even if maybe arbitrary) constraints to be a beauty in itself.
- The final model played approximately 1.3 million games in self-play for over 50 hours of training, using an Intel i7-8700k, a GeForce RTX 3080 and 64GB of well-utilized RAM. I lost track of the failed runs, and did not include the best model I ended up not using for safety reasons (only +45 elo), but the main one is where the vast majority of training went.
- The local arena contained 85 checkpoints and played 189k matches in total.
- For those concerned about the asymmetry of “AI hiding”, believe me or not I have run zero batches during this competition. Only maybe a couple of hundreds of plays in the IDE as random validation that things did not go wrong. I am however, regardless of that, still extremely opposed to CG crippling batches permanently as of this contest.
- A note to CG for future contests, please make the mapgen code in a way that is easy to code in other languages than Java. Had to hardcode the result of that line, was luckily constant regardless of the seed.
- Speaking of map generation, my NN is very unhappy with some of the maps coming out of the generator, because of the sun asymmetry. Sometimes it even estimated its chance of winning at 20% before the game even started…
- Probably forgot to mention something.
Thanks to Codingame and the community for another great competition! I sincerely hope this will remain one of the main focus of this website for many years to come.