85th, 23rd in gold
I didn’t use any search, just trained NN to pick the move it thinks is the best.
First day trained directly on the referee using evolution strategies to see if the policy NN is feasible. It was. This method was sufficient to get into mid gold after all. But for the most time I was using deep Q-learning to train the network.
Most time was spent tweaking hyper params and net architecture/inputs. I had several runs from scratch. Thanks to that I had several differently learned bots to test against. Bot was easily in top100 all the time, unfortunately it was not enough for the legend. I desperately tried several things at the end, even with some search, but it didn’t work, because I haven’t done serious simultaneous games search before, let alone 3 players + randomness. All I can say with proud is that my bot doesn’t have any simulation. It was weird throghout the runs to see my bot was among the strongest ones, but it would lose with 0 points against 2 default ais in IDE, because it would not earn any gold or silver medals in hurdles. But because it got all golds in other minigames, it would still think it is winning (have good score on the moves).
Given the game was easy to simulate, I wonder why I didn’t try that from the start, 2 weeks should be enough to learn about tree search of such games. Meh maybe next time.
I really liked the design of the game, the pixel art and simple rules, though the statement contained some errors or was not 100% clear in some aspects. The viewer could contain more detailed information, at least in debug mode. People in discord were constantly asking why they got 0 points when they got most gold medals :). It seems using AI for cover images is a tradition now.
Appendix:
NN is 2 hidden layers with 200 nodes each, with 214 inputs and 4 outputs. 214 x 200 + 200x200 + 200x4. The inputs are from the perspective of me vs others. I experimented with various inputs. One thing that Lysk told me is to use hurdles relative to players, it sped up the training as it can be seen here (y-axis is winrate against test bots and green is the updated inputs).
Deep Q-learning params:
- double deep q-learning, update target network every 4096 turns
- NN learning: SGD with 0.8 momentum and alpha learning rate
- alpha learning rate: 0.001 to 0.0005 during run; in final run fine-tuning with 0.00005 which stabilized the training
- exploration: e-greedy with e = 0.05 (more in the beginning)
- gamma: 0.99
- 1 million replay buffer
- learn from 32 positions randomly from buffer each turn
- each turn evaluate 3 players and get results, so each turn 3 positions were added to the buffer for learning
- reward at the end only. 1.0 for 1st, 0.25 for 2nd and 0 for being 3rd