Hi,
My current bot for CSB is also a NN but I used a different approach than Agade and pb4.
My NN directly outputs a joint policy for both of my bots (6 choices for each bots, so 6x6 for the joint policy) and I sample this policy at each turn. I have no search algorithm on top of it. I do only one NN evaluation per turn.
For the NN training, I used a policy gradient algorithm: A2C. A few points:
- self-play using the same NNs for both player
- full roll-out and GAE (Generalized Advantage Estimation) instead of n-step for the advantage
- advantages are normalized for each batch
- entropy of the policy in the cost function to (try to) avoid converging too fast on a sub optimal policy.
- 2 types of reward: a training reward and the real reward, I linearly switched from the training reward to the real reward during the first part of training.
- ADAM for the gradient descent with really large batch (a few k samples).
- Input normalization (guessed at first guesses, and then using statistics gathered during a whole training)
- very long training duration. The NN learns really fast to have a very good runner (less 30 minutes I think) but having a good blocker take a very long time.
- a gamma closed to 1 (I am not sure which value I used for the submitted bot, either 0.999 or 0.9999)
For the rewards:
- Real reward: -1, 0, 1 at the end of the game and 0 otherwise
- Training reward: the real reward + a small bonus/malus for each checkpoint checked and bonus for each collision at each step.
- A hack to avoid run-away: I made a player loose if a bot go too far away. It helped during early training but it had sides effects: my NN didn’t do anything anymore when an opponent bot was running away, I had to workaround that in my submitted bot (thanks pb4 for showing me this issue).
My NN has 80 inputs, 1 + 36 outputs, 4 hidden layers using leaky relu (1 common + 3 per output head), tanh for the value, and softmax for the policy.
To have a submitted bot below 100k, I do a final training pass using weight quantization down to 7 bits with a quantization ‘aware’ gradient descent with a reduced learning rate and entropy.
It allows shrinking the NN from ~200k of binary floats to a base85 encoded string of ~60k used by my code (in C) while keeping a quite good accuracy
Good game to Agade and pb4 and thanks a lot for your article. I did tried multiple times to do a Nash/CE Q learning but I never succeeded in achieving anything converging.
Don’t hesitate if you have any questions.