I was also recently able to succesfully do Q-learning on CSB. I trained a runner bot using 8,64,64,64,6 neural network architecture using Deep Double Q-Learning. The neural network takes as inputs information relative to the position of the next two checkpoints and outputs the Q-value of 6 possible moves (+18,0,-18 angle)x(0,200 thrust). This neural network performs pretty well at its task of passing checkpoints as fast as possible. Leading to replays such as this, this and this. I have submitted it to the online arena and it reached a rank of ~90. You can therefore play games with it if you’d like.
Unlike [CPC]Herbert (previously rOut) I havn’t tried to train a blocker yet. However unlike him my training is stable for other activation functions than tanh. I suspect this comes from my strict application of Deepmind’s Deep Q learning algorithm, including error clipping to avoid seldom passing huge gradients through the NN (pass an error signal no larger than 1 in absolute value), sweeping the epsilon in the epsilon-greedy policy from 0.95 to 0.05 at the beginning of the training and the Deep Double Q learning improvement. I also have 1 network instead of his two separate networks for thrust and angle.
Supervised vs Reinforcement Learning
A while back I did my first NN runner bot on CSB, it ranked pretty well for a runner bot (~37 at the time) but it was trained by supervised learning to copy my heuristic runner bot and this showed in it’s play style. On average it would copy pretty well but it would sometimes make a small mistake like turning too early and waste like 50 turns going back to the CP it had missed. What is really nice about this reinforcement learning algorithm is that it is directly optimising for what you care about, in this case racing, instead of optimising for average error on a copy which is only indirectly related to racing.
Non-NN Moves
I hardcoded the first move of boosting to the next CP. And I override the NN’s thrust decision to “SHIELD” if there will be a collision with the enemy next turn (assuming no thrusts) of more than some constant in relative speed. The SHIELD doesn’t seem to be influencing the ranking much though.
Implementation
I implemented it in C++ without libraries. I was able to thread the mini-batch learning process but the scaling isn’t very good: almost but not quite ~2x when using 4 cores
Details
Linear output layer, relu fully connected hidden layers.
Learning rate of 1e-4 to 1e-7 in the backpropagation. I lower it to 1e-7 by hand at the end of the training to get a bit closer to the optimum.
Reward of 1 for passing a CP 0 otherwise
Gamma 0.95 in the bellman equation
Mini-batch of 100 bellman iterations per turn of a race
Memory of 1000 for action replay
Sweep epsilon from 0.95 to 0.05 linearly over the first 1e4 races
50 races until updating the target network in deep double Q-Learning
Inputs normalised to average 0 and standard deviation 1 by sampling random states and finding the Mean/Std of the inputs, then I can subtract the empirical mean and divide by the empirical Std .
Conclusion
I think I can improve performance to train my NN’s faster. The ideal would be if I could train it using a high speed library like tensor flow on a GPU and then export to C++ code. I would like to convert my Stochastic Gradient Descent to a learning-rate-free algorithm like RMSprop, hoping for better and quicker convergence. I want to try training a blocker. I’m also curious about Policy Gradient optimisation which is considered a better alternative to Q-learning, here is an article linked by inorry on that subject.
Acknowledgement
Doing this I sparked pb4’s interest in Reinforcement learning again so he has come back to try it and has very recently succeeded as well. Chatting with him, he helped me find important bugs. I also used as inputs what he was using back in our Supervised learning days. Of course [CPC]rOut/Herbert’s work was a major inspiration as well.