Hi!
I’ve been playing for a few months with Artificial neural networks - or ANN - and in particular, I’ve been trying to create an AI for CSB using an ANN, but using Unsupervised learning (with Deep Q-learning, like explained in DeepMind paper) instead of mimicking an existing AI using a NN.
As I discussed a little bit about it on the chat, and as my AI has been quite OK so far (currently ~30st on Legend league), I’ve been asked to explain a little bit about it, so here it is.
Introduction / References
As an introduction, I think that a bit of knowledge about ANN is welcome - see links in posts above.
ANN introduction
To make it very simple, here is my understanding of an ANN:
An ANN is an approximation of a non-linear function with i input dimensions and o output dimensions, consisting of the composition of n linear functions and n-1 non-linear activation functions:
f: ℝi → ℝo, f = f0 ∘ a1 ∘ … ∘ an ∘ fn
I’m not going into the math details, but this roughly means that an ANN can be used to approximate any other function as long as there is enough, or large enough hidden layers (the fk for 0 < k < n).
Q-learning introduction
For a system that has states and transitions, one can define a function Q which inputs are a given state and a possible transition from this state, and which output is a single reward value. This reward value could then be used to select the most promising transition at any given state of the system to maximize the end reward.
Example: In the CSB game, the state of the game is all the inputs given by CG at the beginning of a turn (position of checkpoints, position of pods, speeds of pods, directions of pods, next checkpoint ids), and the possible transitions are all the possible actions that could be chosen by a player.
In a given state next to the end of the run, the value of the Q function of this state and any action that would lead to an immediate win, would be higher than the one of any action that would lead to a win in 2 or more turns.
The next idea behind Q-learning is that one could approximate such Q function by using an iterative approximation. Then by playing many games, updating the approximation at each transition taken until the approximation appears to be good enough:
For all state s
For all transition t from state s
Q(s,t) := Initial value
S := Initial state
Loop
T := Sample from all possible transitions at state S
S' := Compute S, T
R := Evaluate immediate reward
R' := Evaluate max(Q(S',t) for all transition t from state S')
Q(S,T) := Q(S,T) + α * (R + γ * R' - Q(S,T))
-
The α factor is the learning rate, which would be 1 if the system is deterministic, or < 1 if the system is stochastic.
-
The γ factor is the weight of future rewards compared to immediate rewards.
For more detail, please see the Wikipedia page which is good once you’ve got the basic idea: https://en.wikipedia.org/wiki/Q-learning
Cool but what to do with that now?!
ANN are cool to approximate functions, but when you don’t have any function reference, you don’t know what to approximate. Supervised learning is about approximating the model, but we don’t want a model, we want the AI to learn on its own!
Q-learning look awesome, but practically it’s not very useful. Who’s going to store the Q values for all the possible (S,T) when you have an almost infinite number of possible transitions for each state?!
That’s what DeepMind paper is about: approximating the Q-function using an ANN. The ANN input would be (S,T) and the ANN output would be the Q value. Or better, the ANN input would be S, and its output would be n Q-values, one for each possible transition.
Then, all you need is to make you computer crunch numbers in order to get the ANN converge to a good approximate of the Q function… Easy
Intermission, Of good ANN design
ANN design is not for everyone. Choosing the right combination of number of layers, the size for each layer, the type of each layer can be tricky.
But rest assured, there’re some simple rules to follow to find a good design!
…
Just kidding, I couldn’t find any, it’s a mess.
I’ve got no clue about why my 5 hidden layers network (8-16-16-16-16-16-37) works better than the 8-16-16-16-16-37 but worse than the 3 hidden layers version…, or why sigmoid or relu activation functions never converged but tanh works all the time.
Anyway, I wrote myself a simple generic fully connected layer template and tanh activation function and that’s it. I’m tweaking the hidden layer count and size until I have something that converges fast enough.
Keeping both small is usually good, not too small though. Number of layers is apparently less important than size of the layers, unless you feel your inputs really need to be combined a lot.
For CSB, I had the feeling that a single ANN to model all the possible actions at a given state would be troublesome. Even after limiting to 37 possible angles and 50 values for the thrust, it would still be 1850 possible transitions, so 1850 Q values as outputs of the ANN. That didn’t feel right.
Instead, I went with two different ANN, one which would give me the Q values for the possible angles - 37 outputs, and one which would give me the Q values for the possible thrusts - 50 outputs.
For the inputs, I got some inspiration from discussions on the chat. I believe pb4608 was advocating the use of normalized inputs and inputs relative to the pod, so I started with that as well.
At first I tried to be clever, and I preprocessed the inputs, like giving the distance to the checkpoint in multiple directions in different inputs… but once I started to get some converging ANNs, it became clear that that was useless.
A simple dot(podFacingVector, podToCPVector)
as input (resp. with the orthogonal vector, and both normalized), was enough. The ANN was able to chew it without any need to be clever.
My runner inputs currently look like this:
inputs[0] = dot(point{1, 0} * pod.dir, pod.spd) / max_speed;
inputs[1] = dot(point{0, 1} * pod.dir, pod.spd) / max_speed;
inputs[2] = dot(point{1, 0} * pod.dir, game::points[pod.next] - pod.pos) / max_distance;
inputs[3] = dot(point{0, 1} * pod.dir, game::points[pod.next] - pod.pos) / max_distance;
inputs[4] = dot(point{1, 0} * pod.dir, game::points[pod.next + 1] - pod.pos) / max_distance;
inputs[5] = dot(point{0, 1} * pod.dir, game::points[pod.next + 1] - pod.pos) / max_distance;
inputs[6] = dot(point{1, 0} * pod.dir, game::points[pod.next + 2] - pod.pos) / max_distance;
inputs[7] = dot(point{0, 1} * pod.dir, game::points[pod.next + 2] - pod.pos) / max_distance;
point{1,0}*
and point{0,1}*
are for vector rotations
Deep Neural Q-Training Super Plus™
Training the ANN for Q function approximation is the easy part, it’s basically the same thing as in the pseudo code above.
NN := Random weights
S := Initial state
While epsilon > epsilon_min
Qs := NN.forward(S)
If Random(0,1) < epsilon
T := Sample transition from possible transition
Else
T := Transition with highest value in Qs
S' := Compute S, T
R := Evaluate immediate reward
R' := Evaluate max(Q(S',t) for all transition t from state S')
Qs[T] := Qs[T] + Alpha * (R + Gamma * R' - Qs[T])
NN.backward(Qs)
NN.update_weights()
At each step, the only update that is back-propagated into the ANN is the update to the value of Q(S,T), and the ANN will - hopefully - converge to the Q function.
…
Of course, that doesn’t work, but we are close, I can feel it.
The first ANN were not converging to anything very interesting. The problem that I saw is that in CSB the fixed rewards - see below for reward discussion - are scarse: the rewards only happen when the pod crosses a CP, or when the pod times out.
If you implement the above pseudo code directly, the Q values of the transitions leading to the eventual fixed reward will all be updated before the fixed reward is actually known. This means that in order for high reward of a winning sequence of actions to be propagated back to the first transition of this sequence, the sequence would have to be played again and again n times, n being the number of transitions in the sequence.
In CSB, a timeout reward only happens after 100 transitions so in order for the reward to propagate to the transition leading to this timeout, the 100 transitions would have to be picked 100 more times, with the same initial state, and this is highly unlikely to happen.
Now, the best way to quickly propagate the reward would be to update the transitions Q-values in reverse order, but this would also difficult to do.
So a good compromise from DeepMind paper, is to store the (S,T,S’,R) event in a fixed size memory of the ANN, and instead of updating the Q value for the currently selected transition, to randomly sample a batch of events from the memory and update the Q values for each event.
The events being replayed often and out of order, the fixed rewards are quickly propagated to the transitions leading to them:
NN := Random weights
S := Initial state
While epsilon > epsilon_min
Qs := NN.forward(S)
If Random(0,1) < epsilon
T := Sample transition from possible transition
Else
T := Transition with highest value in Qs
S' := Compute S, T
R := Evaluate immediate reward
Memory.Append(S,T,S',R)
For i from 0 to batch_size:
S,T,S',R := Sample from Memory
R' := Evaluate max(Q(S',t) for all transition t from state S')
Qs := NN.forward(S)
Qs[T] := Qs[T] + Alpha * (R + Gamma * R' - Qs[T])
NN.backward(Qs)
NN.update_weights()
Plus, using an event memory working in cooperation with a neural network feels absolutely natural.
High reward, dead or alive
As for the ANN design, the reward design is another part of black magic.
Basically I would say that you don’t want to create rewards out of nowhere. If there is nothing special about a state, just give a 0 reward. Only give rewards when something positive or negative happens. For example in CSB, for my Runner training I have only three cases where I give a reward:
- The pod passes through a CP: +50, a new random CP is generated
- The pod times out: -10, state is reset to random state
- The pod goes too far from the CP: -10, state is reset to random state (actually this one may not even be necessary, it only saves a bit of time)
That’s it and it works well. That said, using different values (like +1 / -1, +500 / -100, etc…), can completely mess your ANN convergence as well… Once again, black voodoo magic™.
Initially I tried to be clever as well, giving rewards depending on the pod time out counter, its proximity to the CP, the remaining timeout when passing the CP, etc… well that didn’t do any good and was mostly confusing the ANN, which didn’t converge most of the time.
#Numbers
Currently, I’m using the following values for the various parameters, and they gave satisfying results so far:
- 8 inputs - Runner, 12 inputs - Blocker
- 3 to 5 hidden layers, 16 input / output each, may need more if the number of inputs increases
- 0.95 γ
- 1000000 to 2000000 training steps
- 50 to 100 batch updates per step
- steps / 100 memory capacity
#Conclusion
I think that the best part of this method is that every time I tried to be clever in the reward values or in the ANN inputs, I failed.
It looks like it’s a big number crunching machine that will do everything better than you do, so keep it simple, feed it raw data and it will do miracles. I still didn’t try to feed it absolute or non-normalized inputs… who knows, maybe it will even work better.
But seeing an AI learn on its own is very satisfying although I’m not even doing anything special. Especially the Blocker AI which I didn’t believe that it could work at first but did converge after all to an average/good blocker.
I would say that the disadvantages are:
-
the time it takes to train, but thankfully I wrote small visualization windows so I can stare at bleeps and bloops to pass the time*
-
the various black voodoo magic™ parameters that can turn a well converging ANN into a complete nonsense if you tweak it a little bit wrong
-
the code size requirement for hardcoded ANN weights once the training is done**
*
** And I had to workaround CG code size limitation, in very very horrible ways…