Neural Network Ressources

Bleuj · June 16, 2016, 4:10pm

Hello,
I’ve been working on neural networks for CSB quite a lot, especially during the previous week. I decided to show my work and explain it so that anyone motivated who wants to try supervised learning with NN can find some help. I did this through a youtube video and a presentation with slides, so it may be possible to follow with muted video.

Hello, here’s an attempt at presenting a way of coding a neural network that races in CSB decently. I got the idea and motivation from pb4608 who first showed that it could work.
Remember that I’m not an expert on machine learning but I have at least some notions. So please do not take this as a machine learning course This is mostly relevant about how machine learning can be applied to CodinGame specifically.
This is not an efficient technique in CSB, it’s just for fun/glory/self-esteem/to practise machine learning/to spend time.
Sorry for english mistakes and bad accent.
Youtube link : https://youtu.be/tV4sJVtu3y0
PDF slides : https://www.docdroid.net/NgvoYFR

oook · June 16, 2016, 9:29pm

Great resources!
The following video series is a quite practical introduction to neural networks in Python with numpy:

I find it more convenient to experiment and learn about neural networks in Python first, and eventually implement one in C++ afterwards.

anst · June 20, 2016, 1:43pm

What can you do when you have just one neuron?
Quite something! Here is a very gentle and well written intro.

BTW, it’s in Go language (just to give an alternative to the most popular python and C/C++).
But it’s directly readable for C/C++ users (and probably pythonistas too).

_CPC_Herbert · August 10, 2016, 12:50am

Hi!

I’ve been playing for a few months with Artificial neural networks - or ANN - and in particular, I’ve been trying to create an AI for CSB using an ANN, but using Unsupervised learning (with Deep Q-learning, like explained in DeepMind paper) instead of mimicking an existing AI using a NN.

As I discussed a little bit about it on the chat, and as my AI has been quite OK so far (currently ~30st on Legend league), I’ve been asked to explain a little bit about it, so here it is.

Introduction / References

As an introduction, I think that a bit of knowledge about ANN is welcome - see links in posts above.

ANN introduction

To make it very simple, here is my understanding of an ANN:
An ANN is an approximation of a non-linear function with i input dimensions and o output dimensions, consisting of the composition of n linear functions and n-1 non-linear activation functions:

f: ℝⁱ → ℝ^o, f = f₀ ∘ a₁ ∘ … ∘ a_n ∘ f_n

I’m not going into the math details, but this roughly means that an ANN can be used to approximate any other function as long as there is enough, or large enough hidden layers (the f_k for 0 < k < n).

Q-learning introduction

For a system that has states and transitions, one can define a function Q which inputs are a given state and a possible transition from this state, and which output is a single reward value. This reward value could then be used to select the most promising transition at any given state of the system to maximize the end reward.

Example: In the CSB game, the state of the game is all the inputs given by CG at the beginning of a turn (position of checkpoints, position of pods, speeds of pods, directions of pods, next checkpoint ids), and the possible transitions are all the possible actions that could be chosen by a player.

In a given state next to the end of the run, the value of the Q function of this state and any action that would lead to an immediate win, would be higher than the one of any action that would lead to a win in 2 or more turns.

The next idea behind Q-learning is that one could approximate such Q function by using an iterative approximation. Then by playing many games, updating the approximation at each transition taken until the approximation appears to be good enough:

For all state s
  For all transition t from state s
    Q(s,t) := Initial value

S := Initial state
Loop
  T := Sample from all possible transitions at state S
  S' := Compute S, T
  R := Evaluate immediate reward
  R' := Evaluate max(Q(S',t) for all transition t from state S')
  Q(S,T) := Q(S,T) + α * (R + γ * R' - Q(S,T))

The α factor is the learning rate, which would be 1 if the system is deterministic, or < 1 if the system is stochastic.
The γ factor is the weight of future rewards compared to immediate rewards.

For more detail, please see the Wikipedia page which is good once you’ve got the basic idea: https://en.wikipedia.org/wiki/Q-learning

Cool but what to do with that now?!

ANN are cool to approximate functions, but when you don’t have any function reference, you don’t know what to approximate. Supervised learning is about approximating the model, but we don’t want a model, we want the AI to learn on its own!

Q-learning look awesome, but practically it’s not very useful. Who’s going to store the Q values for all the possible (S,T) when you have an almost infinite number of possible transitions for each state?!

That’s what DeepMind paper is about: approximating the Q-function using an ANN. The ANN input would be (S,T) and the ANN output would be the Q value. Or better, the ANN input would be S, and its output would be n Q-values, one for each possible transition.

Then, all you need is to make you computer crunch numbers in order to get the ANN converge to a good approximate of the Q function… Easy

Intermission, Of good ANN design

ANN design is not for everyone. Choosing the right combination of number of layers, the size for each layer, the type of each layer can be tricky.

But rest assured, there’re some simple rules to follow to find a good design!

…

Just kidding, I couldn’t find any, it’s a mess.

I’ve got no clue about why my 5 hidden layers network (8-16-16-16-16-16-37) works better than the 8-16-16-16-16-37 but worse than the 3 hidden layers version…, or why sigmoid or relu activation functions never converged but tanh works all the time.

Anyway, I wrote myself a simple generic fully connected layer template and tanh activation function and that’s it. I’m tweaking the hidden layer count and size until I have something that converges fast enough.

Keeping both small is usually good, not too small though. Number of layers is apparently less important than size of the layers, unless you feel your inputs really need to be combined a lot.

For CSB, I had the feeling that a single ANN to model all the possible actions at a given state would be troublesome. Even after limiting to 37 possible angles and 50 values for the thrust, it would still be 1850 possible transitions, so 1850 Q values as outputs of the ANN. That didn’t feel right.

Instead, I went with two different ANN, one which would give me the Q values for the possible angles - 37 outputs, and one which would give me the Q values for the possible thrusts - 50 outputs.

For the inputs, I got some inspiration from discussions on the chat. I believe pb4608 was advocating the use of normalized inputs and inputs relative to the pod, so I started with that as well.

At first I tried to be clever, and I preprocessed the inputs, like giving the distance to the checkpoint in multiple directions in different inputs… but once I started to get some converging ANNs, it became clear that that was useless.

A simple dot(podFacingVector, podToCPVector) as input (resp. with the orthogonal vector, and both normalized), was enough. The ANN was able to chew it without any need to be clever.

My runner inputs currently look like this:

  inputs[0] = dot(point{1, 0} * pod.dir, pod.spd) / max_speed;
  inputs[1] = dot(point{0, 1} * pod.dir, pod.spd) / max_speed;
  inputs[2] = dot(point{1, 0} * pod.dir, game::points[pod.next] - pod.pos) / max_distance;
  inputs[3] = dot(point{0, 1} * pod.dir, game::points[pod.next] - pod.pos) / max_distance;
  inputs[4] = dot(point{1, 0} * pod.dir, game::points[pod.next + 1] - pod.pos) / max_distance;
  inputs[5] = dot(point{0, 1} * pod.dir, game::points[pod.next + 1] - pod.pos) / max_distance;
  inputs[6] = dot(point{1, 0} * pod.dir, game::points[pod.next + 2] - pod.pos) / max_distance;
  inputs[7] = dot(point{0, 1} * pod.dir, game::points[pod.next + 2] - pod.pos) / max_distance;

point{1,0}* and point{0,1}* are for vector rotations

Deep Neural Q-Training Super Plus™

Training the ANN for Q function approximation is the easy part, it’s basically the same thing as in the pseudo code above.

NN := Random weights
S := Initial state
While epsilon > epsilon_min
  Qs := NN.forward(S)
  If Random(0,1) < epsilon
    T := Sample transition from possible transition
  Else
    T := Transition with highest value in Qs
  S' := Compute S, T
  R := Evaluate immediate reward
  R' := Evaluate max(Q(S',t) for all transition t from state S')

  Qs[T] := Qs[T] + Alpha * (R + Gamma * R' - Qs[T])
  NN.backward(Qs)
  NN.update_weights()

At each step, the only update that is back-propagated into the ANN is the update to the value of Q(S,T), and the ANN will - hopefully - converge to the Q function.

…

Of course, that doesn’t work, but we are close, I can feel it.

The first ANN were not converging to anything very interesting. The problem that I saw is that in CSB the fixed rewards - see below for reward discussion - are scarse: the rewards only happen when the pod crosses a CP, or when the pod times out.

If you implement the above pseudo code directly, the Q values of the transitions leading to the eventual fixed reward will all be updated before the fixed reward is actually known. This means that in order for high reward of a winning sequence of actions to be propagated back to the first transition of this sequence, the sequence would have to be played again and again n times, n being the number of transitions in the sequence.

In CSB, a timeout reward only happens after 100 transitions so in order for the reward to propagate to the transition leading to this timeout, the 100 transitions would have to be picked 100 more times, with the same initial state, and this is highly unlikely to happen.

Now, the best way to quickly propagate the reward would be to update the transitions Q-values in reverse order, but this would also difficult to do.

So a good compromise from DeepMind paper, is to store the (S,T,S’,R) event in a fixed size memory of the ANN, and instead of updating the Q value for the currently selected transition, to randomly sample a batch of events from the memory and update the Q values for each event.

The events being replayed often and out of order, the fixed rewards are quickly propagated to the transitions leading to them:

NN := Random weights
S := Initial state
While epsilon > epsilon_min
  Qs := NN.forward(S)
  If Random(0,1) < epsilon
    T := Sample transition from possible transition
  Else
    T := Transition with highest value in Qs
  S' := Compute S, T
  R := Evaluate immediate reward
  Memory.Append(S,T,S',R)

  For i from 0 to batch_size:
    S,T,S',R := Sample from Memory
    R' := Evaluate max(Q(S',t) for all transition t from state S')
    Qs := NN.forward(S)
    Qs[T] := Qs[T] + Alpha * (R + Gamma * R' - Qs[T])
    NN.backward(Qs)

  NN.update_weights()

Plus, using an event memory working in cooperation with a neural network feels absolutely natural.

High reward, dead or alive

As for the ANN design, the reward design is another part of black magic.

Basically I would say that you don’t want to create rewards out of nowhere. If there is nothing special about a state, just give a 0 reward. Only give rewards when something positive or negative happens. For example in CSB, for my Runner training I have only three cases where I give a reward:

The pod passes through a CP: +50, a new random CP is generated
The pod times out: -10, state is reset to random state
The pod goes too far from the CP: -10, state is reset to random state (actually this one may not even be necessary, it only saves a bit of time)

That’s it and it works well. That said, using different values (like +1 / -1, +500 / -100, etc…), can completely mess your ANN convergence as well… Once again, black voodoo magic™.

Initially I tried to be clever as well, giving rewards depending on the pod time out counter, its proximity to the CP, the remaining timeout when passing the CP, etc… well that didn’t do any good and was mostly confusing the ANN, which didn’t converge most of the time.

#Numbers

Currently, I’m using the following values for the various parameters, and they gave satisfying results so far:

8 inputs - Runner, 12 inputs - Blocker
3 to 5 hidden layers, 16 input / output each, may need more if the number of inputs increases
0.95 γ
1000000 to 2000000 training steps
50 to 100 batch updates per step
steps / 100 memory capacity

#Conclusion
I think that the best part of this method is that every time I tried to be clever in the reward values or in the ANN inputs, I failed.

It looks like it’s a big number crunching machine that will do everything better than you do, so keep it simple, feed it raw data and it will do miracles. I still didn’t try to feed it absolute or non-normalized inputs… who knows, maybe it will even work better.

But seeing an AI learn on its own is very satisfying although I’m not even doing anything special. Especially the Blocker AI which I didn’t believe that it could work at first but did converge after all to an average/good blocker.

I would say that the disadvantages are:

the time it takes to train, but thankfully I wrote small visualization windows so I can stare at bleeps and bloops to pass the time^*
the various black voodoo magic™ parameters that can turn a well converging ANN into a complete nonsense if you tweak it a little bit wrong
the code size requirement for hardcoded ANN weights once the training is done^**

_*

_{** And I had to workaround CG code size limitation, in very very horrible ways…}

pb4 · August 17, 2016, 9:48pm

rOut : Thank you for your very descriptive article. I really like how in depth you went to describe the inputs/outputs, and the pseudo-code provided to explain various versions of your learning algorithm.

Since I started reading articles on Neural Networks and Q-learning, I have observed that a lot of people seem to think that you just “train” an artificial neural network and that it will work. As you explain in your article, it really is not that easy. The comparison is often made between a Neural Network and the way our brain works, but it must be noted that in the end, the similarity is limited to the names used.

In the context in which you use it, a neural network is indeed “merely” a function approximator with a lot of coefficients to tune.

With that said, here are a few questions/remarks :

Congratulations on making anything work with Q-learning. I have a trained network on my computer, but never bothered to put it on the servers to see if it works. I have a bad feeling about it…
I believe you deviated from “Textbook” Q-learning.
My understanding of NN design for Q-learning is the following : you provide as input to the NN a set of values describing the state and transition (S,T). The output of the Neural Network is a single value, which gives the expected reward after taking transition T.
I understand that your NN design is slightly different : the input is the state S, and you have one output for each transition T.
Both approaches seem similar because for each transition you can extract one value as the output and treat it as the expected future reward.
Here is however where I think both approaches differ : in your case, the only place in the Neural Network where the calculation is different between two transitions is in the very last layer. Think about it : the first 5 hidden layers receive exactly the same input (the state S). Hence, it’s only in the very last layer that the coefficients will make a difference between all transitions. This doesn’t seem very efficient, as 5 layers of your neural network become (somewhat) useless.
I’m curious to know if you could try the approach with the (S,T) input and only one output.
You reference the “Experience Replay” technique from the Deepmind Paper. I have forgotten the name, but I saw a paper by Martin Riedmiller where he used a technique in which he explained that when most actions yield no reward, the NN may converge to 0 everywhere. The few datapoints where there is a reward give a huge error, but if they are rare this may minimize the overall error. In order to prevent this, the dataset was biaised toward using more datapoints where a reward is given. Do you have an opinion on trying this ?
With your definition of rewards, there is a way to calculate some theoretical values for your NN’s output. For example, when the pod and the checkpoints are aligned, we know that the ideal trajectory is a straight line with 200 acceleration. This means that the Q value obtained in the two consecutive states should differ only by a factor alpha. Have you checked how close the expected rewards follow this theory in a few special cases as described ?
When the pod validates a checkpoint, I understand that you shift the checkpoint positions in the input, and generate a new “next checkpoint + 2”. This is something I was afraid would not work, and around which I tried to work. I’ll definitely try your approach !
Can you describe your blocker’s input for the NN, and the reward definition ?
Are you still working on this ? I will definitely start again working on my version of a Q-learning NN, let’s see if I can come close to you
I maintain my proposal to provide you with a good AI against which you can train your own AI.
I’d be interested to have an example of how you designed your visualization tool. Would you be willing to share your code ?

_CPC_Herbert · August 21, 2016, 9:38am

I think it’s when I understood that NN are merely function approximators that I started understand what they are and how they worked really. Even in “classic” usage like for classification and supervised learning, there’s a hidden ideal input → output mapping function and training the NN is just about approximating it by tuning the parameters with gradient descent.

pb4608:

I believe you deviated from “Textbook” Q-learning.My understanding of NN design for Q-learning is the following : you provide as input to the NN a set of values describing the state and transition (S,T). The output of the Neural Network is a single value, which gives the expected reward after taking transition T.I understand that your NN design is slightly different : the input is the state S, and you have one output for each transition T.Both approaches seem similar because for each transition you can extract one value as the output and treat it as the expected future reward.Here is however where I think both approaches differ : in your case, the only place in the Neural Network where the calculation is different between two transitions is in the very last layer. Think about it : the first 5 hidden layers receive exactly the same input (the state S). Hence, it’s only in the very last layer that the coefficients will make a difference between all transitions. This doesn’t seem very efficient, as 5 layers of your neural network become (somewhat) useless.I’m curious to know if you could try the approach with the (S,T) input and only one output.

Well, this isn’t my idea. I believe it’s they way they do it in the DeepMind paper as well. I’m not sure about your concern though.

When the NN is updated, the reward gradient is back-propagated into it. So we update the last layer weights first, then the second-to-last, etc… Even if we would back-propagate the rewards one at a time (which we don’t anyway), the inputs of the last layer are the same for two different T but the output gradients aren’t.

We then compute the gradient of the inputs (and of the weights) of the last layer w.r.t. this output gradient. Both gradients are different for two different T, because the output gradients are different. As the last layer inputs gradient is then used as the outputs gradient of the second to last layer, we see that the reward differences are going to be propagated in the whole network recursively.

I may be wrong, and for the sake of discussion, the last layer of my network is also the only one that shows noticeable patterns. However, although the runner NN works OK with only one hidden layer, I could never make a blocker NN that converges with only one layer, whatever size I use.

I had also the intuition that making the good or bad moves stand out in the replay memory could help. So I tried inserting the moves in the memory accordingly to their reward so that good/bad moves would be replayed more often. I think it helped a little bit in the beginning of the training when random moves are chosen and when the NN only outputs random rewards, but it was mostly ruining everything around the end of the training when good/bad moves are known and we mostly want to refine the reward of the neutral moves that lead to good or bad moves. Maybe this could be done by also taking into account where we are in the training session.

But this is clearly a problem in the way training is done as well. For example, even training a runner with fixed rewards only on timeout or when CP is reached, without any other constraint is not working very well. I had to add a distance to CP limit and add a negative fixed reward when the pod was escaping too far, to help discarding such moves.

I guess, but I haven’t

Yep, that’s it.

When a new initial game state is generated (initially or each time the pod loses), I create:

Pod₀ position random in [0 16000] x [0 9000]
Pod₀ speed random vector with random norm in [0 max_speed]
Pod₀ direction random unit vector
CP₀ position random in [0 16000] x [0 9000] so that 4 * point_radius < distance(CP₀, Pod₀) < max_train_distance / 2
CP₁, CP₂ position random in [0 16000] x [0 9000] so that 4 * point_radius < distance(CP_n, CP_m) < max_train_distance

When the pod passes through CP₀, I shift the CPs and I generate a new CP₂ with the same rules as above. The pod’s next CP is always 0.

There’s also a Pod₁ that’s generated using almost the same rules as Pod₀ when I want to add a blocker to the game, all other pods are ignored.

Sure, it’s almost the same as the runner, only that the CP provided are the ones of the target runner and the runner’s position / speed / direction are also provided to it:

  inputs[0]  = dot(point{1, 0} * pod.dir, pod.spd) / max_speed;
  inputs[1]  = dot(point{0, 1} * pod.dir, pod.spd) / max_speed;
  inputs[2]  = dot(point{1, 0} * pod.dir, runner.pos - pod.pos) / max_distance;
  inputs[3]  = dot(point{0, 1} * pod.dir, runner.pos - pod.pos) / max_distance;
  inputs[4]  = dot(point{1, 0} * pod.dir, runner.spd) / max_speed;
  inputs[5]  = dot(point{0, 1} * pod.dir, runner.spd) / max_speed;
  inputs[6]  = dot(point{1, 0} * pod.dir, runner.dir);
  inputs[7]  = dot(point{0, 1} * pod.dir, runner.dir);
  inputs[8]  = dot(point{1, 0} * pod.dir, game::points[runner.next] - pod.pos) / max_distance;
  inputs[9]  = dot(point{0, 1} * pod.dir, game::points[runner.next] - pod.pos) / max_distance;
  inputs[10] = dot(point{1, 0} * pod.dir, game::points[runner.next + 1] - pod.pos) / max_distance;
  inputs[11] = dot(point{0, 1} * pod.dir, game::points[runner.next + 1] - pod.pos) / max_distance;
  inputs[12] = dot(point{1, 0} * pod.dir, game::points[runner.next + 2] - pod.pos) / max_distance;
  inputs[13] = dot(point{0, 1} * pod.dir, game::points[runner.next + 2] - pod.pos) / max_distance;

I got a little bit tired of waiting for the NN trainings to complete actually, I need to speed it up but it would require to write multithreaded NN training code, or GPU training code, or use a NN library that does it already, or buy a new and more powerful computer… In any case it’s not going to happen soon

But I don’t give up, and I will probably go back to it at some point. Also, its proved to work great on CSB, but I’m wondering if I could also make it work for other multiplayer games.

Noted, thanks

Well sure, here it is: http://pastebin.com/aMe6PiJd. I used the cairo 2D drawing library, and the window initialization code is Linux specific… but cairo is portable and can work with any OS windowing system as long as you can provide it a supported drawable surface (for example a Win32 surface, or PNG image). Something more portable using SDL or SFML libraries for the window creation could certainly be written, but I didn’t want to spend too much time on it.

There’s a ENABLE_DRAWING preprocessor switch that allows me to write things like this in my CG codes for local debugging without compilation or runtime errors when running the code on CG:

debug(auto game_debug = draw::init(16000, 9000, 800, 450));
// [...]
debug({
  draw::circle(game_debug, state[0].pos, pod_radius, draw::rgba{1., 0., 0., 1.});
  draw::circle(game_debug, game::points[state[0].next], point_radius, draw::rgba{0., 1., 0., 1.});
  draw::circle(game_debug, game::points[state[0].next + 1], point_radius, draw::rgba{0., 1., 0., 1.});
  
  draw::arrow(game_debug, state[0].pos, state[0].pos + state[0].spd * 1000., draw::rgba{1., 1., 0., 1.});
  draw::arrow(game_debug, state[0].pos, state[0].pos + state[0].dir * 5000., draw::rgba{1., 0., 1., 1.});
  
  draw::circle(game_debug, state[1].pos, pod_radius, draw::rgba{1., 0., 0., 1.});
  draw::circle(game_debug, game::points[state[1].next], point_radius, draw::rgba{0., 1., 0., 1.});
  draw::circle(game_debug, game::points[state[1].next + 1], point_radius, draw::rgba{0., 1., 0., 1.});
  
  draw::arrow(game_debug, state[1].pos, state[1].pos + state[1].spd * 1000., draw::rgba{1., 1., 0., 1.});
  draw::arrow(game_debug, state[1].pos, state[1].pos + state[1].dir * 5000., draw::rgba{1., 0., 1., 1.});
});
// [...]
debug({
  draw::flush(game_debug);
  std::this_thread::sleep_for(std::chrono::milliseconds(16));
  draw::clear(game_debug, draw::rgba{0., 0., 0., 0.75});
});

selslack · October 14, 2016, 9:51pm

A little offtopic: does CG have plans to support NN solutions to their contests?

selslack · October 18, 2016, 3:26pm

@SaiksyApo bump! Is there any plans of officially supporting ‘good’ way of using neural nets on multiplayer and contests? It’s very interesting topic for me.

Maybe just increasing code size limit or allowing us to upload and read files. Or any other functional solution to achieve this?

_CG_SaiksyApo · October 18, 2016, 3:28pm

Atm, it’s not in our priority list.

Agade · October 25, 2016, 9:45am

** And I had to workaround CG code size limitation, in very very horrible ways…

Could you explain/share the code of how you did that? Did you manage to write your coefficients in ascii/base 256?

_CPC_Herbert · December 15, 2016, 12:29pm

Yes, basiscally it’s something like this.

I though of multiple approaches at the time.

The first step would have been to encode the coefficients as a base64 string, instead of an array of hexadecimal constants, although the gain is maybe not very high.

Then, instead of encoding the coefficients themselves, I thought of compressing the coefficient array first with gzip, lzma, or anything else, and encode the result as the base64 string. At program load time, I would have to decode the string and decompress the data to get back the coefficients.

The problem with these two approaches, it that although they gain a bit of code size, there is also an overhead for the decoder / decompresser code.

My codingame framework code was already very big and I had very little margin to include additional code. So, I went further and instead of compressing and encoding the coefficients only, I did it for the program binary itself:

I compile all of my framework code with NN coefficient arrays directly as an executable in local, which may already be smaller than the source code itself because of unused code, and other things.
The resulting binary is then optimally compressed using UPX, which is a nice executable compression tool, which embeds decompression code directly in the executable, in addition to aggressive transformations to achieve minimal executable size.
I encode the resulting binary in base64, and I include it in another small source code that will decode the base64 to a temporary file, at runtime on CodinGame servers, and then execute the resulting binary.

anst · March 12, 2017, 8:01am

A great q-learning ressource has been mentioned in another thread (thanks @Gaarv !) so I’m posting the link here.

Agade · June 2, 2017, 10:22am

I was also recently able to succesfully do Q-learning on CSB. I trained a runner bot using 8,64,64,64,6 neural network architecture using Deep Double Q-Learning. The neural network takes as inputs information relative to the position of the next two checkpoints and outputs the Q-value of 6 possible moves (+18,0,-18 angle)x(0,200 thrust). This neural network performs pretty well at its task of passing checkpoints as fast as possible. Leading to replays such as this, this and this. I have submitted it to the online arena and it reached a rank of ~90. You can therefore play games with it if you’d like.

Unlike [CPC]Herbert (previously rOut) I havn’t tried to train a blocker yet. However unlike him my training is stable for other activation functions than tanh. I suspect this comes from my strict application of Deepmind’s Deep Q learning algorithm, including error clipping to avoid seldom passing huge gradients through the NN (pass an error signal no larger than 1 in absolute value), sweeping the epsilon in the epsilon-greedy policy from 0.95 to 0.05 at the beginning of the training and the Deep Double Q learning improvement. I also have 1 network instead of his two separate networks for thrust and angle.

Supervised vs Reinforcement Learning
A while back I did my first NN runner bot on CSB, it ranked pretty well for a runner bot (~37 at the time) but it was trained by supervised learning to copy my heuristic runner bot and this showed in it’s play style. On average it would copy pretty well but it would sometimes make a small mistake like turning too early and waste like 50 turns going back to the CP it had missed. What is really nice about this reinforcement learning algorithm is that it is directly optimising for what you care about, in this case racing, instead of optimising for average error on a copy which is only indirectly related to racing.

Non-NN Moves
I hardcoded the first move of boosting to the next CP. And I override the NN’s thrust decision to “SHIELD” if there will be a collision with the enemy next turn (assuming no thrusts) of more than some constant in relative speed. The SHIELD doesn’t seem to be influencing the ranking much though.

Implementation
I implemented it in C++ without libraries. I was able to thread the mini-batch learning process but the scaling isn’t very good: almost but not quite ~2x when using 4 cores

Details
Linear output layer, relu fully connected hidden layers.
Learning rate of 1e-4 to 1e-7 in the backpropagation. I lower it to 1e-7 by hand at the end of the training to get a bit closer to the optimum.
Reward of 1 for passing a CP 0 otherwise
Gamma 0.95 in the bellman equation
Mini-batch of 100 bellman iterations per turn of a race
Memory of 1000 for action replay
Sweep epsilon from 0.95 to 0.05 linearly over the first 1e4 races
50 races until updating the target network in deep double Q-Learning
Inputs normalised to average 0 and standard deviation 1 by sampling random states and finding the Mean/Std of the inputs, then I can subtract the empirical mean and divide by the empirical Std .

Conclusion
I think I can improve performance to train my NN’s faster. The ideal would be if I could train it using a high speed library like tensor flow on a GPU and then export to C++ code. I would like to convert my Stochastic Gradient Descent to a learning-rate-free algorithm like RMSprop, hoping for better and quicker convergence. I want to try training a blocker. I’m also curious about Policy Gradient optimisation which is considered a better alternative to Q-learning, here is an article linked by inorry on that subject.

Acknowledgement
Doing this I sparked pb4’s interest in Reinforcement learning again so he has come back to try it and has very recently succeeded as well. Chatting with him, he helped me find important bugs. I also used as inputs what he was using back in our Supervised learning days. Of course [CPC]rOut/Herbert’s work was a major inspiration as well.

_CPC_Herbert · June 4, 2017, 9:21am

Nice !

I refactored mine some time ago, and it’s now all broken. I don’t know where exactly and I don’t have the motivation to make it work again, but this will maybe make me want to try again

pb4 · June 4, 2017, 9:27am

Great news ! Waiting for you to come back then !

If by any chance you have a linux system, I encourage you to use Agade’s Runner Arena : https://github.com/Agade09/CSB-Runner-Arena. It’s a great way to monitor progress for your Neural Networks.

Cleveroad · October 20, 2017, 12:18pm

You could be interested in the article called Importance of Machine Learning Applications in Various Spheres

Agade · January 2, 2019, 4:55pm

I made a chatbot using a Neural Network. I basically copy pasted this tutorial and worked from there.

Architecture
It uses a Gated Recurrent Unit (GRU) which is a type of Recurrent Neural Network (RNN) to “eat” the characters in the chat one by one and output what it thinks the next character should be. I trained three versions on the world, fr and ru channels by teaching it to predict the respective chat logs in the format “username:text” (timestamp removed). Then whenever the bot is triggered to speak I feed it “Bot_Name:” and it generates a sentence character by character which I push onto the chat. The network doesn’t actually generate characters but probabilities for the next character, a categorical distribution is then used to select the next character.

Motivation: Automaton2000
One motivation I had for this project apart from personal interest was to help bring back Automaton2000. Automaton2000 was taking too much RAM on Magus’ cheap dedicated server (1Go RAM). The markov chain of all possible words had exceeded that limit. I made Magus a variable-length Markov chain C++ version (standalone and daemon) that consumes less RAM but it still easily exceeds 1Go. Magus took that work and made it consume even less RAM (github) but I suppose it wasn’t good enough. He also tried to store the tree on the disk instead of RAM but it was too slow. A Neural Network seemed like the ultimate solution against RAM consumption because even large NNs may only take ~100Mo worth of weights (e.g: keras applications networks for image classification). Potentially a NN could also take context into account, like answering people’s names and talking about the subject of the ongoing conversation which a Markov chain cannot.

The training consumes a lot of RAM (GPU VRAM actually…training takes ~2 hours per channel on a 1070) but the inference can be done on CPU and the bot currently consumes roughly ~300Mo of RAM + ~20Mo for each channel. It is a bit slow to generate sentences character by character taking up to a second on my computer and several seconds on Magus’ server. Automaton2000 is currently back using this work.

Issues
One thing that is quite noticeable is the overuse of common words and phrases. One version liked to say “merde” a lot on the french channel. I guess the network just learns what is most common.

Future work (including suggestion from @inoryy)

Bug fixes
BLEU metric
Seq2Seq architecture?
Use a higher level NLP framework like Sockeye?
Word-level instead of character-level?
Somehow ignore “bad data” like things said by Automaton2000?
…

Magus · January 3, 2019, 7:19am

The C++ version was bugged It can’t work like that.

I think i’ll give a last hope shot with a noSQL database.

inoryy · January 20, 2019, 2:20pm

This isn’t directly related to CG, but the question of getting started with reinforcement learning gets brought up a lot, so figured I’d share my new blog post Deep Reinforcement Learning with TensorFlow 2.0.

In the blog post I give an overview of features in the upcoming major TensorFlow 2.0 release by implementing a popular DRL algorithm (A2C) from scratch. While focus is on TensorFlow, I tried to make the blog post self-contained and accessible with regards to DRL theory, with a birds eye overview of methods involved.

algae · January 30, 2019, 3:36am

Agade, I’m interested to apply q-learning to the problem also. I’ve been assuming that you must extract data from the system to do training off-line and then inject the learned weights and models. I haven’t seen a way to get data out of the system. However, reading your write-up above it almost sounds like you are learning on-line. Is that the case?