I think it’s when I understood that NN are merely function approximators that I started understand what they are and how they worked really. Even in “classic” usage like for classification and supervised learning, there’s a hidden ideal input → output mapping function and training the NN is just about approximating it by tuning the parameters with gradient descent.
Well, this isn’t my idea. I believe it’s they way they do it in the DeepMind paper as well. I’m not sure about your concern though.
When the NN is updated, the reward gradient is back-propagated into it. So we update the last layer weights first, then the second-to-last, etc… Even if we would back-propagate the rewards one at a time (which we don’t anyway), the inputs of the last layer are the same for two different T but the output gradients aren’t.
We then compute the gradient of the inputs (and of the weights) of the last layer w.r.t. this output gradient. Both gradients are different for two different T, because the output gradients are different. As the last layer inputs gradient is then used as the outputs gradient of the second to last layer, we see that the reward differences are going to be propagated in the whole network recursively.
I may be wrong, and for the sake of discussion, the last layer of my network is also the only one that shows noticeable patterns. However, although the runner NN works OK with only one hidden layer, I could never make a blocker NN that converges with only one layer, whatever size I use.
I had also the intuition that making the good or bad moves stand out in the replay memory could help. So I tried inserting the moves in the memory accordingly to their reward so that good/bad moves would be replayed more often. I think it helped a little bit in the beginning of the training when random moves are chosen and when the NN only outputs random rewards, but it was mostly ruining everything around the end of the training when good/bad moves are known and we mostly want to refine the reward of the neutral moves that lead to good or bad moves. Maybe this could be done by also taking into account where we are in the training session.
But this is clearly a problem in the way training is done as well. For example, even training a runner with fixed rewards only on timeout or when CP is reached, without any other constraint is not working very well. I had to add a distance to CP limit and add a negative fixed reward when the pod was escaping too far, to help discarding such moves.
I guess, but I haven’t ![]()
Yep, that’s it.
When a new initial game state is generated (initially or each time the pod loses), I create:
- Pod0 position random in [0 16000] x [0 9000]
- Pod0 speed random vector with random norm in [0 max_speed]
- Pod0 direction random unit vector
- CP0 position random in [0 16000] x [0 9000] so that 4 * point_radius < distance(CP0, Pod0) < max_train_distance / 2
- CP1, CP2 position random in [0 16000] x [0 9000] so that 4 * point_radius < distance(CPn, CPm) < max_train_distance
When the pod passes through CP0, I shift the CPs and I generate a new CP2 with the same rules as above. The pod’s next CP is always 0.
There’s also a Pod1 that’s generated using almost the same rules as Pod0 when I want to add a blocker to the game, all other pods are ignored.
Sure, it’s almost the same as the runner, only that the CP provided are the ones of the target runner and the runner’s position / speed / direction are also provided to it:
inputs[0] = dot(point{1, 0} * pod.dir, pod.spd) / max_speed;
inputs[1] = dot(point{0, 1} * pod.dir, pod.spd) / max_speed;
inputs[2] = dot(point{1, 0} * pod.dir, runner.pos - pod.pos) / max_distance;
inputs[3] = dot(point{0, 1} * pod.dir, runner.pos - pod.pos) / max_distance;
inputs[4] = dot(point{1, 0} * pod.dir, runner.spd) / max_speed;
inputs[5] = dot(point{0, 1} * pod.dir, runner.spd) / max_speed;
inputs[6] = dot(point{1, 0} * pod.dir, runner.dir);
inputs[7] = dot(point{0, 1} * pod.dir, runner.dir);
inputs[8] = dot(point{1, 0} * pod.dir, game::points[runner.next] - pod.pos) / max_distance;
inputs[9] = dot(point{0, 1} * pod.dir, game::points[runner.next] - pod.pos) / max_distance;
inputs[10] = dot(point{1, 0} * pod.dir, game::points[runner.next + 1] - pod.pos) / max_distance;
inputs[11] = dot(point{0, 1} * pod.dir, game::points[runner.next + 1] - pod.pos) / max_distance;
inputs[12] = dot(point{1, 0} * pod.dir, game::points[runner.next + 2] - pod.pos) / max_distance;
inputs[13] = dot(point{0, 1} * pod.dir, game::points[runner.next + 2] - pod.pos) / max_distance;
I got a little bit tired of waiting for the NN trainings to complete actually, I need to speed it up but it would require to write multithreaded NN training code, or GPU training code, or use a NN library that does it already, or buy a new and more powerful computer… In any case it’s not going to happen soon ![]()
But I don’t give up, and I will probably go back to it at some point. Also, its proved to work great on CSB, but I’m wondering if I could also make it work for other multiplayer games.
Noted, thanks ![]()
Well sure, here it is: /* -*- C++ -*- include/005-drawing ------------------------------------------ */ - Pastebin.com. I used the cairo 2D drawing library, and the window initialization code is Linux specific… but cairo is portable and can work with any OS windowing system as long as you can provide it a supported drawable surface (for example a Win32 surface, or PNG image). Something more portable using SDL or SFML libraries for the window creation could certainly be written, but I didn’t want to spend too much time on it.
There’s a ENABLE_DRAWING preprocessor switch that allows me to write things like this in my CG codes for local debugging without compilation or runtime errors when running the code on CG:
debug(auto game_debug = draw::init(16000, 9000, 800, 450));
// [...]
debug({
draw::circle(game_debug, state[0].pos, pod_radius, draw::rgba{1., 0., 0., 1.});
draw::circle(game_debug, game::points[state[0].next], point_radius, draw::rgba{0., 1., 0., 1.});
draw::circle(game_debug, game::points[state[0].next + 1], point_radius, draw::rgba{0., 1., 0., 1.});
draw::arrow(game_debug, state[0].pos, state[0].pos + state[0].spd * 1000., draw::rgba{1., 1., 0., 1.});
draw::arrow(game_debug, state[0].pos, state[0].pos + state[0].dir * 5000., draw::rgba{1., 0., 1., 1.});
draw::circle(game_debug, state[1].pos, pod_radius, draw::rgba{1., 0., 0., 1.});
draw::circle(game_debug, game::points[state[1].next], point_radius, draw::rgba{0., 1., 0., 1.});
draw::circle(game_debug, game::points[state[1].next + 1], point_radius, draw::rgba{0., 1., 0., 1.});
draw::arrow(game_debug, state[1].pos, state[1].pos + state[1].spd * 1000., draw::rgba{1., 1., 0., 1.});
draw::arrow(game_debug, state[1].pos, state[1].pos + state[1].dir * 5000., draw::rgba{1., 0., 1., 1.});
});
// [...]
debug({
draw::flush(game_debug);
std::this_thread::sleep_for(std::chrono::milliseconds(16));
draw::clear(game_debug, draw::rgba{0., 0., 0., 0.75});
});