Ultimate Tic Tac Toe [Puzzle discussion]

@Cyberpunk @k4ng0u @dbf @MaximePRZ @bortol @szoti @nmahoude @oreshnik @mdaw @LoganWlv also Great MCTS Java Programmers could you share your attitude as well?

I got into legend with java bot then. I donā€™t remember the playouts count, but it was more than 2k. For game implementation I used bitboards, the win check of small board is done with lookup table. For MCTS part, I used these enhancement: mcts solver; fine-tuning exploration constant (itā€™s generally much less than typical sqrt(2)); tree reuse; 1-ply search win check during expansion and during playouts - if there is a win move then use it.

2 Likes

Hi there,
as far as I remember, I compute approx. 160k simulations in the first round and 20k+ in later rounds. Despite bitboards and tuning the MCTS parameters I did a lot of memory and CPU usage profiling. A major performance bottleneck is the default Java PRNG as far as I recall. Using a LCG64ShiftRandom from Jenetics was a significant improvement. Avoid garbage collection by reusing tree nodes (I use a cache of 350k) and pre-computing stuff that is reused a lot. I hope this helps :slight_smile:

2 Likes

@luigi_gelato thanks for sharing. 20k+ in java is really impressive. You must have done much better than only LCG64ShiftRandom. I tried now switching my ThreadLocalRandom.current() to LCG64ShiftRandom and it doesnā€™t make any difference for better. But ThreadLocalRandom is already better than java.util.Random.

I made a lot of performance-related fixes in my code: custom random number generator, node object pools to avoid GC, lookup tables to check winners, precalculated open book for first several moves, avoid foreach/collections.
I think I have about 30-40k moves playouts (after openbook). It was hard to get into legend with Java, I have not optimized code for any game like that one.
CodingameFramework/README.md at master Ā· wala-fr/CodingameFramework Ā· GitHub is a good start to review what could be optimized.
One of the larges improvements was tuning of MCTS score parameter.

1 Like

@jacek1 thanks for sharing, I use EXPLORATION_PARAM=5, but for regular mcts (without solver). I donā€™t use bitboards. From what both of you write I will have to implement it if I want to progress.

One thing that may affect speed is calculating log only once, for parent, instead every time for child in UCT. Or precalculate all log values in lookup table.

I do not know if it can help you: I represented a small/local board as a number base 3 and the large/whole board as a number base 4; this allows to use it as the index in a list of options, or evaluation, lookup table: the simulation becomes fairly quick. Also I used symmetries of he board to reduce the 81 first moves to ~ 15ā€¦

1 Like

@jgall I got hooked up with the game again after your post and started to re-implement it in Rust :wink: (not fully done yet). I dived into my old Java code and noticed a few things that might help you:

  1. Use a profiler to identify critical paths in your code base. I run the profiler against benchmarks for my game engine (with a random move AI) and for my AIs. Use the observations as basis for hypotheses where optimization may improve your outcomes. Of course, this focuses on technical aspects and not on tuning search space exploration/exploitationā€¦ This is optimization 101, but generally people rely a lot on their intuition, which is surprisingly often wrong (at least for myself)
  2. Have a solid test bed. I developed the code mostly test driven or test first. I ensured to cover all possible edge cases. Additionally, I added code behind a feature flag to validate that actions passed to my game engine are sound (and turn it on in CG only if something is obviously broken, but always run it locally). This way, I am pretty confident that changes donā€™t break my solution and it already saved me countless hours of debugging because my tests where telling me straightaway where things go wild.
  3. One of my first findings after applying rule 1: avoid creating new objects that may get garbaged collected later as much as possible. This may sound intuitive to some, but for me GC was never an optimization topic in my professional work :wink: - if the GC starts its work, youā€™ll either timeout or loose precious CPU cycles. A side effect of pre-computing or caching stuff is that you can do all of that in the first round with lots of time, so you wonā€™t loose time for memory allocations and potentially complex computations later onā€¦ However, this is trickier than it sounds and contributed to a significantly more complex code base than I had before these optimizations.
  4. When re-implementing the game in Rust, I first was surprised that using bitboards did not give me significant performance improvements over my simple 2D array and rule based game logic. I didnā€™t follow rule 1 :-). When using MCTS, youā€™ll spend most of your time on random playouts. And this was (and still is) my bottleneck. So besides using a more performant PRNG, I optimized the way I pick a random action. I donā€™t determine all possible actions first and then pick a random one from this list. Instead I maintain counters to keep track of how many actions are remaining (overall and per board). I use this info to ā€œrollā€ a random index and directly determine the random action based on this index. Using bitboards was very useful to implement this strategy. With this change, I observed a massive performance boost (from 10k to 20k in the first rounds).
  5. I used a very simple MCTS with standard exploration constant 1/sqrt(2). So I seem to not have optimized anything recarding MCTS at the timeā€¦
1 Like

i have some questions about bitboards storing in c++ ?

i actually use a std::bitset<81> to store mutliple board masks (player_x, player_o, non_playable_moves, ā€¦) but when i read some code extracts of this topic, i see some people storing them in uint16_t (but from my understanding, the full board cannot be stored in a 16-bits long variable).

So i have few questions about that :

Is my understanding of the uint16_t type correct ?

Is it better to store a full board in one big variable or to store a full board across multiple 9-bits long boards ?

is there a way to manipulate bitboards with a structure more efficient than bitsets ?

Is my understanding of the uint16_t type correct?

Yes. uint16_t has exactly 16 bits, so you can only store a small board with 2 of it, like so:

struct small_board_t {
    uint16_t x, o;
};

Is it better to store a full board in one big variable or to store a full board across multiple 9-bits long boards?

It mainly depends on the performance of the operations (making moves, generating legal moves, checking win conditions, etc.) you design. If you can design faster operations for a particular representation, then that representation may be right for you. I personally use the latter approach, as I think that it is easier to work on and enables reasonably fast operations.

Memory usage is yet another consideration.

Is there a way to manipulate bitboards with a structure more efficient than bitsets?

As far as I know, std::bitset is just a wrapper around an array of integers, with its methods implemented using efficient bit-wise operations. So for operations supported on std::bitset, I donā€™t see how one can do better by directly using an integer or an array thereof and re-implementing the operations. But if one wants to use operations not supported on std::bitset to speed up computation, like multiplying the value by a magic constant, then itā€™s definitely better to directly use an integer or an array thereof.

1 Like

Could someone explain to me the logic of children storing with the Magus node class ?

class Node {
public:
    Node* childrenBegin;
    Node* childrenEnd;
    unsigned char grid;
    unsigned char action;
    float visits;
    float wins;
};

I tried to implement it with a static list of 5 000 000 nodes, but after 7 rounds the list is fullā€¦ I feel like Iā€™m missing something, hereā€™s the logic Iā€™m have set up:

I initialize my nodelist + a static counter of the last nodelist index:

Node pool[5_000_000];
long pool_index = 0;

every time I need a new node, I get the following available index like this:

Node* get_new_node(){
    return &pool[pool_index++];
}

every time I need to access the children of a node, I reserve n nodes from the nodelist:

void init_node_children(Node* parent, int children_count){

    if(parent->initilized)
        return;

    parent->children_begin = get_new_node();
    parent->children_end = parent->children_begin + children_count;
    parent->initialized = true;
}

This way of doing things has the advantage of making my life easier to pick up a random child, but it has the major drawback of taking up too much spaceā€¦

I think that I could fix this problem by dynamically adding a child to a node, each time I need to explore that specific child (instead of adding ALL node children), but I have no idea how to handle the list of children of a node with only two pointers

Note: at the moment Iā€™m not using UCT, Iā€™m just doing full-random, until the game is over

Thanks a lot for your answer, iā€™m going to try the multiple 9-bits grids approach, to see i ican win some performance !

My guess is you are probably initialising too many nodes. In mcts generally you should be initialising one node per iteration (selection/exploration/rollout/backprop) - the moves in the random rollout phase shouldnā€™t be using the tree or using nodes as this will be very slow and use too much memory. Most positions are only ever visited once so storing information about them isnā€™t helpful.

The next thing to check is if you can have more nodes (your node is probably 32 bytes in size after padding so 5 million of them is only 160mb - plenty of spare memory). My pool was 15 million for pure mcts.

After that you need a way to reuse nodes. Iā€™d recommend just starting a fresh tree every turn (set your pool_index to 0) for now. There are a few ways so reuse the tree across turns (resetting to 0 when you are getting near full or a circular buffer of nodes), but this tends to be a very small increase in bot strength and definitely not something to be worried about until youā€™ve got MCTS fully working.

1 Like

i like the game!

Hello, Im silver now in Ultimate Tic Tac Toe. My code was optimizing every possible move to win the game (begin in center, then goes to the extremities. If you can win a line do it, if not try do defend, ā€¦)

I code in python3.
Now I want to try MCTS. Iā€™ve understood the theory, but I donā€™t understand the way in CodinGame. I need to try every path to score it.
To do it , I need to simulate a lot of games. The only way I find to do it in Condingames, it is with the function ā€œprintā€ (example : print(row,column)); When I do this thhe game is progressing and I canā€™t simulate, I canā€™t go back. I think itā€™s obvious but I donā€™t know.
Can you help me?

If you want to simulate a game, do not send anything to CG. The interface is only designed for playing moves. There is no undo and, in any case, it would be awfully slow.
What you must do is manage games by yourself within your program: pick your starting position, select a move for white, then for black and repeat the process until the end of the game. Then restart a new game until you are out of time. Thus, you will accumulate game results that will feed your MCTS.
One last note: MCTS is all about simulating as many games as possible. Python is not the right tool for that. You have absolutely 0 chance to make it to Legend.

1 Like

Went up to 8th in golden league, with around 10k rollouts(basic monte carlo tree search) after turn 1( using C++). It turns out i wasnā€™t using compilation pragma. When added, rollouts went up to around 25k. I didnā€™t know that and found this hint by luck and it happens to be enough to reach legend.

1 Like

Congrats. Now optimize further your bot to reach 80-90k and add a solver to make it into top 50.

Added solver and now sitting around 120. Still very far from 80k sim, I think that my simulation is very slow (2 micro seconds on average for a simulation from start position to endgame), but iā€™m clueless on how to improve it. Currently i use bitboard, and limit win condition (only when a miniboard get closed). Any idea if it is generally worth going the base 3 encoding vs bitboard ?