Building an AI to play 2048 — Baucom Robotics

Background

I was inspired by carykh's excellent Evolution Simulator series to try my own hand at programming a basic AI. I had some AI experience from grad school but had never implemented my own neural network from scratch, so I decided to try it out and see where I got. Note that all of the code for this project is available on GitHub.

I chose to have my AI learn to play the game 2048 since I enjoy that game quite a lot and it would be relatively simple to program from scratch. I built my own version of 2048 using Python and numpy and put a simple gui on top of it to make the game look like the original. If you aren't familiar with the game, there is a 4 x 4 grid where tiles are randomly generated with values of 2 or 4. When you swipe left, right, up, or down, the tiles all move in that direction and matching tiles which are next to each other combine to form a tile equal to their sum. The goal is to reach a tile with the value of 2048.

Implementation

I decided that I wanted to design my own implementation of a neural network and train the network via genetic learning, as opposed to some of the traditional back-propagation/gradient descent methods. This was likely not the most efficient method to use, but I was more interested in exploring how to implement these systems rather than finding the most 'correct' one.

My neural network implementation was pretty standard. I created Neurons which were organized into Layers and then Layers were connected to form a Network. The hyperbolic tangent function was used as the activation function for each neuron. The network had 3 layers with the input layer having 16 neurons, each with a single input of the value from a single tile. The second hidden layer was fully connected to the first with 8 neurons and the output layer was fully connected to the second layer with 4 neurons, one for each possible action (left, right, up, down). The neuron with the largest output value was used as the selected motion.

My implementation of the genetic algorithm consisted of a collection of Agents (also known as a generation) where each agent had their own Genome. The genome was represented as a list of numbers where each value mapped to a weight or bias value in the neural network structure. These genomes were initialized randomly and the goal was to learn the weights of the network that would maximize the agent's score in the game. Note that each game was scored as:

score = 10 * max_tile + 1 * sum_of_all_tiles

This scoring would give larger rewards for reaching bigger tiles, but would also ensure that progress towards the next tile was counted in the score as well.

For each generation, the process was

def run_one_generation(self):
    self.select_survivors()
    self.generate_offspring()
    self.mutate()
    self.population_fitness_test()

First, a fraction of survivors from the current generation were selected via stochastic sampling without replacement such that a larger score gave the agent a better chance of sticking around, but didn't guarantee them a spot in the next generation; however, I did specifically ensure that the best agent from the previous generation would survive.

Next, the offspring were generated by selecting pairs of parents for each pair of offspring needed and either performing a direct copy of both parents or crossing over their genome to produce two similar, but slightly different children. Note that one agent could be selected as a parent for multiple offspring and the higher scoring agents had a better chance at being selected for reproduction.

Next, all agents had the possibility of having their genome mutated slightly by randomly replacing some values in the genome with other random values. This mutation step ensures that the population stays diverse and a larger possible solution space can be explored than what the population was initialized with.

Finally, the population underwent a fitness test, which consisted of each agent playing a round of 2048 with their current genome serving as the weights and biases of the neural network. The test ended when the agent finished the game (win or lose), or was not progressing any further (i.e. only sending the 'left' command when moving left no longer has any viable moves).

I also added some quality of life features like the ability to save and load the current generation from a file and the ability replay (and visualize) the previous game that an agent played. All random values were created using seeded random streams so the entire system would produce repeatable results when using the same initial seed.

The parameters I used for my testing were as follows:

survival_rate = 0.5 
mutation_rate = 0.001
crossover_rate = 0.7
n_agents = 1500

Results

The initial generation of agents was essentially random output, and they clearly don't make it very far, as the median agent of that generation shows.

The median agent after the first generation

However, there were some agents who did relatively well by chance. This is the best agent from the first generation. You can see that it is still pretty random motions with lots of back and forth, but this agent does well because it varies the selected move instead of just always picking a single direction (which almost always ends the game within a few turns).

The best agent after the first generation

After training for a while, an improvement is definitely noticeable as the best agent after 500 generations is able to reach 256 and had all the tiles available to make 512 when the game ended.

Looking at the median agent shows that the population as a whole is progressing. This agent gets two 64 tiles by the time it finishes the game.

After training for 500 more generations, the best agent can reach 512. However, the median agent doesn't seem to have progressed much, only just barely reaching 128. Additionally, looking at some of the surrounding generations shows that the best agent of this generation got lucky to hit 512 as only about 1 in every 10 generations had an agent reach 512.

Looking a little bit further, it seems that the population has learned the game up to a point, but there is still quite a bit of random luck involved in getting a good score. Taking the best agent and having it play a couple new games generates some very interesting results (a new game is a game with different random seeds than the one it used to get its high score during the fitness test - this just changes where the new tiles pop up during the game) . We can see that on new games, this agent performs very similarly to the median agent.

The best agent after 1000 generations playing a new game

The best agent after 1000 generations playing another new game

Additionally, having the median agent play two new games shows very similar results to the best agent's new games.

The median player after 1000 generations playing a new game

The median player after 1000 generations playing another new game

This suggests to me that the population has fundamentally learned as much of the game as it can with the parameters I have set and now the higher scores are just a matter of luck rather than agent strategy.

Conclusion

Is this the best possible result I can get from my AI? Probably not, however, it is where I am going to leave it for now. I achieved my goal of learning how neural networks and genetic algorithms work together to create agents that can learn to play a game, even if they didn't make it to the point of fully winning the game.

However, I have also learned (as I expected from the outset) that my implementation is slow and inefficient. Training these 1000 generations took a couple of days because I was running the code on my CPU (instead of a GPU like most machine learning tasks) and my code was not well optimized. This did not lend itself particularly well for trying different combinations of parameters, tweaks to the scoring function, or any other adjustments that required re-running the entire learning process many times.

One change that I think would have been of particular help would have been to have each agent scored by playing multiple rounds of the game, to better reward a more generalized understanding of the game, rather than rewarding one lucky round. However, since running the game was the slowest part of the code, increasing the runtime two or three-fold (or more) did not sound particularly appealing.

I am hoping to continue to tinker around with machine learning in the future, but with actual established tools such as TensorFlow or Keras. Maybe I will revisit 2048 at some point with one of those tools to see if I can do better.