Skip to content
Werner Duvaud edited this page May 5, 2020 · 11 revisions

History

MuZero is a model based reinforcement learning algorithm. It is the most recent algorithm in a long line of reinforcement learning algorithms imagined by researchers at DeepMind. It starts in 2014, with the creation of the famous AlphaGo. Alpha Go defeated Go champion Lee Sedol in 2015.

AlphaGo uses an old principle called Monte Carlo Tree Search (MCTS) to plan the rest of the game as a decision tree. But the subtlety is to add a neural network to intelligently select the moves and therefore reduce the number of future moves to simulate.

AlphaGo's neural network is trained at the start with a history of games played by humans, in 2017 AlphaGo is replaced by AlphaGo Zero which no longer uses this history. For the first time, the algorithm learns to play alone, without a priori knowledge of Go strategies. It plays against itself (with MCTS) then learns from these parts (training of the neural network).

AlphaZero is a version of AlphaGo Zero in which we removed all the little tricks related to the game of Go so that it can be generalized to other board games.

In November 2019, the DeepMind team released a new, even more general version of AlphaZero called MuZero. Its particularity is to be model based which means that the algorithm has its own understanding of the game. Thus the rules of the game are learned by the neural network. MuZero builds his own understanding of the game. He can imagine for himself what the game will look like if one performs such or such actions. Before MuZero, the effect of an action on the game was hard coded. The consequence of this is to allow him to play any game like a human who discovers a game.

Further reading

To understand the technical parts, we found the following articles to be interesting.

Our implementation

This repository is based on the Muzero pseudocode. We filled in the missing parts, we used Ray for running pieces of code simultaneously.

Structure

There are four components which are classes that run simultaneously in a dedicated thread. The shared storage holds the latest neural network weights, the self-play uses those weights to generate self-play games and store them in the replay buffer. Finally, those played games are used to train a network and store the weights in the shared storage. The circle is complete.

Those components are launched and managed from the MuZero class in muzero.py and the structure of the neural network is defined in models.py.

graph

See also: MuZero network summary