MuZero

Similar to AlphaZero, MuZero is an algorithm to master games without knowing their rules.

Photo by Elena Popova

We refer to the paper Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model and to this official DeepMind pseudocode for the coding part.

Introduction

The MuZero algorithm combines a tree-based search with a learned model, achieving superhuman performance in a variety of visually complex domains, without any knowledge of their underlying dynamics.

We will give a brief explanation of how MuZero works, using diagrams as intuitive support.

MuZero in short

MuZero is an algorithm for mastering games like Go, Chess, Shogi and Atari without explicitly knowing the rules, thanks to its ability to plan winning strategies in unspecified environments. Trying to overcome the limitations of previous algorithms like AlphaZero, MuZero does not model the entire environment, it just models aspects that are crucial to the agent’s decision-making process: the value (how good is the current position), the policy (which action is the best to take) and the reward (how good was last action).

All these elements are learned using a deep neural network and they are all that is needed to understand what happens when taking a certain action and planning accordingly.

As an overall picture, the MuZero algorithm consists of two parts: self-play (creating game data) and training (producing improved versions of the neural network).

How MuZero acts in the environment

Consider the following picture.

Fig. 1. MuZero algorithm acting in the environment

During a game, Muzero acts in the environment by performing a Monte Carlo Tree Search (MCTS) at each timestep t (the tree search procedure is depicted in Fig. 2). The MCTS algorithm outputs a recommended policy πt and estimated value vt. An action at+1 is sampled from the policy πt , so the environment receives the action and generates a new observation ot+1 and a new reward ut+1. At the end of the episode the trajectory data (the sequence of what has happened in terms of states, actions, rewards etc.) is stored into a replay buffer.

How MuZero uses its model to plan

The MuZero model consists of three main components in the form of representation, dynamics and prediction functions. These components are actually represented by neural networks.

The root state s0 is initialized using a representation function h that encodes past observations (e.g. the Go board or Atari screen) o1, o2, . . ., ot. A difference between MuZero and previous approaches like AlphaZero is that each state sk contains no semantic elements about the environment state for future predictions: MuZero has no knowledge of rules and it does not model the environment (this is why we need not just a prediction function, but also a representation function and a dynamics function).

The policy pk and value vk are evaluated, starting from the state sk, by a prediction function f. The policy is a probability distribution over all moves and the value is an estimate of future reward. The prediction takes place every time the MCTS algorithm hits an unexplored leaf node, so that an immediate value can be assigned to it and probabilities can be assigned to possible actions. After rolling out several game simulations, we use the accumulated rewards u to back up and update the values of nodes. Final outcomes (lose, draw, win) in board games are treated as rewards (−1, 0, +1) occurring at the final step of the episode. After several simulations, at the root node level the algorithm has a satisfactory idea of what the future value relative to the current state might be (having explored many possible futures).

The dynamics function g returns, at each step k, an immediate reward rk and an internal state sk. Dynamics function g can be viewed as a function of the previous state sk-1 and the last selected action ak.

Fig. 2. Muzero model componenst and MC tree search

Let’s describe the MCTS part in short (see Fig. 2). A game state is encoded by a neural network h, the path from initial node to most promising leaf node is determined by the action a, actions are always selected by highest UCB (Upper Confidence Bound) score. The prediction function f gives predictions of policy p and value v. After that, the dynamics function g yields a reward r and a new state s in which f acts again; this joint application of g and f is recurrent until path traversal ends. Finally, the value of the leaf node and collected rewards are backpropagated up the tree, to the root node.

How MuZero trains its model

Training begins with sampling a trajectory from the replay buffer. Initially, h encodes past observations o1, o2, . . . , ot from the sampled trajectory. For a range of K steps, the model is unrolled and, at each step k, the state sk-1 and the true action at+k are fed to the dynamics function g. The parameters of neural networks h, f and g are trained jointly to predict three quantities: the policy pk ≈ πt+k , the value function vkzt+k and the reward rt+kut+k . The value zt+k is a sample return (the final reward for board games or n-step return for Atari).

Fig.3. MuZero training

Implementation

Understanding the inner workings of MuZero would require a long time but, for a quick overview, a reasonably satisfying approach would be to focus on the main muzero function. The pseudocode for the main entrypoint is the following (the full pseudocode is here).

# MuZero training is split into two independent parts: 
# Network training and self-play data generation.
# These two parts only communicate by transferring the latest 
# network checkpoint from the training to the self-play, and 
# the finished games from the self-play to the training.

def muzero(config: MuZeroConfig):
  storage = SharedStorage()
  replay_buffer = ReplayBuffer(config)

  for _ in range(config.num_actors):
    launch_job(run_selfplay, config, storage, replay_buffer)

  train_network(config, storage, replay_buffer)

  return storage.latest_network()

Notice that the first two objects created are a SharedStorage object to store/retrieve model parameters and a ReplayBuffer object to store/retrieve trajectory data. To make the things work, MuZero must play thousands of games against itself, save them in a buffer and train with the data from those games. Initially, some simulation are run (num_actors is the number of parallel game simulations). After completing many selfplay games, the selfplay data are stored in the replay buffer and successively retrieved to get training batches. After training, model parameters (weights) are stored in the shared storage and successively loaded to run other simulations. It is a circular process, as shown in the figure below.

Fig. 4. MuZero circular process

There are many code examples about MuZero, probably the most functional is the one in this repository. However, if you want a simple and easy to follow code — also through debugging tools provided by several IDEs — you could also try this example which uses TensorFlow 2 and Gym (explicitly, the Cartpole environment).

Useful links

MuZero original article.

MuZero pseudocode.

DeepMind page about MuZero.

Several code implementations.

Leave a comment