MuZero

Similar to AlphaZero, MuZero is an algorithm to master games without knowing their rules.

We refer to the paper Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model and to this official DeepMind pseudocode for the coding part.

Introduction

The MuZero algorithm combines a tree-based search with a learned model, achieving superhuman performance in a variety of visually complex domains, without any knowledge of their underlying dynamics.

We will give a brief explanation of how MuZero works, using diagrams as intuitive support.

MuZero in short

MuZero is an algorithm for mastering games like Go, Chess, Shogi and Atari without explicitly knowing the rules, thanks to its ability to plan winning strategies in unspecified environments. Trying to overcome the limitations of previous algorithms like AlphaZero, MuZero does not model the entire environment, it just models aspects that are crucial to the agent’s decision-making process: the value (how good is the current position), the policy (which action is the best to take) and the reward (how good was last action).

All these elements are learned using a deep neural network and they are all that is needed to understand what happens when taking a certain action and planning accordingly.

As an overall picture, the MuZero algorithm consists of two parts: self-play (creating game data) and training (producing improved versions of the neural network).

How MuZero acts in the environment

Consider the following picture.

**Fig. 1**. MuZero algorithm acting in the environment

During a game, Muzero acts in the environment by performing a Monte Carlo Tree Search (MCTS) at each timestep t (the tree search procedure is depicted in Fig. 2). The MCTS algorithm outputs a recommended policy π_t and estimated value v_t. An action a_t+1 is sampled from the policy π_t , so the environment receives the action and generates a new observation o_t+1 and a new reward u_t+1. At the end of the episode the trajectory data (the sequence of what has happened in terms of states, actions, rewards etc.) is stored into a replay buffer.

How MuZero uses its model to plan

The MuZero model consists of three main components in the form of representation, dynamics and prediction functions. These components are actually represented by neural networks.

The root state s₀ is initialized using a representation function h that encodes past observations (e.g. the Go board or Atari screen) o₁, o₂, . . ., o_t. A difference between MuZero and previous approaches like AlphaZero is that each state s_k contains no semantic elements about the environment state for future predictions: MuZero has no knowledge of rules and it does not model the environment (this is why we need not just a prediction function, but also a representation function and a dynamics function).

The policy p_k and value v_k are evaluated, starting from the state s_k, by a prediction function f. The policy is a probability distribution over all moves and the value is an estimate of future reward. The prediction takes place every time the MCTS algorithm hits an unexplored leaf node, so that an immediate value can be assigned to it and probabilities can be assigned to possible actions. After rolling out several game simulations, we use the accumulated rewards u to back up and update the values of nodes. Final outcomes (lose, draw, win) in board games are treated as rewards (−1, 0, +1) occurring at the final step of the episode. After several simulations, at the root node level the algorithm has a satisfactory idea of what the future value relative to the current state might be (having explored many possible futures).

The dynamics function g returns, at each step k, an immediate reward r_k and an internal state s_k. Dynamics function g can be viewed as a function of the previous state s_k-1 and the last selected action a_k.

**Fig. 2**. Muzero model componenst and MC tree search

Let’s describe the MCTS part in short (see Fig. 2). A game state is encoded by a neural network h, the path from initial node to most promising leaf node is determined by the action a, actions are always selected by highest UCB (Upper Confidence Bound) score. The prediction function f gives predictions of policy p and value v. After that, the dynamics function g yields a reward r and a new state s in which f acts again; this joint application of g and f is recurrent until path traversal ends. Finally, the value of the leaf node and collected rewards are backpropagated up the tree, to the root node.

How MuZero trains its model

Training begins with sampling a trajectory from the replay buffer. Initially, h encodes past observations o₁, o₂, . . . , o_t from the sampled trajectory. For a range of K steps, the model is unrolled and, at each step k, the state s_k-1 and the true action a_t+k are fed to the dynamics function g. The parameters of neural networks h, f and g are trained jointly to predict three quantities: the policy p_k ≈ π_t+k, the value function v_k ≈ z_t+k and the reward r_t+k ≈ u_t+k. The value z_t+k is a sample return (the final reward for board games or n-step return for Atari).

Implementation

Understanding the inner workings of MuZero would require a long time but, for a quick overview, a reasonably satisfying approach would be to focus on the main muzero function. The pseudocode for the main entrypoint is the following (the full pseudocode is here).

# MuZero training is split into two independent parts: 
# Network training and self-play data generation.
# These two parts only communicate by transferring the latest 
# network checkpoint from the training to the self-play, and 
# the finished games from the self-play to the training.

def muzero(config: MuZeroConfig):
  storage = SharedStorage()
  replay_buffer = ReplayBuffer(config)

  for _ in range(config.num_actors):
    launch_job(run_selfplay, config, storage, replay_buffer)

  train_network(config, storage, replay_buffer)

  return storage.latest_network()

Notice that the first two objects created are a SharedStorage object to store/retrieve model parameters and a ReplayBuffer object to store/retrieve trajectory data. To make the things work, MuZero must play thousands of games against itself, save them in a buffer and train with the data from those games. Initially, some simulation are run (num_actors is the number of parallel game simulations). After completing many selfplay games, the selfplay data are stored in the replay buffer and successively retrieved to get training batches. After training, model parameters (weights) are stored in the shared storage and successively loaded to run other simulations. It is a circular process, as shown in the figure below.

There are many code examples about MuZero, probably the most functional is the one in this repository. However, if you want a simple and easy to follow code — also through debugging tools provided by several IDEs — you could also try this example which uses TensorFlow 2 and Gym (explicitly, the Cartpole environment).

Useful links

MuZero original article.

MuZero pseudocode.

DeepMind page about MuZero.

Several code implementations.