MuZero figures out chess, rules and all

Alpha-even-more-Zero...

In 1980 the first chess computer with an auto response board, the Chafitz ARB Sargon 2.5, was released. It was programmed by Dan and Kathe Spracklen and had a sensory board and magnet pieces. The magnets embedded in the pieces were all the same kind, so that the board could only detect whether there was a piece on the square or not. It would signal its moves with LEDs located on the corner of each square.

Chafitz ARB Sargon 2.5 | Photo: My Chess Computers

Some years after the release of this computer I visited the Spracklens in their home in San Diego, and one evening had an interesting discussion, especially with Kathy. What would happen, we wondered, if we set up a Sargon 2.5 in a jungle village where nobody knew chess. If we left the people alone with the permanently switched-on board and pieces, would they be able to figure out the game? If they lifted a piece, the LED on that square would light up; if they put it on another square that LED would light up briefly. If the move was legal, there would be a reassuring beep; the square of a piece of the opposite colour would light up, and if they picked up that piece another LED would light up. If the original move wasn’t legal, the board would make an unpleasant sound.

Our question was: could they figure out, by trial and error, how chess was played? Kathy and I discussed it at length, over the Sargon board, and in the end came to the conclusion that it was impossible — they could never figure out the game without human instructions. Chess is far too complex.

Now, three decades later, I have to modify our conclusion somewhat: maybe humans indeed cannot learn chess by pure trial and error, but computers can...

DeepMind’s MuZero teaches itself

You remember how AlphaGo and AlphaZero were created, by Google's DeepMind division. The programs Leela and Fat Fritz were generated using the same principle: tell an AI program the rules of the game, how the pieces move, and then let it play millions of games against itself. The program draws its own conclusions about the game and starts to play master-level chess. In fact, it can be argued that these programs are the strongest entities to have ever played chess — human or computer.

Now DeepMind has come up with a fairly atrocious (but scientifically fascinating) idea: instead of telling the AI software the rules of the game, just let it play, using trial and error. Let it teach itself the rules of the game, and in the process learn to play it professionally. DeepMind combined a tree-based search (where a tree is a data structure used for locating information from within a set) with a learning model. They called the project MuZero. The program must predict the quantities most relevant to game planning — not just for chess, but for 57 different Atari games. The result: MuZero, we are told, matches the performance of AlphaZero in Go, chess, and shogi.

And this is how MuZero works (description from VenturBeat):

Fundamentally MuZero receives observations — images of a Go board or Atari screen — and transforms them into a hidden state. This hidden state is updated iteratively by a process that receives the previous state and a hypothetical next action, and at every step the model predicts the policy (e.g., the move to play), value function (e.g., the predicted winner), and immediate reward (e.g., the points scored by playing a move)."

Evaluation of MuZero throughout training in chess, shogi, Go, and Atari — the y-axis shows Elo rating | Image: DeepMind

As the DeepMind researchers explain, one form of reinforcement learning — the technique in which rewards drive an AI agent toward goals — involves models. This form models a given environment as an intermediate step, using a state transition model that predicts the next step and a reward model that anticipates the reward. If you are interested in this subject you can read the article on VenturBeat, or visit the Deep Mind site. There you can read this paper on the general reinforcement learning algorithm that masters chess, shogi and Go through self-play. Here's an abstract:

The game of chess is the longest-studied domain in the history of artificial intelligence. The strongest programs are based on a combination of sophisticated search techniques, domain-specific adaptations, and handcrafted evaluation functions that have been refined by human experts over several decades. By contrast, the AlphaGo Zero program recently achieved superhuman performance in the game of Go by reinforcement learning from self-play. In this paper, we generalize this approach into a single AlphaZero algorithm that can achieve superhuman performance in many challenging games. Starting from random play and given no domain knowledge except the game rules, AlphaZero convincingly defeated a world champion program in the games of chess and shogi (Japanese chess), as well as Go.

That refers to the original AlphaGo development, which has now been extended to MuZero. Turns out it is possible not just to become highly proficient at a game by playing it a million times against yourself, but in fact it is possible to work out the rules of the game by trial and error.

I have just now learned about this development and need to think about the consequences — discuss it with experts. My first somewhat flippant reaction to a member of the Deep Mind team: "What next? Show it a single chess piece and it figures out the whole game?"

Advertising

Books, boards, sets: Chess Niggemann

Frederic Friedel Editor-in-Chief emeritus of the ChessBase News page. Studied Philosophy and Linguistics at the University of Hamburg and Oxford, graduating with a thesis on speech act theory and moral language. He started a university career but switched to science journalism, producing documentaries for German TV. In 1986 he co-founded ChessBase.

Discuss

Rules for reader comments

Yasser Seirawan 12/14/2019 05:51

Fascinating.

It seems to me that the next step is for MuZero to devise its own board game. One that it would consider sufficiently complex. That is after many "millions of training steps" it would still be challenged-learning-improving at its own creation. It would create its own board (battleground) as well as two armies.

Once MuZero finished its task, it would then be up to us to figure out the rules of play.

Yasser

besominov 12/13/2019 12:35

Yes, but who is telling the program what moves are legal or not?

Like in the hypothetical example with the Sargon computer and the jungle villagers: in that case the board is telling the user what moves are legal or not.

So somebody who knows the rules is telling the program what the rules are, just not in a direct way.

So it does not really seem like such a big step from the previous "zero" approach, just a bit more generalized (and cumbersome).

XecutionStyle 12/13/2019 10:34

The final hyperbole emphasizes what just happened.
Injecting knowledge was an important objective to learning, especially reinforcement learning, as it dramatically reduced the trials necessary for optimal policy. AlphaZero was given such knowledge through the rules of the game; thereon, its predictions were purely statistical: "based on my experience, this move given the current configuration yields a favorable outcome with a probability of...". It knew nothing of the game. It's like throwing a ball, and predicting motion through kinematics - knowing "why" i.e. Gravity, air-friction etc. was not necessary.
MuZero, in the ball example, models those entities involved such as gravity, and uses them to understand how the system behaves. Therefore its predictions are based on fundamentally answering "why", rather than "how" like AlphaZero.
The benefits reach beyond the obvious. Not only will it be safer (as simulating on a model of the environment is harmless compared to needing actual experience), require less experience (understanding gravity exists is far more useful than implicitly accounting for it using statistics), but with a model, we'll be able to explain most system-dynamics such as how a butterfly could/couldn't trigger a hurricane without needing data of every permutation involving butterflies and hurricanes.

ketchuplover 12/13/2019 04:09

regarding the jungle village example I say you can learn how the pieces move but not castling,en passant and pawn promotion.

Keshava 12/12/2019 08:23

"Starting from random play and given no domain knowledge except the game rules, AlphaZero convincingly defeated a world champion program in the games of chess..." Well, Lc0 is how Alpha Zero would currently do in fair competition (ran by third parties) against other top engines. On the CCRL 40/4 Rating List Lc0 is one of the top 4 engines and a close 2nd to Stockfish on the CCRL 40/40 Rating List - If you run your own tests you can usually find a way to get the results you want.

News

Fritz 20

Your personal chess trainer. Your toughest opponent. Your strongest ally.

€69.90

Silence the Sicilian - Win with the Alapin Variation (2.c3)

The Alapin Variation (1.e4 c5 2.c3) offers a simple yet highly effective solution to all major Sicilian lines, by sidestepping them with 2.c3.

€39.90

Queen's Gambit Accepted Powerbook 2025

How do you play the Queen's Gambit Accepted? Does White have promising variations or can Black construct a water-tight repertoire? The Powerbook provides the answers based on 300 000 games, most of them played by engines.

€9.90

Queen's Gambit Accepted Powerbase 2025

The Queen's Gambit Accepted Powerbase 2025 is a database and contains a total of 11827 games from Mega 2025 and the Correspondence Database 2024, of which 240 are annotated.

€9.90

Master Modern Opening Strategy: Flank Attacks against Classical Openings

This course dives into one of the most dynamic and practical ideas in modern chess: using early pawn thrusts like b5 or b4!

€9.90

Rossolimo-Moscow Powerbase 2025

Rossolimo-Moscow Powerbase 2025 is a database and contains a total of 10950 games from Mega 2025 and the Correspondence Database 2024, of which 612 are annotated.

€9.90

Rossolimo-Moscow Powerbook 2025

The greater part of the material on which the Rossolimo/Moscow Powerbook 2025 is based comes from the engine room of playchess.com: 263.000 games. This imposing amount is supplemented by some 50 000 games from Mega and from Correspondence Chess.

€9.90

ChessBase Magazine Extra 226

Focus on the Sicilian: Opening videos on the Najdorf Variation with 6.h3 e5 7.Nb3 (Luis Engel) and the Taimanov Variation with 7.Qf3 (Nico Zwirs). ‘Lucky bag’ with 38 analyses by Anish Giri, Surya Ganguly, Abhijeet Gupta, Yannick Pelletier and many more.

€14.90

SHOP

SHOP

MuZero figures out chess, rules and all

ONLINE SHOP

The Endgame Academy Vol.1: Checkmate & pawn endgames

Alpha-even-more-Zero...

DeepMind’s MuZero teaches itself

Discuss

Fritz 20

Silence the Sicilian - Win with the Alapin Variation (2.c3)

Queen's Gambit Accepted Powerbook 2025

Queen's Gambit Accepted Powerbase 2025

Master Modern Opening Strategy: Flank Attacks against Classical Openings

Rossolimo-Moscow Powerbase 2025

Rossolimo-Moscow Powerbook 2025

ChessBase Magazine Extra 226

Pop-up for detailed settings