Review of "Computer Analysis of World Chess Champions"
by Matej Guid and Ivan Bratko, published in ICGA Journal, Vol 29, No. 2, June 2006, pages 65-73, republished by ChessBase.com.
By Dr. Søren Riis, Oxford, UK
In the paper the authors present a method that is claimed to identify who is the best chess player of all times. The basic idea is to compare the moves played by world champions with the evaluation of those moves given by a strong computer chess program.
If we are to believe the authors, it is possible to determine a player’s strength by having a version of Crafty (one that always looks only 12 plies) judge the quality of the champions’ moves. The quality of a move is calculated by how many pawns (calculated by the program) the move chosen by the player is inferior to the move the program judges to be the best.
First, let me note that if we tried to decide which contemporary chess program is the strongest, based on the authors’ method; we would almost certainly get some quite absurd results!
There are different versions of Crafty, but none of them has a rating of more than 2700 on the latest rating lists. The version used by the Authors is a modified version of Crafty (“amputated”) that for each move searches a fixed number of moves (6 moves and 6.5 in the endgame) before evaluating the quality of each move available in the position. The strength of the program becomes quite unreliable because the horizon effect sets in. Anyway assume (for the sake of the argument) that this amputated version of crafty plays roughly at the same level as the standard version of crafty.
- • Absurdly, on top of the list we would (by definition!) have the "amputated" version of Crafty itself (used by the authors).
- • Almost as absurdly, we would expect that the standard version of crafty would also be on top.
On the other hand, some top programs (especially when run on fast 4 CPU machines) are much stronger than Crafty, and would almost literally shred Crafty to pieces. Yet, essentially the stronger a program is, the less it is likely to behave like Crafty.
Thus to put it in a somewhat simplified way, Crafty would have a tendency to rank all engines rated above 2700 in reverse order, with the weakest at the top of the list, and the strongest engines appearing further down.
But, maybe the method makes sense when testing former world champions? No! What the authors are testing, is simply which of the world champions played chess most in the style of the "amputated" version of Crafty. Capablanca played quite simple chess where the way to make progress apparently is within reach of Crafty. On the other hand Kasparov played numerous games that are well above the grasp of Crafty. It is worth noticing that quite frequently, engines of the level of Crafty (but also much stronger engines) misjudge positions and moves considerably. Most news groups on Computer Chess are full of such examples. The frequency of computers misjudging moves and positions varies with the type of position etc. However, there is no doubt that some players play chess that is simply too deep to be fully appreciated by an engine at Crafty's level.
In fact many pretty standard moves are completely missed by Crafty at search depth 12. Crafty penalizes Fisher’s Rxh5! against Larsen (played in Portoroz) by 0.41 pawns. Crafty at depth 12 thinks Bxg7 is the better move, while in fact only Fisher’s move leads to a clear win. Kasparov’s Bh6! against Short in Zurich 2001 is crushing, and might be the only winning move. Yet Crafty searching at depth 12 penalises Kasparov’s brilliant concept with more than two pawns. In fact Crafty does not have Bh6 among any of the 20 best moves!! It is not unlikely the Kasparov based part of his attack on the possibility of Bh6, and saw this move even earlier in the game. This is utterly beyond what Crafty can handle.
In fact, my conclusion is (based on more examples) that in general Crafty completely fails to understand the depth of Kasparov’s play. Capablanca plays chess that Crafty apparently finds easier to appreciate even though Crafty occasionally also punishes Capablanca unfairly (though I did not find any examples where Crafty completely fail to understand a move by Capablanca).
Now suppose we do the same test with significantly stronger engines. Would the proposed method then make sense? I'm afraid this would not give very good results either. First of all it would favor safe positional players to wild attacking players – I suppose that Capablanca would still look much better than Tal. In fact Tal might look like a hopeless patzer who was lucky to be playing against other patzers.
But, suppose we really just want to be "objective" without any preconception and simply ask who of the world’s champions plays the most perfect chess. What is so wrong with taking the strongest programs – assuming that they are significantly better than Crafty – and ask for their opinion? To pinpoint the problem I will look at the issue from a somewhat theoretical perspective.
Objectively, each chess position is either won for white, drawn or won for black. For won positions the quality of a move can be judged on how much closer it moves the position to mate. The best achievable is +1 (i.e. one move closer). From this abstract perspective a move of minus 10 is a "mistake" in the sense that it changes the position to a position where there are 10 more moves to the mate. I will call such mistakes "harmless" mistakes. A much more serious type of mistake occurs if the player makes a move that converts a won position to a drawn or lost position. From this highly abstract view let us call a move that changes a won position (drawn position) to a drawn position (lost position) a "serious mistake", and a move that changes a won position to a lost position a "double serious mistake".
Is there any way we can measure the quality of moves in positions that are objectively drawn? There is not! This is where psychology and knowledge of the opponent enter the equation. A move that is best against one opponent (i.e. most likely in the long term lead the opponent to make a "mistake" and produce a lost position), might differ from what is the best move against another opponent. From a purely theoretical (and logical) perspective, there is no objective measure why one move is better than another move as long as the position stays balanced (i.e. is objectively drawn). All moves guaranteeing a draw are equally good against perfect play. But, in a real game the opponent is not perfect. The task is to produce moves that maximise the likelihood that the opponent at some stage makes a serious mistake leading to a lost position. But, what is best way to achieve this depends to some extent on the opponent and his/hers strengths and weaknesses. Maybe Capablanca’s way of playing was good enough to get convincing results in 1920. However, Capablanca’s way of playing balanced positions might not have worked very well against contemporary masters. In modern chess, some players find it more important to create complex difficult positions, rather than positions with a cosmetic advantage that are unlikely to cause the opponent great difficulties.
There are, of course, some general principles how best to put pressure on the opponent. Chess is a game of skill with relative clear criteria for good play, so grandmasters have often pretty similar ways of judging positions. It is, however, important to realise that the evaluation of balanced chess positions is to some extent, an art, and that the greatest players (like Kasparov) to some extent, also take psychological factors and strengths and weaknesses of the opponent into account when playing.
Chess engines in the future might play on such a high level that all games essentially result in a draw, and this happens even when one engine is given much less time than the other engine! Still different engines (though they in some sense play perfect) might still evaluate balanced positions somewhat differently. Thus even the future (almost) perfect engines might not agree on who of the champions were the greatest.
To let Crafty judge who was the greatest Chess World Champion is an insult. It is like having a tone-deaf judge who was the greatest composer.
Søren Riis is a Computer Scientist at Queen Mary University of London. He has a PhD in Pure Maths from University of Oxford. He is Danish but currently living near Oxford. He used to play competitive chess around 20 years ago (Elo 2300). Riis has been briefly involved with chess programming, and his interest includes theoretical aspects of computer chess.
The following letter was sent to us independently of Soren Riis's article. It was in reaction to some of the letters that follow below, and to messages that were posted on different computer forums.
Computer Analysis of World Chess Champions – answer to some comments
We would like to thank the readers for their interest in our article on computer analysis of chess champions (ChessBase, 30 October 2006).
We would also like to answer a frequent comment by the readers. The comment goes like this: “A very interesting study, but it has a flaw in that program Crafty, whose rating is only about 2620, was used to analyse the performance of players stronger than this. For this reason the results cannot be useful”. Some readers speculate further that the program will give better ranking to players that have a similar rating to the program itself.
These reservations are perhaps based on a straightforward intuition that the program used must be necessarily stronger than the players analysed. However, things are not so simple and the intuition seems to be misguided in this case. A simple math shows, possibly surprisingly, that:
(a) To obtain a sensible ranking of players, it is not necessary to use a computer that is stronger than the players themselves. There are good chances to obtain a sensible ranking even using a computer that is weaker than the players.
(b) The (fallible) computer will not exhibit preference for players of similar strength to the computer.
These points can be illustrated by a simple example. Let there be three players and let us assume that it is agreed what is the best move in every position. Player 1 plays the best move in 90% of positions, player 2 in 80%, and player 3 in 70%. Assume that we do not know these percentages, so we use a computer program to estimate the players’ performance. Let the program available for the analysis only play the best move in 70% of the positions. In addition to the best move in each position, let there be 10 other moves that are inferior to the best move, but the players occasionally make mistakes and play one of these moves instead of the best move. For simplicity we take that each of these moves is equally likely to be chosen by mistake by a player. So player 1 who plays the best move 90% of the time, will distribute the remaining 10% equally among these 10 moves, giving 1% chance to each of them. Similarly, player 2 will choose any of the inferior moves in 2% of the cases, etc. We also assume that mistakes by all the players, including the computer, are probabilistically independent.
In what situations will the computer, in its imperfect judgement, credit a player for the “best” move? There are two possibilities:
- The player plays the best move, and the computer also believes that this is the best move;
- The player makes an inferior move, and the computer also confuses this same inferior move for the best.
By simple probabilistic reasoning we can now work out the computer’s approximations of the players’ performance based on the computer’s analysis of a large number of positions. Using the formula given below determines that the computer will report the estimated percentages of correct moves as follows: player 1: 63.3%, player 2: 56.6%, and player 3: 49.9%. These values are quite a bit off the true percentages, but they nevertheless preserve the correct ranking of the players. The example also illustrates that the computer did not particularly favour player 3, although that player is of similar strength as the computer.
The simple example above does not exactly correspond to our method which also takes into account the cost of mistakes. But it should help to bring home the point that for sensible analysis we do not necessarily need computers stronger than human players. This is of course not to say that a stronger program, if available, would not be more desirable. Also, it should be noted that our method makes other, more subtle assumptions. Our results should therefore be interpreted in the light of these assumptions.
Ivan Bratko and Matej Guid
P.S. Formula to compute computer’s estimates:
p’ = p * pc + (1 – p) * (1 – pc) / n
p = probability of the player making the best move
pc = probability of the computer making the best move
p’ = computer’s estimate of player’s accuracy p
n = number of inferior moves in a position
Peter Ballard, Adelaide, Australia
Here's what I don't understand: the author only analysed world championship matches. Capa squashed an aging Lasker in a 14 game match, but then was outplayed by Alekhine in a 30+ game marathon. So that equates to a relatively poor record in WC matches. I'd like to see the results for individual matches. In how many matches did the loser score a better "quality of play" index than the winner? If Capa scored better on "quality of play" than Alekhine in the 1927 match, what does that say about their methodology?
Albert Silver, Rio de Janeiro, Brazil
When the authors write "The basic criterion for evaluating World Champions was the average difference between moves played and best evaluated moves by computer analysis", they are basically stating that the moves of the world champions at tournament time controls are less likely to be correct than Crafty at 15-30 seconds a move (the rough time taken to reach the depth chosen by the authors). After all, instead of seeing whether the computer can find the moves of the champions, as is often the case of test suites, here the champions have the unenviable burden of having to play like Crafty. So, who do you trust more on average? Kasparov (or Karpov, Kramnik, etc) at 3 minutes a move, or Crafty at 15-30 seconds?
The histogram of the average error also implies that that is the edge Crafty would have over said World Champion. So if Kasparov has an average error rate over all his moves of 0.1292, this means that in a match where Crafty is given 12 plies limit to play, Kasparov, with all his ability and positional judgement, should lose 5-4 even with 6 times more thinking time. Capablanca being much stronger would only lose 6-5... If the edge is in pawns and not points, then it means that for every 8 moves on average, Crafty expects to gain an advantage of one extra pawn over Kasparov. Have they any idea how utterly absurd that sounds??
Julian Wan, Ann Arbor, USA
Thank you for the very interesting article. It makes several points:
- It shows that raw analysis of how often a player mirrors a computer program's choice of move is not necessarily "proof" of computer assistance – in simpler situations and positions, it may actually reflect that player's great sense and judgement.
- It may open up new avenues of research – note that the games used were only from the matches – if one were to subject games over a period of time, it may show objectively a shift in style.
- It shows how style of play is a complex issue – not that Kasparov who is known for his aggressive style is actually quite close to Karpov who is often viewed as having a different more positional style.
Mohamed Nisthar, Riyadh Saudi Arabia
Capablanca has been nominated as the best of the champions. But if I am not mistaken, he was thoroughly defeated in match with one player from the Indian Subcontinent, Sultan Khan!!! Please check on this and it would be valuable if an analysis is made of those games.
Editor's note: Sultan Khan was one of a few players who had a plus record against Capablanca (as well as against Frank Marshall and Savielly Tartakower). But we only know one game between the two: it was Sultan Khan's white piece victory over Capablanca at the Hastings tournament of 1930: 1.Nf3 Nf6 2.d4 b6 3.c4 Bb7 4.Nc3 e6 5.a3 d5 6.cxd5 exd5 7.Bg5 Be7 8.e3 O-O 9.Bd3 Ne4 10.Bf4 Nd7 11.Qc2 f5 12.Nb5 Bd6 13.Nxd6 cxd6 14.h4 Rc8 15.Qb3 Qe7 16.Nd2 Ndf6 17.Nxe4 fxe4 18.Be2 Rc6 19.g4 Rfc8 20.g5 Ne8 21.Bg4 Rc1+ 22.Kd2 R8c2+ 23.Qxc2 Rxc2+ 24.Kxc2 Qc7+ 25.Kd2 Qc4 26.Be2 Qb3 27.Rab1 Kf7 28.Rhc1 Ke7 29.Rc3 Qa4 30.b4 Qd7 31.Rbc1 a6 32.Rg1 Qa4 33.Rgc1 Qd7 34.h5 Kd8 35.R1c2 Qh3 36.Kc1 Qh4 37.Kb2 Qh3 38.Rc1 Qh4 39.R3c2 Qh3 40.a4 Qh4 41.Ka3 Qh3 42.Bg3 Qf5 43.Bh4 g6 44.h6 Qd7 45.b5 a5 46.Bg3 Qf5 47.Bf4 Qh3 48.Kb2 Qg2 49.Kb1 Qh3 50.Ka1 Qg2 51.Kb2 Qh3 52.Rg1 Bc8 53.Rc6 Qh4 54.Rgc1 Bg4 55.Bf1 Qh5 56.Re1 Qh1 57.Rec1 Qh5 58.Kc3 Qh4 59.Bg3 Qxg5 60.Kd2 Qh5 61.Rxb6 Ke7 62.Rb7+ Ke6 63.b6 Nf6 64.Bb5 Qh3 65.Rb8 1-0
Benoit Chamuleau, Istanbul, Turkey
First of all, thank you ChessBase for the very diversified news you offer on the chess world – since three years or so, I check it out on almost daily basis! Being fascinated by the studies that attempt to compare the strongest chessplayers in the world, I feel that one dimension is continuously overlooked: the fact that chess theory was not as advanced as it is today. If, for example, the number of times that the best move is played is assessed, does this mean the best moves as known today or the best moves as theoretically known at the era of play? Whichever the case is, it means an 'unfair' comparison: players that played strongly in the past, may be weak in today's competition due, for example, to the much greater complexity (in terms of number of 'good' moves) of today's games.
Indeed, people have a certain ceiling in their capacity, which was not as quickly apparent in the past – due to the relatively lower level, or simpler play – as it is today. I therefore think it may be interesting that studies like the one now published on ChessBase, consider the theoretical knowledge known at the era of play, and assess the players accordingly. Always should be noted that the capacities of players from different times cannot objectively be compared: maybe Steinitz did not bad at all against Kramnik, had he known the theory of today. On the other hand, maybe even you or I could be World Champ at Steinitz's time!
Tobias Nordquist, Sandviken, Sweden
This computer thing could even be new ground for a new rating system. ChessBase should develop a algorithm or program that works like the "Analyse game" in Fritz the difference is that the program should spit out a rating number. For example Kramnik's error in game 2 should not be punished so much because the oponent didn't see the error. But Topalovs error in game 9 should be punished. Why? Because his opponent saw it. Why is this so important? Because the computer cannot understand the non objective things in chess. Man call it subjective things and as long as it doesn't get punished its not as wrong as the comuter says it is. Hope you caught that!
Pavel Dimnik, Toronto, Canada
One of the reasons I love chess is its depth, its amazing ability to regularly surprise and intrigue. On that note, I feel I must comment on an important facet of chess that this study does not really touch upon. To be fair, with regards to this type of mathemetical analysis, perhaps it is impossible to do so. The facet that was not touched upon is the ability for great players to create the type of positions they want. Kramnik stymied Kasparov, Tal always found an explosion in the position, and Capablanca always found himself in a logical, positional game (to name a few). I do not think that a quantitative analysis can fully capture this ability, except to act as a comparison between two players, but even then it would favor the player who could most force the play into the form that would benefit his own style.
I applaud the effort taken, but for me this study serves to highlight and remind me of the fact that quantitative analysis can never appreciate the beauty of chess, or the true genius of its champions. It provides a useful tool of course, and chess programs are practically stronger than ever, but without the human 'deau ex machina' to oversee and appreciate what happened on the board, there would be no chess.
Paul Muljadi, USA
Thank you for sharing the Bratko-Guid paper. While this is a scientific and worthwhile attempt to identify the best chess player of all time, I think the the study and paper need much improvement before we can get closer to the truth. First of all, Capablanca being the best chess player of all time does not surprise me at all. I've always concluded the same because I think he is the best endgame player of all time. I think there is strong positive correlation between being the best endgame player and the best chess player. The paper needs to address different phases of the game and their best players. Secondly, we need to include other great chess players who have not been recognized formally as the world champions, such as Philidor, Morphy, Keres, etc. Chess titles are important, but some of the great players never had a chance for the world chess championship titles. And finally, the paper needs to address the psychological, physical, and other external aspects of chess competition as well. Lasker and Botvinnik made significant contributions in these areas.
Frank Dixon, Kingston, Canada
On the whole, this is an outstanding piece of work. I want to thank the two scientists who have put this together, and also to thank ChessBase for making it available. I want to add my additional opinion that GM Vladimir Kramnik, who won the reunification match with GM Topalov a few days ago, may be the first computer-trained World Champion in the history of chess. He was coming up at a time when computers started to reach GM levels of playing strength, and when the silicon beasts started to be used extensively for actual training by top players. This should be taken into account when commenting upon Kramnik's very low percentage of errors. The two writers have not elaborated on this important point. Kramnik's early top-level play, starting in 1991 when he was 16 years of age and made his debut at high levels, was highly tactical while still being strategically sound, for the most part. As he matured, the tactical nature sharply diminished in favour of the strategical style of play, clear evidence of the impact of computer training upon his talent. Players before him, such as Fischer, did not have this opportunity to train with computers. Fischer displayed outstanding tactical prowess in complex positions throughout his career (up to and including 1972), tempering it with more strategical insight as he matured and gained in strength.
Gary Furness MD, Santa Rosa, California, USA
Thanks for that very interesting article. I certainly never thought there would be so many ways to measure the champions within their own peer group.
David Korn, Seattle, USA
First, hearty thanks to both Matej Guid and Ivan Bratko for their excellent article attempting to objectively quantify the relative strengths of the fourteen World Chess Champions. I found this fascinating and read every line several times carefully, and totally delighted in their straightforward application of simple but well conceived metrics to the performance of the champions of chess.
I am wondering, since Mr. Sonas has already so well charted the historic ELO's of these same players--and others as well, in his ChessMetrics.com site – as well as in other articles comparing the greatest performances over time and highlights both Karpov and Kasparov over many years for their cumulative strength in major tournaments, what insights Mr. Guid and Bratko have as to reconcile the general perception of often Garry Kasparov as perhaps the greatest chess player of all time, and at times Karpov, who both won many games, in super strong tournaments, of long periods of time?
Let me hasten to repeat, in no way do I wish to subtract anything but heartfelt and gusty applause to their work, and this wonderful article. Many times we have heard of the depth of Capablanca's play, and how he just seemed to 'know' the right moves; similarly, this confirms what we all hear about Kramnik, that he has the deepest understanding of chess. Some say deeper than Kasparov.
Now here is the rub: if as this article says that Capablanca and Kramnik in both accuracy and blunder rate taken together in more complex positions combined are leaders, while at the same time other analysis points the way solidly to Kasparov and Karpov, then can we not say that these two may not 100% have had the most absolute accurate play, but in will power, guile, determination, combativeness, tenacity, will to win, etc., that they led the world for so long due to factors external to, or not to accuracy alone, if you follow? Not to split hairs.
But Kramnik must not have played nearly as many games as the two other K's, Karpov and Kasparov. Similarly, Capablanca did indeed loose few games, but he also did not play that many games just as Fischer did not, in a relative sense to other world chess champions?
So all agreed as Mr. Guid and Bratko suggest; but also Kasparov and Karpov won for so long that their results indicate supremacy for reasons including but not limited to accuracy and blunder reduction. Also, there is something to be said for knowing when to play the absolute best moves, or when to make the deepest calculations.
This seems to relate this discussion not just to raw calculation, but to physiological endurance, and a sense of emotional economy, while at other times hints at the will to win, which again is the emotional or physical resolve to try to win every game, as distinct from accuracy alone, thus again bespeaking of Garry Kasparov’s and Robert J. Fischer’s ell known command and desire to win every game.
Matt, Goddard, Atlanta, USA
It's nice to see an engine-based evaluation of historic player strength, but the study has made at least one dubious move, and at least one blunder. On the dubious side, we have the mechanism for adjusting strengths based on a subset of complicated positions, to account for playing style, when in reality that playing style itself may be the greater factor in a player's strength. It's understandable that we'd want to measure a player's ability to handle complications, especially since these are thing we can most accurately assess through an engine, but there is no correlating measure of a player's ability to avoid (or create) complications.
The blunder is that the study doesn't take into account opening theory. Nowadays, Kramnik plays many more moves that are prepared by theory and confirmed by computer engines; the assessment of his games is going to show a higher percentage of "perfect" moves in the opening than an assessment of Steinitz's games, and as a result, players scores will be inflated over time. It may be argued that a knowledge of opening theory is part and parcel of a player's strength, but that would run counter to the intent of comparing players across historic periods in the first place. Instead, it would seem that the study would need to find some line of demarcation in the games, between opening theory and over-the-board play. Simply looking at the games from, say, move 20 is problematic, since the depth of (accurate) opening theory has increased over time, and also because such hedging would provide a greater sample for positional players, who tend to have longer games. An idea to base the demarcation on the first dubious move would also have inherent problems: King's Gambit theory would all be sampled, for instance, and also there may be an imbalance imposed when playing strength is measured after one side is already at a disadvantage. Using the computer's opening books as a demarcation also fails: theory to a modern player may well be over-the-board play for the historic; also, the entirety of a Kramnik's preparation is not contained in the files. So, it's not an easy problem to overcome.