Ranking chess players according to the quality of their moves

by Frederic Friedel
4/27/2017 – How do you rate players from different periods? An AI researcher has undertaken to do it based not on the results of the games, but on the quality of the moves played. Jean-Marc Alliot used a strong chess engine running on a 640 processor cluster to analyse over two million positions that occurred in 26,000 games of World Champions since Steinitz. From this he produced a table of probable results between players of different eras. Example: Carlsen would have beaten Smyslov 57:43.

ChessBase 14 Download ChessBase 14 Download

Everyone uses ChessBase, from the World Champion to the amateur next door. Start your personal success story with ChessBase 14 and enjoy your chess even more!

Along with the ChessBase 14 program you can access the Live Database of 8 million games, and receive three months of free ChesssBase Account Premium membership and all of our online apps! Have a look today!


Artificial Intelligence evaluates chess champions

The Elo rating system in chess, well known to all of us, is based on the results of players against each other. Designed by the Hungarian physics professor and chess master Árpád Imre Élo in 1970 the system is used to predict the probability of rated players winning or losing their games against another rated players. If a player performs better or worse than predicted then rating points are added to or deducted from his rating. However, the Elo system does not take into account the the quality of the moves played during a game and is therefore unable to reliably rank players who have played at different periods in history.

Now computer scientist and AI researcher Jean-Marc Alliot of the Institut de Recherche en Informatique de Toulouse has come up with a new system (and reported on it in the journal of the International Computer Games Association) that does exactly that: rank players by evaluating the quality of their actual moves. He does this by comparing the moves of World Champions with those of a strong chess engine – the program Stockfish running on a supercomputer. The assumption is that the engine is executing almost perfect moves.

Alliot has evaluated 26,000 games played by World Champions since Steinitz, estimating the probability of their making a mistake – and the magnitude of the mistake – for each position in their games. From this he derived a probabilistic model for each player, and used it to compute the win/draw/lose probability for any given match between any two players. The predictions, he says, have proven not only to be extremely close to the results from actual encounters between the players, but they also fare better than those based on Elo scores. The results, he claims, demonstrate that the level of chess players has been steadily increasing. The current world champion, Magnus Carlsen, tops the list, while Bobby Fischer is third.

Here are predictions of game results between the different world champions in their best year:

  Ca Kr Fi Ka An Kh Sm Pe Kp Ks
Carlsen   52 54 54 57 58 57 58 56 60
Kramnik 49   52 52 55 56 56 57 55 59
Fischer 47 49   51 53 57 56 57 56 59
Kasparov 47 49 50   53 54 54 54 53 57
Anand 44 46 48 48   54 52 53 53 57
Khalifman 43 45 44 47 47   50 51 52 53
Smyslov 43 45 45 47 49 51   50 51 53
Petrosian 43 44 45 47 49 50 51   52 53
Karpov 44 46 45 48 48 49 50 49   51
Kasimdzhanov 41 43 42 45 45 48 48 48 50  

Under current conditions, Alliot feels, this new ranking method cannot immediately replace the Elo system, which is easier to set up and implement. However, increases in computing power will make it possible to extend the new method to an ever-growing pool of players in the near future.

Read the full detailed paper published by Jean-Marc Alliot in the ICGA Journal, Volume 39 -1, April 2017. Mathematically proficient readers are welcome to comment on his method and his results.

Editor-in-Chief of the ChessBase News Page. Studied Philosophy and Linguistics at the University of Hamburg and Oxford, graduating with a thesis on speech act theory and moral language. He started a university career but switched to science journalism, producing documentaries for German TV. In 1986 he co-founded ChessBase.
Discussion and Feedback Join the public discussion or submit your feedback to the editors


Rules for reader comments


Not registered yet? Register

imdvb_8793 imdvb_8793 5/17/2017 12:58
"While it is a fascinating study, I think it is more accurate from an era starting from Fischer/Karpov or maybe even Kasparov. Right now, precise play and winning are two very equivalent goals. However, in the past, they were not. The primary goal was to win, and while players tried to do it with precise play, many used the psychology of the opponent and knowingly entered lines that here harder for the opponent, even though they had more solid alternatives."

Excellent point, k2a2!

"Also when solid individuals like Carlsen, Karpov etc play guys like Tal, Alekhine, Kasparov and Fischer the more unhinged and 'low' quality moves, especially if made nearer the time control, can actually be decisive." - MikeyBoy


"Khalifman over Karpov is scientific proof that this study is invalid!" - bbrodinsky.

:) Pretty much... (Except for the 'scientific' part - but it's common sense proof. Khalifman wasn't better than Karpov even when they were both still playing at the same time, and Karpov was WAY past his peak.)
imdvb_8793 imdvb_8793 5/17/2017 12:25
"The assumption is that the engine is executing almost perfect moves."

This is the assumption I've always had the biggest problem with when it comes to these studies... Chess players have been making this assumption since the beginning of time about the strongest players in the world (I'm not making this up, I've read such claims in various books from all eras), and the next generation has always "proven" this to be wrong through "better analysis". Nowadays the strongest players happen to be the engines. I'm not sure that guarantees anything, move quality-wise... In any case, I know of no scientific proof that any individual move the engine gives as best over what the old GM (for example) played in his game is actually, mathematically better. It's just "better" based on the evaluation parameters set for that engine (and most others, probably), which are far from being mathematical absolutes.

Of course, those who do think engines play near-perfect chess are impossible to convince they might be wrong. The people who thought Morphy or Capablanca played near-perfect chess (which evidence now suggests they most likely didn't) were the same...

Also, 54-47 (or even 54-46) - Carlsen's expected score, under this system, vs. both Fischer and Kasparov - doesn't represent, in percentages, a statistically relevant difference, if I'm not mistaken. I could easily be - I'm not an expert on this topic; I'm just going by what I remember hearing about this sort of thing from people that are.

Lastly, as has been mentioned before, here, too, positional players (which Carlsen and Kramnik, for instance, most definitely are) seem to have a clear advantage in studies like this one. Another reason to doubt the validity of the resulting probable scores.
psychess psychess 5/5/2017 04:41
WildKid you nailed it! Furthermore, a Tal/Kasparov-like player may induce more blunders in their opponents but the number of blunders the opponent makes isn't taken into consideration and it's difficult to ascertain forced blunders because the quality of the opposition also influences the number of blunders an opponent makes. Thus this system actually detects more recent players facing high quality opposition who have unaggressive styles of play where humans are more likely to make moves that computers like. The authors attempted to derive measures of position complexity but they did not greatly discriminate between the players. Instead of measuring how much each player played like Stockfish that need to derive more meaningful indicators such as creating winning positions and blunder percentage of self and opponent if they want to detect meaningful differences between players.
lajosarpad lajosarpad 5/1/2017 10:44
@RayLopez 2/2

"The relative ranking of Capablanca, Euwe, Steinitz, and the players in the control group are preserved at all depths using any of the programs at any level of search. These experimental findings reject possible speculations that using chess programs stronger than CRAFTY for the analysis could lead to completely different results, and that Capablanca’s result is merely a consequence of his style being similar to CRAFTY’s style. We also view these results as another confirmation that in order to obtain a sensible ranking of the players, it is not necessary to use a computer that is stronger than the players themselves."
This is a given result for a given set of players. If we assume - and we cannot know this for sure - that the result is not mistaken, then it is wrong to conclude that this small pattern will describe the pattern in other cases, so this is a fallacious argument to make readers accept the methodology. A given experimental result is not a proof that other experiments will have similar results. It is only the first step. Then one must prove that the result of the experiment will imply similar results of other experiments to prove the point shown. This is missing in the quoted study of 2011 and that wrong conclusion which resulted in the fallacious usage of mathematical induction is used in your premise in this debate. No wonder that you contradict yourself when you state that increased accuracy of analysis will not change the ranking of the players, when the ranking highly depends on the comparison's accuracy.
lajosarpad lajosarpad 5/1/2017 10:33
@RayLopez 1/2 the title of the article is "Using Heuristic-Search Based Engines for Estimating Human Skill at Chess". The keyword here is estimation, that is, we are not talking about exact values. The exact values are not known, they are approached by a learning algorithm. The method used was empiric, that is, the statistical result of a finite set of experiments formulate an educated guess and that guess is assumed to be close to reality. In this study the strength of players is measured by the correctness of moves. If the depth of analysis is small, then the accuracy of this measure is very low indeed. Possible bugs of the chosen engine are not checked against using other engines. A single metric is used which has its flaws. All these problems generate a lot of white noises which make the analysis inaccurate. If you are not accurately determining the correctness of the moves in your analysis, then you are not accurately evaluating the strength of the player who made the moves (not to mention the other problems that are not handled by this methodology). If you are not accurately determining the players who made the moves, then the result of the comparison between inaccurately determined player strengths will be inaccurate as well. If the comparison of player strengths is inaccurate, then the sorting which is based on your inaccurate comparison will be inaccurate as well. And if the sorting of players is inaccurate, then the relative ranking will be inaccurate, incorrect as well. I am not saying the order of the players reached by the study is inaccurate. It can be accurate by accident, but the methodology is clearly inadequate to be even close to provide a result in which we can trust. I would have my doubts even if more plies were used and more engines and more metrics. Even in that case the white noise could result in slight inaccuracies, especially when two players are very close to each-other in strength. However, the current methodology does not even approach that accuracy.

"The issue is relative rankings of players, not 'accuracy'."
Relative rankings of players based on inaccurate measure will be inaccurate as well. If two players with very close playing strength can be accurately sorted (ranked) by this approach, then this is an accurate measure. If this is an accurate measure, then it surely will be able to always find out the correctness of the moves. And if that is the case, then Kramnik should not use supercomputers when preparing for a game. He should use Stockfish with a depth of 5, as according to your assumption that is enough to accurately determine the ranking of the players. Good luck explaining this to super GMs.
RayLopez RayLopez 4/30/2017 07:00
@lajosarpad - see this paper: http://en.chessbase.com/post/using-che-engines-to-estimate-human-skill notice stronger and weaker engines give the same relative rankings. The issue is relative rankings of players, not 'accuracy'. Of course the stronger engine is more accurate.
lajosarpad lajosarpad 4/30/2017 12:25

I highly doubt that a computer having a horizont of n plies would have the same accuracy as a computer having a horizont of m plies, where m > n. You claim that a given (very small!) depth is accurate-enough. I highly doubt it. If there is any difference between the result of n plies and m plies, when m > n, then the result will not be similar. But just make an engine at depth of 5 play against an engine at depth of 15 from 12 games and see the result. I know it is better for a human to wait for a small amount of time and the profit is greater if the "response" arrives earlier, but I am sure your argument that the depth does not really matter is a fallacy. Why do you think super GMs analyze with super computers into great depth? If you are right they should switch to smaller depth, because that's enough, right? Good luck convincing Kramnik that he should switch his engine to 7 plies when preparing for a game. And since people are using engines set to great depth to find out the truth in a position, evaluating their moves using a much smaller depth is clearly fallacious, especially in the case of home preparations or in positions very similar to some of their previous studies analyzed at great depth. It is up to the authors to decide whether they want to do a serious study or they just want to bluff. As of your suggestion of using a tiny limit of horizont, the study will be quicker and of low quality. You also say that my point that if you use only Stockfish you will not overcome its bugs is irrelevant. I disagree. How can you say that a provable way to reduce the number of bugs and therefore increase the accuracy of the study is irrelevant? Please share the URL where they prove that smaller amount of plies has the same accuracy as greater amount of plies as a limit. Note that my suggestion was to use multiple engines so if an engine has a given semantic bug, there is at least a chance the others will detect it. It is a provably superior approach in terms of accuracy: Stockfish has bugs. The other engines might be correct where Stockfish is incorrect. The makers of such studies would at least have the possibility to observe such discrepancies. You seem to firmly believe the study is perfect and if so, it is understandable why you advocate against suggestions for boost of accuracy. But if a position is misunderstood by Stockfish at a depth of 5 moves and the players understand it (for instance it is a long endgame they know how to play), then Stockfish will consistently punish all their moves until they are very close to the end of the variation. If the variation is of 40 moves length, the players each will be punished invalidly for inaccuracy 35 times. In comparison: many games are even shorter than that and if we say that a chess game lasts maybe 35 moves on average, then the approach you support would say a GM was having only blunders for an amount of moves which could be a whole game, just because of that inaccuracy. So your approach is clearly wrong and the study is a step towards the right direction but it is very far from being accurate indeed. Please, don't try to convince me that Stockfish has no bugs or they are not relevant or the accuracy is the same with smaller amount of plies than larger ones. I know why those arguments are fallacious but would like to avoid allocating time for explaining it.
Igor Freiberger Igor Freiberger 4/29/2017 11:44
This is an interesting discuss. Actually, I don't think attacker is better than defender nor tactical play is better than positional. What I argue is the model has flaws when it does not balance these differences – and thus produces suspicious results like ranking Khalifman and Kasindzhanov above Capablanca or Spassky.
RayLopez RayLopez 4/29/2017 06:13
@Igor Freiberger - I think we agree more than disagree. I am not saying Tal was no good (attack works better than defense in short time controls, that's why gambits work in blitz). I am not endorsing Keene's work, just saying he and I agree: I rather take a good defender (Petrosian! Capa!) over a good attacker (Fisher! Alekhine!) in a long match, well, maybe Petrosian in his peak and Fischer not at his peak ;-) But that does not mean the paper is wrong. Defense works. That's the strength of a computer. When I play against my computer (set to expert level), it fails to long term strategies, and makes mistakes, but within the 'event horizon' of 5 to 7 moves, it plays great defense and perfect chess. In short, it plays better chess than me most of the time. That's why PCs are hard to beat: within their event horizon they play perfect chess. And a player who plays like a PC, like So, or Carlsen, or any of the top 10 (not sure about Naka however, as he's more of a wild attacker) play 'defensive' and 'perfect' chess. Oh, btw, Giri (defender) beat Jobava (attacker) in a poisoned pawn variation in Iceland just now. Just one data point, so not reliable, but it makes my point. If you mention Kasparov (attacker) beating Karpov (defender), I would counter that Karpov was just too low rated. After all, Khalifman says the PC is better than Karpov.

@larjosarpad - "If you only use Stockfish and Stockfish is stronger than you, then you will not be able to find Stockfish's mistakes." - true but irrelevant. The paper I mention in the Chessbase archives show that the relative rating of players don't change even if you use stronger machines. Example: PC1 can see 5 moves and rates Carlsen, Kramnik, Fischer as the best of all time, in that order. PC2 can see 10 moves and also would rate these players in the same order (says this paper). It does not matter that at move 7 PC2 could beat PC1, which cannot see that far. This goes to my point about how most of the time, "a blunder is a blunder". We all like to think our bad sacrifices that are refuted are brilliant positional sacs, but most of the time, chess being largely a tactical game, they are not. Positional sacs are rare, or not as common as you think, is the short way of saying this.
FlannDefence FlannDefence 4/29/2017 06:07
Results, not moves, are what counts. Humans sometimes struggle to understand computer moves, but humans will often make good, well motivated moves for which the computer will comprehend no reason.
blueflare blueflare 4/29/2017 05:21
I dont know man, but fide champs were no world champs except anand, karpov, and topalov.
lajosarpad lajosarpad 4/29/2017 11:17
@Igor Freiberger you are right in stating that players going into risky variations tend to make more mistakes due to the nature of positions they play and that does not necessarily mean that they are weaker than more solid players. I think this is a good argument to state that the sole metric described in the paper is not enough.

@RayLopez I did not advocate changing the engine. My suggestion was to use several engines to be able to detect the particular mistakes of given engines by comparing its results to the other engines and noticing the disagreement and to take into account the minimal number of plies to accurately understand the position when we evaluate a mistake, don't just add the position drops. Adding them is much easier, but far from correct in determining player strength. If you only use Stockfish and Stockfish is stronger than you, then you will not be able to find Stockfish's mistakes. Using more engines will not change the position being analyzed, but it will change the result of the analyze of some positions, making the study provably superior compared to the current version. And the number of plies needed to accurately evaluate a move is a necessary, but not complete metric to determine the difficulty of accurately understanding a given position. There is a difference in missing a mate in 1 and missing a mate in 35, don't you agree? And what about positions where all positional heuristics point out there is a draw, but in fact it is a mate in 35? If 35 moves (70 plies) is outside of the horizont, then the engine will fail to correctly evaluate the position and thus it will reach to mistaken conclusions and therefore, misunderstanding the situation it will not necessarily be able to determine the best move in the position. So the number of plies is essential when we determine the horizont of the engines or the greatness of the error. It might well be that the evaluation drops with the same amount, but the drop of the position does not directly correlate to the greatness of the mistake of a human player.
Igor Freiberger Igor Freiberger 4/29/2017 09:14
@RayLopez: firstly, a sacrifice with a refutation is not a 'bad sacrifice'. You are thinking about perfect moves and not competition. Second, the handicap for risky players is not a matter of opinion. It is objetive: games played by Topalov or Shirov or Jobava contain much more incorrect moves (by both players) than those played by Giri, Leko or Karjakin. A complex, risky game naturally cause less precise moves due to abundance of variants and tactical possibilities. Alliot model mensures errors so it necessarily favors secure, positional and towards-endgame style. Third, Tal was NOT 'only a winter champion'. This is a misinformed statement. Please read about his illness before the 1961 rematch and his further problems in Curaçao. And his results in 1969-1972 and 1980-1981. It is well known that Keene works are not a good source of historic information – even when he copies other's books.
RayLopez RayLopez 4/28/2017 07:54
@WildKid, @Igor Freiberger. I did say that some chess masters get an 'unfair boost' from the algorithm that penalizes bad sacrifices. In another Chessbase article on rating past masters using PCs, it may be the link by The_Tenant, it was pointed out that Capablanca gets high scores (and maybe the other grandmasters mentioned in this thread) because Capa would reduce to the endgame, where there's less chance for "wild moves" that may or may not be sound. Another way of putting it: who is a better chess player, Capa or Tal? Much as I love Tal, I think it's Capa. Tal was only a winter champion and he lost to the solid Botvinnik (Keene & Divinsky's 'Warriors of the Mind' puts Tal behind Botvinnik as well). But in any event, while offense is better than defense in chess, for most time controls, I think the study does not severely handicap attackers that much.

@lajosarpad: another Chessbase article in archives shows that changing the engine, making it stronger, does not change the relative position of the chess players ranked. This is because seeing extra ply (move moves) does not change the fact that a bad move is usually a bad move, even after three moves.

@fgkdjlkag - yes, out-of-sample predictions are always a remedy to 'p-hacking' statistics tricks, time will tell.

The assumption a lot of people against this paper are making is that in chess a bad move can turn into a good move (a positional sacrifice) many moves later. This is usually not the case, in fact it's quite rare. Masters blunder, and usually a bad move is a bad move. The reason positional sacrifices are so delightful is that they are so rare. Positional sac games like these below are the exception, not the rule.

(Queen sac that computers initially think make no sense, though, give time, they tend to agree the sacs are sound)

http://www.chessgames.com/perl/chessgame?gid=1139428 Milko Bobotsov vs Mikhail Tal
http://www.chessgames.com/perl/chessgame?gid=1102190 Abram Davidovich Zamikhovsky vs Rashid Gibiatovich Nezhmetdinov
pocketknife pocketknife 4/28/2017 12:22
Nice play wiht the numbers- but it not exactly about good moves. Instead Who plays like a machine competition. As we play against humans it has little relevance in a chess game.
Mark S Mark S 4/28/2017 10:30
This move analysis evaluation using strong chess engines is not a novel idea because I have seen this discussion since Jan 31 2017.
It was discussed on that thread around 3 months ago. I won't be surprised if the same tool of Ferdinand Mosca was used to automate the analysis.
Mark S Mark S 4/28/2017 10:24
I agree with KevinC and other posts which points out some weaknesses on this statistical research.
One of these is that there are many positions where the winning move is not 1 but 4 moves or more. Making a sub-optimal move would be flagged as a blunder but World Champions would chose those suboptimal move if it is simpler and still surely wins. But for stockfish it would be considered a blunder because its eval score made a huge jump in score from +23.00 to +9.00. For humans, +9.00 is perfectly winning and simplifies the position faster, than going to +23.00 move which is very complex.
Igor Freiberger Igor Freiberger 4/28/2017 09:11
1. Recent players receive a huge advantage because his opening moves will be much more near to correct than moves played decades ago. The 'conformance' of opening moves does not measures player strength .

2. Lasker, Tal and Topalov are severely punish by their style with this model. And solid positional players are favored.

3. The way used to find the best year of each player is flawed. Number of victories is very relative. Kasparov was playing marvelously in 1986, but his number of victories was limited by the World Championship match against Karpov. Of course, a +1 against Karpov was then much more significative than a +5 in a tournament with weaker opponents.

4. Regardless all technical background, one cannot consider this study really valid when it says that Khalifman – who never played a WCH match nor reached the top 5 – would win against Karpov or Botvinnik.
benedictralph benedictralph 4/28/2017 09:06
Another thing worth nothing with respect to this kind of study is that theoretically, "perfect play" by both sides very likely results in a draw. Let that sink in for a while. Especially with regard to claims about "who might have defeated whom". Stockfish is far from a perfect player, by the way.
WildKid WildKid 4/28/2017 09:06
RayLopez: You seem to have missed part of my comparison of Tal and Petrosian. In simple, stable positions, almost all moves by both parties will be close to optimal. In highly complex and unbalanced positions, errors on both sides will be more common even if the players are equally good. Therefore, players whose playing style encourages complex and unbalanced positions (e.g. Tal) will be penalized by this metric relative to those who prefer to play simple and stable positions (Petrosian, Capablanca, maybe Carlsen). That was my point - it wasn't principally about 'unsound' sacrifices etc.
lajosarpad lajosarpad 4/28/2017 08:52
Very interesting article, I think this study is very useful, yet in its current state it is very far from reliably comparing the players, however, it is an attempt to fill a void and I think this attempt is made towards the right direction.

Chriticism: We need to use several engines for this study to rule out the known and unkown bugs of particular engines. This would still not deal with the so called common bugs, but there is no known method to effectively address those in this context. The study uses a single metric: the correctness of moves (according to engines). While we know this metric is not 100% in determining the actual correctness, this metric is the best which can be provided (if, again, more engines are being used). However, in agreement with @WildKid I think a single metric is not enough, since it will punish risky play without compensation. I think @RayLopez is not entirely correct when he tells us that Tal's risky play would be uneffective against modern opposition. Risky play is often rewarded by Caissa, just think about the recent example of Kramnik's rook sacrifice, which is probably objectively incorrect, not to mention that if the assumption was correct and Tal would be uneffective, his style would evolve and not necessarily into the assumed direction. We would need metrics like trickyness (a move might appear to be wrong for a high number of plies until the point is clear, leading opponents to wrong conclusions), we should take the number of plies minimally needed to find out a given move was mistaken. Also, anothe metric could be an aggregate number of acceptable responses to a move (how much preciseness is needed to deal with a given player). The shock value suggested by @WildKid is also a nice addition, but it should be rated with reduced weight since it measures the number of successfully tricking the opponent, which is unreliable and highly depends on the opposition of the player, not necessarily on the player himself. So while I agree he is right and the shock value should be used, due to its unreliability, we should not overrate it. Objective metrics should have a higher weight, of course.
fgkdjlkag fgkdjlkag 4/28/2017 06:35
@RayLopez 2) The "p-hacking" argument (data fitting): always valid, but so what? Would you rather ex-post have the theory *not* fit the data?

The point is, this method and Elo could be used on out-of-sample data. Thus we see which is actually better.

Another point - it's incorrect that we know with any statistical probability who would win in historical match-ups, as more recent players have more ideas to study. Give Fischer access to all of Carlsen's games, and the result is not going to be 54-47 in favor of Carlsen, as predicted. Of course all the newer players will generally speaking, seem superior to any past player.
The_Tenant The_Tenant 4/28/2017 04:09
* http://www.alliot.fr/CHESS/ficga.html.en

Look at where Capablanca ranks on the list....

Then look at the Guid and Bratko analysis, where Capablanca consistently ranks in the top three strongest players of all time.

* http://en.chessbase.com/post/the-quality-of-play-at-the-candidates-090413

* http://en.chessbase.com/post/using-che-engines-to-estimate-human-skill
bbrodinsky bbrodinsky 4/28/2017 02:19
Khalifman over Karpov is scientific proof that this study is invalid!
tsttst70 tsttst70 4/28/2017 02:12
TL;DR: Khalifman > Karpov.
RayLopez RayLopez 4/28/2017 01:54
Great paper. I fully agree with it. Good points made by the commentators, which I rebut:

1) The "Tal is penalized" argument, aka 'sub-optimal moves played for a reason'. Tal would not thrive against a fully booked equal player of today. And please, not every unsound move, played for shock value, is 'genius' (if so, I'm a genius, since at club play I almost always sac a piece for two pawn for fun, but in tournament play that won't work). And does anybody still believe in the "Steinitz King" (i.e., don't play O-O)? As GM John Nunn has pointed out, players from long ago played some really bad chess at times.

2) The "p-hacking" argument (data fitting): always valid, but so what? Would you rather ex-post have the theory *not* fit the data?

3) the "Stockfish penalizes players in a won position if they don't win quickly " argument. Logically, this would only penalize a player who gets a won position, then plays sloppy to finally win. Well you say, that's most people. OK, then nobody is being penalized since everybody would get equally penalized, so nobody is unfairly penalized! Logically, the only people who might get an 'unfair advantage' from this factor would be players who always simplify to an endgame then play the best move to win. It's been said that Capablanca was such a player, and another is Kramnik, and perhaps Carlsen is another, so perhaps they would get a slight 'unfair boost' from this factor. But if everybody 'plays sloppy once they won', then there's no unfair advantage to any one person. And, btw, it's not good chess to play sloppy in a won position. I can't count how many games I've lost when up 400 centipawns and played sloppy (I'm just a club player)

An argument not made but common when these papers are discussed: 'you can't use a PC to rate human games' (nonsense, but around 20 years ago this argument was common)

Finally, this type of study is not completely new. If you search Chessbase's archives you'll see others have used a PC to rate moves, it's well known. And indeed it's been found by many others that players of today make fewer blunders than players of yesteryear, hence they must be playing better chess. That said, I personally think the 2700Chess site "all time Elo" list is a bit biased against older players from yesteryear, who did not need to step up their game since they were already clearly so much stronger. Only very strong players, perhaps not present back then, would bring out the best. So I would think Capablanca, Lasker, Fischer would hold their own against the top ten of today.
The_Tenant The_Tenant 4/28/2017 01:36
26 plies seems too shallow a depth for a comprehensive analysis. I think they should have went to at least 30-32 plies. Any computer chess buffs care to chime in on this?
MikeyBoy MikeyBoy 4/28/2017 12:11
I'm not sure any of the above other than Fischer, Kramnik and Kasparov would have over 50% chance of beating Karpov when he was at his strongest. I don't see the peak Carlsens or Anands standing a chance.

Also when solid individuals like Carlsen, Karpov etc play guys like Tal, Alekhine, Kasparov and Fischer the more unhinged and 'low' quality moves, especially if made nearer the time control, can actually be decisive.

Of course it is very easy for me to sit here and criticize. At the end of the day this is still a brilliant study and has opened up some great debate.
k2a2 k2a2 4/27/2017 11:03
While it is a fascinating study, I think it is more accurate from an era starting from Fischer/Karpov or maybe even Kasparov. Right now, precise play and winning are two very equivalent goals. However, in the past, they were not. The primary goal was to win, and while players tried to do it with precise play, many used the psychology of the opponent and knowingly entered lines that here harder for the opponent, even though they had more solid alternatives.
genem genem 4/27/2017 09:37
Nice study. A more empirical, and therefore more accurate, way to articulate the results of this study involves discarding the term 'quality'. Instead, this study ranks chess players on their 'degree of agreement with Stockfish'.
Computers excel over humans in tactics, but perhaps not in positional play. Therefore the ranking approach used in this fine study would disfavor someone like Petrosian (who was more about positional play than about attacking tactics, relative to his peers).

No one study is perfect. An issue with this study is that it does not account for the playing strength of the opponent. On average, Kasparov faced tougher competition than did Fischer (such as comparing the USSR Champ tournaments, versus the USA Champ tournaments). It is easier to find the best move when playing a weak opponent than when playing a strong opponent.
Perhaps the methodology used in this study could take that into account and compensate accordingly.

I've wondered by nobody did this study years ago.
And I still wonder why no wrapper around generic Fritz (meaning anything like Stockfish or Komodo) yet estimates a player's Elo rating from an input packet of his games (seems do-able).
Exabachay Exabachay 4/27/2017 09:36
This has about zero historical and comparative relevance; you don't compare Newton and Hawking based on pure objective knowledge of physics.
besler besler 4/27/2017 09:06
Definitely an interesting paper.
I have a suggestion for an improvement along the lines of the first suggestion of 'WildKit': The evaluation algorithm for each move should only count against the player if it can show both that the move was sub-optimal, AND that the move CONTRIBUTED TO the eventual loss of the game. So, if the move in question was the beginning of a series of moves which resulted in steadily decreasing evaluations, and eventual loss, this would count against the player. On the other hand, if the move resulted in a sudden, downwards 'Blip' of evaluation, which was isolated and not part of a trend leading to a loss of the game, the evaluation should give the player the 'benefit of the doubt', i.e. it should assume that the move had some positive practical aspects that aren't apparent in the evaluation (e.g. a Tal-like move).
This method might miss some situations where a blunder goes unpunished by the opponent (i.e. two '??' moves in a row), but my assumption would be that a double-blunder type situation is much less common in high level games than the situation I described above, and thus this would be an overall improvement.
elmerdsangalang elmerdsangalang 4/27/2017 08:19
By rating individual players' strength relative to the perfect computer, the ratings will be inflation/deflation-free. We have acquired the means by which the strength of players from different periods can be accurately compared. (e.g. Will the Fischer of 1972 beat the Spassky of 1969?)
KrushonIrina KrushonIrina 4/27/2017 04:28
diegoami diegoami 4/27/2017 04:03
Well, this does not consider the fact that OTB you may play "practical chess" - moves that are not sound, but to which the opponent may not be prepared -"novelties".
Players such Kasparov, Topalov and Lasker.
timisis timisis 4/27/2017 03:28
Sadly the common man, and quite a few of the specialists too, may miss the futility of such academic work. Not only does it penalize mavericks as pointed out by WildKid, it also penalizes someone like Karpov for winning a game "slowly". In fact since scoring the point is the point of the game, you can never do much better than working with the results like the ELO system does. Perhaps in smaller scale, for example if someone is playing great chess but blunders a point or two in a tournament, or conversely is playing bad but is gifted a couple of points, one could adjust the rating for the "luck" factor. But at the very top of the chess foodchain, the question is usually if the game ever leaves the "winning zone" or the "drawing margin", and at that the machines are still largely clueless, you can count on Smyslov and Kramnik to have a better idea whether capturing a pawn with opposite color bishops allows winning chances or not, or if Karpov is taking another 7 moves regrouping for a pawn break that was probably winning, but now winning "more". And of course we have to factor in the "opposition factor", whether a player is forced to play ultraprecisely or not. So basically we could focus on quantifying the machine's ability to predict whether a position is winning or not, which, by the way, would allow our silicon friends to become even more scary. To predict who would win a match between Smyslov and Carlsen is, on the other hand, laughable, so let's try not to make too many improvements to that predictive method!
benedictralph benedictralph 4/27/2017 02:22
There appear to be things not uncommon between human players that the research does not account for. For instance, mistakes (human fallibility) that either side has a chance of recovering from. This may lead to unbalanced positions (materially) that could go either way. We see this quite often, even between games at the highest levels. A computer engine simply would not allow for this sort of thing. Those positions and games are pruned out of existence in terms of what constitutes the "correct" or "best" move in any given situation. Basically, the research fails to take into account the "art" of human play, for lack of a better word. Just replace Stockfish with Carlsen (and remove Carlsen from the list of players analyzed). How viable does the research look now? Consider then that Stockfish isn't even human yet is being used as a benchmark of sorts for humans. Again, how viable does the research look now? I fail to see the point of this kind of study, sorry to say.
WildKid WildKid 4/27/2017 01:17
I have one further statistical criticism. The authors show that their measure 'predicts' World Championship results better than ELO, and imply that this shows their measure is better than ELO. This is fallacious, since they are retrofitting the very data from which their model is derived.

In general, if we have a large dataset D and a subset S, a reasonable measure based on S will almost always retrofit S better than a reasonable model based on D. For example, a theory T(post) derived to fit the data after an experiment will almost always fit the data better than a theory T(prior) devised before the experiment. That does NOT mean that T(post) is a better theory than T(prior). This fallacy comes up all the time in Evidence-Based Medicine, among other fields.

To make their inference valid, the authors would need to compare to an ELO-like measure based only on the set of games they are using, rather than based on all ELO-valid games.
KevinC KevinC 4/27/2017 12:55
Or often second-best moves are the clearest way to win for a human.
anamanam anamanam 4/27/2017 12:04
Statistics is a fantastic science.

1. "it is now possible to predict the outcome of a match between any World Champion from any active year with any other Champion
taken in any active year; it is even possible to predict the result of Fischer 1970 against Fischer 1971."
Looking forward to that, too, not only best-year base ranking.

2. "for each player, the “best year” was found by searching for the year where the player had the largest number of victories against all other players and all other years".
How different would it be if the "best year" was defined based on the highest conformance, i.e. the year of best match between actual vs. computer moves?