Winning starts with what you know
The new version 18 offers completely new possibilities for chess training and analysis: playing style analysis, search for strategic themes, access to 6 billion Lichess games, player preparation by matching Lichess games, download Chess.com games with built-in API, built-in cloud engine and much more.
The following article was translated from the original German by a non-statistician and may contain some terminological inaccuracies, for which we apologise. The intention should be clear. Note too that Jeff Sonas is currently on a business trip to Hyderabad, India, and may not be able to take up the debate for a week or two. |
I have a PhD in mathematical statistics. Therefore I want to make a few comments as a statistician and not as a chess programmer to Jeff Sonas' article.
The Elo number is the maximum-likelihood estimator of true playing strength. Even under the assumption that the Elo model is true, the Elo number is therefore only an estimator and no true measure which is the case for all statistical values. With only eleven games, as in the case of Brutus, the true value can be a great distance away from the estimator. It is therefore a standard statistical practice to document also the 95% or 99% confidence interval of the estimator (the true value lies with 95/99% probability in this region). This is what the Swedes do in the SSDF lists.
Dr Chistian ("Chrilly") Donninger. In the Slashdot discussion forum one contributor said he looked like the archetypical "mad scientist". |
I cannot immediately tell you what this confidence interval might be, but I estimate that it will reveal that there is no significant difference between the top programs. The variance of the estimator decreases with the square root of the number of games. After 11 games the variance is therefore about one third of an individual game. Even after 1000 games the Elo value has a variance of about +/-10 points around the true value. This statistical fact is a very serious problem for chess programmers. Adding a new feature improves usually a program at its best by 10 Elo points. One needs 1000 games to determine, if this new feature is really an improvement.
The standard statistical method is as follows: one defines the zero or standard
hypothesis, e.g. all top programs have the same playing strength. This hypothesis
is confirmed or rejected. One cannot answer this question with the estimator in our case the Elo value alone. Without the variance and the confidence
interval the Elo value is fairly meaningless.
We have to ask ourselves: do chess programs have the same variance as human
players? And more relevantly: do games between chess programs have the same
distribution as those played between human beings? Games between human players
and computers also have different dynamics to those between creatures of the
same species. Does Kasparov play at a 2800 level when he faces a computer? Apart
from that there are specialists who have demonstrated that they have a far higher
anti-computer rating than in over-the-board human play.
I cannot answer these questions. But before they are seriously addressed and there are statistical tests to do this it is difficult to give precise Elo ratings.
The most important point, however, is the following: it is well known that the tails of the normal distribution underestimate the probabilities of chess games. In other words, if two players with rather different Elo ratings play against each other, the chances of the weaker players are underestimated, and conversely the chances of the stronger overestimated. Especially at GM level the draw margin is quite large. One must have a considerable advantage to win the game.
The Elo model does not take this fact into account. It is one of the reasons that Garry Kasparov does not play against Elo 2500 players. It would ruin his reputation of being the highest ranked Elo rated player in history. Jeff Sonas has measured exactly this well-known phenomenon, viz. using the normal distribution for calculation Elo ratings. If one can conclude anything from his data, it is this: the stronger the human opponent, the higher the Elo score of the program. Shredder has not become a weaker player, the human opponents in Argentina were simply weaker.
Some years ago a Hungarian lady entered the bookstore Wiener Schachverlag. She said she wanted to buy a chess book for her husband, who had "2300 living points". This caused a bit of puzzlements, until a polyglot member of the company figured it out. The word "Elö" in Hungarian means "living". The good lady had never heard of the Hungarian physicist Prof. Arpud Elö, who invented the Elo system. She simply assumed that her husband had been talking about "living points". That was the expression she translated into German.