Some remarks on Jeff Sonas' article
By Chrilly Donninger
The following article was translated from the original
German by a non-statistician and may contain some terminological inaccuracies,
for which we apologise. The intention should be clear. Note too that Jeff
Sonas is currently on a business trip to Hyderabad, India, and may not be
able to take up the debate for a week or two. |
I have a PhD in mathematical statistics. Therefore I want to make a few comments as a statistician and not as a chess programmer to Jeff Sonas' article.
The Elo number is the maximum-likelihood estimator of true playing strength.
Even under the assumption that the Elo model is true, the Elo number is therefore
only an estimator and no true measure which is the case for all statistical
values. With only eleven games, as in the case of Brutus, the true value can
be a great distance away from the estimator. It is therefore a standard statistical
practice to document also the 95% or 99% confidence interval of the estimator
(the true value lies with 95/99% probability in this region). This is what the
Swedes do in the SSDF lists.

Dr Chistian ("Chrilly") Donninger. In the Slashdot
discussion forum one contributor said he looked like the archetypical
"mad scientist".
|
|
I cannot immediately tell you what this confidence interval might be, but I
estimate that it will reveal that there is no significant difference between
the top programs. The variance of the estimator decreases with the square root
of the number of games. After 11 games the variance is therefore about one third
of an individual game. Even after 1000 games the Elo value has a variance of
about +/-10 points around the true value. This statistical fact is a very serious
problem for chess programmers. Adding a new feature improves usually a program
at its best by 10 Elo points. One needs 1000 games to determine, if this new
feature is really an improvement.
The standard statistical method is as follows: one defines the zero or standard
hypothesis, e.g. all top programs have the same playing strength. This hypothesis
is confirmed or rejected. One cannot answer this question with the estimator in our case the Elo value alone. Without the variance and the confidence
interval the Elo value is fairly meaningless.
We have to ask ourselves: do chess programs have the same variance as human
players? And more relevantly: do games between chess programs have the same
distribution as those played between human beings? Games between human players
and computers also have different dynamics to those between creatures of the
same species. Does Kasparov play at a 2800 level when he faces a computer? Apart
from that there are specialists who have demonstrated that they have a far higher
anti-computer rating than in over-the-board human play.
I cannot answer these questions. But before they are seriously addressed and there are statistical tests to do this it is difficult to give precise
Elo ratings.
The most important point, however, is the following: it is well known that
the tails of the normal distribution underestimate the probabilities of chess
games. In other words, if two players with rather different Elo ratings play
against each other, the chances of the weaker players are underestimated, and
conversely the chances of the stronger overestimated. Especially at GM level
the draw margin is quite large. One must have a considerable advantage to win
the game.
The Elo model does not take this fact into account. It is one of the reasons
that Garry Kasparov does not play against Elo 2500 players. It would ruin his
reputation of being the highest ranked Elo rated player in history. Jeff Sonas
has measured exactly this well-known phenomenon, viz. using the normal distribution
for calculation Elo ratings. If one can conclude anything from his data, it
is this: the stronger the human opponent, the higher the Elo score of the program.
Shredder has not become a weaker player, the human opponents in Argentina were
simply weaker.
An Elo anecdote
Some years ago a Hungarian lady entered the bookstore Wiener Schachverlag.
She said she wanted to buy a chess book for her husband, who had "2300
living points". This caused a bit of puzzlements, until a polyglot member
of the company figured it out. The word "Elö" in Hungarian means
"living". The good lady had never heard of the Hungarian physicist
Prof. Arpud Elö, who invented the Elo system. She simply assumed that her
husband had been talking about "living points". That was the expression
she translated into German.