The
Sonas Rating Formula – Better than Elo?
by Jeff Sonas
Every three months, FIDE publishes a list of chess ratings for thousands of
players around the world. These ratings are calculated by a formula that Professor
Arpad Elo developed decades ago. This formula has served the chess world quite
well for a long time, but I believe that the time has come to make some significant
changes to that formula.
At the start of August, I participated in a four-day conference in Moscow about
rating systems, sponsored by WorldChessRating. One of the conclusions from this
conference was that an extensive "clean" database of recent games
was needed, in order to run tests on any new rating formula that was developed.
In subsequent weeks, Vladimir Perevertkin collected the raw results from hundreds
of thousands of games between 1994 and 2001, and I have imported that information
into my own database for analysis.
I have experimented with lots of different rating formulas, generating historical
ratings from 1994-2001 based upon those formulas. For instance, we can see what
would have happened if all of the blitz and rapid games were actually included
in the rating calculation, or if different coefficients within the formulas
were adjusted. All of the following suggestions are based upon that analysis
EXECUTIVE SUMMARY – FOUR MAIN SUGGESTIONS
Suggestion #1: Use a more dynamic K-Factor
I believe that the basic FIDE rating formula is sound, but it does need to
be modified. Instead of the conservative K-Factor of 10 which is currently used,
a value of 24 should be used instead. This will make the FIDE ratings more than
twice as dynamic as they currently are. The value of 24 appears to be the most
accurate K-Factor, as well. Ratings that use other K-Factors are not as successful
at predicting the outcomes of future classical games.
Suggestion #2: Get rid of the complicated Elo table
Elo's complicated table of numbers should be discarded, in favor of a simple
linear model where White has a 100% expected score with a 390-point (or more)
rating advantage, and a 0% expected score with a 460-point (or more) rating
disadvantage. Other expected scores in between can be extrapolated with a simple
straight line. Note that this assigns a value of 35 rating points to having
the White pieces, so White will have an expected score of 50% with a 35-point
rating deficit, and an expected score of 54% if the players' ratings are identical.
This model is far more accurate than Elo's table of values. Elo's theoretical
calculations do not match the empirical data from actual results, and do not
take the color of pieces into account either. They also show a statistical bias
against the higher-rated players.
Suggestion #3: Include faster time control games, which receive less weight
than a classical game
Classical games should be given their normal importance. Games played at the
"modern" FIDE control are not as significant, and thus should only
be given an 83% importance. Rapid games should be given a 29% importance, and
blitz games an 18% importance. The choice to rate these types of games will
actually improve the ratings' ability to predict the outcome of future classical
games. By using these particular "weights", the ratings will be more
accurate than if rapid and blitz games were completely excluded. The exact values
of 83%, 29%, and 18% have been optimized for maximal accuracy and classical
predictive power of the ratings. If you prefer a more exact definition that
recognizes different types of rapid controls, or one that incorporates increments,
I have included a graph further down which allows you to calculate more precise
coefficients for arbitrary time controls.
Suggestion #4: Calculate the ratings monthly rather than quarterly
There is no reason why rating lists need to be out of date. A monthly interval
is quite practical, considering that the calculation time for these ratings
is almost negligible. The popularity of the Professional ratings shows that
players prefer a more dynamic and more frequently-updated list.
A SIMPLER FORMULA
In some ways, the Elo approach is already very simple. Whenever a "rated"
game of chess is played, the difference in FIDE ratings is checked against a
special table of numbers to determine what each player's "predicted"
score in the game should be. If you do better than that table predicts, your
rating will increase by a proportionate amount. If you do worse than "predicted",
your rating will decrease correspondingly.
Let's say, for instance, that you have a rating of 2600, and you play a 20-game
match against somebody rated 2500. In these games, your rating advantage is
100 points. The sacred Elo table of numbers tells us that your predicted score
in that match is 12.8/20. Thus if you actually score +5 (12.5/20), that would
be viewed as a slightly sub-par performance, and your rating would decrease
by 3 points as a result.
However, the unspoken assumption here is that the special table of numbers
is accurate. Today's chess statistician has the advantage of incredible computing
power, as well as millions of games' worth of empirical evidence. Neither of
these resources were available to Elo at the time his table of numbers was proposed.
Thus it is possible, today, to actually check the accuracy of Elo's theory.
Here is what happens if you graph the actual data:

Elo's numbers (represented by the white curve) came from a theoretical calculation.
(If you care about the math, Elo's 1978 book tells us that the numbers are based
upon the distribution of the difference of two Gaussian variables with identical
variances but different means.) This inverse exponential distribution is so
complicated that there is no way to provide a simple formula predicting the
score from the two players' ratings. All you can do is consult the special table
of numbers.
I don't know why it has to be so complicated. Look at the blue line in my graph.
A straight line, fitted to the data, is clearly a more accurate depiction of
the relationship than Elo's theoretical curve. Outside of the +/- 350 range,
there is insufficient data to draw any conclusions, but this range does include
well over 99% of all rated games. I have a theory about where Elo's calculations
may have gone astray (having to do with the uncertainty of rating estimates),
but the relevant point is that there is considerable room for improvement in
Elo's formula.
Why do we care so much about this? Well, a player's rating is going to go up
or down, based on whether the player is performing better than they "should"
be performing. If you tend to face opponents at the same strength as you, you
should score about 50%; your rating will go up if you have a plus score, and
down if you have a minus score. However, what if you tend to face opponents
who are 80-120 points weaker than you? Is a 60% score better or worse than predicted?
What about a 65% score? More than half of the world's top-200 actually do have
an average rating advantage of 80-120 points, across all of their games, so
this is an important question.
Let's zoom into that last graph a little bit (also averaging White and Black
games together). The white curve in the next graph shows you your predicted
score from the Elo table, if you are the rating favorite by 200 or fewer points.
That white curve is plotted against the actual data, based on 266,000 games
between 1994 and 2001, using the same colors as the previous graph:

There is a consistent bias in Elo's table of numbers against the higher-rated
player. To put it bluntly, if you are the higher-rated player, a normal performance
will cause you to lose rating points. You need an above-average performance
just to keep your rating level. Conversely, if you are the lower-rated player,
a normal performance will cause you to gain rating points.
For instance, in that earlier example where you had a rating of 2600 and scored
12.5/20 against a 2500-rated opponent, you would lose a few rating points. As
it turns out, your 12.5/20 score was actually a little BETTER than would be
expected from the ratings. Using the blue line in the last graph, you can see
that a 100-point rating advantage should lead to a score slightly over 61%,
and you actually scored 62.5%. Thus, despite a performance that was slightly
above par, you would actually lose rating points, due to the inaccuracy of Elo's
table of numbers.
It may seem trivial to quibble over a few rating points, but this is a consistent
effect which can have large cumulative impact over time. For instance, it appears
that this effect cost Garry Kasparov about 15 rating points over the course
of the year 2000, and the same for Alexei Shirov. With their very high ratings,
each of those players faced opposition that (on average) was weaker by 80-120
points, and so the ratings of both Kasparov and Shirov were artificially diminished
by this effect.
In contrast, Vladimir Kramnik also had a high rating in 2000, but due to his
large number of games against Kasparov during that year, Kramnik's average rating
advantage (against his opponents) was far smaller than Kasparov's or Shirov's.
Thus, this bias only cost Kramnik 1 or 2 rating points over the course of the
year 2000.
The bias also has an effect on the overall rating pool. It compresses the ratings
into a smaller range, so the top players are underrated and the bottom players
are overrated. Players who tend to be the rating favorites in most of their
games (such as the top-100 or top-200 players) are having their ratings artificially
diminished due to this effect. Thus the rise in grandmaster ratings, that we
have seen in recent years, would have been even greater had a more accurate
rating system been in place. You will see an illustration of this later on,
when we look at some monthy top-ten lists since 1997 using various rating formulas.
It's great to have some sort of scientific justification for your formula,
as Professor Elo did, but it seems even more important to have a formula which
is free of bias. It shouldn't matter whether you face a lot of stronger, weaker,
or similar-strength opponents; your rating should be as accurate an estimate
of your strength as possible, and this simply does not happen with Elo's formula.
My "linear model" is much simpler to calculate, easier to explain,
significantly more accurate, and shows less bias.
A MORE DYNAMIC FORMULA
For all its flaws, the Elo rating formula is still a very appealing one. Other
rating systems require more complicated calculations, or the retention of a
large amount of historical game information. However, the Professional ratings
are known to be considerably more dynamic than the FIDE ratings, and for this
reason most improving players favor the Professional ratings. For instance,
several months ago Vladimir Kramnik called the FIDE ratings "conservative
and stagnant".
Nevertheless, it is important to realize that there is nothing inherently "dynamic"
in Ken Thompson's formula for the Professional ratings. And there is nothing
inherently "conservative" in Arpad Elo's formula for the FIDE ratings.
In each case there is a numerical constant, used within the calculation, which
completely determines how dynamic or conservative the ratings will be.
In the case of the Elo ratings, this numerical constant is the attenuation
factor, or "K-Factor". In case you don't know, let me briefly explain
what the K-Factor actually does. Every time you play a game, there is a comparison
between what your score was predicted to be, and what it actually was. The difference
between the two is multiplied by the K-Factor, and that is how much your rating
will change. Thus, if you play a tournament and score 8.5 when you were predicted
to score 8.0, you have outperformed your rating by 0.5 points. With a K-Factor
of 10, your rating would go up by 5 points. With a K-Factor of 32, on the other
hand, your rating would go up by 16 points.
In the current FIDE scheme, a player will forever have a K-Factor of 10, once
they reach a 2400 rating. With a K-Factor of 5, the FIDE ratings would be far
more conservative. With a K-Factor of 40, they would leap around wildly, but
the ratings would still be more accurate than the current ratings. The particular
choice of 10 is somewhat arbitrary and could easily be doubled or tripled without
drastic consequences, other than a more dynamic (and more accurate) FIDE rating
system.
As an example of how the K-Factor affects ratings, consider the following graph
for Viktor Korchnoi's career between 1980 and 1992. Using the MegaBase CD from
Chessbase, I ran some historical rating calculations using various K-Factors,
and this graph shows Korchnoi's rating curve for K-Factors of 10, 20, and 32.
Note that these ratings will differ from the actual historical FIDE ratings,
since MegaBase provides a different game database than that used by the FIDE
ratings.

You can see that the red curve (K-Factor of 10) is fairly conservative, slower
to drop during 1982-3 when Korchnoi clearly was declining, and remaining relatively
constant from 1985 through 1992, almost always within the same 50-point range.
For a K-Factor of 20, however, Korchnoi's rating jumps around within a 100-point
range over the same 1985-1992 period (see the blue curve), whereas with a K-Factor
of 32 there is almost a 200-point swing during those years (see the yellow curve).
Thus the K-Factor can easily cause an Elo formula to be either very conservative
or very dynamic.
For the Thompson formula, there is also a numerical constant which determines
how dynamic the ratings will be. The current Professional ratings use a player's
last 100 games, with the more recent games weighted more heavily. If they used
the last 200 games instead, the ratings would be sluggish and resistant to change.
If they used the last 50 games, they would be even more dynamic. You might think
that Professional ratings using only the last 50 games would be far more dynamic
than any reasonable Elo-style formula, but in fact the Elo formula with a K-Factor
of 32 seems to be even more dynamic than a Thompson formula which uses only
the last 50 games. Take a look at the career rating curve for Jan Timman from
1980 to 1992, using those two different formulas. Again, I did these calculations
myself, using data from MegaBase 2000.

It is clear that the red curve (Elo-32) is even more dynamic than the blue
curve (Thompson-50), with higher peaks and lower valleys. However, it should
also be clear that the two rating systems are very similar. If you could pick
the right numerical constants, the Thompson and Elo formulas would yield extremely
similar ratings. In these examples, I chose Korchnoi and Timman more or less
at random; my point was to show that there is nothing inherently "dynamic"
about the Professional ratings or "conservative" about the FIDE ratings.
It is really almost a mathematical accident that they are this way, unless perhaps
the initial Thompson formula was specifically intended to be more dynamic than
FIDE's ratings.
So, it is clear that the FIDE ratings could be made more dynamic simply by
increasing the K-Factor. Is this a good idea?
In an attempt to answer this question, I have run many rating calculations
for the time period between 1994 and 2001, using various formulas. In each case,
I retroactively determined how accurate the ratings were at predicting future
results. Based on those calculations, it became possible to draw a curve showing
the relationship between K-Factor and accuracy of the ratings:

It appears that a K-Factor of 24 is optimal. For smaller values, the ratings
are too slow to change, and so ratings are not as useful in predicting how well
players will do each month. For larger values, the ratings are too sensitive
to recent results. In essence, they "over-react" to a player's last
few events, and will often indicate a change in strength when one doesn't really
exist. You can see from this graph that even using a super-dynamic K-Factor
of 40 would still result in greater accuracy than the current value of 10.
RAPID AND BLITZ
Recent years have seen an increased emphasis on games played at faster time
controls. Official FIDE events no longer use the "classical" time
controls, and rapid and blitz games are regularly used as tiebreakers, even
at the world championship level. There are more rapid events than ever, but
rapid and blitz games are completely ignored by the master FIDE rating list.
Instead, a separate "rapid" list, based on a small dataset, is maintained
and published infrequently and sporadically.
For now, to keep things simple, I want to consider only four classifications
of time controls. The "Classical" time control, of course, refers
to the traditional time controls of two hours for 40 moves, one hour for 20
moves, and then half an hour for the rest of the game. "Modern" (FIDE)
controls are at least 90 minutes per player per game, up to the Classical level.
"Blitz" controls are always five-minute games with no increments,
and "Rapid" has a maximum of 30 minutes per player per game (or 25
minutes if increments are used). I understand that these four classifications
don't include all possible time controls (what about g/60, for instance?). However,
please be patient. I will get to those near the end of this article.
The question of whether to rate faster games, and whether to combine them all
into a "unified" list, is a very controversial topic. I don't feel
particularly qualified to talk about all aspects of this, so as usual I will
stick to the statistical side. Let's go through the argument, point-by-point.
(1) I am trying to come up with a "better" rating formula.
(2) By my definition, a rating formula is "better" if it is more accurate
at predicting future classical games.
(3) The goal is to develop a rating formula with "optimal" classical
predictive power.
(4) Any data which significantly improves the predictive power of the rating
should be used.
(5) If ratings that incorporate faster-time-control games are actually "better"
at predicting the results of future classical games, then the faster games should
be included in the rating formula.
It is clear that Modern, Rapid, and Blitz games all provide useful information
about a player's ability to play classical chess. The statistics confirm that
conclusion. However, the results of a single Classical game are more significant
than the results of a single Modern game. Similarly, the results of a single
Modern game are more significant than the results of a single Rapid game, and
so on.
If we were to count all games equally, than a 10-game blitz tournament, played
one afternoon, would count the same as a 10-game classical tournament, played
over the course of two weeks. That doesn't feel right, and additionally it would
actually hurt the predictive power of the ratings, since they would be unduly
influenced by the blitz results. Thus it appears that the faster games should
be given an importance greater than zero, but less than 100%.
This can be accomplished by assigning "coefficients" to the various
time controls, with Classical given a coefficient of 100%. For example, let's
say you did quite well in a seven-round Classical tournament and as a result
you would gain 10 rating points. What if you had managed the exact same results
in a seven-round Rapid tournament instead? In that case, if the coefficient
for Rapid time controls were 30%, then your rating would only go up by 3 points,
rather than 10 points.
How should those coefficients be determined? The question lies somewhat outside
of the realm of statistics, but I can at least answer the statistical portion
of it. Again, I must return to the question of accuracy and predictive power.
If we define a "more accurate" rating system as one which does a better
job of predicting future outcomes than a "less accurate" rating system,
then it becomes possible to try various coefficients and check out the accuracy
of predictions for each set. Data analysis would then provide us with "optimal"
coefficients for each time control, leading to the "optimal" rating
system.
Before performing the analysis, my theory was that a Modern (FIDE) time control
game would provide about 70%-80% as much information as an actual classical
game, a rapid game would be about 30%-50%, and a blitz game would be about 5%-20%.
The results of the time control analysis would "feel" right if it
identified coefficients that fit into those expected ranges. Here were the results:

The "optimal" value for each coefficient appears as the peak of each
curve. Thus you can see that a coefficient of 83% for Modern is ideal, with
other values (higher or lower) leading to less accurate predictions in the ratings.
Similarly, the optimal value for Blitz is 18%, and the optimal value for Rapid
is 29%. Not quite in the ranges that I had expected, but nevertheless the numbers
seem quite reasonable.
A MORE ACCURATE FORMULA
To summarize, here are the key features of the Sonas rating formula:
(1) Percentage expectancy comes from a simple linear formula:
White's %-score = 0.541767 + 0.001164 * White rating advantage, treating White's
rating advantage as +390 if it is better than +390, or -460 if it is worse than
-460.
(2) Attenuation factor (K-Factor) should be 24 rather than 10.
(3) Give Classical games an importance of 100%, whereas Modern games are 83%,
Rapid games are 29%, and Blitz games are 18%. Alternatively, use the graph at
the end of this article to arrive at an exact coefficient which is specific
to the particular time control being used.
(4) Calculate the rating lists at the end of every month.
This formula was specifically optimized to be as accurate as possible, so it
should come as no surprise that the Sonas ratings are much better at predicting
future classical game outcomes than are the existing FIDE ratings. In fact,
in every single month that I looked at, from January 1997 through December 2001,
the total error (in predicting players' monthly scores) was higher for the FIDE
ratings than for the Sonas ratings:

How can I claim that the Sonas ratings are "more accurate" or "more
effective at predicting"? I went through each month and used the two sets
of ratings to predict the outcome of every game played during that month. Then,
at the end of the month, for each player, I added up their predicted score using
the Elo ratings, and their predicted score using the Sonas ratings. Each of
those rating systems had an "error" for the player during that month,
which was the absolute difference between the player's actual total score and
the rating system's predicted total score.
For example, in April 2000 Bu Xiangzhi played 18 classical games, with a +7
score for a total of 12.5 points. Based on his rating and his opponents' ratings
in those games, the Elo rating system had predicted a score of 10.25, whereas
the Sonas rating system had predicted a score of 11.75. In this case, the Elo
error would be 2.25, whereas the Sonas error would be 0.75. By adding up all
of the errors, for all players during the month, we can see what the total error
was for the Sonas ratings, and also for the Elo ratings. Then we can compare
them, and see which rating system was more effective in its predictions of games
played during that month. In the last graph, you can see that the Sonas ratings
turned out to be more effective than the Elo ratings in every single one of
the 60 months from January 1997 to December 2001.
You are probably wondering what the top-ten-list would look like, if the Sonas
formula were used instead of the Elo formula. Rather than giving you a huge
list of numbers, I'll give you a few pictures instead.
First, let's look at the "control group", which is the current Elo
system (including only Classical and Modern games). These ratings are based
upon a database of 266,000 games covering the period between January 1994 and
December 2001. The game database is that provided by Vladimir Perevertkin, rather
than the actual FIDE-rated game database, and these ratings are calculated 12
times a year rather than 2 or 4. Thus the ratings shown below are not quite
the same as the actual published FIDE ratings, but they do serve as an effective
control group.

Next, you can see the effect of a higher K-Factor. Using a K-Factor of 24 rather
than 10, players' ratings are much more sensitive to their recent results. For
instance, you can see Anatoly Karpov's rating (the black line) declining much
more steeply in the next graph. Similarly, with the more dynamic system, Garry
Kasparov dropped down very close to Viswanathan Anand after Linares 1998. In
fact, Kasparov briefly fell to #3 on this list in late 2000, after Kramnik defeated
him in London and then Anand won the FIDE championship. And Michael Adams was
very close behind at #4.

Finally, by examining the next graph, you can see the slight effect upon the
ratings if faster time controls are incorporated. In the years between 1994
and 1997, Kasparov and Anand did even better at rapid chess than at classical
chess, and so you can see that their ratings are a little bit higher when rapid
games are included. Some other players show some differences, but not significant
ones. In general, the two graphs are almost identical.

You might also notice that the ratings based upon a linear model with a K-Factor
of 24 are about 50 points higher than the ratings with the current formula.
As I mentioned previously, this is mostly due to a deflationary effect in the
current formula, rather than an inflationary effect in the linear model. Since
there is an unintentional bias against higher-rated players in the Elo table
of numbers, the top players are having their ratings artificially depressed
in the current system. This bias would be removed through the use of my linear
model.
It is unsurprising that a rating system with a higher K-Factor would have some
inflation, though. If a player does poorly over a number of events and then
stops playing, they will have "donated" rating points to the pool
of players. Perhaps someone scored 30/80 rather than the predicted 40/80, over
a few months. In the current system, they would have donated 100 points to the
pool, whereas with a K-Factor of 24, it would have been 240 points instead.
Since a very successful player will probably keep playing, while a very unsuccessful
player might well stop playing, this will have an inflationary effect on the
overall pool. Of course, this is a very simplistic explanation and I know that
the question of inflation vs. deflation is a very complicated one.
I am not suggesting that we suddenly recalculate everyone's rating and publish
a brand-new rating list. For one thing, it's not fair to retroactively rate
games that were "unrated" games at the time they were played. By showing
you these graphs, I am merely trying to illustrate how my rating system would
behave over time. Hopefully this will illustrate what it would mean to have
a K-Factor of 24 rather than 10, and you can also see the impact of faster time
controls.
For the sake of continuity of the "official" rating list, it seems
reasonable that if this formula were adopted, everyone should retain their previous
rating at the cut-over point. Once further games were played, the ratings would
begin to change (more rapidly than before) from that starting point.
OTHER TIME CONTROLS
The above conclusions about time controls were based upon only four different
classifications: Blitz, Rapid, Modern, and Classical. However, those classifications
do not include all typical time controls. For instance, Modern has a minimum
of 90 minutes per player per game, whereas Rapid has a maximum of 30 minutes
per player per game. Ideally, it would be possible to incorporate the coefficients
for these four classifications into a "master list" which could tell
you what the coefficient should be for g/60, or g/15 vs. g/30 for that matter.
I did a little bit of analysis on some recent TWIC archives, and determined
that about 50% of games last between 30 and 50 moves, with the average game
length being 37 moves. I therefore defined a "typical" game length
as 40 moves, and then looked at how much time a player would use in a "typical"
game in various time controls, if they used their maximum allowable time to
reach move 40.
This means a player would spend 5 minutes on a typical Blitz game, 5-30 minutes
on a typical Rapid game, 90-120 minutes on a typical Modern game, and 120 minutes
on a typical Classical game. Finally, I graphed my earlier coefficients of 18%,
29%, 83%, and 100% against the typical amount of time used, and arrived at the
following important graph:

This sort of approach (depending upon the maximum time used through 40 moves)
is really useful because it lets you incorporate increments into the formula.
A blitz game where you have 5 minutes total, will obviously count as a 5-minute
game in the above graph, and you can see that the coefficient would be 18%.
A blitz game where you get 5 minutes total, plus 15 seconds per move, would
in fact typically be a 15 minute game (5 minutes + 40 moves, at one extra minute
per four moves = 15 minutes), and so the recommended coefficient would be 27%
instead for that time control.
The very common time control of 60 minutes per player per game, would of course
count as a 60-minute game, and you can see that this would be 55%. And the maximum
coefficient of 100% would be reached by a classical time control where you get
a full 120 minutes for your first 40 moves.
CONCLUSION
It is more important than ever before for ratings to be accurate. In the past,
invitations to Candidate events were based upon a series of qualification events.
Now, however, invitations and pairings are often taken directly from the rating
list. The field for the recent Dortmund candidates' tournament was selected
by averaging everyone's FIDE and Professional ratings into a combined list,
and then picking the top players from that list. For the first time, a tournament
organizer has acknowledged that the FIDE ratings are not particularly accurate,
and that a different formula might work better.
The FIDE ratings are way too conservative, and the time control issue also
needs to be addressed thoughtfully. I know that this is an extremely tricky
issue, and it would be ridiculous to suggest that it is simply a question of
mathematics. If change does come about, it will be motivated by dozens of factors.
Nevertheless, I hope that my efforts will prove useful to the debate. I also
hope you agree with me that the "Sonas" formula described in this
article would be a significant improvement upon the "Elo" formula
which has served the chess world so well for decades.
Please send me e-mail at jeff@chessmetrics.com if you have any questions, comments,
or suggestions. In addition, please feel free to distribute or reprint text
or graphics from this article, as long as you credit the original author (that's
me).

Additional reading