The Sonas Rating Formula – Better than Elo?
by Jeff Sonas
Every three months, FIDE publishes a list of chess ratings for thousands of players around the world. These ratings are calculated by a formula that Professor Arpad Elo developed decades ago. This formula has served the chess world quite well for a long time, but I believe that the time has come to make some significant changes to that formula.
At the start of August, I participated in a four-day conference in Moscow about rating systems, sponsored by WorldChessRating. One of the conclusions from this conference was that an extensive "clean" database of recent games was needed, in order to run tests on any new rating formula that was developed. In subsequent weeks, Vladimir Perevertkin collected the raw results from hundreds of thousands of games between 1994 and 2001, and I have imported that information into my own database for analysis.
I have experimented with lots of different rating formulas, generating historical ratings from 1994-2001 based upon those formulas. For instance, we can see what would have happened if all of the blitz and rapid games were actually included in the rating calculation, or if different coefficients within the formulas were adjusted. All of the following suggestions are based upon that analysis
EXECUTIVE SUMMARY – FOUR MAIN SUGGESTIONS
Suggestion #1: Use a more dynamic K-Factor
I believe that the basic FIDE rating formula is sound, but it does need to be modified. Instead of the conservative K-Factor of 10 which is currently used, a value of 24 should be used instead. This will make the FIDE ratings more than twice as dynamic as they currently are. The value of 24 appears to be the most accurate K-Factor, as well. Ratings that use other K-Factors are not as successful at predicting the outcomes of future classical games.
Suggestion #2: Get rid of the complicated Elo table
Elo's complicated table of numbers should be discarded, in favor of a simple linear model where White has a 100% expected score with a 390-point (or more) rating advantage, and a 0% expected score with a 460-point (or more) rating disadvantage. Other expected scores in between can be extrapolated with a simple straight line. Note that this assigns a value of 35 rating points to having the White pieces, so White will have an expected score of 50% with a 35-point rating deficit, and an expected score of 54% if the players' ratings are identical. This model is far more accurate than Elo's table of values. Elo's theoretical calculations do not match the empirical data from actual results, and do not take the color of pieces into account either. They also show a statistical bias against the higher-rated players.
Suggestion #3: Include faster time control games, which receive less weight than a classical game
Classical games should be given their normal importance. Games played at the "modern" FIDE control are not as significant, and thus should only be given an 83% importance. Rapid games should be given a 29% importance, and blitz games an 18% importance. The choice to rate these types of games will actually improve the ratings' ability to predict the outcome of future classical games. By using these particular "weights", the ratings will be more accurate than if rapid and blitz games were completely excluded. The exact values of 83%, 29%, and 18% have been optimized for maximal accuracy and classical predictive power of the ratings. If you prefer a more exact definition that recognizes different types of rapid controls, or one that incorporates increments, I have included a graph further down which allows you to calculate more precise coefficients for arbitrary time controls.
Suggestion #4: Calculate the ratings monthly rather than quarterly
There is no reason why rating lists need to be out of date. A monthly interval is quite practical, considering that the calculation time for these ratings is almost negligible. The popularity of the Professional ratings shows that players prefer a more dynamic and more frequently-updated list.
A SIMPLER FORMULA
In some ways, the Elo approach is already very simple. Whenever a "rated" game of chess is played, the difference in FIDE ratings is checked against a special table of numbers to determine what each player's "predicted" score in the game should be. If you do better than that table predicts, your rating will increase by a proportionate amount. If you do worse than "predicted", your rating will decrease correspondingly.
Let's say, for instance, that you have a rating of 2600, and you play a 20-game match against somebody rated 2500. In these games, your rating advantage is 100 points. The sacred Elo table of numbers tells us that your predicted score in that match is 12.8/20. Thus if you actually score +5 (12.5/20), that would be viewed as a slightly sub-par performance, and your rating would decrease by 3 points as a result.
However, the unspoken assumption here is that the special table of numbers is accurate. Today's chess statistician has the advantage of incredible computing power, as well as millions of games' worth of empirical evidence. Neither of these resources were available to Elo at the time his table of numbers was proposed. Thus it is possible, today, to actually check the accuracy of Elo's theory. Here is what happens if you graph the actual data:
Elo's numbers (represented by the white curve) came from a theoretical calculation. (If you care about the math, Elo's 1978 book tells us that the numbers are based upon the distribution of the difference of two Gaussian variables with identical variances but different means.) This inverse exponential distribution is so complicated that there is no way to provide a simple formula predicting the score from the two players' ratings. All you can do is consult the special table of numbers.
I don't know why it has to be so complicated. Look at the blue line in my graph. A straight line, fitted to the data, is clearly a more accurate depiction of the relationship than Elo's theoretical curve. Outside of the +/- 350 range, there is insufficient data to draw any conclusions, but this range does include well over 99% of all rated games. I have a theory about where Elo's calculations may have gone astray (having to do with the uncertainty of rating estimates), but the relevant point is that there is considerable room for improvement in Elo's formula.
Why do we care so much about this? Well, a player's rating is going to go up or down, based on whether the player is performing better than they "should" be performing. If you tend to face opponents at the same strength as you, you should score about 50%; your rating will go up if you have a plus score, and down if you have a minus score. However, what if you tend to face opponents who are 80-120 points weaker than you? Is a 60% score better or worse than predicted? What about a 65% score? More than half of the world's top-200 actually do have an average rating advantage of 80-120 points, across all of their games, so this is an important question.
Let's zoom into that last graph a little bit (also averaging White and Black games together). The white curve in the next graph shows you your predicted score from the Elo table, if you are the rating favorite by 200 or fewer points. That white curve is plotted against the actual data, based on 266,000 games between 1994 and 2001, using the same colors as the previous graph:
There is a consistent bias in Elo's table of numbers against the higher-rated player. To put it bluntly, if you are the higher-rated player, a normal performance will cause you to lose rating points. You need an above-average performance just to keep your rating level. Conversely, if you are the lower-rated player, a normal performance will cause you to gain rating points.
For instance, in that earlier example where you had a rating of 2600 and scored 12.5/20 against a 2500-rated opponent, you would lose a few rating points. As it turns out, your 12.5/20 score was actually a little BETTER than would be expected from the ratings. Using the blue line in the last graph, you can see that a 100-point rating advantage should lead to a score slightly over 61%, and you actually scored 62.5%. Thus, despite a performance that was slightly above par, you would actually lose rating points, due to the inaccuracy of Elo's table of numbers.
It may seem trivial to quibble over a few rating points, but this is a consistent effect which can have large cumulative impact over time. For instance, it appears that this effect cost Garry Kasparov about 15 rating points over the course of the year 2000, and the same for Alexei Shirov. With their very high ratings, each of those players faced opposition that (on average) was weaker by 80-120 points, and so the ratings of both Kasparov and Shirov were artificially diminished by this effect.
In contrast, Vladimir Kramnik also had a high rating in 2000, but due to his large number of games against Kasparov during that year, Kramnik's average rating advantage (against his opponents) was far smaller than Kasparov's or Shirov's. Thus, this bias only cost Kramnik 1 or 2 rating points over the course of the year 2000.
The bias also has an effect on the overall rating pool. It compresses the ratings into a smaller range, so the top players are underrated and the bottom players are overrated. Players who tend to be the rating favorites in most of their games (such as the top-100 or top-200 players) are having their ratings artificially diminished due to this effect. Thus the rise in grandmaster ratings, that we have seen in recent years, would have been even greater had a more accurate rating system been in place. You will see an illustration of this later on, when we look at some monthy top-ten lists since 1997 using various rating formulas.
It's great to have some sort of scientific justification for your formula, as Professor Elo did, but it seems even more important to have a formula which is free of bias. It shouldn't matter whether you face a lot of stronger, weaker, or similar-strength opponents; your rating should be as accurate an estimate of your strength as possible, and this simply does not happen with Elo's formula. My "linear model" is much simpler to calculate, easier to explain, significantly more accurate, and shows less bias.
A MORE DYNAMIC FORMULA
For all its flaws, the Elo rating formula is still a very appealing one. Other rating systems require more complicated calculations, or the retention of a large amount of historical game information. However, the Professional ratings are known to be considerably more dynamic than the FIDE ratings, and for this reason most improving players favor the Professional ratings. For instance, several months ago Vladimir Kramnik called the FIDE ratings "conservative and stagnant".
Nevertheless, it is important to realize that there is nothing inherently "dynamic" in Ken Thompson's formula for the Professional ratings. And there is nothing inherently "conservative" in Arpad Elo's formula for the FIDE ratings. In each case there is a numerical constant, used within the calculation, which completely determines how dynamic or conservative the ratings will be.
In the case of the Elo ratings, this numerical constant is the attenuation
factor, or "K-Factor". In case you don't know, let me briefly explain
what the K-Factor actually does. Every time you play a game, there is a comparison
between what your score was predicted to be, and what it actually was. The difference
between the two is multiplied by the K-Factor, and that is how much your rating
will change. Thus, if you play a tournament and score 8.5 when you were predicted
to score 8.0, you have outperformed your rating by 0.5 points. With a K-Factor
of 10, your rating would go up by 5 points. With a K-Factor of 32, on the other
hand, your rating would go up by 16 points.
In the current FIDE scheme, a player will forever have a K-Factor of 10, once they reach a 2400 rating. With a K-Factor of 5, the FIDE ratings would be far more conservative. With a K-Factor of 40, they would leap around wildly, but the ratings would still be more accurate than the current ratings. The particular choice of 10 is somewhat arbitrary and could easily be doubled or tripled without drastic consequences, other than a more dynamic (and more accurate) FIDE rating system.
As an example of how the K-Factor affects ratings, consider the following graph for Viktor Korchnoi's career between 1980 and 1992. Using the MegaBase CD from Chessbase, I ran some historical rating calculations using various K-Factors, and this graph shows Korchnoi's rating curve for K-Factors of 10, 20, and 32. Note that these ratings will differ from the actual historical FIDE ratings, since MegaBase provides a different game database than that used by the FIDE ratings.
You can see that the red curve (K-Factor of 10) is fairly conservative, slower to drop during 1982-3 when Korchnoi clearly was declining, and remaining relatively constant from 1985 through 1992, almost always within the same 50-point range. For a K-Factor of 20, however, Korchnoi's rating jumps around within a 100-point range over the same 1985-1992 period (see the blue curve), whereas with a K-Factor of 32 there is almost a 200-point swing during those years (see the yellow curve). Thus the K-Factor can easily cause an Elo formula to be either very conservative or very dynamic.
For the Thompson formula, there is also a numerical constant which determines how dynamic the ratings will be. The current Professional ratings use a player's last 100 games, with the more recent games weighted more heavily. If they used the last 200 games instead, the ratings would be sluggish and resistant to change. If they used the last 50 games, they would be even more dynamic. You might think that Professional ratings using only the last 50 games would be far more dynamic than any reasonable Elo-style formula, but in fact the Elo formula with a K-Factor of 32 seems to be even more dynamic than a Thompson formula which uses only the last 50 games. Take a look at the career rating curve for Jan Timman from 1980 to 1992, using those two different formulas. Again, I did these calculations myself, using data from MegaBase 2000.
It is clear that the red curve (Elo-32) is even more dynamic than the blue curve (Thompson-50), with higher peaks and lower valleys. However, it should also be clear that the two rating systems are very similar. If you could pick the right numerical constants, the Thompson and Elo formulas would yield extremely similar ratings. In these examples, I chose Korchnoi and Timman more or less at random; my point was to show that there is nothing inherently "dynamic" about the Professional ratings or "conservative" about the FIDE ratings. It is really almost a mathematical accident that they are this way, unless perhaps the initial Thompson formula was specifically intended to be more dynamic than FIDE's ratings.
So, it is clear that the FIDE ratings could be made more dynamic simply by increasing the K-Factor. Is this a good idea?
In an attempt to answer this question, I have run many rating calculations for the time period between 1994 and 2001, using various formulas. In each case, I retroactively determined how accurate the ratings were at predicting future results. Based on those calculations, it became possible to draw a curve showing the relationship between K-Factor and accuracy of the ratings:
It appears that a K-Factor of 24 is optimal. For smaller values, the ratings are too slow to change, and so ratings are not as useful in predicting how well players will do each month. For larger values, the ratings are too sensitive to recent results. In essence, they "over-react" to a player's last few events, and will often indicate a change in strength when one doesn't really exist. You can see from this graph that even using a super-dynamic K-Factor of 40 would still result in greater accuracy than the current value of 10.
RAPID AND BLITZ
Recent years have seen an increased emphasis on games played at faster time controls. Official FIDE events no longer use the "classical" time controls, and rapid and blitz games are regularly used as tiebreakers, even at the world championship level. There are more rapid events than ever, but rapid and blitz games are completely ignored by the master FIDE rating list. Instead, a separate "rapid" list, based on a small dataset, is maintained and published infrequently and sporadically.
For now, to keep things simple, I want to consider only four classifications
of time controls. The "Classical" time control, of course, refers
to the traditional time controls of two hours for 40 moves, one hour for 20
moves, and then half an hour for the rest of the game. "Modern" (FIDE)
controls are at least 90 minutes per player per game, up to the Classical level.
"Blitz" controls are always five-minute games with no increments,
and "Rapid" has a maximum of 30 minutes per player per game (or 25
minutes if increments are used). I understand that these four classifications
don't include all possible time controls (what about g/60, for instance?). However,
please be patient. I will get to those near the end of this article.
The question of whether to rate faster games, and whether to combine them all into a "unified" list, is a very controversial topic. I don't feel particularly qualified to talk about all aspects of this, so as usual I will stick to the statistical side. Let's go through the argument, point-by-point.
(1) I am trying to come up with a "better" rating formula.
(2) By my definition, a rating formula is "better" if it is more accurate at predicting future classical games.
(3) The goal is to develop a rating formula with "optimal" classical predictive power.
(4) Any data which significantly improves the predictive power of the rating should be used.
(5) If ratings that incorporate faster-time-control games are actually "better" at predicting the results of future classical games, then the faster games should be included in the rating formula.
It is clear that Modern, Rapid, and Blitz games all provide useful information about a player's ability to play classical chess. The statistics confirm that conclusion. However, the results of a single Classical game are more significant than the results of a single Modern game. Similarly, the results of a single Modern game are more significant than the results of a single Rapid game, and so on.
If we were to count all games equally, than a 10-game blitz tournament, played one afternoon, would count the same as a 10-game classical tournament, played over the course of two weeks. That doesn't feel right, and additionally it would actually hurt the predictive power of the ratings, since they would be unduly influenced by the blitz results. Thus it appears that the faster games should be given an importance greater than zero, but less than 100%.
This can be accomplished by assigning "coefficients" to the various time controls, with Classical given a coefficient of 100%. For example, let's say you did quite well in a seven-round Classical tournament and as a result you would gain 10 rating points. What if you had managed the exact same results in a seven-round Rapid tournament instead? In that case, if the coefficient for Rapid time controls were 30%, then your rating would only go up by 3 points, rather than 10 points.
How should those coefficients be determined? The question lies somewhat outside of the realm of statistics, but I can at least answer the statistical portion of it. Again, I must return to the question of accuracy and predictive power. If we define a "more accurate" rating system as one which does a better job of predicting future outcomes than a "less accurate" rating system, then it becomes possible to try various coefficients and check out the accuracy of predictions for each set. Data analysis would then provide us with "optimal" coefficients for each time control, leading to the "optimal" rating system.
Before performing the analysis, my theory was that a Modern (FIDE) time control game would provide about 70%-80% as much information as an actual classical game, a rapid game would be about 30%-50%, and a blitz game would be about 5%-20%. The results of the time control analysis would "feel" right if it identified coefficients that fit into those expected ranges. Here were the results:
The "optimal" value for each coefficient appears as the peak of each curve. Thus you can see that a coefficient of 83% for Modern is ideal, with other values (higher or lower) leading to less accurate predictions in the ratings. Similarly, the optimal value for Blitz is 18%, and the optimal value for Rapid is 29%. Not quite in the ranges that I had expected, but nevertheless the numbers seem quite reasonable.
A MORE ACCURATE FORMULA
To summarize, here are the key features of the Sonas rating formula:
(1) Percentage expectancy comes from a simple linear formula:
White's %-score = 0.541767 + 0.001164 * White rating advantage, treating White's rating advantage as +390 if it is better than +390, or -460 if it is worse than -460.
(2) Attenuation factor (K-Factor) should be 24 rather than 10.
(3) Give Classical games an importance of 100%, whereas Modern games are 83%, Rapid games are 29%, and Blitz games are 18%. Alternatively, use the graph at the end of this article to arrive at an exact coefficient which is specific to the particular time control being used.
(4) Calculate the rating lists at the end of every month.
This formula was specifically optimized to be as accurate as possible, so it should come as no surprise that the Sonas ratings are much better at predicting future classical game outcomes than are the existing FIDE ratings. In fact, in every single month that I looked at, from January 1997 through December 2001, the total error (in predicting players' monthly scores) was higher for the FIDE ratings than for the Sonas ratings:
How can I claim that the Sonas ratings are "more accurate" or "more effective at predicting"? I went through each month and used the two sets of ratings to predict the outcome of every game played during that month. Then, at the end of the month, for each player, I added up their predicted score using the Elo ratings, and their predicted score using the Sonas ratings. Each of those rating systems had an "error" for the player during that month, which was the absolute difference between the player's actual total score and the rating system's predicted total score.
For example, in April 2000 Bu Xiangzhi played 18 classical games, with a +7 score for a total of 12.5 points. Based on his rating and his opponents' ratings in those games, the Elo rating system had predicted a score of 10.25, whereas the Sonas rating system had predicted a score of 11.75. In this case, the Elo error would be 2.25, whereas the Sonas error would be 0.75. By adding up all of the errors, for all players during the month, we can see what the total error was for the Sonas ratings, and also for the Elo ratings. Then we can compare them, and see which rating system was more effective in its predictions of games played during that month. In the last graph, you can see that the Sonas ratings turned out to be more effective than the Elo ratings in every single one of the 60 months from January 1997 to December 2001.
You are probably wondering what the top-ten-list would look like, if the Sonas formula were used instead of the Elo formula. Rather than giving you a huge list of numbers, I'll give you a few pictures instead.
First, let's look at the "control group", which is the current Elo system (including only Classical and Modern games). These ratings are based upon a database of 266,000 games covering the period between January 1994 and December 2001. The game database is that provided by Vladimir Perevertkin, rather than the actual FIDE-rated game database, and these ratings are calculated 12 times a year rather than 2 or 4. Thus the ratings shown below are not quite the same as the actual published FIDE ratings, but they do serve as an effective control group.
Next, you can see the effect of a higher K-Factor. Using a K-Factor of 24 rather than 10, players' ratings are much more sensitive to their recent results. For instance, you can see Anatoly Karpov's rating (the black line) declining much more steeply in the next graph. Similarly, with the more dynamic system, Garry Kasparov dropped down very close to Viswanathan Anand after Linares 1998. In fact, Kasparov briefly fell to #3 on this list in late 2000, after Kramnik defeated him in London and then Anand won the FIDE championship. And Michael Adams was very close behind at #4.
Finally, by examining the next graph, you can see the slight effect upon the ratings if faster time controls are incorporated. In the years between 1994 and 1997, Kasparov and Anand did even better at rapid chess than at classical chess, and so you can see that their ratings are a little bit higher when rapid games are included. Some other players show some differences, but not significant ones. In general, the two graphs are almost identical.
You might also notice that the ratings based upon a linear model with a K-Factor of 24 are about 50 points higher than the ratings with the current formula. As I mentioned previously, this is mostly due to a deflationary effect in the current formula, rather than an inflationary effect in the linear model. Since there is an unintentional bias against higher-rated players in the Elo table of numbers, the top players are having their ratings artificially depressed in the current system. This bias would be removed through the use of my linear model.
It is unsurprising that a rating system with a higher K-Factor would have some inflation, though. If a player does poorly over a number of events and then stops playing, they will have "donated" rating points to the pool of players. Perhaps someone scored 30/80 rather than the predicted 40/80, over a few months. In the current system, they would have donated 100 points to the pool, whereas with a K-Factor of 24, it would have been 240 points instead. Since a very successful player will probably keep playing, while a very unsuccessful player might well stop playing, this will have an inflationary effect on the overall pool. Of course, this is a very simplistic explanation and I know that the question of inflation vs. deflation is a very complicated one.
I am not suggesting that we suddenly recalculate everyone's rating and publish a brand-new rating list. For one thing, it's not fair to retroactively rate games that were "unrated" games at the time they were played. By showing you these graphs, I am merely trying to illustrate how my rating system would behave over time. Hopefully this will illustrate what it would mean to have a K-Factor of 24 rather than 10, and you can also see the impact of faster time controls.
For the sake of continuity of the "official" rating list, it seems reasonable that if this formula were adopted, everyone should retain their previous rating at the cut-over point. Once further games were played, the ratings would begin to change (more rapidly than before) from that starting point.
OTHER TIME CONTROLS
The above conclusions about time controls were based upon only four different classifications: Blitz, Rapid, Modern, and Classical. However, those classifications do not include all typical time controls. For instance, Modern has a minimum of 90 minutes per player per game, whereas Rapid has a maximum of 30 minutes per player per game. Ideally, it would be possible to incorporate the coefficients for these four classifications into a "master list" which could tell you what the coefficient should be for g/60, or g/15 vs. g/30 for that matter.
I did a little bit of analysis on some recent TWIC archives, and determined that about 50% of games last between 30 and 50 moves, with the average game length being 37 moves. I therefore defined a "typical" game length as 40 moves, and then looked at how much time a player would use in a "typical" game in various time controls, if they used their maximum allowable time to reach move 40.
This means a player would spend 5 minutes on a typical Blitz game, 5-30 minutes on a typical Rapid game, 90-120 minutes on a typical Modern game, and 120 minutes on a typical Classical game. Finally, I graphed my earlier coefficients of 18%, 29%, 83%, and 100% against the typical amount of time used, and arrived at the following important graph:
This sort of approach (depending upon the maximum time used through 40 moves) is really useful because it lets you incorporate increments into the formula. A blitz game where you have 5 minutes total, will obviously count as a 5-minute game in the above graph, and you can see that the coefficient would be 18%. A blitz game where you get 5 minutes total, plus 15 seconds per move, would in fact typically be a 15 minute game (5 minutes + 40 moves, at one extra minute per four moves = 15 minutes), and so the recommended coefficient would be 27% instead for that time control.
The very common time control of 60 minutes per player per game, would of course count as a 60-minute game, and you can see that this would be 55%. And the maximum coefficient of 100% would be reached by a classical time control where you get a full 120 minutes for your first 40 moves.
It is more important than ever before for ratings to be accurate. In the past, invitations to Candidate events were based upon a series of qualification events. Now, however, invitations and pairings are often taken directly from the rating list. The field for the recent Dortmund candidates' tournament was selected by averaging everyone's FIDE and Professional ratings into a combined list, and then picking the top players from that list. For the first time, a tournament organizer has acknowledged that the FIDE ratings are not particularly accurate, and that a different formula might work better.
The FIDE ratings are way too conservative, and the time control issue also needs to be addressed thoughtfully. I know that this is an extremely tricky issue, and it would be ridiculous to suggest that it is simply a question of mathematics. If change does come about, it will be motivated by dozens of factors. Nevertheless, I hope that my efforts will prove useful to the debate. I also hope you agree with me that the "Sonas" formula described in this article would be a significant improvement upon the "Elo" formula which has served the chess world so well for decades.
Please send me e-mail at firstname.lastname@example.org if you have any questions, comments, or suggestions. In addition, please feel free to distribute or reprint text or graphics from this article, as long as you credit the original author (that's me).