What's Wrong With the Elo System?

by Jeff Sonas
4/19/2020 – Want to know an expert statistical opinion about which FIDE-strength players you should avoid facing? The Elo system is supposed to be fair toward players of all strengths, but Jeff Sonas sees major problems with the current state of the FIDE Elo rating pool, and it's getting worse each year. It will take multiple articles to walk you through the analysis, but here is chapter one of the story.

ChessBase 16 - Mega package Edition 2021 ChessBase 16 - Mega package Edition 2021

Your key to fresh ideas, precise analyses and targeted training!
Everyone uses ChessBase, from the World Champion to the amateur next door. It is the program of choice for anyone who loves the game and wants to know more about it. Start your personal success story with ChessBase and enjoy the game even more.

More...

Analysis of the problem by Jeff Sonas

Part 1

In theory, the FIDE Elo rating system ought to function equally well no matter what your opponent's Elo rating is. You wouldn't need to worry much about whether you're facing opponents with higher, lower, or similar Elo ratings to your own, because your expected score would be realistic and fair in all those cases. However, according to chess statistician Jeff Sonas it looks like the FIDE Elo system is not actually working well in this regard. In fact Jeff thinks the problem is getting worse and worse each year.

Let's start with a concrete example. Imagine a 28-year-old English player named Michael, with a standard FIDE Elo rating of 1950 and a K-factor of 20. He suspects it is possible to "game the system" a bit, maximizing his rating by being careful about which tournaments he plays in. He will choose one of these three events to play in:

  1. Midland Championship Major (average player rating: 1550)
  2. Northumbria Challengers (average player rating: 1750)
  3. Northumbria Masters (average player rating: 2150)

For this example, we will make the simplifying assumption that Michael's opponents in the Midland Championship Major would have ratings ranging from 1400 to 1699 (with an average of 1550), and the opponents in the Northumbria Challengers would range from 1600 to 1899 (with an average of 1750), and the opponents in the Northumbria Masters would range from 2000 to 2299 (with an average of 2150). Which event should he choose?

Many people think it's best for your rating if you just face (and dominate) relatively weak opponents in an event where you can build up a big plus-score, and so your initial answer might be that it's best (for Michael's rating) for him to enter the weakest event: the Midland Championship Major, with players averaging 400 points weaker than Michael. But it's not that simple.

Certainly it is true that Michael's best chances for a high percentage score would be in the weakest event. On the other hand, with Michael having such a large rating advantage, the Elo Expectancy Table would also predict Michael to score really well against those weak opponents, making it hard for him to exceed expectation and actually gain rating points. From the perspective of his desired rating gain/loss, he is really competing against the expected scores calculated from that table. He could win the event and lose rating points at the same time!

Another plausible answer is that it doesn't matter which event Michael plays in. In the Midland Championship Major, his rating says he should score 89% and so perhaps we just trust that on average he would score 89% and break even from a rating standpoint. Whereas in the Northumbria Challengers, his rating says he would score 74%, and in the Northumbria Masters, his rating says he would score 26%, and we could make the same argument in all these events, that he ought to perform in accordance with the Elo expectation. A decade or two ago, this might have been a reasonable argument. However, it doesn't work in today's chess world and today's FIDE rating pool.

I have access to a very large dataset of chess data, including all of the game results that FIDE has used for rating calculation, going back more than a decade, and so I am going to bring some real-world data into this argument. We can look at the last few years' worth of games, in each case checking what the Elo expected score was and what the actual result was, and then compare how well they have matched up (on average).

To keep a manageable size to our analysis, we will create a few groups of players based on rounding down their ratings to the nearest 100. Or you can think of it like all players in the same group have the same first two digits in their rating. So all rated players would be in one of 19 groups: either those rated 1000-1099, those rated 1100-1199, ...;, 2700-2799, or 2800-2899. And players move from one group to another over time, as their ratings change. To save space, we will call these the 10xx, 11xx, ...;, 27xx, and 28xx rating groups.

For each rating group, we will see how they performed against each other rating group. We will only consider standard (classical) chess for now, and to make sure we are capturing the current state of things, we will only consider games played in 2017, 2018, or 2019.

Michael is rated 1950, and our simplified Midland Championship Major has players ranging from 1400 to 1699. Let's start by checking our database and seeing how 19xx players (such as Michael) have done against 14xx opponents.

Across 2017-2019, I can find 22,150 games where a player rated 1900-1999 faced an opponent rated 1400-1499. In such a matchup, the Elo table always gives the higher-rated player an expected score of 92% (the "400-point rule" ensures this), but in the actual game results, the higher-rated player has only scored 86.6%. From the perspective of the 19xx player, here are the numbers:

So if Michael played a 10-game event against nothing but 14xx opponents, he would probably average about 8.66 points out of 10. This would actually be pretty bad for his rating, because the Elo system predicts 9.2 points out of 10 for him, and thus overall we would expect Michael (having a K-factor of 20) to lose about 10.8 rating points in such an event. You can see this in the rightmost column in the above table.

And then we can do a similar calculation for the games where players of Michael's strength (namely, in the 19xx group) faced opponents in the 15xx or 16xx groups. This completes the 300-point rating range in our hypothetical "Midland Championship Major". Let's add in those two groups into our table, and the bottom row will show the overall average across all three categories of opponent in this tournament.

We will also color-code the final column (the overall expected rating gain/loss for Michael in a ten-game event) with a red/blue gradient. Under this coloring scheme, larger negative numbers are a darker red, larger positive numbers are a darker blue, and the red/blue colors get lighter and lighter as they approach zero (which is white). So blue is good for Michael's rating, red is bad for Michael's rating, and the darker colors are the more extreme rating changes.

So we can see that games against the 15XX or 16XX groups are even worse for Michael's rating (the negative numbers are larger magnitude, and the red color is darker) and overall across the three ranges, the Elo system would predict an 89.0% score for him at this tournament, whereas empirical data suggests he would score 81.5% instead. So across ten games at this tournament, Michael can expect to lose 15 rating points, even if he performs at a typical level for a player with his rating. Thus it seems that entering a tournament like this against significantly weaker players is not a good tactical choice for Michael, if he wants to maintain his rating.

What about the second option? The hypothetical "Northumbria Challengers" is a little stronger event where the opponents would be (on average) 200 points weaker than Michael. Is this a good opportunity for him to build up his rating against weaker foes? Well, when we check the numbers again, it's not quite as bad as the first tournament, but this event turns out to be another one where we expect Michael to lose rating points, just based on how players in his rating group tend to do against such opponents:

Across ten games against this level of opposition, Michael would probably score about 6.5 or 7.0 points (the 68.3% in the bottom row tells us this), but the Elo system yields an expected score of 74.4% for him, and so overall he would expect to lose about 12 rating points in such a ten-game event, if he performs like a typical 19xx player. Another red tournament to avoid!

Fortunately for Michael, there is one more option. In the third tournament, the hypothetical "Northumbria Masters", Michael would finally be outrated (by an average of 200 points). At last there is some good news for Michael: he can expect to outperform his Elo prediction against these opponents! Here is what the numbers look like for 19xx players against the three groups (20xx, 21xx, and 22xx):

In a ten-game event against these opponents, his Elo rating would predict a 25.7% score, but empirical evidence suggests Michael would really average more like 30%, and thus he can expect to gain about 9 rating points if he performs in typical fashion for his rating group. The rightmost column is colored blue because Michael can expect to gain rating points from his games against such opponents. And in fact, the 22xx opponents are preferable to the 21xx or 20xx opponents, as you can see from the blue being darker and the +11.3 being a bigger number than in the other two rows. So even better than the "Northumbria Masters" would be if Michael could face exclusively the 22xx group.

What has Michael learned from this exercise? It's better for his rating if he plays somewhat stronger opponents rather than somewhat weaker opponents. And amazingly, this is not just true for someone of Michael's approximate strength, but rather it's true across the entire rating list, although the effect is not as pronounced at the very top. We will look at this in greater detail in upcoming articles.

If you are comfortable by now with the red/blue coloring scheme, and with the idea that we are always looking at what the expected rating change will be for someone with K=20 across a ten-game event, we can take the logical next step, which is to see what the pattern is for all possible opponents of a 19xx player like Michael. We will get rid of the middle columns from our table, and instead we will just show the leftmost (Opponent Rating Range) and rightmost (Average Rating Change) columns. Again, the rating change is from the perspective of the 1900-1999 player with a K-factor of 20.

This reveals a characteristic pattern to the empirical data. Note that we are only showing data for cells where we have at least 500 games of historical data, so that's why the cells are blank against the 27xx and 28xx groups; there are not enough games played by 19xx players against such opponents to get useful data. Also note that when a 19xx player faces other 19xx players, the average expected score is 50% and the average actual score is 50% (due to the symmetry of both people counting as both the "player" and the "opponent"), and so the cell is a solid white (zero expected rating change) on the "1900-1999 row" when the group faces itself.

From the darkest-colored cells, and/or the largest magnitude numbers, we can see that the opponents that are most advantageous to Michael's 19xx group are those in the 21xx, 22xx, and 23xx group. Against opponents that are 200-400 points stronger, Michael can expect to outscore his expectation by about 5 percentage points, meaning if he is K=20 then he will gain about 10 rating points for every ten games played. However, if he faces opponents that are 200-400 points weaker than him (the 15xx, 16xx, and 17xx) then he can expect to lose about 15 or 16 rating points for every ten games played.

This effect peaks at a rating difference around 200-400 Elo points. And even though the darkest colors are at +/- 200-400 points, for even larger rating differences we see that something else takes over. Namely, the "400-point rule" that I mentioned earlier. This rule was originally instituted as a "350-point rule" to incentivize grandmasters to play in open tournaments. It was adjusted to become a "400-point rule" about a decade ago, and it ensures that the expected score from a one-sided game can never be higher than 92% or lower than 8%. As it turns out, if Michael gets to face weak enough players, he can score far better than 92%. So that is why we see some dark blue at the very top, where empirical data tells us that the 19xx group tends to score 96% against 11xx opponents and 97% against 10xx opponents, and so if Michael can arrange to face really weak opposition like that, he can expect to dominate and have it do good things to his rating.

And conversely, the "400-point rule" works against Michael if he faces really strong opposition himself. When the 19xx group faces opponents in the 26xx group, they have only managed to score about 3% on average, which means Michael would lose significant rating points if he performed as expected against 26xx opposition. We can see this from the darker red color at the bottom of the grid, against 2600-2699 opposition.

Enough about Michael and his hypothetical three tournaments he was trying to choose from. If you have a rating yourself, and you are still with me in reading this article, then first of all: thanks for persevering! And second of all, let's expand that last graphic to the left and the right to include all 19 of the rating groups, not just Michael's 19xx group. You may well be wondering yourself what level of opponents you should seek out, or avoid, in order to maximize the likely effect on your own rating. If so, just find the column that corresponds to your rating group, and then look for the darkest blue cells in your column:

Also remember that these numbers refer to the expected rating change from playing 10 games at K=20. If you have a K-factor of 40, you should double these numbers, and if you have a K-factor of 10, you should divide them by 2.

If the Elo expectancy table were systematically fair and realistic, then this grid would be colored solid white, or only very light red or light blue. Unfortunately there are indeed some very dark regions, especially where players rated 1300-1499 are facing opponents rated 1600-1799. A few things you might notice upon reflection:

  1. It is true up-and-down the rating list, that players with a 200-400 point Elo advantage in a game will have real trouble in just "treading water" and maintaining their rating, because the Elo table consistently overstates their chances in such games.
  2. This effect is strongest among relatively weak players (1400-1800 Elo rating) and not as present among stronger players
  3. The "400-point rule" is too tight of a constraint; instead of the expected score being capped at 92%/8% for rating differences exceeding 400 points, it would function better statistically as a "600-point rule" or "700-point rule", capping the expected score at 98%/2% or even 99%/1%. This would admittedly provide lesser incentive for master-level players to compete in open events, but it is worth considering.

And I think the most important insight of all is this: the FIDE Elo rating pool is too stretched out. This has the result of exaggerating the true difference in strength between any two players, and so the expected score is calculated to be further from 50% than it really should be. And this is a very difficult problem to solve. It cannot be solved by simply adjusting the Elo Expectancy Table; if we did that, then the ratings would respond accordingly and we would be in the same boat again. It is much trickier to solve than that.

This is far from the end of the story, but perhaps it is a good place to end the first chapter. Hopefully we have set the stage for the next analysis. In the next article I will dig into this topic more deeply, including my assertion that things are getting worse and worse each year.

Please feel free to email me at jeff@chessmetrics.com if you want to discuss this further, or if you have any suggestions on what you would like me to talk about in upcoming analysis.

 


Jeff Sonas is a statistical chess analyst who invented the Chessmetrics system for rating chess players, which is intended as an improvement on the Elo rating system. He is the founder and proprietor of the Chessmetrics.com website, which gives Sonas' calculations of the ratings of current players and historical ratings going back as far as January 1843. Sonas graduated with honors with a B.S. in Mathematical and Computational Sciences from Stanford University in 1991.

Discuss

Rules for reader comments

 
 

Not registered yet? Register

plughxyzzy plughxyzzy 10/11/2020 03:35
The real problem is insufficient world wide competition so there are too many areas with players that tend to play themselves. The best strategy is to find the weakest area and dominate those players to build up a rating. Depending how they figure the initial rating it could be better to , for the first time , play in a very strong tournament with high rated players so even if you score poorly your initial rating will be higher so you can then build off that start instead of trying to catch up later.
JimGeary JimGeary 4/28/2020 09:47
Did you / can you control for age?
I found that the best way to game the system is to play in tournaments where at least 40% of the field remembers Chumbawumba.
fgkdjlkag fgkdjlkag 4/22/2020 09:26
I find @GreenKlaser and @Zagliveri_chess's comments very interesting. If they are accurate, it suggests that changes to the tournament format and prize structure should be made. How should they look? I wonder about Shahade's suggestion of random pairings within a score group (or a similar variation). This would dramatically change the desired outcome of a game between 2 players depending on the round. Eg, 2 top players facing each other toward the end of a Swiss tourament each know what result they need for a tournament prize (which may be different for both players, who may both not be eligible for a prize), which will heavily influence the moves in the game itself. But if these 2 top players faced each other in an early round (which happens at a very low probability in Swiss tournaments), the character of the game would be very different.

FIDE is reportedly working on a rating system for Fischer Random. What implications does the author and all the commenters' analysis have on what this rating system should look like, assuming they start from scratch?
Former Prodigy Former Prodigy 4/20/2020 08:15
I have only now spotted a comment of elmerdsangalang, with whom I completely agree. I have already had an interesting correspondence on this topic with Mr Michael Hantschke in February 2018 after he published his article on a similar topic.
GM David Navara, Czech Republic
Former Prodigy Former Prodigy 4/20/2020 08:08
In fact, it is absolutely normal that higher-rated players usually lose rating when facing lower-rated opponents.
There are overrated and underrated players, the overrated ones usually having higher ratings than the underrated ones. This means that lower-rated players are more likely to be underrated and to gain rating (until they become overrated), while the higher-rated players are prone to be overrated.
See "Regression toward the mean" on Wikipedia: https://en.wikipedia.org/wiki/Internal_validity#Regression_toward_the_mean
I will quote a few lines from there:
"This type of error occurs when subjects are selected on the basis of extreme scores (one far away from the mean) during a test. For example, when children with the worst reading scores are selected to participate in a reading course, improvements at the end of the course might be due to regression toward the mean and not the course's effectiveness. If the children had been tested again before the course started, they would likely have obtained better scores anyway. Likewise, extreme outliers on individual scores are more likely to be captured in one instance of testing but will likely evolve into a more normal distribution with repeated testing."

That said, I am not questioning the rating system (except for beginners, juniors and players with few rated games), but the real differences tend to be slightly lower than those expressed by ratings. That said, it is a general statistical tendency rather than a golden rule. In 2004-2019, I mostly faced nominally weaker opponents but still gained some rating against them, for I was not overrated and got used to facing them (rather then the world elite).
If the lower-rated players would not be likely to increase their ratings when facing higher-rated opponents, there would be no generation change and so on. Higher-rated players would remain higher-rated forever. (Well, I am simplifying a bit.)
adbennet adbennet 4/20/2020 07:16
elmerdsangalang wrote: "Elo ratings are statistical approximations of playing strength."

That's not true at all.

Here's an idea to "fix" the colors on chart - no rating change for a draw. Simple, elegant. Because a correctly played game is a draw, drawing against a higher rated in no way "proves" that you should be closer than your published rating indicates. So just leave the status quo. 1200 draws against Magnus Carlsen? No change!
elmerdsangalang elmerdsangalang 4/20/2020 05:18
The title of the article is inaccurate and misleading. There is nothing wrong with the Elo Rating System. What is being commented on by the author is the implementation of a particular rule that introduces rating biases leading to erroneous rating results. Elo ratings are statistical approximations of playing strength. They are rough estimates in the beginning when based on scant data but considerably improve when more game results get included in the calculations over time. The underrating and overrating also get corrected over time And we should not forget that earning a chessplayer rating is only incidental to and not the primary objective for playing the game. All sorts of impurities get included in the rating results such as blunders, nerves and even dishonesty.
Bank2010 Bank2010 4/20/2020 05:10
One thing that is neglected in Chess scoring system is 'the color of pieces' you play. It's always assumed that playing with White or Black is the same. It's not. If you look at the statistics of any big database, White has a better winning rate. So to solve the problem (or at least getting closer), the expectancy score should be created separately for White and Black.

Aside from that, I don't understand why the tie-break system almost NEVER credit the players who has more Black games or at least something need to be done with players who are doing well with Black.
Strength In Numbers Strength In Numbers 4/20/2020 03:46
My first thought is that this is all caused by improving players' ratings lagging behind their true skill level. The lower someone's rating, the more likely he is to improve and the faster he will do so. As a consequence of this, the lower rated player in a game will be "more underrated" on average and thus score better than the expected result from the table.

In the article here we see the worst offenders are 1300s versus 1600s. The 1300s are scoring 0.1 higher than expected per game (20 rating gain at k=20 over 10 games), which is a performance of around 1400. With k=40, a player with 1300 rating performing like this gains 4 rating points per game on average, i.e. it will take him 25 games to actually reach his skill level of 1400 (and that's ignoring that gains slow down at 1350..). This is a huge number of games compared to how quickly some players can improve from 1300 to 1400, especially considering many games at this level aren't even FIDE rated.

At higher ratings the difference between the Elo table and the actual results becomes smaller and for 2300s playing against stronger opponents there's hardly any difference anymore. Then for 2400s it becomes larger again, because here we go from k=20 to k=10, causing the ratings of improving players to lag behind again!
adbennet adbennet 4/20/2020 05:51
typo, sorry, should have been "players about 200-300 Elo weaker", not "200-200"
adbennet adbennet 4/20/2020 05:48
The basic defect of the Elo system, like the other systems that preceded it (Harkness, Ingo, etc.) is that it incorrectly gives the rating change for a draw to be midway between the rating change for a win and that for a loss. But then, they knew that was problematic when they created these systems. What they didn't know was that people would come to care so much about a silly number that was never supposed to be more than a rough approximation. All attempts to "fix" the rating system are just making the silliness worse.

As for the colored cells in the table, chessplayers don't need any fancy statistics to inform them that playing down hurts them rating-wise. It's a felt experience OTB that players about 200-200 Elo weaker are close enough in strength to have a good chance of getting a draw, siphoning away some rating points. But I think most players overcompensate and make the problem worse. I know I did. I once did a rating analysis of my own play. As a 2200 facing an average 2000 (average of all my opponents), I realized that if I drew every game as white against the Sicilian, my rating would go up! Basically I was playing way too sharply and killing myself by getting into time pressure. Playing less sharply and "not caring" about the occasional draw worked like a charm, in fact I did quickly gain about 50 Elo without any other changes to my play.

And it's obvious from the colors on the chart that the highest rated players have learned the same lesson. Play objectively. If the opponent plays well then it's a draw and there's nothing to be done about it. So GMs give up the penalty rating points against the 200-300 Elo worse players, but they don't double down by overpressing, which would give up even more rating points.
Zagliveri_chess Zagliveri_chess 4/20/2020 03:01
GreenKlaser explains very well the patterns observed in empirical data. If the expected performance is 70% and the actual is 60%, then your effort was not in par with your abilities. Often players play for tournament results not, not for rating. How many and how often is rating range-dependent. That leads to the patterns observed. Nothing wrong with the ELO system, no evidence of theoretical flaws.

I would add that in early rounds of tournaments higher rated players tend to offer draws unjustified by position on the boards, perhaps to keep their energy for later rounds which decide tournament prizes. Just add tournament type (for example swiss, round-robin) as explanatory variable in your model, and you will be able to quantify this tendency. Other times you have draws because the opponents know each other and the higher rated one is sympathetic to the weaker player. Finally, draws are offered to avoid having to continue a game for too long. Hence player diligence and effort is not consistent leading to the observed patterns.
GreenKlaser GreenKlaser 4/20/2020 12:03
Not considered is that players do not only play for rating points. They play for tournament scores. They even play for prizes. These goals can conflict. For example, a player needing only a draw in the last round to win clear first place will often offer a draw to a lower rated player and go home early. It may be claimed that since this sort of thing has happened before, that the rating system based on past games has factored this in, but has it?
meninstein meninstein 4/19/2020 06:46
Excellent analysis. Chessmetrics is a fantastic place to learn about metrics. Thanks for your time.
Michael Jones Michael Jones 4/19/2020 04:32
Obviously professionals have good reason to be concerned with their rating since it determines which tournaments they will be invited to, but I find it rather sad that amateur players are so obsessed with it that they'll pick which tournaments to enter based on which are likely to maximise their rating gain. As a 1780 Michael, I pick tournaments according to which are convenient for me in terms of dates and location. If I picked them more carefully, could I become the 1950 Michael of the article? Maybe. Do I care? No - I'll continue to enjoy playing chess, regardless of my rating.
jimijr jimijr 4/19/2020 03:05
In my best result I went 3-2 against opponents rated 450 points higher than me on average. First I got bonus points to raise me to within 400 points, then with the plus score it added up to almost 100 points gained. So yeah, play 'up' if you can.
ChessSpawnVermont ChessSpawnVermont 4/19/2020 02:43
Whilst I don't play many rated tournaments and am in one of the lower rating categories (USCF), I now only enter FIDE rate tournaments with much stronger opposition relative to my rating. This means a) I don't face unrated (provisional) or scholastic rated players who too often have ratings that don't reflect their real ability, and b) I play against opponents consistently rated 400 to 800 points higher. In FIDE sections, if I lose to a such higher rated player, I lose very few rating points. If I draw, I gain more than enough rating points to offset any points lost. If I win against a higher rated opponent, I hit the jackpot. By using this approach after playing in the U-1200 section of a local tournament, winning three out of four rounds and LOSING eight rating points, I've played three FIDE tournaments raising my rating by close to 100 points. My playing strength is about 500 points higher than my actual rating as testified to by my FIDE opponents chatting after our games and by my coach of four years who is an IM. In one case I had a player rated close to 900 points higher than me tell me that he would have offered a draw after 38 moves, but he would have lost too many rating points. Wisely, he continued. I played a less than accurate move at move 43. I resigned at move 59.

Is this "gaming" the system. It sure is. The result is that I face known, stronger opposition that makes for interesting games that challenge me and help improve my game. The only objection that I've ever heard is that looking at my low rating unfairly lulls opponents into an off-guard state. That was undoubtedly true early on, but now players know that my rating doesn't reflect my playing strength. Thus, we all play our best games although, admittedly, my opponents are at greater risk vis-a-vis rating rating points.
1