4/10/2016 – An interesting phenomenon of the ELO rating system is that when examining game outcomes for a group of lower rated players, higher rated players tend to underperform relative to the theoretical ELO probability, and lower-rated players overperform. What are the causal factors explaining the inability of the ELO system to predict results? A study by Ganesh Viswanath.

An interesting phenomenon of the ELO rating system is that when examining game outcomes for a group of lower rated players, higher rated players tend to under-perform relative to the theoretical ELO probability, and lower-rated players over-perform. This begs the question, what are the causal factors in explaining the inability of the ELO system to predict results? I hypothesize that age is the key factor and that most of the discrepancy between the theory and the data can be explained by underrated juniors.

The ELO system and its discrepancies has been well documented in the past. It was not too long ago that a ratings conference was held in Athens (1) to discuss the future of the rating system, and various issues were brought up such as ratings inflation, the optimal K factor and other statistical concerns. As a primer for those unfamiliar with the rating system, the ELO is a statistical based system for ranking players. The original ELO formula proposes that a player who is 100 points higher rated should win a game with a 64% probability, and a 200 point difference gives approximately a 75% chance of winning for the higher rated opponent. A key assumption of ELO is that players’ abilities are normally distributed (2). By this assumption, the expected probability of winning follows the following logistic curve, illustrated in Figure 1.

Figure 1: Theoretical ELO curve

A result established in a previous article by chess statistician Jeff Sonas (3) is that there are significant discrepancies between the implied probabilities of the ELO system and the data. How does one measure these discrepancies? A simple way is to compile a database of game results and player ratings, which I have done using Chessbase’s MegaBase 2016. Based on this sample I can estimate a conditional expectation function, which is the average score players are making at each rating difference. For example, if there are 100 games in my sample with a rating difference of 100 points, and the higher rated player scores 60/100, then the sample estimate of the expected score of the higher rated player is 0.6. Using this procedure, I look at a cohort of players born in the 1970s and subdivide the group into players at grandmaster level (rated >2500), an experts to master level (rated between 2000 and 2500), and sub-expert players with a rating between 1500 and 2000. The results of this exercise, shown in Figure 1, illustrate that it is clear the ELO system works well for the higher rating groups, however there is essentially more noise in the lower rating group, with evidence that lower rated players are performing above the expected score given by the ELO system, and higher rated players are underperforming. This begs the question, what are the causal factors in explaining this discrepancy in the rating system for the lower rated group? Is it just noise or is there a more compelling explanation?

Figure 2: Comparing sample distribution of results for different rating groups

I hypothesize that what really drives this discrepancy is the systematic undervaluation of young players. This is intuitive as we all know that young players are often underrated and their current ratings might lag their true potential by a significant margin. The ELO system works great in a perfect world where all players have transitioned to their “steady-state” ratings, however if there are vast discrepancies between realized and true rating ability then this can explain the trends we are seeing in the data. First, I examine the game results of all current juniors (4), and divide the sample into those that are currently under 2000 and a 2000+ category (Figure 3 ). Systematic undervaluation of lower rated players is evident in both samples indicating that it is a junior-wide phenomenon. It is startling to note how extensive the undervaluation of the younger player is. Lower rated juniors are getting an expected score 10 percentage points more per game than implied by the ELO rating system. This result is intuitive when we think of how juniors who are at master level strength today have clearly overperformed their ratings in order to get to their current level.

Figure 3: Undervaluation of lower rated players is a youth phenomenon

To further establish my reasoning I compare the current under 2000 rated sample of junior players to the game results of adult players from the same rating group who belong to a cohort born in the 1970s (Figure 4). As I suspected, the adults tend to perform in accordance with the ELO system when they are lower rated, however are more likely to underperform than juniors when they are higher rated than their opponents.

What are the implications of this result? Ratings statisticians recognize that younger players are often more likely to improve and transition to a higher rating. Both FIDE and the USCF give higher K factors (5) to younger players so they can converge to their steady state faster. However, this still does not solve the intrinsic problem of ratings undervaluation that we see in the data. One of the potential costs of the system is that it is implicitly penalizing adult players that are having to face many juniors and are continually losing rating points due to their under-performance relative to the ELO system. One potential solution would be a recalibration of the parameters of the current ELO formula for predicing probabilities, and in particular controlling for age.

Figure 4: Comparing Under 2000 juniors to adults at a comparative rating level

My work in this article is by no means complete and there are further robustness checks and data analysis needed to support my conclusion that underrated youth are the driving force behind rating discrepancies. The next step in my analysis would be to extend my sample size and track junior-junior, junior-adult and adult-adult pairings more accurately to demonstrate that all of the rating discrepancy is being driven by games involving juniors. My motivation in writing this article is to shed light on what is an understudied issue in chess statistics and is increasingly relevant when we see the rise of chess as being a game of youth with an increasing share of young players in the modern game.

*I would like to acknowledge the assistance of Jeff Sonas in starting the data collection for this project, and Jonas Tungodden for feedback. *

(1) http://en.chessbase.com/post/impreions-from-fide-rating-conference-2010

(2) Although this is not the subject of this article, there are arguments against the validity of the normality assumption.

(3) See article by Jeff Sonas at http://en.chessbase.com/post/sonas-overall-review-of-the-fide-rating-system-220813

(4) All players registered in the player encyclopedia of MegaBase 2016, born on 1st of January, 1998 or later

(5) K factors determine how many points a player can win/lose from a game. The rating change is equal to Kfactor (Game outcome-expected winning score). For example, if you are 100 points higher rated you should win 64% of the time according to the ELO system so the expected winning score is 0.64.

Ganesh Viswanath is from Perth, Australia and is currently studying a PhD in economics from the University of California, Berkeley. In his spare time he likes to play chess tournaments in the bay area (facing a lot of juniors!) and has more recently been trying to venture into chess statistics. |

Discussion and Feedback
Join the public discussion or submit your feedback to the editors

More...

83

More...

56

@Hillion "lower rated players are subject to a greater variability in their 'strength' :much more blunders causing loss or wins in their games"

I don't see this variability/blunder effect at all, and I don't think lower rated players blunder more. Make worse moves. Sure. Blunder? No.

If a blunder occurs on a chessboard, but the opponent does not see it, it is not a blunder.

The ability to spot the blunder is not a random variable, it is skill. You have the problem then of distinguishing between a genuine blunder (Kasparov missing mate in one qualifies as a blunder), or a bad move that reflects the lack of a chess skill.

Let me explain:

If I cannot see more than a move ahead, then many of my moves will appear as a random blunder, but they are not. They reflect a lack of skill in calculating, and that is what the rating should pick up.... Kasparov can see more than a move ahead. Missing a mate in one for him is a blunder. A beginner maybe cannot see that far. It is not a blunder for a beginner then, it is a lack off skill.

If I never ask myself what the opponent is threatening, but only look at my plans, I will miss obvious threats, and that will also appear to be a random blunder. But it is not random at all.|It will occur anytime my opponent makes a threat.

Most chess errors are systematic, not random. Which is good, since by studying ones games, we can try to eradicate the errors. You can't so easily eradicate an error that is truly random.

The main variable is who is the lower rated player paired against. Stronger players will spot the blunders, and reveal the weakness of the player. But they are stronger and will have higher ratings, so not much impact on the rating. Weaker players will not even spot the "blunder", so it is not a blunder at all. If your rating formula is producing a large variability in the ratings of lower rated players ( after a large number of games are played), then you have a lousy rating system, and it needs to be fixed.

Reversion to the mean is moot.

Thanks, Ganesh. I appreciate the feedback.

Hi Azlan,

I think that a universal elo system would solve the undervaluation problem to some degree as it means that ratings are being updated at a much higher frequency. Whether it would completely eliminate it is questionable as I think with juniors there is also a lot more noise generating the 'reversion to the mean' hypothesis that another discussant mentioned.

I think that a universal elo system would solve the undervaluation problem to some degree as it means that ratings are being updated at a much higher frequency. Whether it would completely eliminate it is questionable as I think with juniors there is also a lot more noise generating the 'reversion to the mean' hypothesis that another discussant mentioned.

How does one explain the fact that a young player like Chidambaram outplayed Carlson from an inferior position out of the opening in the recent Qatar Open?

Hi, Ganesh. Interesting work related to the present Elo system. I wonder, however, would your analysis also be applicable, in principle, to the universal rating system that I have proposed?

https://www.researchgate.net/publication/299366440_A_Proposal_for_a_Universal_Chess_Rating_System

The purpose of my universal rating system, however, is not so much "accuracy" (of strength of play) as it is inclusiveness (of more people) and the potential of the expansion of the game to new markets. I believe this is more toward what Kasparov envisioned. At least, that's what he told me that he had in mind when we last met.

https://www.researchgate.net/publication/299366440_A_Proposal_for_a_Universal_Chess_Rating_System

The purpose of my universal rating system, however, is not so much "accuracy" (of strength of play) as it is inclusiveness (of more people) and the potential of the expansion of the game to new markets. I believe this is more toward what Kasparov envisioned. At least, that's what he told me that he had in mind when we last met.

Glad to see the discussion of the article, want to make a few points.

1) Firestorm, I have left out regression analysis for the time being but the simplest way to measure the age effect is to use a logistic regression with an age dummy. This will capture the shift in the estimated function we see when comparing adults and juniors in Figure 4, and no doubt the age dummy will be statistically significant (I think the anova analysis you recommend will give the same result).

2) Firestorm, I agree tracking junior progress from their starting age is a more granular way of looking at this problem. Indeed I expect progress is non-linear and is probably faster the earlier the starting age. After all, learning chess is a bit like learning a language, it is much easier to absorb information on a new game when you are young.

3) Narceleb, I think that Kalman filter would be interesting as it is probably the optimal technique to use for rating adjustment, and from an academic point of view most certainly it would be interesting to see whether it is a significant improvement over Elo when looking at the lower rated group, and can sort out the undervaluation problem we are seeing. From a practical point of view fide, uscf and other organizations will not consider this because it lacks transparency and will complicate rating formula considerably as now the K factor becomes stochastic.

1) Firestorm, I have left out regression analysis for the time being but the simplest way to measure the age effect is to use a logistic regression with an age dummy. This will capture the shift in the estimated function we see when comparing adults and juniors in Figure 4, and no doubt the age dummy will be statistically significant (I think the anova analysis you recommend will give the same result).

2) Firestorm, I agree tracking junior progress from their starting age is a more granular way of looking at this problem. Indeed I expect progress is non-linear and is probably faster the earlier the starting age. After all, learning chess is a bit like learning a language, it is much easier to absorb information on a new game when you are young.

3) Narceleb, I think that Kalman filter would be interesting as it is probably the optimal technique to use for rating adjustment, and from an academic point of view most certainly it would be interesting to see whether it is a significant improvement over Elo when looking at the lower rated group, and can sort out the undervaluation problem we are seeing. From a practical point of view fide, uscf and other organizations will not consider this because it lacks transparency and will complicate rating formula considerably as now the K factor becomes stochastic.

kid => natural developping brain process => unexpected skills discoveries => elo rating improving faster than expected.

adult => mature brain process => predictible play in its own style => fair elo rating.

adult => mature brain process => predictible play in its own style => fair elo rating.

ELO -> Electric Light Orchestra.

I agree with A7... that age is not the factor, but growing (or declining) skill which causes the current Elo to lag behind the "true Elo".

To fix this I think you have to take into account past Elo history to determine if the Elo is increasing (as it will be for young players usually, but not exclusively), and use some form of regression to predict "true current Elo" and use that information to modify the Elo algorithm.

There are of course numerous technical problems to be solved to flesh this out to a reliable algorithm, but perhaps it can be done (and earn a PhD in the process :).

To fix this I think you have to take into account past Elo history to determine if the Elo is increasing (as it will be for young players usually, but not exclusively), and use some form of regression to predict "true current Elo" and use that information to modify the Elo algorithm.

There are of course numerous technical problems to be solved to flesh this out to a reliable algorithm, but perhaps it can be done (and earn a PhD in the process :).

The first problem is assuming the distribution of ratings in Gaussian (a.k.a., "normal" or a "bell curve"). It is not. Ratings more closely follow a Chi-squared distribution.

I recommend investigation into developing a rating system based on a Kalman Filter. If one has not played for a while, the estimate of one's current rating becomes more uncertain, and his next result will carry more weight for that player, but less for his opponents. Also, if a player's rating has been on an upward trajectory, if there is a two-month gap between tournaments, his estimated rating (not used for pairing purposes, but for computing his new rating and those of his opponents) is projected by advancing that trajectory two months forward to the tournament date.

I recommend investigation into developing a rating system based on a Kalman Filter. If one has not played for a while, the estimate of one's current rating becomes more uncertain, and his next result will carry more weight for that player, but less for his opponents. Also, if a player's rating has been on an upward trajectory, if there is a two-month gap between tournaments, his estimated rating (not used for pairing purposes, but for computing his new rating and those of his opponents) is projected by advancing that trajectory two months forward to the tournament date.

In my opinion (of 'expert' researcher' in statistics) lower rated players are subject to a greater variability in their 'strength' :much more blunders causing loss or wins in their games- which are, in the mean, shorter those of high rated players . This causes a regression to the mean well known for instance in models with (random) errors in the variables.

@ganvisnat

The ratings of some juniors are indeed absurdly low, but by leaving them out of your graphics, you're painting a pretty picture about the problem, compared to how bad it actually is.

As long as FIDE remains hellbent, on giving an ELO rating to just about anyone who plays a tournament, the only practical solution to avoid "penalising adult players that are having to face many juniors", would be to both give a higher K to the lower rated players, as well as a lower one, the higher rated ones. And the fluctuation should be proportional, i.e. a 1000 ELO player should have a higher K than a 1100 ELO one.

The ratings of some juniors are indeed absurdly low, but by leaving them out of your graphics, you're painting a pretty picture about the problem, compared to how bad it actually is.

As long as FIDE remains hellbent, on giving an ELO rating to just about anyone who plays a tournament, the only practical solution to avoid "penalising adult players that are having to face many juniors", would be to both give a higher K to the lower rated players, as well as a lower one, the higher rated ones. And the fluctuation should be proportional, i.e. a 1000 ELO player should have a higher K than a 1100 ELO one.

I liked your article, and the fact that you gave a measured and constructive response to observations from people who read and commented it (compared to another author of articles submitted as scientific to Chessbase who sadly responds quite differently), and the quality of your approach. There is, I would have thought, a technically simple (statistically speaking) check you can do for age effects- but hard to control for confounds on progress because of everything else that happens throughout childhood and adolescence in terms of changes. That is, if you can control for age at which a player started (I won't say junior, because we want to partial out that implicit assumption for the time being), then you can include age as part of your analysis in e.g. an ANOVA- data should conform to the requirements, though of course there are other statistical approaches you can use. I suspect an ANOVA is probably the most powerful if you have other factors you want to control for in a multi-factorial approach (rather than two-way). I'm guessing you know what I'm on about here, given your academic area.

The argument "we should see the same effect in adults taking up the game on the same study program" is testable but I wouldn't argue it- younger brains are much better than older brains at making connections (and I do mean at the neurological level, not as a general observation) and hence acquiring skills is faster- plasticity, basically. But that's about interpretation, not analysis per se, and my gut feeling doesn't change the need for data and analysis one iota. Just saying- there is an argument you are on the right track, but that empirical rigour nonetheless required.

Nice succinct article! One final point, and this is a little bugbear of mine. Everyone knows what you mean by "ELO", of course- you mean the Elo rating system, but "ELO" is not an acronym- it refers to the guy who did the original work to create a workable rating system for chess- Arpad Elo, a Hungarian statistician. So the correct term when talking about the Elo system is, of course, "Elo".

In conclusion, thank you again, Chessbase, for providing a forum for work like this, and good luck Ganesh on your work on statistics in chess. Measurement can be an end in itself, of course, but your point is very good, and has real world application, and even if you do find "adults are similarly underrated with no statistically significant difference from children", and "age at which a child starts is not a statistically significant factor in degree of underrating" (though I bet you don't), it won't invalidate your results for the population you are looking at- whether adults are underrated or not, children certainly are in cases, and its important to correct that. Back in the seventies an organiser involved in juniors in London (quite famous, Leonard Barden), produced ratings for juniors because of the suspected lag of about 18 months between actual strength OTB and their rating on the standard list (produced, at the time, once a year)- the problem has been around for years, but the options for real qualitative data analysis have improved substantial since then for many reasons you'll clearly understand. Actually ... one final, final point- if you do look at age starting in chess versus improvement, it would be really interesting (well, for me, anyway), to look at structure of progress- I bet its not going to be linear, and that has the potential to be useful in looking at training and motivation- when a junior hits a plateau, for example, rather than the more negative "the data says you should have reached rating "n" by now but you haven't"- very discouraging.

All in all, really good work- I hope you can keep it going.

The argument "we should see the same effect in adults taking up the game on the same study program" is testable but I wouldn't argue it- younger brains are much better than older brains at making connections (and I do mean at the neurological level, not as a general observation) and hence acquiring skills is faster- plasticity, basically. But that's about interpretation, not analysis per se, and my gut feeling doesn't change the need for data and analysis one iota. Just saying- there is an argument you are on the right track, but that empirical rigour nonetheless required.

Nice succinct article! One final point, and this is a little bugbear of mine. Everyone knows what you mean by "ELO", of course- you mean the Elo rating system, but "ELO" is not an acronym- it refers to the guy who did the original work to create a workable rating system for chess- Arpad Elo, a Hungarian statistician. So the correct term when talking about the Elo system is, of course, "Elo".

In conclusion, thank you again, Chessbase, for providing a forum for work like this, and good luck Ganesh on your work on statistics in chess. Measurement can be an end in itself, of course, but your point is very good, and has real world application, and even if you do find "adults are similarly underrated with no statistically significant difference from children", and "age at which a child starts is not a statistically significant factor in degree of underrating" (though I bet you don't), it won't invalidate your results for the population you are looking at- whether adults are underrated or not, children certainly are in cases, and its important to correct that. Back in the seventies an organiser involved in juniors in London (quite famous, Leonard Barden), produced ratings for juniors because of the suspected lag of about 18 months between actual strength OTB and their rating on the standard list (produced, at the time, once a year)- the problem has been around for years, but the options for real qualitative data analysis have improved substantial since then for many reasons you'll clearly understand. Actually ... one final, final point- if you do look at age starting in chess versus improvement, it would be really interesting (well, for me, anyway), to look at structure of progress- I bet its not going to be linear, and that has the potential to be useful in looking at training and motivation- when a junior hits a plateau, for example, rather than the more negative "the data says you should have reached rating "n" by now but you haven't"- very discouraging.

All in all, really good work- I hope you can keep it going.

Nice article Ganesh! A sensible theory analysed by numbers. That is how I imagine a work by a PhD should look like. Much much much better than other PhDs publishing on the Chessbase page.

Hi A7fecd1676b88

Would like to comment on your points (I'm author of article).

1) Based on discussions with some of my colleagues, we did think about game experience as being the real factor as opposed to age. i.e. it is the fact that juniors have played few games that mean they are further from the steady state and so are more likely to be underrated. This would mean that new adults should follow a similar pattern to juniors as they are also transitioning to the steady state. That is a robustness check I ought to do.

2) I'm aware of the absurdly low ratings of some juniors so I trimmed my sample to just look at 1500-2000 rated games of juniors and adults.

3) Overrated adults is an interesting phenomenon, I guess to show that I need to refine my results a bit more to only get junior-adult pairings to show this. Right now I haven't classified the opponents' age in my data analysis but will get to that step soon.

Ganesh

Would like to comment on your points (I'm author of article).

1) Based on discussions with some of my colleagues, we did think about game experience as being the real factor as opposed to age. i.e. it is the fact that juniors have played few games that mean they are further from the steady state and so are more likely to be underrated. This would mean that new adults should follow a similar pattern to juniors as they are also transitioning to the steady state. That is a robustness check I ought to do.

2) I'm aware of the absurdly low ratings of some juniors so I trimmed my sample to just look at 1500-2000 rated games of juniors and adults.

3) Overrated adults is an interesting phenomenon, I guess to show that I need to refine my results a bit more to only get junior-adult pairings to show this. Right now I haven't classified the opponents' age in my data analysis but will get to that step soon.

Ganesh

Obviously Doug Eckert's scenario can unfortunately happen, but it is also made possible by FIDE lowering its rating floor. Previous floors did not allow 1800 - 1900 FIDE.

The problem in the U.S. is not that complicated. There are a lot of junior players who play U.S. Chess Federation rated tournaments, but not many, if any, FIDE rated tournaments. That can lead to situations where the kids are rated 2200 - 2300 USCF and 1800 - 1900 FIDE. That is a losing proposition for the adults and has resulted in 100 FIDE points being lopped off my ELO in the last year. The solution is pretty simple. If an adult is playing someone under the age of 16 whose FIDE rating is lower than 2200, give the adult player a K factor of 5 for that game. The kid can still get the rating points if he wins, but, the adult won't be penalized.

Underrating can happen when true progress is faster than the manifestation of that progress in rated games.

" I hypothesize that age is the key factor and that most of the discrepancy between the theory and the data can be explained by underrated juniors."

1) I don't see where you get "age" and "juniors" as the KEY factors. Factors yes, because a young player's brain is still developing in all areas cognitively, so you have to expect some improvement as they get older even if they don't study, just because off that. This is the inverse of the mental decline we see in the old.

Traditionally however, juniors tended to be underrated because they can, if they study, improve significantly between tournaments, so that their rating is only a measure of their old skill they had at the last tournament, not their current strength. But that logic must apply to ANYBODY who is fairly new to the game. An adult who just learned the game but began a serious program of study must also show the same effect of being underrated. Age is only indirectly a factor because most juniors have a lot to learn. But not all. The Polgars at age 11 or so would be considered juniors, but probably would not be underrated.

So I have to disagree with the wording of the hypothesis, it is perhaps explained by new rapidly improving players who are underrated (although these tend to be juniors).

To make the point more explicit. If playing chess was illegal until you turned 21, you would see the same underrated effect, but on people in their early 20s, not juniors.

2) There is also this trend of "junior/scholastic" tournaments, where kids only play other kids. If you look at these tournament cross tables, you see absurdly low USCF ratings...like 400 or 500. As a kid, you could go 5-0 in such tournaments, and not get a high rating, even though you are reading chess books and practicing against Fritz at home. Such a kid might be quite strong in reality, so we have yet another mechanism for producing an underrated junior.

3) Where age might become a factor, as opposed ones newness to the game, is that older players lack the stamina demanded by a chess tournament. In this case, it is perhaps the opposite effect of the young player, the older player can be overrated because his faculties are declining -- unless you are Korchnoi that is.

4) Normal distribution is indeed an assumption that must be looked at, simply because I don't believe it is what we actually find in the USCF rating pool.

1) I don't see where you get "age" and "juniors" as the KEY factors. Factors yes, because a young player's brain is still developing in all areas cognitively, so you have to expect some improvement as they get older even if they don't study, just because off that. This is the inverse of the mental decline we see in the old.

Traditionally however, juniors tended to be underrated because they can, if they study, improve significantly between tournaments, so that their rating is only a measure of their old skill they had at the last tournament, not their current strength. But that logic must apply to ANYBODY who is fairly new to the game. An adult who just learned the game but began a serious program of study must also show the same effect of being underrated. Age is only indirectly a factor because most juniors have a lot to learn. But not all. The Polgars at age 11 or so would be considered juniors, but probably would not be underrated.

So I have to disagree with the wording of the hypothesis, it is perhaps explained by new rapidly improving players who are underrated (although these tend to be juniors).

To make the point more explicit. If playing chess was illegal until you turned 21, you would see the same underrated effect, but on people in their early 20s, not juniors.

2) There is also this trend of "junior/scholastic" tournaments, where kids only play other kids. If you look at these tournament cross tables, you see absurdly low USCF ratings...like 400 or 500. As a kid, you could go 5-0 in such tournaments, and not get a high rating, even though you are reading chess books and practicing against Fritz at home. Such a kid might be quite strong in reality, so we have yet another mechanism for producing an underrated junior.

3) Where age might become a factor, as opposed ones newness to the game, is that older players lack the stamina demanded by a chess tournament. In this case, it is perhaps the opposite effect of the young player, the older player can be overrated because his faculties are declining -- unless you are Korchnoi that is.

4) Normal distribution is indeed an assumption that must be looked at, simply because I don't believe it is what we actually find in the USCF rating pool.

1