Arpad Elo and the Elo Rating System
By Daniel Ross
Indiana University of Pennsylvnia
Math 563 – Mathematical Statistics I
Ask any United States Chess Federation (USCF) member about his or her "Elo;" you will likely get an immediate and accurate response. Now ask that same USCF member where the term "Elo" comes from. Something along the lines of, "I forget... someone told me what it stands for once, but I can't remember" is more likely than the correct response. Although the most widely used chess rating system in the world bears his name, few have anything more than a vague idea of how the system works while still fewer are aware of the statistician Dr. Arpad Elo himself.
A physicist by discipline, the Hungarian-born Elo was a fervent amateur chess player the greater part of his life. Performing at the master level, Elo won the Wisconsin championship eight times between the years of 1935 and 1961. In 1959 – having already served the USCF for twenty years – Elo was appointed chairman of the USCF rating committee; the following year the USCF adopted the revolutionary new rating system that he conceived. A decade later, in 1970, the World Chess Federation, better known as FIDE (Fédération Internationale des Échecs), also implemented the Elo system to track the relative strength of all masters worldwide. Today, nearly a half-a-century after inception, the Elo system very closely resembles the original form. Further, the Elo system has been applied to many other forms of pairwise competition including Scrabble, Massively Multiplayer Online Role-Playing Games (MMORPGs) such as World of Warcraft and various professional and collegiate sports.
Arpad Emrick Elo (original Hungarian: Árpád Imre Élö), born in 1903 in Hungary, but moved to the United States with his parents as a child in 1913. He became a professor of physics at Marquette University in Milwaukee, Wisconsin. He was also a chess master who won the Wisconsin State Championship eight times. He died in Brookfield, Wisconsin in 1992.
Arpad Elo is best known for his system of rating two-player games such as chess. He developed his formula and a chess rating system which was approved and passed at a meeting of the United States Chess Federation in St. Louis in 1960. In 1970, FIDE, the World Chess Federation, agreed to adopt the Elo Rating System. From then until the mid-1980s, Elo himself made the rating calculations. At the time, the computational task was relatively easy because fewer than 2000 players were rated by FIDE.
FIDE reassigned the task of managing and computing the ratings to others, excluding Elo. FIDE also added new "Qualification for Rating" rules to its handbook awarding arbitrary ratings (typically in the 2200 range, which is the low end for a Chess Master) for players who scored at least 50% in the games he played at selected events, such as named Chess Olympiads. Elo and others objected to these new rules as arbitrary and politically-driven. Source: Wikipedia.
While his rating system remains the most significant of his contributions, Dr. Elo also made noteworthy observations regarding the development of chess players by demographics such as age, place of birth and gender. Additionally, Elo applied his rating system to historical tournament results dating back to the early 19th century. In doing so, he made it possible for the first time to compare the relative strength of any two chess players of significant strength for over a century-and-a-half. Although ardently contested, these ratings are still considered by many the most accurate way of rating players who preceded any formal rating system.
Upon being appointed chairman of the USCF rating committee in 1959, Arpad Elo was given the daunting task of overhauling the current rating system. Developed by Kenneth Harkness in the early 1950s, the system – now generally referred to as the "Harkness system" – was originally embraced by members of the chess community who for the first time had a way of quantifying his or her abilities. Within a few years, however, it was obvious that the Harkness system – which could be summed up in single table (see Appendix A) – was insufficient in its simplicity. Although Harkness ratings were often deemed fair in the common case, certain fringe circumstances resulted in statistically inaccurate results. For the sake of tradition and, more importantly, USCF members, Elo kept two important parts of the Harkness system intact: the rating scale and the class categories of that scale.
The rating scale, which has a minimum bound of 0, places the cutoff for candidate masters (also known as experts) at 2000. Although the maximum of the scale is technically unbounded, it would be unprecedented for a player to exceed a rating of 3000. Because the magnitude of these numbers are arbitrary, Elo thought it wise to leave people at their current rating for the sakes of both the chess community (who could remain ignorant of the changes in calculation) and the USCF (who would otherwise need to recalculate the rating of every one of their members). The more important and easily overlooked concept that Elo took from the Harkness system was the concept of a player "class" which is defined as a 200 point gap in ranking (see Appendix B). Through observation of previous tournament results, Elo found that one class accurately portrayed the standard deviation (σ) in terms of strength of performance for a given chess player over a series of games.
Using past results and Harkness ratings, Elo observed that the distribution of individual performances resembles a normal distribution with a σ of one class (200 points). Using a mean (μ) of zero(1) gives us the following probability density function (pdf):
Because there are two participants in a single game of chess – with each participant having a performance deviation of one class (σ1 and σ2) – the standard deviation used in the pdf derived as follows:
The cumulative distribution function (cdf) of our pdf using this σ' is integral to the Elo system in terms of calculating expected performances (prior to a tournament), calculating performance ratings (during a tournament), and updating ratings (following a tournament). For example, consider the following graph:
Here we have graphed the cdf of a normal distribution with μ = 0 and σ' = 282.84 ELO. The dots correspond to a given player winning a single game against an individual rated one class higher (24%), half-a-class higher (36%) and equally rated (50%). Note that draws count as half a win for each participant and is the most likely case when two players of equal strength are competing.
Using this graph, we can quantify expected results and actual performances, which, combined, are used to calculate new ratings following a tournament. This is more instructive and arguably more interesting to demonstrate by example; consider the most recent FIDE World Championship in Mexico City. Eight players participated and played every other player twice (once as white and once as black) with the expected results:
The current rating (Rc) for each player is known at the beginning of the tournament as the cumulative results of his past performances. The average rating (Ra) for the competition of each participant as well as the difference between Ra and his current rating (Dc) is trivial, but integral to subsequent calculations. The expected winning percentage (Pe) over the course of the tournament for each individual comes from the cumulative normal distribution Pe = P [ X ≤ Dc]. This can be approximated from the graph above, or formally solved as follows:
When the Elo system was originally implemented, a close approximation of this number could be looked up on a chart. This number is then multiplied by the total number of games, in our example fourteen, so the expected number of wins (We) can be calculated. For a more accurate result, one could calculate the Pe for each game and sum them together. This method, although recognized as more accurate since the Elo system was originally implemented, was far too time consuming – especially in large tournaments – for those calculating ratings by hand, but is given here as We’.
Now that the expected results are known, they can be compared to the actual final results of the tournament:
Here, the observed number of wins (Wo) is shown along with the winning percentage (Po) which is simply the number of wins divided by the number of games played. The performance rating (Rp) is calculated by taking the observed winning percentage and essentially reversing the process of finding the expected winning percentage. In other words, we’re looking for the Do that satisfies the equation Po = [ X ≤ Do ] for the cdf of the normal distribution given above; this number is then added to Ra to get Rp.
Of much interest to the participants involved – as well as the chess community – is how much Rp deviated from his current rating (ΔRp) which is a better representation of his performance in terms of his ability than Wo. For example: Vladimir Kramnik and Boris Gelfand may have tied for second place, but Kramnik’s rating (2769) going into the tournament indicated that he was expected to perform better than Gelfand (2733). That is why, despite having the same number of points, Kramnik’s ΔRp (+31) was significantly lower than Gelfand’s (+72).
The difference between Wo and We/We’ (ΔW/ΔW’) is also given here. The average of the absolute value of ΔW and ΔW’ for the tournament is 0.79 and 0.71, respectively; as expected, We’ was slightly smaller and therefore closer to the actual results.
Finding an approximation of the probable error for ΔW is as follows: take half of the largest ΔRc between any two individuals the group and find Pe; let Qe = 1 - Pe; define the variance as σ2 = N * Qe * Pe; the square root of this number (the standard deviation) divided by 0.67449(2) is the probable error. For our example, the probable error is ±1.26 wins. Statistical tolerance expects half of the contestants to fall outside of this range yet only one does. Two probable reasons for this abnormally low number exist: a small sample set and the fact that less performance deviation exists at the highest levels of play.
Finally, we have the new rating (Rn) of each participant which is calculated by multiplying ΔW by a coefficient, K, known as the K-factor, and adding it to Rc. Increased K-factors increase the variability of ratings, therefore, most federations (including USCF and FIDE) use a proportionally larger K-factor for players with fewer than 25 games played since the confidence in their given rating is less than that of an established player. Here a K-factor of 10 – which is relatively low – was used to calculate Rn since this is what FIDE currently uses for players rated 2400 and above. When originally implemented, the USCF used a K-factor of 32, but has since moved to a dynamic system which gives players with a lower rating or fewer games played a higher number than experienced or higher-rated players.
Juxtapose the merits of the Elo system are several well documented weaknesses. The first, and perhaps most obvious, weakness of the Elo system is that no advantage is given to White. This advantage is apparent even at my level (B Class) of play: my Po is 0.708 as White and 0.556 as Black. Although Elo argued that this difference would offset since that most players play about half of their games as White and half as Black (I would beg to differ), the fact is that when ratings were calculated by hand this kind of sophistication in the system was much too time consuming. Another arguable weakness of the Elo system is the concept of rating inflation. For example, Bobby Fischer – considered by many the best chess player of all time – had a top rating of 2780 which would currently put him as the fourth-ranked player in the world. Others argue, however, that this indicates exactly what it implies: that, due to increased knowledge of the game including the use of computers in preparation, the top players of today are simply better than the best players of ten, twenty or fifty years ago. Either way, this point relates to a third weakness of the Elo system: Ratings are relative to your competition which opens the possibility of abnormally high (or low) ratings within a controlled group of players. This knowledge allowed Claude Bloodgood, only a master level player, to reach a rating of 2702 (which was second best in the USCF at the time) by organizing and participating in hundreds of prison tournaments that had mostly weak players. Similarly, the Union of Myanmar went from four rated players in January of 1997 to six players in the top 100 in January of 2000 by organizing closed tournaments. While such behavior may seem juvenile, consider that the biggest tournaments in the world pay exorbitant appearance fees to the highest rated players thereby encouraging players to have the highest rating possible. To date, neither USCF nor FIDE have dealt with any of the addressed weaknesses of the Elo system other than on a case-to-case basis.
Beyond these weaknesses are possible exploitations, the most common of which is the fact that only active players can have an active rating. Often a promising young player will stop playing in tournaments for an extended period of time, but continue to advance his or her abilities. This individual will eventually enter a large tournament with lucrative prize funds for each class, thereby increasing his or her chances of winning a large amount of money by participating with players of theoretically less ability. Another common exploitation – prevalent in online play – is selective pairing. A higher rated player will often only challenge or accept challenges from significantly weaker opponents. Because most online variations of the Elo system allow for a minimum gain of one point for the winning player, a rare, off-chance loss is compensated by a large number of relatively easy victories.
Despite these flaws, very few changes have been made to the Elo system over the past 47 years. As previously mentioned, because of advances in computer assistance, USCF now uses a dynamic K-factor which is higher for lower-rated and less-experienced players. The use of computers also allows every game to be rated individually and independent of the tournaments in which they are played. This is transparent to USCF members since official ratings are posted quarterly, but continually tabulated ratings (which no longer need to be rounded to the nearest Elo point) are statistically favorable in their accuracy. The final change made by the USCF – also made possible by the increased accessibility of computers – is the transition from a normal distribution to a logistic distribution. By observing a large number of results, the USCF determined that a logistic distribution most accurately extrapolated outcomes. FIDE still uses the normal distribution that Elo had originally put in place.
Regardless, the general principles of the Elo system have withstood the test of time, which is all the more impressive when one considers that Dr. Elo lacked the processing power of modern computers. To me, it’s disappointing that so many people can use something without appreciating it on even a base level. So next time a ches player asks you what “ELO” stands for, at the very least let them know that it was actually someone’s last name.
Appendix B: USCF Classes and Distribution of Classes Amongst Active Members(3)
- Elo, Arpad. The Rating of Chess Players Past and Present. New York: Arco, 1978.
- Harkness, Kenneth. The Official Blue Book and Encyclopedia of Chess. New York: McKay, 1956.
- Crowther, Mark. "FIDE Ratings Crisis." This Week in Chess. 24 January, 2000.
- Crowther, Mark. "Claude F. Bloodgood III July 14th 1924 - August 4th 2001." This Week in Chess. 20 August, 2001.
- Sonas, Jeff. "The Sonas Rating Formula – Better Than Elo?" Chessbase. 22 October, 2002.
- Marin, Mihail. "Anand's Triumph in the Final Round." Chessbase. 30 September, 2007.
- "Elo Rating System." Wikipedia, The Free Encyclopedia. 4 October, 2007, 15:44 UTC. Wikimedia Foundation, Inc. 9 October 2007.
1. A μ of zero indicates the difference in an individual's current rating and their expected performance which is obviously most probable at zero. It may seem more intuitive to use an individual's current rating as Î¼; this will simply offset the distribution by μ points. When the Elo system was introduced, all ratings were calculated by hand. Having a single model to describe all players allowed those calculating ratings the convenience of looking values up from a table or graph, thereby saving time.
2. This is derived from the definition of probable error; the ratio of the standard deviation that contains 75% of the cumulative distribution of a function. For example, in order to get a Pe of 75% with a σ of 282.84, a Dc of 0.67449 * 282.84 = 190.77 would be needed.
3. Data available to USCF members at https://secure.uschess.org/MembersOnly/download.php
|Dan Ross is 28 years old and lives in Johnstown, Pennsylvania, USA. He is a full-time software engineer and part-time graduate student (Indiana University of Pennsylvania, Applied Mathematics). Dan enjoys playing ice hockey (just not particularly well) and chess. "Chess, for me, is a hobby," he writes, "not a job (partially by choice, but mostly because I'm just not that good – 1605-P21 USCF). I prefer to enjoy my games; occasionally even turning down the a better move for a more interesting one As White I tend to play the King's Gambit or an Italian Game; as Black I've been experimenting with the Pirc or the Dutch (with mixed results) My best result thus far in a USCF tournament was the 49th Gateway Open in Pittsburgh, Pennsylvania (3.5/4.0 in the U1600 section – second place).|