Arpad Elo and the Elo Rating System
By Daniel Ross
Indiana University of Pennsylvnia
Math 563 – Mathematical Statistics I
Ask any United States Chess Federation (USCF) member about his or her "Elo;"
you will likely get an immediate and accurate response. Now ask that same USCF
member where the term "Elo" comes from. Something along the lines of, "I
forget... someone told me what it stands for once, but I can't remember"
is more likely than the correct response. Although the most widely used chess
rating system in the world bears his name, few have anything more than a vague
idea of how the system works while still fewer are aware of the statistician
Dr. Arpad Elo himself.
A physicist by discipline, the Hungarian-born Elo was a fervent amateur chess
player the greater part of his life. Performing at the master level, Elo won
the Wisconsin championship eight times between the years of 1935 and 1961. In
1959 – having already served the USCF for twenty years – Elo was
appointed chairman of the USCF rating committee; the following year the USCF
adopted the revolutionary new rating system that he conceived. A decade later,
in 1970, the World Chess Federation, better known as FIDE (Fédération
Internationale des Échecs), also implemented the Elo system to track
the relative strength of all masters worldwide. Today, nearly a half-a-century
after inception, the Elo system very closely resembles the original form. Further,
the Elo system has been applied to many other forms of pairwise competition
including Scrabble, Massively Multiplayer Online Role-Playing Games (MMORPGs)
such as World of Warcraft and various professional and collegiate sports.
Emrick Elo (original Hungarian: Árpád Imre Élö),
born in 1903 in Hungary, but moved to the United States with his parents
as a child in 1913. He became a professor of physics at Marquette University
in Milwaukee, Wisconsin. He was also a chess master who won the Wisconsin
State Championship eight times. He died in Brookfield, Wisconsin in 1992.
Arpad Elo is best known for his system of rating two-player games such
as chess. He developed his formula and a chess rating system which was
approved and passed at a meeting of the United States Chess Federation
in St. Louis in 1960. In 1970, FIDE, the World Chess Federation, agreed
to adopt the Elo Rating System. From then until the mid-1980s, Elo himself
made the rating calculations. At the time, the computational task was
relatively easy because fewer than 2000 players were rated by FIDE.
FIDE reassigned the task of managing and computing the ratings to others,
excluding Elo. FIDE also added new "Qualification for Rating"
rules to its handbook awarding arbitrary ratings (typically in the 2200
range, which is the low end for a Chess Master) for players who scored
at least 50% in the games he played at selected events, such as named
Chess Olympiads. Elo and others objected to these new rules as arbitrary
and politically-driven. Source:
While his rating system remains the most significant of his contributions,
Dr. Elo also made noteworthy observations regarding the development of chess
players by demographics such as age, place of birth and gender. Additionally,
Elo applied his rating system to historical tournament results dating back to
the early 19th century. In doing so, he made it possible for the first time
to compare the relative strength of any two chess players of significant strength
for over a century-and-a-half. Although ardently contested, these ratings are
still considered by many the most accurate way of rating players who preceded
any formal rating system.
Upon being appointed chairman of the USCF rating committee in 1959, Arpad
Elo was given the daunting task of overhauling the current rating system. Developed
by Kenneth Harkness in the early 1950s, the system – now generally referred
to as the "Harkness system" – was originally embraced by members
of the chess community who for the first time had a way of quantifying his or
her abilities. Within a few years, however, it was obvious that the Harkness
system – which could be summed up in single table (see Appendix
A) – was insufficient in its simplicity. Although Harkness ratings
were often deemed fair in the common case, certain fringe circumstances resulted
in statistically inaccurate results. For the sake of tradition and, more importantly,
USCF members, Elo kept two important parts of the Harkness system intact: the
rating scale and the class categories of that scale.
The rating scale, which has a minimum bound of 0, places the cutoff for candidate
masters (also known as experts) at 2000. Although the maximum of the scale is
technically unbounded, it would be unprecedented for a player to exceed a rating
of 3000. Because the magnitude of these numbers are arbitrary, Elo thought it
wise to leave people at their current rating for the sakes of both the chess
community (who could remain ignorant of the changes in calculation) and the
USCF (who would otherwise need to recalculate the rating of every one of their
members). The more important and easily overlooked concept that Elo took from
the Harkness system was the concept of a player "class" which is defined
as a 200 point gap in ranking (see Appendix B). Through
observation of previous tournament results, Elo found that one class accurately
portrayed the standard deviation (σ) in terms of strength of performance
for a given chess player over a series of games.
Using past results and Harkness ratings, Elo observed that the distribution
of individual performances resembles a normal distribution with a σ of
one class (200 points). Using a mean (μ) of zero(1)
gives us the following probability density function (pdf):
Because there are two participants in a single game of chess – with
each participant having a performance deviation of one class (σ1
and σ2) – the standard deviation used in the pdf derived
The cumulative distribution function (cdf) of our pdf using this σ' is
integral to the Elo system in terms of calculating expected performances (prior
to a tournament), calculating performance ratings (during a tournament), and
updating ratings (following a tournament). For example, consider the following
Here we have graphed the cdf of a normal distribution with μ = 0 and
σ' = 282.84 ELO. The dots correspond to a given player winning a single
game against an individual rated one class higher (24%), half-a-class higher
(36%) and equally rated (50%). Note that draws count as half a win for each
participant and is the most likely case when two players of equal strength are
Using this graph, we can quantify expected results and actual
performances, which, combined, are used to calculate new ratings following a
tournament. This is more instructive and arguably more interesting to demonstrate
by example; consider the most recent FIDE World Championship in Mexico City.
Eight players participated and played every other player twice (once as white
and once as black) with the expected results:
The current rating (Rc) for each player is known at the beginning
of the tournament as the cumulative results of his past performances. The average
rating (Ra) for the competition of each participant as well as the
difference between Ra and his current rating (Dc) is trivial,
but integral to subsequent calculations. The expected winning percentage (Pe)
over the course of the tournament for each individual comes from the cumulative
normal distribution Pe = P [ X ≤ Dc]. This can be
approximated from the graph above, or formally solved as follows:
When the Elo system was originally implemented,
a close approximation of this number could be looked up on a chart. This number
is then multiplied by the total number of games, in our example fourteen, so
the expected number of wins (We) can be calculated. For a more accurate
result, one could calculate the Pe for each game and sum them together.
This method, although recognized as more accurate since the Elo system was originally
implemented, was far too time consuming – especially in large tournaments –
for those calculating ratings by hand, but is given here as We’.
Now that the expected results are known, they can
be compared to the actual final results of the tournament:
Here, the observed number of wins (Wo)
is shown along with the winning percentage (Po) which is simply the
number of wins divided by the number of games played. The performance rating
(Rp) is calculated by taking the observed winning percentage and
essentially reversing the process of finding the expected winning percentage.
In other words, we’re looking for the Do that satisfies the equation
Po = [ X ≤ Do ] for the cdf of the normal distribution
given above; this number is then added to Ra to get Rp.
Of much interest to the participants involved – as well as the chess community
– is how much Rp deviated from his current rating (ΔRp)
which is a better representation of his performance in terms of his ability
than Wo. For example: Vladimir Kramnik and Boris Gelfand may have
tied for second place, but Kramnik’s rating (2769) going into the tournament
indicated that he was expected to perform better than Gelfand (2733). That is
why, despite having the same number of points, Kramnik’s ΔRp
(+31) was significantly lower than Gelfand’s (+72).
difference between Wo and We/We’ (ΔW/ΔW’)
is also given here. The average of the absolute value of ΔW and ΔW’
for the tournament is 0.79 and 0.71, respectively; as expected, We’
was slightly smaller and therefore closer to the actual results.
Finding an approximation of the probable error for ΔW is as follows: take
half of the largest ΔRc between any two individuals the group
and find Pe; let Qe = 1 - Pe; define the variance
as σ2 = N * Qe * Pe; the square root of
this number (the standard deviation) divided by 0.67449(2)
is the probable error. For our example, the probable error is ±1.26 wins. Statistical
tolerance expects half of the contestants to fall outside of this range yet
only one does. Two probable reasons for this abnormally low number exist: a
small sample set and the fact that less performance deviation exists at the
highest levels of play.
we have the new rating (Rn) of each participant which is calculated
by multiplying ΔW by a coefficient, K, known as the K-factor, and adding
it to Rc. Increased K-factors increase the variability of ratings,
therefore, most federations (including USCF and FIDE) use a proportionally larger
K-factor for players with fewer than 25 games played since the confidence in
their given rating is less than that of an established player. Here a K-factor
of 10 – which is relatively low – was used to calculate Rn since
this is what FIDE currently uses for players rated 2400 and above. When originally
implemented, the USCF used a K-factor of 32, but has since moved to a dynamic
system which gives players with a lower rating or fewer games played a higher
number than experienced or higher-rated players.
the merits of the Elo system are several well documented weaknesses. The first,
and perhaps most obvious, weakness of the Elo system is that no advantage is
given to White. This advantage is apparent even at my level (B Class) of play:
my Po is 0.708 as White and 0.556 as Black. Although Elo argued that
this difference would offset since that most players play about half of their
games as White and half as Black (I would beg to differ), the fact is that when
ratings were calculated by hand this kind of sophistication in the system was
much too time consuming. Another arguable weakness of the Elo system is the
concept of rating inflation. For example, Bobby Fischer – considered by many
the best chess player of all time – had a top rating of 2780 which would currently
put him as the fourth-ranked player in the world. Others argue, however, that
this indicates exactly what it implies: that, due to increased knowledge of
the game including the use of computers in preparation, the top players of today
are simply better than the best players of ten, twenty or fifty years ago. Either
way, this point relates to a third weakness of the Elo system: Ratings are relative
to your competition which opens the possibility of abnormally high (or low)
ratings within a controlled group of players. This knowledge allowed Claude
Bloodgood, only a master level player, to reach a rating of 2702 (which was
second best in the USCF at the time) by organizing and participating in hundreds
of prison tournaments that had mostly weak players. Similarly, the Union of
Myanmar went from four rated players in January of 1997 to six players in the
top 100 in January of 2000 by organizing closed tournaments. While such behavior
may seem juvenile, consider that the biggest tournaments in the world pay exorbitant
appearance fees to the highest rated players thereby encouraging players to
have the highest rating possible. To date, neither USCF nor FIDE have dealt
with any of the addressed weaknesses of the Elo system other than on a case-to-case
these weaknesses are possible exploitations, the most common of which is the
fact that only active players can have an active rating. Often a promising young
player will stop playing in tournaments for an extended period of time, but
continue to advance his or her abilities. This individual will eventually enter
a large tournament with lucrative prize funds for each class, thereby increasing
his or her chances of winning a large amount of money by participating with
players of theoretically less ability. Another common exploitation – prevalent
in online play – is selective pairing. A higher rated player will often only
challenge or accept challenges from significantly weaker opponents. Because
most online variations of the Elo system allow for a minimum gain of one point
for the winning player, a rare, off-chance loss is compensated by a large number
of relatively easy victories.
these flaws, very few changes have been made to the Elo system over the past
47 years. As previously mentioned, because of advances in computer assistance,
USCF now uses a dynamic K-factor which is higher for lower-rated and less-experienced
players. The use of computers also allows every game to be rated individually
and independent of the tournaments in which they are played. This is transparent
to USCF members since official ratings are posted quarterly, but continually
tabulated ratings (which no longer need to be rounded to the nearest Elo point)
are statistically favorable in their accuracy. The final change made by the
USCF – also made possible by the increased accessibility of computers – is the
transition from a normal distribution to a logistic distribution. By observing
a large number of results, the USCF determined that a logistic distribution
most accurately extrapolated outcomes. FIDE still uses the normal distribution
that Elo had originally put in place.
Regardless, the general principles of the Elo system have withstood the test
of time, which is all the more impressive when one considers that Dr. Elo lacked
the processing power of modern computers. To me, it’s disappointing that so
many people can use something without appreciating it on even a base level.
So next time a ches player asks you what “ELO” stands for, at the very least
let them know that it was actually someone’s last name.
Appendix A: The Harkness System
Appendix B: USCF Classes and Distribution of Classes
Amongst Active Members(3)
- Elo, Arpad. The Rating of Chess Players Past and Present. New York: Arco,
- Harkness, Kenneth. The Official Blue Book and Encyclopedia of Chess. New
York: McKay, 1956.
- Crowther, Mark. "FIDE Ratings Crisis." This Week in Chess. 24
- Crowther, Mark. "Claude F. Bloodgood III July 14th 1924 - August 4th
2001." This Week in Chess. 20 August, 2001.
- Sonas, Jeff.
"The Sonas Rating Formula – Better Than Elo?" Chessbase.
22 October, 2002.
- Marin, Mihail.
"Anand's Triumph in the Final Round." Chessbase. 30 September,
Rating System." Wikipedia, The Free Encyclopedia. 4 October, 2007,
15:44 UTC. Wikimedia Foundation, Inc. 9 October 2007.
1. A μ of zero indicates the difference in an
individual's current rating and their expected performance which is obviously
most probable at zero. It may seem more intuitive to use an individual's current
rating as Î¼; this will simply offset the distribution by μ points. When
the Elo system was introduced, all ratings were calculated by hand. Having a
single model to describe all players allowed those calculating ratings the convenience
of looking values up from a table or graph, thereby saving time.
2. This is derived from the definition of probable
error; the ratio of the standard deviation that contains 75% of the cumulative
distribution of a function. For example, in order to get a Pe of
75% with a σ of 282.84, a Dc of 0.67449 * 282.84 = 190.77 would
3. Data available to USCF members at https://secure.uschess.org/MembersOnly/download.php
||Dan Ross is 28 years old and lives in Johnstown, Pennsylvania,
USA. He is a full-time software engineer and part-time graduate student
(Indiana University of Pennsylvania, Applied Mathematics). Dan enjoys playing
ice hockey (just not particularly well) and chess. "Chess, for me,
is a hobby," he writes, "not a job (partially by choice, but mostly
because I'm just not that good – 1605-P21 USCF). I prefer to enjoy
my games; occasionally even turning down the a better move for a more interesting
one As White I tend to play the King's Gambit or an Italian Game; as Black
I've been experimenting with the Pirc or the Dutch (with mixed results)
My best result thus far in a USCF tournament was the 49th Gateway Open in
Pittsburgh, Pennsylvania (3.5/4.0 in the U1600 section – second place).