The Deloitte/FIDE Chess Rating Challenge
By
Jeff Sonas
Deloitte Australia sponsored the $10,000 first prize. Deloitte is a preeminent
provider of analytics globally and helps companies capture, manage and analyze
their data as part of their overall business strategy.
The contest itself was hosted on the Kaggle website. Contest participants could
download a very large historical dataset of chess game data (prepared by me
from FIDE's archives). This dataset included all known game-by-game outcomes
among 54,000 different players in FIDE-rated events, across an eleven-year period
from January 1999 through December 2009. Player names and FIDE ratings were
not included in the game data; only a unique ID for each player was provided.
In order to do well in the contest, you were expected to develop your own approach
to calculating chess ratings and making predictions, or you could base your
system on publicly-known methods. The eleven years of historical game data,
covering about two million games, would allow your system to “learn” about chess
ratings and chess outcomes, allowing you to determine what approaches would
work best for using past results to make future predictions. Analyzing all these
games allows you to “train” your system with optimal parameters, as well as
calculating ratings (and any other measures) for all 54,000 players going into
January 2010.
Once you were ready to make predictions, there were an additional 88,000 game
matchups provided (corresponding to actual games played among the 54,000 chess
players during the first three months of 2010), that you had to predict the
outcome of. Of course you were not told the actual outcomes of any of those
games: only who had White and who had Black. Once you submitted your predictions
for all 88,000 of the games using the Kaggle website, the accuracy of your predictions
would be immediately scored by the website, and updated scores shown on the
contest leaderboard. Each team could submit two sets of predictions each day,
and the single submission that contained the most accurate predictions (according
to a "log-likelihood" scoring algorithm) would win first place.
There were two prize categories for the contest: the “main prize” category
and the restricted “FIDE prize” category. The FIDE prize category was for the
most promising enhancement to, or alternative to, the Elo system, and the finalists
are currently being evaluated by a team of FIDE representatives (more on this
in another article). In the main prize category, there were no restrictions
on the methodology used for predictions, although you were forbidden from determining
the actual identities of players and then looking up their known actual results
from 2010. The winner would take home the $10,000 first prize sponsored by
Deloitte, and the 2nd/3rd/4th place finishers won chess software autographed
by top players and donated by Chessbase.
As expected, there was a wide variety of methodologies used. It wouldn’t be
nearly good enough to just use the Elo system, or other well-known approaches
such as Mark Glickman's Glicko algorithm or my own Chessmetrics formula. You
would need an artificial intelligence "machine learning" algorithm
that could search through N-dimensional space, using “stochastic gradient descent”
to hunt for the set of ratings that minimized your error. Or you needed to
use what is called an “ensemble” approach, combining dozens of simplistic rating
calculations and other indicators, and allowing the computer to search for the
most effective way to combine the simple indicators into a complex and effective
predictive system. Or something else equally sophisticated! The participants
had a head start in this contest, thanks to a similar contest that was held
on a smaller scale last year, also hosted by Kaggle. At the end of that contest,
the top ten finishers provided documentation of their approaches, and those
writeups were publicly available on the Kaggle website.

Contest winner Tim Salimans, Ph.D. student in Econometrics at Erasmus University
Rotterdam
The winner of this most recent contest, Tim Salimans, has no connection to
chess, although his university research focuses heavily on predictions based
on data. His basic approach was to infer the skills of all players by means
of factored approximate Bayesian inference, using expectation propagation. This
method was inspired strongly by the approaches used by the top two finishers
(Yannis Sismanis and Jeremy Howard) from the previous contest, as well as the
TrueSkill model developed by Microsoft and used in their Xbox gaming system
to track ratings in a variety of video games.
Second place was taken by a reclusive data prediction professional known only
as Sami, who "occasionally plays three-minute patzer level blitz games
in FICS”. Sami used an ensemble method, but also estimated the rating difference
separately for each pair of players, rather than trying to calculate each player’s
rating independently. Sami led the contest a couple of times but ultimately
finished in second place. According to Sami:
My model wasn’t a rating system because a rating is a weak predictor. To
my knowledge, Shirov never beat Kasparov, even though their ratings were close
enough, hence it will not work to try predicting a score for a Shirov vs.
Kasparov or Shirov vs. Kramnik game from the same ratings.
Third place was won by Andy Cotter, a graduate student at TTI-Chicago studying
machine learning. For his predictions, Andy developed a blended skill measure
combining a logistic-regression based approach with a modified version of Mark
Glickman's Glicko rating system. Andy had a meteoric rise at the end of the
contest, showing up in the top twenty for the first time only in the final week
of the 12-week contest, and eventually finishing in third place. Although “not
a good chess player”, Andy was drawn to the contest partially due to his fond
memories of playing speed chess with friends in high school.
Jason Tigg and David Clague, who both have Ph.D.’s in particle physics and
live in the UK, formed a team that led for a week near the end but ultimately
finished in fourth place. Jason is a “keen but not particularly accomplished
chess player”, and David has “minimal chess knowledge”. You can see a growing
trend about what type of person does well in these contests! Their model was
based on a hyperbolic tangent expectancy function, trained using stochastic
descent. Vladimir Nikulin from Australia (who focuses primarily on data mining
and “can play chess at some basic level”) finished fifth, and last year's winner,
Yannis Sismanis, finished in sixth place.
This type of contest is based upon the fundamental premise that "predictive
power" is useful in a chess rating system. This is a concept that many
people have a big problem with, insisting that ratings are only supposed to
describe what has already happened, and are not intended to predict anything
about the future. Regardless of the "intent" of ratings, I think
it is very useful to know how closely a chess rating is correlated with a player's
next results, and to strive for a rating formula that correlates with next results
as closely as possible. Certainly it is not the only factor (simplicity and
fairness being two other important aspects of an ideal rating system) but there
is no reason we shouldn't try and maximize the predictive accuracy of our rating
system as long as we are not sacrificing other factors. For some applications
of chess ratings, the complexity of the calculation doesn't matter at all, and
so one that is as accurate as possible would be ideal.
You might wonder: at what level are we measuring "predictive power"?
We could measure it at the tournament level, looking for which approach most
closely predicts the order of finish in each tournament. Or we could measure
it at the player/month level, looking for which approach most closely predicts
each player's total score out of the games they played in a month. Finally,
we could measure it on a game-by-game level, looking for which approach most
closely predicts the outcome in each individual game. In fact, the scoring
for this contest was applied on a game-by-game level, whereas the previous contest
had scored at the player/month level. Any of these approaches will work well,
but the game-by-game approach is most useful in allowing contest participants
to truly optimize their models through sound statistical techniques such as
cross-validation.
During the contest (held from February 2011 through May 2011), participants
were only asked to submit predictions for games played in the recent past, from
January 2010 through March 2010. Ideally the identities of the chessplayers,
and the outcomes of those 88,000 games from last year, would have remained a
secret throughout the contest. We are not aware of anyone "cracking the
code" and figuring out the actual identities of all 54,000 players. Nevertheless,
some of the participants did figure out a clever (and technically legal) trick
based on particular knowledge about chess. They realized that even if you don't
know the outcomes of those games from 2010, you can look at the strength of
opposition faced by a player, and if the player faced unusually strong opposition,
it is likely that they were in a Swiss event and that they did well in their
games (otherwise they would have faced players closer to their own strength). If
they faced unusually weak opposition, it is likely that they did poorly in their
Swiss event. And so the predictions of those games could be adjusted accordingly,
with a big boost to your score in the contest! Although the winners of the
contest were shown to have the superior predictive systems even when removing
this "future scheduling" trick from their system, it was a big flaw
in my contest design, and is certainly a loophole that will have to be firmly
closed if such a contest is ever held again!
Since the contestants had many weeks in which to optimize their predictions
for the games from January 2010 to March 2010, we also held a brief follow-up
stage where the winners needed to use their same formulas to predict a new set
of games, from April 2010 through June 2010. This would help to assess the
robustness of the winning methods, and the “future scheduling” loophole was
also eliminated in this follow-up stage. The top two finishers in the future
stage were Tim Salimans and the team of Jason Tigg and David Clague, and a simple
“winners ensemble”, averaging the predictions of those two teams together, was
most accurate of all.
To give an idea of why I can claim that these approaches are a lot more accurate
than the Elo system, allow me to briefly illustrate a couple of ways to measure
predictive accuracy. We will take a small, elite tournament as an example (the
Capablanca Memorial from June 2010), and compare the predictions made from players’
Elo ratings, versus the predictions made by the “winners ensemble” of Tim Salimans
and the team of Jason Tigg and David Clague. First of all, we could score the
predictions from each individual game.
Rnd |
Result (White listed first) |
W score |
Elo |
Error |
Ensemble |
Error |
1 |
Nepomniachtchi 1/2 Alekseev |
0.5 |
54% |
0.302 |
54% |
0.302 |
1 |
Dominguez 1/2 Bruzon |
0.5 |
59% |
0.309 |
62% |
0.314 |
1 |
Short 0-1 Ivanchuk |
0.0 |
48% |
0.281 |
46% |
0.270 |
… |
|
|
|
|
|
|
10 |
Ivanchuk 1/2 Nepomniachtchi |
0.5 |
59% |
0.309 |
60% |
0.311 |
10 |
Dominguez 1-0 Alekseev |
1.0 |
56% |
0.254 |
55% |
0.263 |
10 |
Short 1-0 Bruzon |
1.0 |
56% |
0.251 |
62% |
0.211 |
|
(all 30 games) |
|
Avg=0.296 |
Avg=0.293 |
By this scoring measure (using the same Binomial Deviance error function as
was used in the actual contest), we see that the Winners’ Ensemble has an average
error of 0.293, and the Elo System has an average error of 0.296 (slightly less
accurate). Of course, this is a tiny sample of only 30 games, but when we look
across all 112,837 games from April 2010 through June 2010, we find that the
Winners’ Ensemble had an average error of 0.248 and the Elo system had an average
error of 0.257. This is quite a significant difference.
Another way to compare the predictions would be to look at the final scores
of each player in a tournament, compared to their predicted final scores. To
continue the small example of the Capablanca Memorial again, we can see what
the Elo prediction for each player’s final score would be (rounded to the nearest
half-point), compared to the prediction from the Winners’ Ensemble (again rounded
to the nearest half-point). By adding up the absolute errors from each prediction,
we get an overall measure of which prediction was best:
Player |
Total/10 |
Elo |
Error |
Ensemble |
Error |
Ivanchuk |
7.0 |
5.5 |
1.5 |
6.0 |
1.0 |
Nepomniachtchi |
6.0 |
5.0 |
1.0 |
5.0 |
1.0 |
Dominguez |
5.5 |
5.0 |
0.5 |
5.0 |
0.5 |
Short |
5.5 |
5.0 |
0.5 |
5.0 |
0.5 |
Alekseev |
3.0 |
5.0 |
2.0 |
5.0 |
2.0 |
Bruzon Batista |
3.0 |
4.5 |
1.5 |
4.0 |
1.0 |
|
|
Total=7.0 |
Total=6.0 |
By this measure, we see that the Elo predictions had a total error of 7.0 points,
compared to the Winners’ Ensemble with its total error of 6.0, and so for this
tournament at least, the Elo predictions were less accurate. Again, this is
just one tournament and way too tiny a sample for drawing any conclusions at
all, but when you look at a lot of tournaments, the trend is clear. For instance,
there were 221 tournaments (between April 2010 and June 2010) having an average
participant Elo rating of 2200+ and at least 30 games played. If we perform
the same comparison again, for each tournament separately, and then add up the
results, we find that the Winners’ Ensemble method is much more likely to make
a better prediction than the Elo method for everyone’s final score:

All of these predictions were calculated during 2011, but were for games already
played (from the first half of 2010). And of course, the true test of a prediction
system is whether it can predict the future (not the past) successfully. So,
just for fun, after the contest the top finishers were also asked to calculate
ratings up to the present (they were given additional data from FIDE in order
to do this), and to use their system to submit predictions (still without knowing
identities of the players) that would allow me to calculate winning odds in
the Kazan FIDE Candidates Matches. The contest winner (Tim Salimans) as well
as two others from the top five (Vladimir Nikulin and the team of Jason Tigg
and David Clague) agreed to participate in this little experiment.
As above, any conclusions about the overall "accuracy" of a rating
method, based only upon the outcome of one small tournament, should carry very
little weight. Nevertheless, it is interesting to note that all three systems
drew identical conclusions about which players in the top 20 seemed to have
Elo ratings that were too low. Alexander Grischuk (who recently made it to the
finals of the Candidates' event) was identified as the most underrated of the
eight Candidates, by far. All three systems felt that Grischuk (ranked #12 on
the May 1 list) was underrated by about 20 Elo points, and ought to be #8 or
#9 on the world rating list instead.
Of course, there is a flip side – all three systems felt that Boris Gelfand
and Hikaru Nakamura were two of the three players in the top twenty who were
most overrated by the Elo system. And in fact, both players recently won short
matches where the three systems would have considered them to be very slight
underdogs (Gelfand in his match against Grischuk, and Nakamura in his match
against Ruslan Ponomariov). Further, Gelfand also had to first defeat two other
top players just to reach the Candidates' finals match. Elo-based predictions
would have given pre-tournament odds of 28-1 against Gelfand winning the event,
whereas with the other three systems (on average) the pre-tournament odds would
have been 35-to-1!
Such experiments are interesting, but the real way to measure the accuracy
of a rating system is to look at very large numbers of players and games, such
as was done in these two Kaggle contests. And it is clear that whether you are
predicting the final standings in each tournament, or evaluating each game independently,
the top finishers have developed rating systems that are far more accurate at
predicting future outcomes than the Elo system (which finished in 80th place
out of 189 in the latest contest). I consider the winners’ systems to be about
15% more accurate than the Elo system in their predictions. Another big question
– what are the most promising practical improvements to the FIDE rating system
– is a separate issue, pertaining to the “FIDE prize” category of the contest. This
prize category was specifically targeted at practical improvements to the existing
FIDE rating system. There is more to say on this matter, but it is a topic for
another time!
Finally, I would very much like to thank the sponsors (Deloitte and ChessBase)
of the main contest. It is quite encouraging that the winner of the main prize
in this contest had a base model inspired by the winners of the previous contest.
It suggests that these contests are useful in advancing the state of the art
for chess predictions, and perhaps we are approaching the "optimal"
system!
Copyright
Jeff Sonas / ChessBase
Previous articles by Jeff Sonas on ChessBase.com

|
Sonas: Assessment of the EU performance calculation
16.04.2011 – The 2011 European Individual Championship
left 29 players with a tied score vying for eight places in the next World
Cup. To break the tie the ECU used performance ratings, but calculated
them in a way that led to some bizarre results – and to a formal
protest by at least one player. Jeff Sonas introduces us to other,
more logical systems. As usual his report is presented with
exceptional clarity. |

|
The Elo rating system – correcting the expectancy tables
30.03.2011 – In recent years statistician Jeff
Sonas has participated in FIDE meetings of "ratings experts" and received
access to the historical material from the federation archives. After
thorough analysis he has come to some remarkable new conclusions which
he will share with our readers in a series of articles. The first gives
us an excellent overview of how the rating system works. Very
instructive. |

|
The Deloitte/FIDE Chess Rating Challenge
20.02.2011 – Statician Jeff Sonas and Kaggle,
a site specializing in data modeling with regular prediction competitions,
have launched a new online contest to develop a more accurate chess rating
system. Professional services firm Deloitte provides a $10,000 prize to
the winner, and FIDE will also bring a top finisher to Athens, Greece
to present their rating system. Report
and California pictorial. |

|
Can you out-predict Elo? – Competition update
21.09.2010 – Can we devise a more accurate
method for measuring chess strength and predicting results than Elo, which
has done good service for half a century? Jeff Sonas has given statisticians
65,000 games which they must use to predict the results 7,800 other games.
The idea is to find out who can out-perform Elo. In the lead is 28-year-old
Portugese biochemist Filipe Maia. Current
status. |

|
Impressions from FIDE rating conference 2010
10.06.2010 – The FIDE ratings conference, held
last week in Athens, Greece, spent quite a bit of time discussing the
problem of rating inflation. Two different opinions met head on: one of
chess statistician Jeff Sonas, USA, and one represented by Polish GM Bartlomiej
Macieja. The subject matter is not easy to understand, but our colleague
Michalis Kaloumenos made a serious effort to do so. Food
for thought. |

|
Rating inflation – its causes and possible cures
27.07.2009 – Thirty years ago there was one
player in the world rated over 2700. Fifteen years ago there were six.
Today there are thirty-three. What is the cause of this "rating inflation".
A general improvement of chess skills? A larger number of players in the
rating pool? The way the initial ratings are conducted? In this clearly
written article statistician Jeff Sonas addresses these questions. Must
read! |

|
Rating debate: is 24 the ideal K-factor?
03.05.2009 – FIDE decided to speed up the change
in their ratings calculations, then turned more cautious about it. Polish
GM Bartlomiej Macieja criticised them for balking, and Jeff Sonas provided
compelling statistical reasons for changing the K-factor to 24. Finally
John Nunn warned of the disadvantages of changed a well-functioning system.
Here are some more interesting
expert arguments. |

|
FIDE: We support the increase of the K-factor
29.04.2009 – Yesterday we published a letter
by GM Bartlomiej Macieja asking the World Chess Federation not to delay
the decision to increase the K-factor in their ratings calculation. Today
we received a reply to Maceija's passionate appeal from FIDE, outlining
the reasons for the actions. In addition interesting letters from our
readers, including one from statistician Jeff Sonas. Opinions
and explanations. |

|
Making sense of the FIDE cycle
10.12.2005 – Why, many of our readers have
asked, are eight players who got knocked out in round four of the World
Cup in Khanty-Mansiysk still playing in the tournament? What, they want
to know, is the point? Has it something to do with the 2007 FIDE world
championship cycle? It most certainly does. Jeff Sonas explains in meticulous
detail. Please
concentrate. |

|
The Greatest Chess Player of All Time – Part IV
25.05.2005 – So tell us already, who was the
greatest chess performance of all time? After analysing and dissecting
many different aspects of this question, Jeff Sonas wraps it up in the
final installment of this series, awarding his all-time chess "Oscar"
nomination to the overall greatest
of all time. |

|
The Greatest Chess Player of All Time – Part III
06.05.2005 – What was the greatest chess performance
of all time? Jeff Sonas has analysed the duration different players have
stayed at the top of the ratings list, at the highest individual rating
spikes and best tournament performances. Today he looks at the most impressive
over-all tournament performances in history, and comes up with some very
impressive statistics. |

|
The Greatest Chess Player of All Time – Part I
24.04.2005 – Last month Garry Kasparov retired
from professional chess. Was he the greatest, most dominant chess player
of all time? That is a question that can be interpreted in many different
ways, and most answers will be extremely subjective. Jeff Sonas has conducted
extensive historical research and applied ruthlesss statistics to seek
a solution
to an age-old debate. |

|
How (not) to play chess against computers
11.11.2003 – Is there any way that human chess
players can withstand the onslaught of increasingly powerful computer
opponents? Only by modifying their own playing style, suggests statistician
Jeff Sonas, who illustrates a fascinating link between chess aggression
and failure against computers. There may still be a chance for humanity.
More...
|

|
Physical Strength and Chess Expertise
07.11.2003 – How can humans hope to hold their
ground in their uphill struggle against chess computers? Play shorter
matches, stop sacrificing material, and don't fear the Sicilian Defense,
says statistician Jeff Sonas, who also questions the high computer ratings
on the Swedish SSDF list. Here
is his evidence. |

|
Are chess computers improving faster than grandmasters?
17.10.2003 – The battle between humans and
machines over the chessbaord appears to be dead-even – in spite of giant
leaps in computer technology. "Don't forget that human players are improving
too," says statistician Jeff Sonas, who doesn't think it is inevitable
that computers will surpass humans. Here is his statistical
evidence. |

|
Man vs Machine – who is winning?
08.10.2003 – Every year computers are becoming
stronger at chess, holding their own against the very strongest players.
So very soon they will overtake their human counterparts. Right? Not necessarily,
says statistician Jeff Sonas, who doesn't believe that computers will
inevitably surpass the top humans. In a series of articles Jeff presents
empirical
evidence to support his claim. |

|
Does Kasparov play 2800 Elo against a computer?
26.08.2003 – On August 24 the well-known statistician
Jeff Sonas presented an article entitled "How
strong are the top chess programs?" In it he looked at the performance
of top programs against humans, and attempted to estimate an Elo rating
on the basis of these games. One of the programs, Brutus, is the work
of another statistician, Dr Chrilly Donninger, who replies
to Jeff Sonas. |

|
Computers vs computers and humans
24.08.2003 – The SSDF
list ranks chess playing programs on the basis of 90,000 games. But
these are games the computers played against each other. How does that
correlate to playing strength against human beings? Statistician Jeff
Sonas uses a number of recent tournaments to evaluate the true
strength of the programs. |

|
The Sonas Rating Formula – Better than Elo?
22.10.2002 – Every three months, FIDE publishes
a list of chess ratings calculated by a formula that Professor Arpad Elo
developed decades ago. This formula has served the chess world quite well
for a long time. However, statistician Jeff Sonas believes that the time
has come to make some significant changes to that formula. He presents
his proposal in this milestone
article. |

|
The best of all possible world championships
14.04.2002 – FIDE have recently concluded a
world championship
cycle, the Einstein Group is running their own world
championship, and Yasser Seirawan has proposed a "fresh
start". Now statistician Jeff Sonas has analysed the relative merits
of these three (and 13,000 other possible) systems to find out which are
the most practical, effective, inclusive and unbiased. There are some
suprises in store (the FIDE system is no. 12,671 on the list, Seirawan's
proposal is no. 345). More. |