Sonas: The Deloitte/FIDE Chess Rating Challenge

6/8/2011 – The Deloitte/FIDE Chess Rating Challenge, a worldwide contest to determine the most accurate rating algorithm for predicting chess game outcomes, has recently concluded. The winner of the $10,000 first prize was Tim Salimans, a Ph.D. student in the Netherlands. Tim submitted the most accurate predictions, out of all 189 teams who participated, and so he won the big prize.

The Deloitte/FIDE Chess Rating Challenge

By Jeff Sonas

Deloitte Australia sponsored the $10,000 first prize.  Deloitte is a preeminent provider of analytics globally and helps companies capture, manage and analyze their data as part of their overall business strategy.

The contest itself was hosted on the Kaggle website. Contest participants could download a very large historical dataset of chess game data (prepared by me from FIDE's archives). This dataset included all known game-by-game outcomes among 54,000 different players in FIDE-rated events, across an eleven-year period from January 1999 through December 2009. Player names and FIDE ratings were not included in the game data; only a unique ID for each player was provided.

In order to do well in the contest, you were expected to develop your own approach to calculating chess ratings and making predictions, or you could base your system on publicly-known methods. The eleven years of historical game data, covering about two million games, would allow your system to “learn” about chess ratings and chess outcomes, allowing you to determine what approaches would work best for using past results to make future predictions. Analyzing all these games allows you to “train” your system with optimal parameters, as well as calculating ratings (and any other measures) for all 54,000 players going into January 2010.

Once you were ready to make predictions, there were an additional 88,000 game matchups provided (corresponding to actual games played among the 54,000 chess players during the first three months of 2010), that you had to predict the outcome of. Of course you were not told the actual outcomes of any of those games: only who had White and who had Black. Once you submitted your predictions for all 88,000 of the games using the Kaggle website, the accuracy of your predictions would be immediately scored by the website, and updated scores shown on the contest leaderboard. Each team could submit two sets of predictions each day, and the single submission that contained the most accurate predictions (according to a "log-likelihood" scoring algorithm) would win first place.

There were two prize categories for the contest: the “main prize” category and the restricted “FIDE prize” category. The FIDE prize category was for the most promising enhancement to, or alternative to, the Elo system, and the finalists are currently being evaluated by a team of FIDE representatives (more on this in another article). In the main prize category, there were no restrictions on the methodology used for predictions, although you were forbidden from determining the actual identities of players and then looking up their known actual results from 2010.  The winner would take home the $10,000 first prize sponsored by Deloitte, and the 2nd/3rd/4th place finishers won chess software autographed by top players and donated by Chessbase.

As expected, there was a wide variety of methodologies used. It wouldn’t be nearly good enough to just use the Elo system, or other well-known approaches such as Mark Glickman's Glicko algorithm or my own Chessmetrics formula. You would need an artificial intelligence "machine learning" algorithm that could search through N-dimensional space, using “stochastic gradient descent” to hunt for the set of ratings that minimized your error.  Or you needed to use what is called an “ensemble” approach, combining dozens of simplistic rating calculations and other indicators, and allowing the computer to search for the most effective way to combine the simple indicators into a complex and effective predictive system.  Or something else equally sophisticated!  The participants had a head start in this contest, thanks to a similar contest that was held on a smaller scale last year, also hosted by Kaggle. At the end of that contest, the top ten finishers provided documentation of their approaches, and those writeups were publicly available on the Kaggle website.


Contest winner Tim Salimans, Ph.D. student in Econometrics at Erasmus University Rotterdam

The winner of this most recent contest, Tim Salimans, has no connection to chess, although his university research focuses heavily on predictions based on data. His basic approach was to infer the skills of all players by means of factored approximate Bayesian inference, using expectation propagation. This method was inspired strongly by the approaches used by the top two finishers (Yannis Sismanis and Jeremy Howard) from the previous contest, as well as the TrueSkill model developed by Microsoft and used in their Xbox gaming system to track ratings in a variety of video games.

Second place was taken by a reclusive data prediction professional known only as Sami, who "occasionally plays three-minute patzer level blitz games in FICS”. Sami used an ensemble method, but also estimated the rating difference separately for each pair of players, rather than trying to calculate each player’s rating independently. Sami led the contest a couple of times but ultimately finished in second place. According to Sami:

My model wasn’t a rating system because a rating is a weak predictor. To my knowledge, Shirov never beat Kasparov, even though their ratings were close enough, hence it will not work to try predicting a score for a Shirov vs. Kasparov or Shirov vs. Kramnik game from the same ratings.

Third place was won by Andy Cotter, a graduate student at TTI-Chicago studying machine learning.  For his predictions, Andy developed a blended skill measure combining a logistic-regression based approach with a modified version of Mark Glickman's Glicko rating system.  Andy had a meteoric rise at the end of the contest, showing up in the top twenty for the first time only in the final week of the 12-week contest, and eventually finishing in third place.  Although “not a good chess player”, Andy was drawn to the contest partially due to his fond memories of playing speed chess with friends in high school.

Jason Tigg and David Clague, who both have Ph.D.’s in particle physics and live in the UK, formed a team that led for a week near the end but ultimately finished in fourth place. Jason is a “keen but not particularly accomplished chess player”, and David has “minimal chess knowledge”. You can see a growing trend about what type of person does well in these contests! Their model was based on a hyperbolic tangent expectancy function, trained using stochastic descent. Vladimir Nikulin from Australia (who focuses primarily on data mining and “can play chess at some basic level”) finished fifth, and last year's winner, Yannis Sismanis, finished in sixth place.

This type of contest is based upon the fundamental premise that "predictive power" is useful in a chess rating system.  This is a concept that many people have a big problem with, insisting that ratings are only supposed to describe what has already happened, and are not intended to predict anything about the future.  Regardless of the "intent" of ratings, I think it is very useful to know how closely a chess rating is correlated with a player's next results, and to strive for a rating formula that correlates with next results as closely as possible.  Certainly it is not the only factor (simplicity and fairness being two other important aspects of an ideal rating system) but there is no reason we shouldn't try and maximize the predictive accuracy of our rating system as long as we are not sacrificing other factors. For some applications of chess ratings, the complexity of the calculation doesn't matter at all, and so one that is as accurate as possible would be ideal.

You might wonder: at what level are we measuring "predictive power"?  We could measure it at the tournament level, looking for which approach most closely predicts the order of finish in each tournament.  Or we could measure it at the player/month level, looking for which approach most closely predicts each player's total score out of the games they played in a month.  Finally, we could measure it on a game-by-game level, looking for which approach most closely predicts the outcome in each individual game.  In fact, the scoring for this contest was applied on a game-by-game level, whereas the previous contest had scored at the player/month level.  Any of these approaches will work well, but the game-by-game approach is most useful in allowing contest participants to truly optimize their models through sound statistical techniques such as cross-validation.

During the contest (held from February 2011 through May 2011), participants were only asked to submit predictions for games played in the recent past, from January 2010 through March 2010.  Ideally the identities of the chessplayers, and the outcomes of those 88,000 games from last year, would have remained a secret throughout the contest. We are not aware of anyone "cracking the code" and figuring out the actual identities of all 54,000 players.  Nevertheless, some of the participants did figure out a clever (and technically legal) trick based on particular knowledge about chess.  They realized that even if you don't know the outcomes of those games from 2010, you can look at the strength of opposition faced by a player, and if the player faced unusually strong opposition, it is likely that they were in a Swiss event and that they did well in their games (otherwise they would have faced players closer to their own strength). If they faced unusually weak opposition, it is likely that they did poorly in their Swiss event. And so the predictions of those games could be adjusted accordingly, with a big boost to your score in the contest!  Although the winners of the contest were shown to have the superior predictive systems even when removing this "future scheduling" trick from their system, it was a big flaw in my contest design, and is certainly a loophole that will have to be firmly closed if such a contest is ever held again!

Since the contestants had many weeks in which to optimize their predictions for the games from January 2010 to March 2010, we also held a brief follow-up stage where the winners needed to use their same formulas to predict a new set of games, from April 2010 through June 2010.  This would help to assess the robustness of the winning methods, and the “future scheduling” loophole was also eliminated in this follow-up stage.  The top two finishers in the future stage were Tim Salimans and the team of Jason Tigg and David Clague, and a simple “winners ensemble”, averaging the predictions of those two teams together, was most accurate of all.

To give an idea of why I can claim that these approaches are a lot more accurate than the Elo system, allow me to briefly illustrate a couple of ways to measure predictive accuracy. We will take a small, elite tournament as an example (the Capablanca Memorial from June 2010), and compare the predictions made from players’ Elo ratings, versus the predictions made by the “winners ensemble” of Tim Salimans and the team of Jason Tigg and David Clague. First of all, we could score the predictions from each individual game. 

Rnd

Result (White listed first)

W score

Elo

Error

Ensemble

Error

1

Nepomniachtchi 1/2 Alekseev

0.5

54%

0.302

54%

0.302

1

Dominguez 1/2 Bruzon

0.5

59%

0.309

62%

0.314

1

Short 0-1 Ivanchuk

0.0

48%

0.281

46%

0.270

 

 

 

 

 

10

Ivanchuk 1/2 Nepomniachtchi

0.5

59%

0.309

60%

0.311

10

Dominguez 1-0 Alekseev

1.0

56%

0.254

55%

0.263

10

Short 1-0 Bruzon

1.0

56%

0.251

62%

0.211

 

(all 30 games)

 

Avg=0.296

Avg=0.293

By this scoring measure (using the same Binomial Deviance error function as was used in the actual contest), we see that the Winners’ Ensemble has an average error of 0.293, and the Elo System has an average error of 0.296 (slightly less accurate). Of course, this is a tiny sample of only 30 games, but when we look across all 112,837 games from April 2010 through June 2010, we find that the Winners’ Ensemble had an average error of 0.248 and the Elo system had an average error of 0.257. This is quite a significant difference.

Another way to compare the predictions would be to look at the final scores of each player in a tournament, compared to their predicted final scores. To continue the small example of the Capablanca Memorial again, we can see what the Elo prediction for each player’s final score would be (rounded to the nearest half-point), compared to the prediction from the Winners’ Ensemble (again rounded to the nearest half-point).  By adding up the absolute errors from each prediction, we get an overall measure of which prediction was best:

Player

Total/10

Elo

Error

Ensemble

Error

Ivanchuk

7.0

5.5

1.5

6.0

1.0

Nepomniachtchi

6.0

5.0

1.0

5.0

1.0

Dominguez

5.5

5.0

0.5

5.0

0.5

Short

5.5

5.0

0.5

5.0

0.5

Alekseev

3.0

5.0

2.0

5.0

2.0

Bruzon Batista

3.0

4.5

1.5

4.0

1.0

 

Total=7.0

Total=6.0

By this measure, we see that the Elo predictions had a total error of 7.0 points, compared to the Winners’ Ensemble with its total error of 6.0, and so for this tournament at least, the Elo predictions were less accurate.  Again, this is just one tournament and way too tiny a sample for drawing any conclusions at all, but when you look at a lot of tournaments, the trend is clear. For instance, there were 221 tournaments (between April 2010 and June 2010) having an average participant Elo rating of 2200+ and at least 30 games played. If we perform the same comparison again, for each tournament separately, and then add up the results, we find that the Winners’ Ensemble method is much more likely to make a better prediction than the Elo method for everyone’s final score:

All of these predictions were calculated during 2011, but were for games already played (from the first half of 2010).  And of course, the true test of a prediction system is whether it can predict the future (not the past) successfully.  So, just for fun, after the contest the top finishers were also asked to calculate ratings up to the present (they were given additional data from FIDE in order to do this), and to use their system to submit predictions (still without knowing identities of the players) that would allow me to calculate winning odds in the Kazan FIDE Candidates Matches.  The contest winner (Tim Salimans) as well as two others from the top five (Vladimir Nikulin and the team of Jason Tigg and David Clague) agreed to participate in this little experiment. 

As above, any conclusions about the overall "accuracy" of a rating method, based only upon the outcome of one small tournament, should carry very little weight.  Nevertheless, it is interesting to note that all three systems drew identical conclusions about which players in the top 20 seemed to have Elo ratings that were too low. Alexander Grischuk (who recently made it to the finals of the Candidates' event) was identified as the most underrated of the eight Candidates, by far. All three systems felt that Grischuk (ranked #12 on the May 1 list) was underrated by about 20 Elo points, and ought to be #8 or #9 on the world rating list instead.

Of course, there is a flip side – all three systems felt that Boris Gelfand and Hikaru Nakamura were two of the three players in the top twenty who were most overrated by the Elo system. And in fact, both players recently won short matches where the three systems would have considered them to be very slight underdogs (Gelfand in his match against Grischuk, and Nakamura in his match against Ruslan Ponomariov). Further, Gelfand also had to first defeat two other top players just to reach the Candidates' finals match. Elo-based predictions would have given pre-tournament odds of 28-1 against Gelfand winning the event, whereas with the other three systems (on average) the pre-tournament odds would have been 35-to-1!

Such experiments are interesting, but the real way to measure the accuracy of a rating system is to look at very large numbers of players and games, such as was done in these two Kaggle contests. And it is clear that whether you are predicting the final standings in each tournament, or evaluating each game independently, the top finishers have developed rating systems that are far more accurate at predicting future outcomes than the Elo system (which finished in 80th place out of 189 in the latest contest). I consider the winners’ systems to be about 15% more accurate than the Elo system in their predictions. Another big question – what are the most promising practical improvements to the FIDE rating system – is a separate issue, pertaining to the “FIDE prize” category of the contest. This prize category was specifically targeted at practical improvements to the existing FIDE rating system. There is more to say on this matter, but it is a topic for another time!

Finally, I would very much like to thank the sponsors (Deloitte and ChessBase) of the main contest. It is quite encouraging that the winner of the main prize in this contest had a base model inspired by the winners of the previous contest.  It suggests that these contests are useful in advancing the state of the art for chess predictions, and perhaps we are approaching the "optimal" system!

Copyright Jeff Sonas / ChessBase


Previous articles by Jeff Sonas on ChessBase.com

Sonas: Assessment of the EU performance calculation
16.04.2011 – The 2011 European Individual Championship left 29 players with a tied score vying for eight places in the next World Cup. To break the tie the ECU used performance ratings, but calculated them in a way that led to some bizarre results – and to a formal protest by at least one player. Jeff Sonas introduces us to other, more logical systems. As usual his report is presented with exceptional clarity.

The Elo rating system – correcting the expectancy tables
30.03.2011 – In recent years statistician Jeff Sonas has participated in FIDE meetings of "ratings experts" and received access to the historical material from the federation archives. After thorough analysis he has come to some remarkable new conclusions which he will share with our readers in a series of articles. The first gives us an excellent overview of how the rating system works. Very instructive.

The Deloitte/FIDE Chess Rating Challenge
20.02.2011 – Statician Jeff Sonas and Kaggle, a site specializing in data modeling with regular prediction competitions, have launched a new online contest to develop a more accurate chess rating system. Professional services firm Deloitte provides a $10,000 prize to the winner, and FIDE will also bring a top finisher to Athens, Greece to present their rating system. Report and California pictorial.

Can you out-predict Elo? – Competition update
21.09.2010 – Can we devise a more accurate method for measuring chess strength and predicting results than Elo, which has done good service for half a century? Jeff Sonas has given statisticians 65,000 games which they must use to predict the results 7,800 other games. The idea is to find out who can out-perform Elo. In the lead is 28-year-old Portugese biochemist Filipe Maia. Current status.

Impressions from FIDE rating conference 2010
10.06.2010 – The FIDE ratings conference, held last week in Athens, Greece, spent quite a bit of time discussing the problem of rating inflation. Two different opinions met head on: one of chess statistician Jeff Sonas, USA, and one represented by Polish GM Bartlomiej Macieja. The subject matter is not easy to understand, but our colleague Michalis Kaloumenos made a serious effort to do so. Food for thought.

Rating inflation – its causes and possible cures
27.07.2009 – Thirty years ago there was one player in the world rated over 2700. Fifteen years ago there were six. Today there are thirty-three. What is the cause of this "rating inflation". A general improvement of chess skills? A larger number of players in the rating pool? The way the initial ratings are conducted? In this clearly written article statistician Jeff Sonas addresses these questions. Must read!

Rating debate: is 24 the ideal K-factor?
03.05.2009 – FIDE decided to speed up the change in their ratings calculations, then turned more cautious about it. Polish GM Bartlomiej Macieja criticised them for balking, and Jeff Sonas provided compelling statistical reasons for changing the K-factor to 24. Finally John Nunn warned of the disadvantages of changed a well-functioning system. Here are some more interesting expert arguments.

FIDE: We support the increase of the K-factor
29.04.2009 – Yesterday we published a letter by GM Bartlomiej Macieja asking the World Chess Federation not to delay the decision to increase the K-factor in their ratings calculation. Today we received a reply to Maceija's passionate appeal from FIDE, outlining the reasons for the actions. In addition interesting letters from our readers, including one from statistician Jeff Sonas. Opinions and explanations.

Making sense of the FIDE cycle
10.12.2005 – Why, many of our readers have asked, are eight players who got knocked out in round four of the World Cup in Khanty-Mansiysk still playing in the tournament? What, they want to know, is the point? Has it something to do with the 2007 FIDE world championship cycle? It most certainly does. Jeff Sonas explains in meticulous detail. Please concentrate.

The Greatest Chess Player of All Time – Part IV
25.05.2005 – So tell us already, who was the greatest chess performance of all time? After analysing and dissecting many different aspects of this question, Jeff Sonas wraps it up in the final installment of this series, awarding his all-time chess "Oscar" nomination to the overall greatest of all time.

The Greatest Chess Player of All Time – Part III
06.05.2005 – What was the greatest chess performance of all time? Jeff Sonas has analysed the duration different players have stayed at the top of the ratings list, at the highest individual rating spikes and best tournament performances. Today he looks at the most impressive over-all tournament performances in history, and comes up with some very impressive statistics.

The Greatest Chess Player of All Time – Part I
24.04.2005 – Last month Garry Kasparov retired from professional chess. Was he the greatest, most dominant chess player of all time? That is a question that can be interpreted in many different ways, and most answers will be extremely subjective. Jeff Sonas has conducted extensive historical research and applied ruthlesss statistics to seek a solution to an age-old debate.

How (not) to play chess against computers
11.11.2003 – Is there any way that human chess players can withstand the onslaught of increasingly powerful computer opponents? Only by modifying their own playing style, suggests statistician Jeff Sonas, who illustrates a fascinating link between chess aggression and failure against computers. There may still be a chance for humanity. More...

Physical Strength and Chess Expertise
07.11.2003 – How can humans hope to hold their ground in their uphill struggle against chess computers? Play shorter matches, stop sacrificing material, and don't fear the Sicilian Defense, says statistician Jeff Sonas, who also questions the high computer ratings on the Swedish SSDF list. Here is his evidence.

Are chess computers improving faster than grandmasters?
17.10.2003 – The battle between humans and machines over the chessbaord appears to be dead-even – in spite of giant leaps in computer technology. "Don't forget that human players are improving too," says statistician Jeff Sonas, who doesn't think it is inevitable that computers will surpass humans. Here is his statistical evidence.

Man vs Machine – who is winning?
08.10.2003 – Every year computers are becoming stronger at chess, holding their own against the very strongest players. So very soon they will overtake their human counterparts. Right? Not necessarily, says statistician Jeff Sonas, who doesn't believe that computers will inevitably surpass the top humans. In a series of articles Jeff presents empirical evidence to support his claim.

Does Kasparov play 2800 Elo against a computer?
26.08.2003 – On August 24 the well-known statistician Jeff Sonas presented an article entitled "How strong are the top chess programs?" In it he looked at the performance of top programs against humans, and attempted to estimate an Elo rating on the basis of these games. One of the programs, Brutus, is the work of another statistician, Dr Chrilly Donninger, who replies to Jeff Sonas.

Computers vs computers and humans
24.08.2003 – The SSDF list ranks chess playing programs on the basis of 90,000 games. But these are games the computers played against each other. How does that correlate to playing strength against human beings? Statistician Jeff Sonas uses a number of recent tournaments to evaluate the true strength of the programs.

The Sonas Rating Formula – Better than Elo?
22.10.2002 – Every three months, FIDE publishes a list of chess ratings calculated by a formula that Professor Arpad Elo developed decades ago. This formula has served the chess world quite well for a long time. However, statistician Jeff Sonas believes that the time has come to make some significant changes to that formula. He presents his proposal in this milestone article.

The best of all possible world championships
14.04.2002 – FIDE have recently concluded a world championship cycle, the Einstein Group is running their own world championship, and Yasser Seirawan has proposed a "fresh start". Now statistician Jeff Sonas has analysed the relative merits of these three (and 13,000 other possible) systems to find out which are the most practical, effective, inclusive and unbiased. There are some suprises in store (the FIDE system is no. 12,671 on the list, Seirawan's proposal is no. 345). More.

Feedback and mail to our news service Please use this account if you want to contribute to or comment on our news page service



Discuss

Rules for reader comments

 
 

Not registered yet? Register