Rating debate (6): Here comes the proof!

5/4/2009 – "I couldn't believe my eyes when I read GM John Nunn's opinion," writes GM Bartlomiej Macieja (pronunciation supplied), the original initiator of this debate. He presents proof for the fact, challenged by Nunn, that the K-factor and the frequency of rating lists are related to one another. Other readers have also weighed in, a wrap-up reply by John Nunn will appear soon. Long, interesting read.

Before we come to the Great K-factor Debate (6) we would like to clear up an important matter: how to pronounce the name of the initiator (instigator?) of this discussion, Polish GM Bartlomiej Macieja.

Listen to the Polish pronunciation of Maciej – spoken by beeloo (male from Poland). You must add a "ya" at the end for the surname, and the stress is on the second syllable ("Mah-CHEY-ya").

We come to the first name, Bartłomiej – which in the original contains a cunningly crossed out "l". It is of course the English Bartholomew, which comes from the Aramaic and means "son of Tholmai". Tholmai normally means "furrows", and Bartholomew is accordingly "son of the furrows", meaning one who is rich in land. Alternatively, it is speculated that Tholmai is a form of the Greek name Ptolemy, or from some minor characters in the Bible called Talmai (תלמי). Before you listen to the the next daunting sound file, know that his friends call him Bartek, which is simply pronounced Bar-tek.

Polish pronunciation of Bartłomiej – spoken by Marcyyy (female from the US).

So it is Bartek Mah-chey-ya, and we hope that our readers will no longer shy away from discussing the important subject of rating calculations and change just because they are afraid of trying to pronounce the name of one of its prime advocates.


The K-factor – here comes the proof!

By GM Bartlomiej Macieja

I couldn't believe my eyes when I read GM John Nunn's opinion: "The K-factor and the frequency of rating lists are unrelated to one another. Rating change depends on the number of games you have played. If you have played 40 games in six months, it doesn't make any difference whether FIDE publishes one rating list at the end of six months or one every day; you've still played the same number of games and the change in your rating should be the same."

It does make a significant difference how often rating lists are published. To understand this effect it is enough to imagine a player rated 2500 playing one tournament a month. With two rating lists published yearly, if he wins 10 points in every tournament, his rating after half a year will be 2500+6*10=2560. If rating lists are published four times a year, after three months his rating becomes 2500+3*10=2530 so it gets more difficult for him to gain rating points in further tournaments. After three more tournaments the player reaches the final rating of only about 2500+3*10+3*6=2548. With six rating lists published yearly, the final rating of the player (after half a year) is only about 2500+2*10+2*7+2*5=2544.

Obviously it is only an approximation, the exact values may slightly differ, however the effect is clear. The rating change, contrary to GM John Nunn's opinion, is not the same. And that's what I meant by: "The higher frequency of publishing rating lists reduces the effective value of the K-factor, thus the value of the K-factor needs to be increased in order not to make significant changes in the whole rating system.".

There are many possible ways to establish the correct value of the K-factor. For sure the following approach desires attention:

Let's imagine two players with different initial ratings, let's say 2500 and 2600, achieving exactly the same results against exactly the same opponents for a year. The main idea of the Elo system is that if two players do participate in tournaments and show exactly the same results, their ratings should be the same. You can think about it also as "forgetting about very old results". Please note that it is by far not the same approach as used in many other sports, for instance in tennis. In the Elo system, if a player doesn't participate in tournaments, his rating doesn't change (I don't want to discuss now if it is correct or not). But if he does, there is no reason why his rating should be different from the rating of another player achieving exactly the same results against exactly the same opponents.

With one rating list published yearly, as was initially done by FIDE, the value of at least K=700/N was needed to reach the goal. As the majority of professional players play more than 70 rated games per year, the value of K=10 would play its role. However, with more rating lists published yearly, the initially higher rated player will always have higher rating than his initially lower rated colleague (if both achieve exactly the same results), unless the K-factor is extremely high. For this reason it is better to ask the question, which value of the K-factor will reduce the initial difference of ratings by 100 (for instance from 100 points to only 1 point) in a year?

In a good approximation, the answer is K = (m*700/N)*[1-(0,01)(1/m)], where m is the number of lists published per year. For N=80 (suggestion of GM John Nunn), we get: if m=2 -> K should be 16, if m=4 -> K should be 24, if m=6 -> K should be 28. Otherwise, an initially higher rated player may still have a higher rating a year later even if he was achieving worse results than an initially lower rated player. It would not only be strange, but also unfair, as for many competitions, including the World Championship Cycle, the participants are qualified by rating. Please note, that if N is lower, the K-factor should be even bigger.

Some people suggest that twelve months in a row of showing identical results may still not be enough to consider two players to be equally strong (or, to be more precise, to have their initial rating difference reduced by 100). Let's calculate which value of the K-factor will reduce the initial difference of ratings by 100 in two years. For N=80 (160 games in two years) we get K=17, for N=70 (140 games in 2 years) we get K=19.

I believe that out of the last 100 games (it is even more than professor Elo recommended) a sound judgement can be made. It means, that the value of the K-factor accepted in Dresden during the General Assembly (K=20) was a wise choice.


More reader feedback

Alexander Kornijenko

John Nunn's reasoning, to quote himself, doesn't stand up to examination. I mean this one: "If you have played 40 games in 6 months, it doesn't make any difference whether FIDE publishes one rating list at the end of six months or one every day; you've still played the same number of games and the change in your rating should be the same."

Now, a concrete (unrealistic) example: all 40 games are played by a 2400 player who won all games against 40 different players rated 2400. Assume equal distribution of games in time (an unequal one would make the difference even bigger but I don't aim to show the extreme difference, just the fact that the results are not equal):

  1. All games are rated in two three-month periods: after the first three months 20 games are rated and the player gains 20*5=100 pts. His new rating is 2500. After the second three-month he gains 20*3.6=72 pts. So his rating after 6 months is 2572.

  2. Games rated each two months. Let's assume he plays 13 games in first 2 months, 13 in the second ones and 14 in the third ones (again 40 games in 6 months). After the first two months he gets 13*5=65 pts. New raing 2465. After the second two months he gets 13*4.1=53.3 pts. 0.3 gets rounded off for the new list so 53, new rating 2518. Finally, in the third list he gets 14*3.4=47.6. Rounded 48 points. New rating: 2566.

Now it's obvious, 2566 is not equal to 2572. So, Dr. Nunn was wrong about rating not depending on period length. This example ends up in a small difference, just six points, but one can easily construct an example with a much more extreme difference.

As for cheating: it easily gets ruled out if one requires a certain minimum amount of games per rating period and maximum variance of this amount within last three to four rating periods around the qualification event in question. Really, if in one period a player cheats and gets a rating higher than his performance, then in the next period this high rating will go down, because he will perform lower than his (new) rating.

As for inflation: I don't think there is any inflation in ratings at all. Games between >2400 and <2400 happen relatively rare, and if both categories of players do as well as the probability tables predict, the ratings do not change. But since the games are rare, the <2400 players do slightly worse than the table predicts (since such a game is likely to be lost) - and since they have a higher K, they lose more points than the >2400 opponents gain, which leads to deflation and not inflation. Of course, if the <2400 player wins, it happens opther way around, but it evens out as long as the probability tables are correct

As for an ideal K-factor generally, I don't think it exists. It depends much on the activity of the players – which changes all the time. I think both previous and new regulations of K-factors is acceptable but the new one is better since with the previous system was too inert (especially with respect to rapidly improving or declining players).

Michael Babigian, Elk Grove, USA

It seems so easy to test the effect of various ratings changes in our computerized world, that I don't understand why the formula debate is framed by the "opinions" of politicians or top players.

First it seems clear a decision of what is most important must be established. Jeff Sonas specifically states and clearly believes that the most important use of a player's rating is to "predict players' future results." John Nunn states that individual results are "subject to a wide random variation" and changing the K-factor is not likely to "more accurately reflect a player's strength." Here in lies the rub. These two things – predicting future results and a players smoothed average strength are not necessarily well correlated (perhaps it is, but again test it). So first we must answer the real question. Do we want

  1. smoothed strength numbers over many games to get a general idea of the overall playing level in the recent past,
  2. a rating that best predicts future results, or
  3. are these both obtainable?

Here's where the testing comes in. We have years of tournament results including the dates the events took place as well as the round number within the tournament. Using these results databases, it appears trivial to have a computer apply the current rating system to the players in chronological order and then as you move through the data, check how accurately the rating predicts the next game result. After thousands and thousands of games the accuracy of the predictions can be determined for a specific formula or K-factor within some margin of error. You can then change the rating system and retest other K values, rating formulas, etc. Assuming the new results are outside the margin of error, you will have, through actual analysis, put this political debate to rest – "IF" predicting results is the most important goal. If you are looking for smoothed strength numbers to be used as event qualifiers, perhaps a different test methodology is needed, but again, test the idea against the actual data to determine the formula's performance and most importantly "publish the methodology used and the results."

This of course does not address Dr Nunn's concerns about rating manipulation and cheating. It's sad to have to recognize this type of problem when constructing a rating system, but cheating appears in every known sport and chess is no different. It seems to me there are three ways to address this problem and unfortunately choosing between them is more a matter of picking one and deciding after the fact whether it was effective than about mathematical analysis. Predicting and accounting for the techniques used by cheaters is not a trivial matter and any system you design will be analyzed by cheaters and they will find new ways to exploit it.

Here are some rough thoughts on the cheating issue:

  1. Build the rating formula to provide the "most accurate results" as if cheating never happened and then employ other techniques to detect and weed out cheaters and their results. This approach has the advantage of keeping the rating formula constant over the long haul, but is susceptible to manipulation by politicians or governing bodies as the "definition" of cheating changes in order to catch new cheating techniques being used by the players.

  2. Build a rating formula that is less accurate but much less susceptible to manipulation by cheats. The down side is that eventually cheats will find ways of manipulating even the most well thought out formulas and you reduce the accuracy of the calculation.

  3. Use a combination of a slightly less accurate formula along with other techniques outside the rating calculation to detect and reduce the effect of cheating.

There are no easy answers when it comes to cheating, but establishing a rating formula that is accurate (by whatever measure you choose) and is not subject to wild inflation/deflation etc. is a simple matter of testing past results.

My final suggestion would be to hire/recruit the assistance of some independent statisticians from perhaps a well known university to test various methodologies and then publish the entire process along with their recommendations. Giving them the results databases, explaining what we hope to get from the rating numbers, and a description of known cheating methods, would give them a great place to start their study. Should the results spawn additional questions a follow up from the statisticians could be published addressing these areas.

P.S. Having a different K-factor for different rating levels is not mathematically sound. This creates an anomaly right at the threshold (i.e. 2400) where the rating methodology is no longer accurate. Losing at 2400 costs fewer points than would be gained for winning at 2399. This asymmetry can't "improve accuracy." If it did, why not always make wins more than losses? In addition, if a specific K-factor better predicts results at 2400, why would it fail to provide optimum results at 2350?

Baquero Luis, Medellín, Colombia
If Macieja, Sonnas and others are promoting a change based on arguments that according to a mathematician (Dr. Nunn) don't have any weight, we must suspect that the proposal is a just a movement of someone in a negotiation class work shop. Who would benefit from this? Let me blitz some possibilities:

  • Managers, with an immediate profit mind, and money to buy rating points
  • Managers, who would love to see each month a supermatch between the new highest ever rating (false star)and his pupil
  • Anti Kasparov mediocres, who would like to see him every day lower in the history of ratings

I agree with Dr. Nunn in that there are negative consequences, in the direction in which we see today some behavior: high rated players protecting their rating, keeping away from tournaments or arranging draws; using forbideen computer aid during games; buying, not a game, neither a tournament, but a string of tournaments. Serious players are not dreaming in permanent rule changes in order to surge; unfortunatedly FIDE directives are willing to approve any change if they smell that the public mass of chess enthusiats would incresase.

Haldun Unalmis, Houston, Texas
I totally agree with GM Nunn. The current rating system has a proven track record and there is no need for a dramatic change. The playing strength of a player may fluctuate from month to month, but that doesn't necessarily mean that his overall chess strength dramatically changes. The rating needs to be an indicator of the chess strength over a significant period of time not just over recent results.

Bill Coyle, Canon City, Colorado, USA
Has a publication of trailing average been considered? Retain the current K-Factor and append a trailing average of the last twelve published ratings. That would give an immediate picture of a players current strength relative to his past performance.

Robert Coeglin, Uppsala, Sweden
John Nunn writes: "The K-factor and the frequency of rating lists are unrelated to one another. Rating change depends on the number of games you have played. If you have played 40 games in 6 months, it doesn't make any difference whether FIDE publishes one rating list at the end of six months or one every day; you've still played the same number of games and the change in your rating should be the same."

This is just not true. Its simple mathematics. For example: You start with 2500 elo, win 50 games and each of these wins are calculated towards your 2500 rating. From this formula you gain 200 elo points and end up with 2700 elo in the old rating system. If you produce new ratings more often, your rating will gradually increase, and the +elo points you get from each win will be less since your own rating is higher than 2500. The K-factor and the frequency of rating lists ARE related to each other, of course...

William Dubinsky, New Jersey, USA
John Nunn didn't read Sonas's letter. He DID back check the increased predictive value of an increase in K and concluded that it would increase the predictive value. I think at the same time the letter from Mark Adams is exactly on point in that less emphasis on ratings and more emphasis on playing makes a lot of sense.

John Crooks, Stilwell, USA
I tend to agree with Dr. Nunn. I am an actuary by trade, and thus have a background in statistics and determining likely expected outcomes based on prior information. I do not believe that increasing the K factor will make ratings better predictors of likely outcomes in future events - with a few exceptions. In the case of a rapidly advancing or declining player the rating will always lag that players current strength. A higher K factor will shorten the time in which this occurs, but at what price? For established players, whose ratings would bounce around much more than before, it would do a worse job of predicting future results, unless you really believe that a players mature strength intrinsically, and predictably varies over short periods of time. There is plenty of evidence to contradict this. Look to the performances of some of our best players in the past few years (Topalov, Kamsky, Anand, Ivanchuk). They have had dramatic difference in tournament performances over short periods of time. Not only does a higher K factor NOT help to predict these results, but it would in fact do a poorer job as the rating moves between events would be higher in both directions.

IM Dietmar Kolbus, Trier, Germany
Rasing the K-factor will also lead to a higher number of tournament disruptions due to withdrawals or non-appearance of players. As life proves particularly during winter season, common cold or flu is widely spread, even in large tournament halls, and with the rating stakes raised, players with base symptoms of these diseases may have a much higher incentive not to show up for the game than in the past. This may wipe out legitimate title norm chances for some players and would be a frustration for organisers and sponsors. It can also be expected that the number of withdrawing players from any tournament with poor early results may increase sharply as well.

GM Nigel Davies, Southport UK
One other major issue here is that greater volatility in players' ratings will make it a lot harder for organisers to target particular categories of round robin tournaments. A second point is that it will also increase title inflation as players are more likely to bob above particular rating thresholds at some point.

Kris Littlejohn, Dallas, USA
As many others have said, it seems astonishing that FIDE has not (either themselves or with the help of a third party) simply recalculated the rating lists of the past 2-3 years using the proposed new system to provide all players with a simple side-by-side comparison. If this cannot (will not) be done, then perhaps someone out there could recalculate a hand full or even just one player's rating based on the data that is already provided on the FIDE web site. At the top the obvious player that comes to mind is Ivanchuk - let's say from Jan. 2007 - present. He is far and away the most active and also "streakiest" of players at the very highest level - under the current system already fluctuating nearly between 2700 and 2800 in not especially long periods of time. With an increased K-factor would he now fluctuate between 2650 and 2850? It wouldn't be this dramatic of course since lists are also published more often, but just how wide would the pendulum swing?

This of course would not be perfect, as it would be done in a vacuum without all of his opponents' ratings also being recalcuated. However, while I am no statistician and lack the expertise to judge, this seems like it would be a fairly small difference when looking at an individual. While this would provide little or no insight into the concerns about players below 2400 and title requirements it might help to either allay or justify the fears our top grandmasters might have about the changes.

Nick Barnett, Cape Town, South Africa
Please ecourage Jeff Sonas to get involved in rating matters. Elo ratings are fair but there is an obvious kreep, using the great Korchnoi as an example. Viktor goes down the ladder while not appearing to lose that many rating points from his prime. A lot depends whether you play in open events with rapidly improving kids or in restricted events against established opponents. I believe the Ausralians have abandoned Elo for something else.

Paul, Dargan, Dubai, UAE
John Nunn states that K and the frequency of rating lists are unrelated. While this may be true for active professionals, I do not believe it is the case for amateurs who do not play many games. If you are under (or indeed over) rated then you gain extra credit for you performances until your rating changes. If you can only get one tournament in before your rating is corrected upwards this slows the adjustment of your rating ot its true level. Therefore publishing lists more frequenctly slows the correction of ratings that do not reflect playting strength. I'm not saying that increasing K is the answer, nor that it is a proportionate response. But Dr Nunn (rather unusually) seems to have missed the point!? Maybe there's some cunning tactical continuation he's got in mind that I haven't seen?

John Nunn has promised to send us a wrap-up answer to all the feedback.

References

 


Feedback and mail to our news service Please use this account if you want to contribute to or comment on our news page service



Discuss

Rules for reader comments

 
 

Not registered yet? Register