Before we come to the Great K-factor Debate (6) we would like to clear up an
important matter: how to pronounce the name of the initiator (instigator?) of
this discussion, Polish GM Bartlomiej Macieja.
Listen to the Polish pronunciation of Maciej – spoken by
beeloo (male from Poland). You must add a "ya"
at the end for the surname, and the stress is on the second syllable
("Mah-CHEY-ya").
We come to the first name, Bartłomiej – which in the original contains
a cunningly crossed out "l". It is of course the English Bartholomew,
which comes from the Aramaic and means "son of Tholmai". Tholmai normally
means "furrows", and Bartholomew is accordingly "son of the furrows",
meaning one who is rich in land. Alternatively, it is speculated that Tholmai
is a form of the Greek name Ptolemy, or from some minor characters in the Bible
called Talmai (תלמי). Before you listen to the the next
daunting sound file, know that his friends call him Bartek, which is simply
pronounced Bar-tek.
Polish pronunciation of Bartłomiej – spoken by
Marcyyy (female from the US).
So it is Bartek Mah-chey-ya, and we hope that our readers will no longer shy
away from discussing the important subject of rating calculations and change
just because they are afraid of trying to pronounce the name of one of its prime
advocates.
The K-factor – here comes the proof!
By GM Bartlomiej Macieja
I couldn't believe my eyes when I read GM John Nunn's opinion: "The K-factor
and the frequency of rating lists are unrelated to one another. Rating change
depends on the number of games you have played. If you have played 40 games
in six months, it doesn't make any difference whether FIDE publishes one rating
list at the end of six months or one every day; you've still played the same
number of games and the change in your rating should be the same."
It does make a significant difference how often rating lists are published.
To understand this effect it is enough to imagine a player rated 2500 playing
one tournament a month. With two rating lists published yearly, if he wins 10
points in every tournament, his rating after half a year will be 2500+6*10=2560.
If rating lists are published four times a year, after three months his rating
becomes 2500+3*10=2530 so it gets more difficult for him to gain rating points
in further tournaments. After three more tournaments the player reaches the
final rating of only about 2500+3*10+3*6=2548. With six rating lists published
yearly, the final rating of the player (after half a year) is only about 2500+2*10+2*7+2*5=2544.
Obviously it is only an approximation, the exact values may slightly differ,
however the effect is clear. The rating change, contrary to GM John Nunn's opinion,
is not the same. And that's what I meant by: "The higher frequency of publishing
rating lists reduces the effective value of the K-factor, thus the value of
the K-factor needs to be increased in order not to make significant changes
in the whole rating system.".
There are many possible ways to establish the correct value of the K-factor.
For sure the following approach desires attention:
Let's imagine two players with different initial ratings, let's say 2500 and
2600, achieving exactly the same results against exactly the same opponents
for a year. The main idea of the Elo system is that if two players do participate
in tournaments and show exactly the same results, their ratings should be the
same. You can think about it also as "forgetting about very old results".
Please note that it is by far not the same approach as used in many other sports,
for instance in tennis. In the Elo system, if a player doesn't participate in
tournaments, his rating doesn't change (I don't want to discuss now if it is
correct or not). But if he does, there is no reason why his rating should be
different from the rating of another player achieving exactly the same results
against exactly the same opponents.
With one rating list published yearly, as was initially done by FIDE, the value
of at least K=700/N was needed to reach the goal. As the majority of professional
players play more than 70 rated games per year, the value of K=10 would play
its role. However, with more rating lists published yearly, the initially higher
rated player will always have higher rating than his initially lower rated colleague
(if both achieve exactly the same results), unless the K-factor is extremely
high. For this reason it is better to ask the question, which value of the K-factor
will reduce the initial difference of ratings by 100 (for instance from 100
points to only 1 point) in a year?
In a good approximation, the answer is K = (m*700/N)*[1-(0,01)(1/m)],
where m is the number of lists published per year. For N=80 (suggestion of GM
John Nunn), we get: if m=2 -> K should be 16, if m=4 -> K should be 24,
if m=6 -> K should be 28. Otherwise, an initially higher rated player may
still have a higher rating a year later even if he was achieving worse results
than an initially lower rated player. It would not only be strange, but also
unfair, as for many competitions, including the World Championship Cycle, the
participants are qualified by rating. Please note, that if N is lower, the K-factor
should be even bigger.
Some people suggest that twelve months in a row of showing identical results
may still not be enough to consider two players to be equally strong (or, to
be more precise, to have their initial rating difference reduced by 100). Let's
calculate which value of the K-factor will reduce the initial difference of
ratings by 100 in two years. For N=80 (160 games in two years) we get K=17,
for N=70 (140 games in 2 years) we get K=19.
I believe that out of the last 100 games (it is even more than professor Elo
recommended) a sound judgement can be made. It means, that the value of the
K-factor accepted in Dresden during the General Assembly (K=20) was a wise choice.
More reader feedback
Alexander Kornijenko
John Nunn's reasoning, to quote himself, doesn't stand up to examination. I
mean this one: "If you have played 40 games in 6 months, it doesn't make
any difference whether FIDE publishes one rating list at the end of six months
or one every day; you've still played the same number of games and the change
in your rating should be the same."
Now, a concrete (unrealistic) example: all 40 games are played by a 2400 player
who won all games against 40 different players rated 2400. Assume equal distribution
of games in time (an unequal one would make the difference even bigger but I
don't aim to show the extreme difference, just the fact that the results are
not equal):
-
All games are rated in two three-month periods: after the first three months
20 games are rated and the player gains 20*5=100 pts. His new rating is
2500. After the second three-month he gains 20*3.6=72 pts. So his rating
after 6 months is 2572.
-
Games rated each two months. Let's assume he plays 13 games in first 2
months, 13 in the second ones and 14 in the third ones (again 40 games in
6 months). After the first two months he gets 13*5=65 pts. New raing 2465.
After the second two months he gets 13*4.1=53.3 pts. 0.3 gets rounded off
for the new list so 53, new rating 2518. Finally, in the third list he gets
14*3.4=47.6. Rounded 48 points. New rating: 2566.
Now it's obvious, 2566 is not equal to 2572. So, Dr. Nunn was wrong about rating
not depending on period length. This example ends up in a small difference,
just six points, but one can easily construct an example with a much more extreme
difference.
As for cheating: it easily gets ruled out if one requires a certain minimum
amount of games per rating period and maximum variance of this amount within
last three to four rating periods around the qualification event in question.
Really, if in one period a player cheats and gets a rating higher than his performance,
then in the next period this high rating will go down, because he will perform
lower than his (new) rating.
As for inflation: I don't think there is any inflation in ratings at all. Games
between >2400 and <2400 happen relatively rare, and if both categories
of players do as well as the probability tables predict, the ratings do not
change. But since the games are rare, the <2400 players do slightly worse
than the table predicts (since such a game is likely to be lost) - and since
they have a higher K, they lose more points than the >2400 opponents gain,
which leads to deflation and not inflation. Of course, if the <2400 player
wins, it happens opther way around, but it evens out as long as the probability
tables are correct
As for an ideal K-factor generally, I don't think it exists. It depends much
on the activity of the players – which changes all the time. I think both
previous and new regulations of K-factors is acceptable but the new one is better
since with the previous system was too inert (especially with respect to rapidly
improving or declining players).
Michael Babigian, Elk Grove, USA
It seems so easy to test the effect of various ratings changes in our computerized
world, that I don't understand why the formula debate is framed by the "opinions"
of politicians or top players.
First it seems clear a decision of what is most important must be established.
Jeff Sonas specifically states and clearly believes that the most important
use of a player's rating is to "predict players' future results."
John Nunn states that individual results are "subject to a wide random
variation" and changing the K-factor is not likely to "more accurately
reflect a player's strength." Here in lies the rub. These two things –
predicting future results and a players smoothed average strength are not necessarily
well correlated (perhaps it is, but again test it). So first we must answer
the real question. Do we want
- smoothed strength numbers over many games to get a general idea of the overall
playing level in the recent past,
- a rating that best predicts future results, or
- are these both obtainable?
Here's where the testing comes in. We have years of tournament results including
the dates the events took place as well as the round number within the tournament.
Using these results databases, it appears trivial to have a computer apply the
current rating system to the players in chronological order and then as you
move through the data, check how accurately the rating predicts the next game
result. After thousands and thousands of games the accuracy of the predictions
can be determined for a specific formula or K-factor within some margin of error.
You can then change the rating system and retest other K values, rating formulas,
etc. Assuming the new results are outside the margin of error, you will have,
through actual analysis, put this political debate to rest – "IF"
predicting results is the most important goal. If you are looking for smoothed
strength numbers to be used as event qualifiers, perhaps a different test methodology
is needed, but again, test the idea against the actual data to determine
the formula's performance and most importantly "publish the methodology
used and the results."
This of course does not address Dr Nunn's concerns about rating manipulation
and cheating. It's sad to have to recognize this type of problem when constructing
a rating system, but cheating appears in every known sport and chess is no different.
It seems to me there are three ways to address this problem and unfortunately
choosing between them is more a matter of picking one and deciding after the
fact whether it was effective than about mathematical analysis. Predicting and
accounting for the techniques used by cheaters is not a trivial matter and any
system you design will be analyzed by cheaters and they will find new ways to
exploit it.
Here are some rough thoughts on the cheating issue:
-
Build the rating formula to provide the "most accurate results"
as if cheating never happened and then employ other techniques to detect
and weed out cheaters and their results. This approach has the advantage
of keeping the rating formula constant over the long haul, but is susceptible
to manipulation by politicians or governing bodies as the "definition"
of cheating changes in order to catch new cheating techniques being used
by the players.
-
Build a rating formula that is less accurate but much less susceptible
to manipulation by cheats. The down side is that eventually cheats will
find ways of manipulating even the most well thought out formulas and you
reduce the accuracy of the calculation.
-
Use a combination of a slightly less accurate formula along with other
techniques outside the rating calculation to detect and reduce the effect
of cheating.
There are no easy answers when it comes to cheating, but establishing a rating
formula that is accurate (by whatever measure you choose) and is not subject
to wild inflation/deflation etc. is a simple matter of testing past results.
My final suggestion would be to hire/recruit the assistance of some independent
statisticians from perhaps a well known university to test various methodologies
and then publish the entire process along with their recommendations. Giving
them the results databases, explaining what we hope to get from the rating numbers,
and a description of known cheating methods, would give them a great place to
start their study. Should the results spawn additional questions a follow up
from the statisticians could be published addressing these areas.
P.S. Having a different K-factor for different rating levels is not mathematically
sound. This creates an anomaly right at the threshold (i.e. 2400) where the
rating methodology is no longer accurate. Losing at 2400 costs fewer points
than would be gained for winning at 2399. This asymmetry can't "improve
accuracy." If it did, why not always make wins more than losses? In addition,
if a specific K-factor better predicts results at 2400, why would it fail to
provide optimum results at 2350?
Baquero Luis, Medellín, Colombia
If Macieja, Sonnas and others are promoting a change based on arguments that
according to a mathematician (Dr. Nunn) don't have any weight, we must suspect
that the proposal is a just a movement of someone in a negotiation class work
shop. Who would benefit from this? Let me blitz some possibilities:
- Managers, with an immediate profit mind, and money to buy rating points
- Managers, who would love to see each month a supermatch between the new
highest ever rating (false star)and his pupil
- Anti Kasparov mediocres, who would like to see him every day lower in the
history of ratings
I agree with Dr. Nunn in that there are negative consequences, in the direction
in which we see today some behavior: high rated players protecting their rating,
keeping away from tournaments or arranging draws; using forbideen computer aid
during games; buying, not a game, neither a tournament, but a string of tournaments.
Serious players are not dreaming in permanent rule changes in order to surge;
unfortunatedly FIDE directives are willing to approve any change if they smell
that the public mass of chess enthusiats would incresase.
Haldun Unalmis, Houston, Texas
I totally agree with GM Nunn. The current rating system has a proven track record
and there is no need for a dramatic change. The playing strength of a player
may fluctuate from month to month, but that doesn't necessarily mean that his
overall chess strength dramatically changes. The rating needs to be an indicator
of the chess strength over a significant period of time not just over recent
results.
Bill Coyle, Canon City, Colorado, USA
Has a publication of trailing average been considered? Retain the current K-Factor
and append a trailing average of the last twelve published ratings. That would
give an immediate picture of a players current strength relative to his past
performance.
Robert Coeglin, Uppsala, Sweden
John Nunn writes: "The K-factor and the frequency of rating lists are unrelated
to one another. Rating change depends on the number of games you have played.
If you have played 40 games in 6 months, it doesn't make any difference whether
FIDE publishes one rating list at the end of six months or one every day; you've
still played the same number of games and the change in your rating should be
the same."
This is just not true. Its simple mathematics. For example: You start with
2500 elo, win 50 games and each of these wins are calculated towards your 2500
rating. From this formula you gain 200 elo points and end up with 2700 elo in
the old rating system. If you produce new ratings more often, your rating will
gradually increase, and the +elo points you get from each win will be less since
your own rating is higher than 2500. The K-factor and the frequency of rating
lists ARE related to each other, of course...
William Dubinsky, New Jersey, USA
John Nunn didn't read Sonas's letter. He DID back check the increased predictive
value of an increase in K and concluded that it would increase the predictive
value. I think at the same time the letter from Mark Adams is exactly on point
in that less emphasis on ratings and more emphasis on playing makes a lot of
sense.
John Crooks, Stilwell, USA
I tend to agree with Dr. Nunn. I am an actuary by trade, and thus have a background
in statistics and determining likely expected outcomes based on prior information.
I do not believe that increasing the K factor will make ratings better predictors
of likely outcomes in future events - with a few exceptions. In the case of
a rapidly advancing or declining player the rating will always lag that players
current strength. A higher K factor will shorten the time in which this occurs,
but at what price? For established players, whose ratings would bounce around
much more than before, it would do a worse job of predicting future results,
unless you really believe that a players mature strength intrinsically, and
predictably varies over short periods of time. There is plenty of evidence to
contradict this. Look to the performances of some of our best players in the
past few years (Topalov, Kamsky, Anand, Ivanchuk). They have had dramatic difference
in tournament performances over short periods of time. Not only does a higher
K factor NOT help to predict these results, but it would in fact do a poorer
job as the rating moves between events would be higher in both directions.
IM Dietmar Kolbus, Trier, Germany
Rasing the K-factor will also lead to a higher number of tournament disruptions
due to withdrawals or non-appearance of players. As life proves particularly
during winter season, common cold or flu is widely spread, even in large tournament
halls, and with the rating stakes raised, players with base symptoms of these
diseases may have a much higher incentive not to show up for the game than in
the past. This may wipe out legitimate title norm chances for some players and
would be a frustration for organisers and sponsors. It can also be expected
that the number of withdrawing players from any tournament with poor early results
may increase sharply as well.
GM Nigel Davies, Southport UK
One other major issue here is that greater volatility in players' ratings will
make it a lot harder for organisers to target particular categories of round
robin tournaments. A second point is that it will also increase title inflation
as players are more likely to bob above particular rating thresholds at some
point.
Kris Littlejohn, Dallas, USA
As many others have said, it seems astonishing that FIDE has not (either themselves
or with the help of a third party) simply recalculated the rating lists of the
past 2-3 years using the proposed new system to provide all players with a simple
side-by-side comparison. If this cannot (will not) be done, then perhaps someone
out there could recalculate a hand full or even just one player's rating based
on the data that is already provided on the FIDE web site. At the top the obvious
player that comes to mind is Ivanchuk - let's say from Jan. 2007 - present.
He is far and away the most active and also "streakiest" of players
at the very highest level - under the current system already fluctuating nearly
between 2700 and 2800 in not especially long periods of time. With an increased
K-factor would he now fluctuate between 2650 and 2850? It wouldn't be this dramatic
of course since lists are also published more often, but just how wide would
the pendulum swing?
This of course would not be perfect, as it would be done in a vacuum without
all of his opponents' ratings also being recalcuated. However, while I am no
statistician and lack the expertise to judge, this seems like it would be a
fairly small difference when looking at an individual. While this would provide
little or no insight into the concerns about players below 2400 and title requirements
it might help to either allay or justify the fears our top grandmasters might
have about the changes.
Nick Barnett, Cape Town, South Africa
Please ecourage Jeff Sonas to get involved in rating matters. Elo ratings are
fair but there is an obvious kreep, using the great Korchnoi as an example.
Viktor goes down the ladder while not appearing to lose that many rating points
from his prime. A lot depends whether you play in open events with rapidly improving
kids or in restricted events against established opponents. I believe the Ausralians
have abandoned Elo for something else.
Paul, Dargan, Dubai, UAE
John Nunn states that K and the frequency of rating lists are unrelated. While
this may be true for active professionals, I do not believe it is the case for
amateurs who do not play many games. If you are under (or indeed over) rated
then you gain extra credit for you performances until your rating changes. If
you can only get one tournament in before your rating is corrected upwards this
slows the adjustment of your rating ot its true level. Therefore publishing
lists more frequenctly slows the correction of ratings that do not reflect playting
strength. I'm not saying that increasing K is the answer, nor that it is a proportionate
response. But Dr Nunn (rather unusually) seems to have missed the point!? Maybe
there's some cunning tactical continuation he's got in mind that I haven't seen?
John Nunn has promised to send us a wrap-up answer to all
the feedback.
References