In Part I and Part II of this series, we looked at some historical evidence suggesting that right now the top human grandmasters and the top chess computers are extremely closely-matched. Further, there is no compelling evidence to indicate that computers are soon going to pull ahead of the top humans.
With the Kasparov-Deep Blue matches so far in the past, it must come as a big surprise to many people that computers have not yet surpassed the top grandmasters. Although computers are obviously getting stronger due to hardware and software improvements, humans have also improved their play against computers, faster than expected.
What does the future hold for grandmasters against computers? It all depends upon which group can improve faster, relative to the other. "Improvement" would typically suggest that a player is adding something positive to their play. However, remember that it could also mean that a player is removing something negative from their play. Either one constitutes "improvement".
I can think of three main categories where grandmasters and/or computers could improve:
Let's go through those three categories and see how they apply to computer improvement against humans, as well as human improvement against computers.
Clearly, improving the hardware will allow a chess computer program to play objectively stronger. Faster-executing programs can evaluate deeper, or more accurately, in the same amount of "thinking" time. From examining the past several years of the SSDF (Swedish Computer Chess Association) computer list, we can say that hardware leaps of 80 points have happened approximately every two years. This would suggest that computer hardware is providing an annual increase of 40 points of strength.
However, it is important to remember chess columnist Mig Greengard's quote:
"The computer doesn't really play chess. It plays another game that looks
like chess but has its own rules." When two computers are playing each
other, if one can search 17 moves deep and the other can only search 15-16
moves deep, then the first computer has a big advantage because it sees everything
the second computer sees, and then some. That's why it is conceivable that
computers are gaining 40 points a year against older computers, from searching
depth alone. Against humans, however, a search 17 moves deep vs. a search 15
moves deep is not as relevant, because the human isn't thinking nearly so far
Of course, it is difficult for humans to improve their physical strength when playing computers. However, there is still a very effective way to "improve" their success, and that is to remove a negative factor which has hindered grandmasters' performance: the lengthy match. You can see here that computers do progressively better against humans, as a one-on-one match progresses:
This is likely due to the effect of physical and mental fatigue upon the human, as the match continues. In a match between two humans, the fatigue would mostly balance out as the match progressed, since both humans would get tired. But obviously the computer does not get tired or discouraged. It is also possible that this effect is related to humans using up their opening novelties at the start of a match, or some other factor, but fatigue seems likely to be the real culprit.
I should also point out that you don't see this effect in Swiss or round-robin tournaments that have both computer and human participants. Computers do about the same against humans, whether in the start, middle, or end of a tournament, so there seems to be something particularly draining for the humans about a one-on-one match against a computer.
Certainly, upgraded software will play objectively stronger chess, even on the same hardware. Improved chess knowledge, better opening books, better endgame tablebases, better search techniques, and better utilization of hardware will all enable superior moves to be found in the same amount of thinking time. How can we express this in terms of rating points? Well, in Part I we looked at how the SSDF ratings of the top-ranked computers have progressed over time. Let's review that graph once again:
However, remember that this only applies to games between computers. In the same way that hardware upgrades probably don't give the full 40-point annual improvement against humans, it seems likely that software upgrades also don't provide an additional 30-point annual improvement against humans. Surely some of those 30 points of software upgrades will come from improvements to a program's opening library. Since the older programs are commercially available, it is fairly straightforward to play thousands of games against older software and to identify holes in the opening books of those older programs. This will allow new software to dominate older software, but against humans, the improvements to the opening book (while useful) probably won't translate to a full 30-point annual improvement.
With a 40-point annual improvement due to hardware upgrades, and a 30-point annual improvement due to software upgrades, that would normally suggest that computers are getting stronger at a rate of 70 Elo points a year, relative to humans. This is clearly not happening. If the SSDF list is indeed over-estimating the true rate of improvement of computer programs, what would we expect to see? Over time, the ratings of the top programs would drift higher and higher, until they got so ridiculously high that some sort of correction would need to be applied to reflect the true strength of the computer programs.
This is exactly what happened a few years ago, and that explains the curious downturn of the SSDF graph in mid-2000. This is what Thoralf Karlsson, SSDF chairman, had to say in August 2000:
The SSDF rating list provides information about the relative strength of chess programs, when tested in the way SSDF does, but does not necessarily say which Elo-rating a certain program would achieve after having played hundreds of tournament games against human players. How good or bad the individual correlation between SSDF- and ELO-ratings is, will most likely never be established. So many games against humans will never be played.
Apart from establishing relative ratings, we have had the ambition that
the general level of the list would be fairly realistic, compared to human
ratings. From our start in 1984 we have used tournament games against Swedish
chess players to calibrate the list. At some points we have discarded older
games, believing that human chess players with time have become better to exploit
the weaknesses of chess programs. Until the latest rating list the level of
the list has been unchanged from summer 1991, and was based on 337 tournament
games against Swedish players between 1987 and 1991. Regrettably it has not
been possible for us to play any more games for many years now.
For some time we had the general impression that the level of the list was rather OK. But during the latest years it has become more and more obvious that the best programs on the latest hardware don't get as high Elo-ratings as our list could be interpreted to predict. If this is due to differences between Swedish- and Elo-ratings, to the "human learning effect", to some kind of "spreading effect" in a computer-computer list or a combination of these and perhaps other factors, we don't know.
It is difficult to find a perfect solution, but we have chosen to correlate the level of the list to the results of tournament games between computers and Elo-rated humans, played during the latest years. For us it has been very convenient to use Chris Carson's compilation of such games. Calculations based on these games indicate that the level of the list is about 100 points too high. So from now on we have lowered the list with 100 points!
To summarize, before the correction, in early 2000, the SSDF ratings were still accurate in how they ranked computers against each other, but the actual rating numbers were too high, across the board. Those numbers ultimately were coming from a few hundred games played against Swedish players in 1987-1991. And it was becoming too much of a stretch to extrapolate forward from games played by the top Mephisto and Fidelity computers, on 68020 processors against lower-rated humans, a dozen years in the past. For one thing, there was no allowance for the fact that human players had gotten objectively stronger, or had learned to play better against computers, since 1991.
So, at that point, about 100 games were analyzed from events between humans and computers in 1997-2000. The humans in those games had an average FIDE rating below 2400; the only two events against really strong humans were Junior at Dortmund 2000, and Fritz at the Dutch Championships in 2000. Thoralf Karlsson also had to make some assumptions about the impact of different hardware, since the hardware used by Junior and Fritz in those events (for example) was different from that used by the SSDF. The conclusion from all of this analysis was that all SSDF ratings should be reduced by 100 points. There have been no further corrections since then.
However, I believe that the same kind of upward drift has continued in the three-plus years since August 2000. It is true that today's top computers would dominate the top computers from three years ago, leading to a 200-point difference on the SSDF list. However, I don't see that it necessarily means that today's top computers would play 200 points better against top grandmasters.
For one thing, computers were doing unusually well against humans, exactly in that time frame. If you remember the performance rating graph from Part II a couple weeks ago, top computers had a performance rating (against humans) of 2444 between 1995 and 1997, and then it shot up 200 points (to 2647) between 1998 and 2000. But the improvement didn't continue at that rate; the performance rating of computers against humans only went up by a total of 62 points between the 1998-2000 range and the 2001-2003 range. And as I tried to prove in Part II, even that improvement only came from computers becoming more dominant against the lower-rated humans; humans rated 2550+ are just as successful against computers today as they were five years ago.
Since the SSDF list is calibrated against human-computer results from 1997-2000, and more than 80% of those humans were rated below 2550 anyway, I think it is a mistake to look at the 2800+ SSDF ratings of the top programs and to conclude that those top programs will dominate today's top grandmasters. The battle is not over yet.
In Part IV Jeff Sonas examines playing style and the question of whether it is possible to "tune" computers to play especially well against humans. He includes statistical analysis on which openings are especially suited to the playing style of computers, i.e. which lines humans should probably avoid. This article will appear this weekend – well, you're just going to have to wait like everyone else, aren't you, Garry...
Jeff Sonas is a statistical chess analyst who has has invented a new rating system and used it to generate 150 years of historical chess ratings for thousands of players. You can explore these ratings on his Chessmetrics website. Jeff is also Chief Architect for Ninaza, providing web-based medical software for clinical trials.