Database basics - part 7

by ChessBase
10/8/2004 – Our ongoing ChessBase Workshop series on database searches continues this week with an examination of some common mistakes that might prevent you from gathering the information you require. Learn how to avoid these missteps by reading Steve Lopez's latest article.

ChessBase 17 - Mega package - Edition 2024 ChessBase 17 - Mega package - Edition 2024

It is the program of choice for anyone who loves the game and wants to know more about it. Start your personal success story with ChessBase and enjoy the game even more.

More...

DATABASE BASICS -- PART 7

by Steve Lopez

Many years ago, while I was working in auto parts, a customer came to the counter to inquire about a part. He gave me the part description and his vehicle's specs. I looked up the part, pulled it from inventory, and brought it out to him.

"That's not what I want!" he cried indignantly. I dutifully reviewed the description of what he'd asked for and the year, model, and engine of his car. Double-checking everything, I discovered that I'd pulled the exact part he'd requested. "But this is what you asked for," I told him.

"Damn it!" he cried. "Don't get me what I asked for -- get me what I want!"

I see a lot of parallels between that story and the experiences of some chess database users. Back when I did telephone support for chess software I would sometimes have to diagnose the reason why a user's database search wasn't locating what he needed. Unless there was some weird technical glitch (like a corrupted database file), the problem wasn't with the program -- the root cause was located between the keyboard and the chair.

See, a piece of computer software is stupid; it can't think for itself. It can't make intuitive leaps -- it can only search for and find what you ask it to find. It will always get you what you asked for, not necessarily what you wanted.

That's why you need to be specific about what you ask Fritz' (or any other program's) database Search mask to locate for you. This is exactly the reason why we've spent the last five ChessBase Workshops describing in detail how to use the various Search mask dialogues: to help you learn to ask for the material you want with a minimum of confusion.

In this week's article, we're going to start looking at "putting it all together" by using the different dialogues to locate what you need and help you avoid a few pitfalls along the way.

I've learned that the single biggest mistake that users make is to ask the program for too much information. Just because a dialogue item exists doesn't mean that you have to use it in every single search. We'll start with a simple example. Let's look for all of the King's Indian Defense games in a particular large database. The easiest way to do this is by searching for ECO codes E60 to E99. [1] After performing the search, we come up with 156,443 games.

[1] If you're going to do a lot of database searches, I heartily recommend that you learn the ECO codes for the openings that you regularly play. You can just type the alphanumeric code into the "ECO" field of the Search mask and very quickly be presented with a list of all games which qualify. If you want to see a "translation table" of ECO codes and their equivalent English names, you can find one here. (And please, please, please bookmark it as one of your "Favorites" now. You would not believe the number of e-mails I get from people who ask me for some link or other that I presented in an article from years gone by. Trust me -- after writing over 400 chess software articles on various websites, if you can't remember the article in which I presented a link, neither can I.)

Now let's look for K.I.D. games again, but this time we'll add an isolated White pawn on d4 to the search parameters.[2] With this extra parameter added, the search turns up considerably fewer games -- this time the program finds 3,706 games.

[2] The technical details for how to do this are as follows. Click the "Game data" tab and type "E60" (without the quotes) in the left-hand box to the right of "ECO". In the righthand box, type "E99". You've just told the program that you want all games from ECO codes E60 to E99. Now click the "Position" tab. With the "'Look for' board" radio button selected, place a White pawn on d4. Now click the "'Exclude' board" radio button and place White pawns as follows: c2 through c7 inclusive, e2 through e7 inclusive, and d2, d3, d5, d6, and d7. Click the "OK" tab and the program will search for all Kings Indian Defense games in which White has an isolated d4 pawn.

Now let's toss one more parameter into the mix: we'll do the same search but limit it to the years 1990 through 1999. Upon doing this search we come up with a total of 2,060 games. So we can see that each time we add an extra parameter to the search we get fewer hits. It appears to be a paradox, but it's true: the more information you supply in the Search mask, the less information you receive in return. (I wrote a longer piece on this a few months ago in ChessBase Workshop).

Now we'll look at another interesting phenomenon that I like to call "Garbage in: Garbage out".[3] If you do a player search you need to spell the player's name correctly if you want any hits. I wish I had a dollar for every phone call or e-mail I've received from users who say, "Your program is crap! I did I search for all of Bobby Fischer's games and got nothing back! Your software doesn't work!"

[3] "Garbage in: Garbage out" (or GIGO for short) is an old computer geek term for exactly the same phenomenon I'm describing. If you input junk, you get junk back.

Of course, upon further investigation it's discovered that the problem again occurred somewhere between the keyboard and the chair. The first thing I do is ask how the user spells "Fischer". And, of course, 99% of the time they've left out the "c" and spelled it "Fisher". Chess computer software can't make "fuzzy" assumptions about what you're really looking for; when you type in "Fisher", the program looks for games played by people with that exact name. That old rant "Don't get me what I asked for -- get me what I want!" just doesn't cut it here -- all a program can do is find exactly what you tell it to find. And if you tell it the wrong thing, you get something other than what you wanted. Garbage in, garbage out.

Most of the other 1% of "Fischer errors" are caused by the user typing "Bobby" in the field for the player's first name. In professional quality databases, the man's name is given as "Robert", not "Bobby". We'll come back to the remainder of that last 1% in a moment, after we first hit on another important point.

Player name searches can be tricky and much depends on the quality of the database you're using. Let's use Bobby Fischer as an example again. If you do a player search for "Fischer" (no first name or initial), you'll get Bobby's games -- but you'll also get other players whose last names are also Fischer. You'll need to cut down the search by adding a parameter: the first initial "R". This will get you closer, but you'll still get some other players mixed in. So you spell out the entire first name: "Robert".

And this leads right to another rant I once heard. "Your database is crap! I did a search for "Robert Fischer" and got games played between 1973 and 1991, and after 1992! Everybody knows that Bobby Fischer was inactive during those years! What are you guys trying to pull??"

Nothing. The database search is correct. The other Robert Fischer is a USCF Master and is a frequent player in the DC/Maryland/Virginia area.[4] Bob's games will turn up in a search of any of the larger databases when you use the player name parameters of "Robert Fischer".

[4] I know Bob and he's a really good guy; if you bump into him at a tournament, please give him my absolute best regards. One of my favorite memories from Virginia Chess Federation tournaments was when some young kid would see Bob's name on a wall chart and freak out: "Bobby Fischer's here! He's playing on Board Three!" Man, I don't care how many times I witnessed that; it never got old.

So you might try adding yet another parameter: specifying an Elo rating of 2600+. The problem is that this will eliminate most of Bobby Fisher's games: the majority of his career took place before the introduction of the Elo system, so the bulk of his games won't carry a rating attached to his name.

The point? Sometimes you have to pare down the search results manually because life ain't perfect. In a perfect world (at least as far as databasing is concerned), Bobby Fischer's entire career would have been Elo rated, or he'd never have left chess, or never have made a brief 1992 comeback, or the US Master named Robert Fischer would have been named "Fred Fernwinkle" instead (though Bob would doubtless take issue with that last point). Sometimes you're just going to have to live with the fact that a perfectly-specified search is going to turn up some unwanted results due to the fact that the world isn't perfect. Murphy's Law is the underlying rule of the universe.

Another example is "A. Karpov". You can do a search for "Karpov" and get all of the former World Champion's games -- along with lots of other Karpovs and even a "Karpova" or two. If you limit the search by using "Karpov, A." you get closer, but you also get Al Karpov's games (no joke -- try it and see). You can try "Karpov, Anatoly" and get dangerously close, but then you miss games listed as "Karpov, Anatoli" as well as games in which no first name or initial is provided.

And that drags us kicking and screaming to two more points. The first is that a database search is only as good as the source data. If you're working with a database in which the games of some players appear under multiple name spellings, or in which some use first initials/names as part of the header info while others don't, you're going to get incomplete results. Even professional, commercial databases contain mistakes here and there -- not a happy thought but understandable when you realize that most commercial databases contain over two million games these days.[5]

[5] It's interesting (and kind of humorous) to note here that there has been a huge explosion of available chess data, accelerated by the rise in popularity of the Internet. I started in the chess software business in 1992 and back then a 100,000 game database was considered to be "da bomb". Software programs typically shipped with databases of 2,000 games or less. Keep in mind, though, that the most you could fit on a high-density 3.5" floppy disk back then was 5,000 unannotated games. You can fit a lot more on a CD or DVD today with plenty of room to spare. These days if a chess program doesn't ship with at least a half-million game database the consumer feels ripped off. Ah, the march of progress...

So the quality of the source data definitely has an impact on your searches. Another related problem is an insurmountable one: translation problems between alphabets.

It's tough to translate names from one alphabet to another, especially when certain characters have no single English equivalent. Spelling names phonetically isn't a foolproof solution, either, when some languages contain sounds which have no single equivalent in another tongue. Don't believe me? Try this experiment (if you have the necessary library resources). Pull out a present-day atlas and look up the capital of China; you'll likely see it printed as "Beijing". Now find an atlas from as recently as thirty years ago and find the same location -- it'll be printed as "Peiking". Go back farther, to the early 20th century, and you'll see it as "Peiping". But the city's name hasn't changed. The Chinese have been pronouncing it the same way since antiquity. The varied spellings represent the ongoing struggle of Westerners to approximate the Chinese pronunciation in print, exacerbated by the fact that the phonetic sounds don't "translate" well into English characters.

The same thing happens with translations between alphabets. It's a noted fact that many strong chessplayers over the last three-quarters of a century have been from the former Soviet Union.[6] This creates problems for folks compiling chess databases because some characters in the Cyrillic alphabet have no single corresponding character in the English alphabet. Viktor Korchnoi's is the leading example of this problem. In addition to his first name being various spelled with a "c" or a "k", his last name has been spelled "Kortchnoi", "Korchnoi", "Kortschnoy", "Korchnoy", "Kortchnoj", etc. etc. etc. ad infinitum.

[6] Here's the "Cliffs Notes" version. Early in the life of the Soviet Union, it was decided that the alleged superiority of the Communist system over Capitalism would be proven on many battlegrounds: sports, science, militarily, intellectually. The arena the Soviets chose for the intellectual battle was chess. It was a natural choice, since chess was an ingrained part of the Eastern European culture anyway. Government sponsored programs were established in the USSR to develop the populace's chess skills, particularly in the identification and training of child prodigies. This worked wonderfully well -- that's why we've seen many, many more strong Soviet chessplayers than we've seen emerge from the Western bloc. In fact, the "average, man on the street" citizen of the USSR (back in the day) was a much better chess player than his or her Western counterpart: I've read estimates that an "average" Soviet chessplayer's skill corresponded to a USCF Class A rating. I once knew a Russian émigré who became something of a regional "superstar" in the DC/Baltimore area back in the early 1990's, as he was a very strong USCF Master (once he got rated on these shores). He was baffled by the attention -- "back home" he was considered to be a bit better than average, but nothing special.

So how do we deal with this? If you have a database you've built yourself from a variety of sources, you'll need to manually edit the names to try to achieve some type of uniformity (or else live with the fact that you'll need to do multiple searches to find all of the requisite games). If you have a commercial database all you need to is try a search; if you get no hits, manually scan down the database list to pick out the player's name and make a note of the spelling.

OK, now we need to backtrack to Fischer again, but it's to make an important point. I once had a baffling phone call from a user who tried a search for Fischer's games as White and got no hits, but he'd spelled Bobby's last name correctly. This puzzled me for a moment or two, until I had a sudden epiphany.[7] I asked the guy to start clicking on the other Search mask tabs, and I found the problem. He'd previously been doing a search for Grob games (1.g4) and hadn't reset the Search mask. The board in the "Position" dialogue still had a pawn on g4 when he'd typed Fischer's name into the Player field. So the program was looking for all games with Fischer as White that started with 1.g4. There are none -- with only a couple of exceptions, Bobby was strictly an e4 player as White.

[7] I've been asked (many times through the years) how I became a help desk/technical support person and what are the skills required. It takes a really weird type of person to be a "tech head" (or "propellerhead", as I prefer to refer to myself), and I'm no exception. About 90% of it requires an intimate familiarity with the software -- you need to use it extensively and know the features like the back of your hand. Another 7% or 8% of the job requires logical thinking skills and experimentation -- reconstructing the steps that the user was taking when the problem occurred and a willingness to risk your machine and data (and occasionally your sanity) to experiment and uncover the problem and solution. The other 2% or 3% is the ability to "think outside the box" and hit these "sudden epiphanies" which lead you to the solution (usually doing the aforementioned experimentation for confirmation). I'm not blowing my own horn here, far from it. It's often tough to be "different". In fact, most of us propellerheads are genuine mutants -- and Professor X isn't taking any more applications.

That happy little accident leads us right to the last point we'll make this week: the Search mask doesn't reset between searches unless you click the "Reset" button. (The exception being this: after you exit the program, the Search mask resets). This is a huge point -- it's mondo important. I've had dozens of similar calls over the years and, in almost all cases, the problem was that the user hadn't reset the Search mask in-between two unrelated searches.

So the "Reset" button is your friend. Don't be afraid to use it.

Still more to come next week. Until then, have fun!

Previous articles


© 2004, Steven A. Lopez. All rights reserved.


Reports about chess: tournaments, championships, portraits, interviews, World Championships, product launches and more.

Discuss

Rules for reader comments

 
 

Not registered yet? Register