Law of diminishing returns

11/27/2003 – Cramming the ChessBase search mask full of search criteria should result in plenty of "hits", right? Not exactly – the more you put in, the less you get out. In this week's ChessBase Workshop, database expert Steve Lopez explains the "law of diminishing returns".

ChessBase 14 Download ChessBase 14 Download

Everyone uses ChessBase, from the World Champion to the amateur next door. Start your personal success story with ChessBase 14 and enjoy your chess even more!


Along with the ChessBase 14 program you can access the Live Database of 8 million games, and receive three months of free ChesssBase Account Premium membership and all of our online apps! Have a look today!

More...

THE LAW OF DIMINISHING RETURNS

by Steve Lopez

I recently had an interesting e-mail exchange with a friend, regarding the search features in ChessBase (although the discussion could apply equally well to the database search function of the Fritz family of playing programs). My friend did a search on the Mega Database (over 2 million games) for all the games of a specific player using a specific opening, all of which were won by White in less than forty moves, and which involved an isolated d-pawn for White. My friend was simply astonished that the search didn't come up with thousands of games. "With all that specific stuff I was looking for, you'd think I'd have come up with a lot more than three games!"

Noooooo, I wouldn't think that, because I know how data searches work: the more info you put into the search mask, the less material you get back. And this applies to all kinds of data searches, not just chess database searches. Let's look at a typical (albeit non-chess) example to give you an idea of how this works. Let's say you live in a medium-sized town and you have one of those phone book CD collections which contains millions of names, addresses, and phone numbers culled from various public phone records in the U.S. Let's say you do a search for every phone number from your home town -- the software returns 50,000 "hits"; that is, 50,000 phone numbers from your hometown phone book.

Now let's assume you do a search for every number from your home town in which a person's last name starts with the letter "A". With this (more detailed) search, you get 4,000 hits. Now you narrow the search to everyone from your home town whose last name is "Adams". This brings back 10 hits. You narrow the search again, to everyone from your home town whose last name is Adams and who lives on Main Street. This time you get two hits. So you narrow it again to everyone from your home town whose last name is Adams, who lives on Main Street, and who lives within three blocks of the town square; this time you get a single hit.

You can easily see what's happened: the more criteria that you cram into a database search request, the less material you get back.

Now this is second nature to a lot of people, but there's also a lot of ChessBase users who are laboring under the misconception that database searches jam-packed full of search criteria should yield more info than sparser searches. That's just not so -- and to understand why, we need to look at Boolean algebra. Relax -- this'll be pretty painless.

Let's go back to our electronic phone book on CD. If we did a search for everyone who lives in your home town OR whose name starts with the letter A, we'd get millions of hits. That's because we're looking for two different sets of data and lumping them together: everyone from your home town OR everyone whose name starts with "A". Sure, there will be some overlap (people from your town whose last names start with "A"), but with an OR search you're going to get people who fit into either category.

Now we'll do a search for people who are from your hometown AND whose last name starts with "A". As described above (in our made-up example), you get 4,000 hits. That's because you're looking for people who fit into both categories: people living in your town AND whose last name begins with "A". An AND search is much more specific than an OR search, because when you do an AND search information has to match all of the categories you include in the search to qualify as a "hit".

While it is possible to do certain kinds of OR searches in the ChessBase search mask (I'm thinking specifically of the "'Or' board" when you do a position search), in general your searches will be AND searches; therefore, the more details you include in the search mask, the fewer games will come up in the search.

Let's look at an example using Mega Database 2003 as our game source and I'll show you how adding more criteria to the search reduces the number of games that the search uncovers.

  • Total number of games in the database: 2,312,072 (the first entry in the database is an introductory text, so I didn't count it as a game)
  • Games played by players named "Kasparov": 2,486
  • Games played by players named "Garry Kasparov": 2,253
  • Games played by Garry Kasparov as Black: 923
  • Games won by Garry Kasparov as Black: 377
  • Games won by Garry Kasparov as Black in the King's Indian Defense: 56
  • Games won by Garry Kasparov as Black in the King's Indian Defense in which both players were rated 2500 or higher: 42
  • Games played by Garry Kasparov as Black in the King's Indian Defense in which both players were rated 2500 or higher, in 35 moves or less: 14
  • Games played by Garry Kasparov as Black in the King's Indian Defense in which both players were rated 2500 or higher, in 35 moves or less, in which a Black Knight was played to b4 between moves 5 and 30: 2

(And for players keeping score at home, those games were against Kavalek, Bugojno 1982, and Portisch, Linares 1990. And for players gifted with prodigious memories, you'll recognize these search criteria: I used similar ones for an older article I wrote on January 24, 1999 in which I searched Mega Database 1999 and got different results for all of the searches except the last one).

Take a look at the numbers. Each time we added a new element to the search, we got a lower number of hits: the more specific the search, the lower number of hits the search produces.

My point with all of this is not just to play around with numbers. There's actually a practical application here. If you're doing a search for an opening, start with an ECO code search; this will dredge up the most material. For example, if you search Mega DB 2003 for all King's Indian Defense games (codes E60 through E99), you'll get 156,443 games. If you narrow the search to just E60 through E69, you'll get 53,955 games. If you narrow the search further to games using those openings, but in which White has an isolated d4-pawn between moves 15 and 60, you'll get 2,481 games. Blockade that d4-pawn with a Black Knight on d5 and you'll scare up 492 games. Toss in a Black Knight moving to b4 and you'll knock the number right down to a mere three games.

If your database searches are turning up too many hits (to where you can't decide what games to view), add some more elements to your search criteria. Conversely, if your searches are yielding too few hits (or no games at all), simplify your search to a lesser number of elements.

And keep in mind that not every possible opening or chess position will appear in a database, no matter how large it is. I recall one user who searched for a very offbeat variation of the Grob (an opening which is seldom played at the master or GM level) and was astonished that his search yielded no hits. Sorry, but there are some opening variations which just don't get played in top-level tournaments.

As a final tip, please remember that you need to click the search mask's "Reset" button if you've finished one search and want to do another unrelated search. The search mask "remembers" your last set of search criteria until you either click the "Reset" button or exit the program. I recall one user who was absolutely amazed that his copy of a Mega Database "had no Bobby Fischer games"; that is, until we checked his search criteria and saw that he'd previously been searching for Grob games. He had just typed in Fischer's name without first resetting, so the program was searching for all of Fischer's Grob games. Of course, there are none.

Until next week, have fun!


© 2003, Steven A. Lopez. All rights reserved.


Topics cb8
Discussion and Feedback Join the public discussion or submit your feedback to the editors


Discuss

Rules for reader comments

 
 

Not registered yet? Register