Do Women Play More Beautiful Chess? – A Response to Critics
By Azlan Iqbal, Ph.D.
In response to the original
article, I received a lot of feedback; most of which was negative and
the rest neutral. This was not entirely unexpected given the subject matter.
Regardless, not all the feedback (including the personal attacks) were seemingly
from militant feminists and men who felt as if I had just insulted their
girlfriends or wives. The former may feel that we are still at risk of regressing
to a time when the roles of men and women in society were more clearly defined
and the latter are probably just succumbing to what some might say are protective
or defensive evolutionary instincts (e.g. mother for child, men for ‘defenseless’
women). There is only so much value or credibility one can attach to anonymous
commentators on the Internet. Are they really who they claim to be? One
also has to wonder if they would have responded similarly, or in stark contrast
praised the work, had I found that women played more beautiful
chess (even if just within the scope of mate-in-3 sequences).
Besides, anyone taking single sentences totally out of context and drawing
conclusions from that about the whole body of work or writing a long, angry-sounding
‘rebuttal’ the very next day (especially with admittedly little
or no background in artificial intelligence or computer science) is doing
a poor job of making a cogent argument or providing constructive criticism.
For the record, and to assuage any concerns, I did not doctor any of the
experimental results and ChessBase did not strike some kind of Faustian
bargain with me to exploit women for the sake of a few more mouse clicks.
Our relationship goes back many years and has virtually nothing to do with
women or money. As academics, we are actually used to negative feedback,
especially when it pertains to new or controversial ideas being proposed.
I recall in Cambridge University back in 2008 when I first met him, Lotfi
Zadeh (who introduced the concept of fuzzy sets) said that when he first
proposed some of his ideas, some of his peers said he “should be lynched”
for trying to promote a ‘lack of precision’ in computing. Fortunately,
to my knowledge, none of my peers feel that way about my work.
Anyway, some of the feedback I received were nevertheless actually genuine
questions and concerns (including about my credentials and credibility)
that I felt I should address in the interest of science; hence this follow-up
article. I will not list the questions individually as many of them overlap.
Instead, I will simply begin to address them collectively based on my interpretation
of what they were trying to get at. Let me begin by saying that the aesthetics
model in chess that I developed for my Ph.D. is considered a seminal piece
of work in then uncharted waters. Therefore a realistic and doable scope
needed to be set, i.e. three-move mate sequences. This is actually how one
goes about doing a Ph.D. You should not bite off more than you can chew
or you will never complete the degree. ‘Aesthetics’ was also
defined as a common ground between the domains of chess problem composition
and real games, since both are not without beauty. The necessary experiments
were performed and the results showed that the computer could indeed apply
the model to recognize aesthetics within the game (given three-movers) in
a way that correlated positively and well with domain-competent human assessment.
This had never been demonstrated before and was a significant contribution
to the pool of knowledge in artificial intelligence. It also had practical
applications, such as allowing the aesthetic analysis of thousands upon
thousands of chess problems and winning move sequences in games that would
be far too difficult for humans to do reliably.
During my study, I worked with many chess experts as well. In order to
understand all this satisfactorily, you will probably have to read
my thesis in its entirety. There are no shortcuts, just as there were
none for me in preparing it. Even though we tend not to read as much as
we used to (we see more pictures and videos now), we are not yet at the
stage where we can depend on a computer to comprehend complex written texts
and answer intelligent questions intelligently thereby saving us a lot of
time and effort. I was awarded my Ph.D. from the University of Malaya which,
when I graduated, was the top university in Malaysia. They also had a policy
that, in addition to my own supervisor, there would be one internal and
two external reviewers, all of whom must be full professors with related
expertise. The external reviewers must also be from overseas, which in my
case happened to be from renowned universities in the UK and Australia.
My thesis was with them for seven months.
Also, unlike some institutions, all four professors (including my supervisor)
must unanimously agree that the Ph.D. be awarded. It is not unusual or uncouth
for one who has successfully attained their Ph.D. to put the ‘Ph.D.’
title at the end of one’s name in scientific reports or articles,
just as medical doctors would put ‘M.D.’ at the end of theirs.
Chess grandmasters are also known for putting ‘GM’ at the front
of theirs in articles and it is ironic and hypocritical that any of them
should think doing this is cocky or ‘showing off’. Given all
this, for anyone to suggest that the aesthetics model developed lacks credibility
is to declare one’s ignorance or preconceptions about me and the region
I am from. While I have enjoyed visiting the West many times on business
and pleasure, I have never had any intention of actually working or staying
there. Not once at any stage of my education or career have I ever even
applied to do so. I am quite happy living in and serving my own country.
Thanks to the growth of the Internet, we are living in a virtually borderless
world anyway.
The three-mover aesthetics model I developed and tested was furthermore
extended in our
2012 IEEE paper, also with the help of chess experts (who also happen
to be Ph.D. holders), to include not only three-movers but also studies
(and logically, longer mates). The paper, of course, was thoroughly peer-reviewed
and had to be revised before being accepted for publication so the extended
model is also, ‘experimentally-validated’. It is true that in
validating the model, the average of three cycles of evaluation by Chesthetica
(as opposed to just one cycle) for each move sequence was tested but using
just one cycle in future experiments is valid as well because, like a human
judge of aesthetics, Chesthetica may or may not deliver exactly the same
evaluation each time it looks at the same sequence (you will have to read
the IEEE paper carefully to learn why this works well). This phenomenon
is of little concern because the program’s consistency and reliability
‘over time’ has already been demonstrated by taking the average
of its evaluations using multiple cycles. It does not imply that multiple
cycles should always be used in the future and that only crisp, unchanging
aesthetic values are acceptable for each sequence. The possibility of slight
variations in an aesthetic evaluation makes the model more dynamic yet still,
on average, consistent and reliable (much like a human judge). As for replication
of experimental results, that depends on the hypothesis. Should the original
hypothesis have stated that a single cycle of evaluation be used, then the
replication of the experiment should use a single cycle as well and the
result accepted, whatever it may be. Analogously, the p-value for
statistical significance should also be determined beforehand and not reset
after the experiment to better suit the results (e.g. changing it from 0.01
to 0.05).
While I have written many papers, to the extent memory serves, I do not
self-publish at all even though admittedly, some of my publications are
certainly better or more prestigious than others (like any academic). For
example, when Britannica invited me to write the entry for “computational
aesthetics” (my PhD field of study) in their encyclopedia or Springer’s
recent publication of our book on the DSNS
approach that my Chesthetica software uses to create original chess problems.
Not to mention many papers published in the ICGA
Journal, a reputable computer games journal with a high standard of
publication. Ken Thompson published there too. I also have papers in high
ranking AI conferences such as the AAAI and IJCAI. I do not ordinarily like
to draw attention to these things but when questioned, I suppose I must
set the record straight. As a side note, it is probably not a good idea
to prepare conference slides at the last minute because typos may show up
and you really cannot tell how seriously some people might take things like
that and use it to draw conclusions about you.
As for ‘impact factor’, academics are well-aware of its limitations
and interested readers might care to look those up as well, such as
explained here. In short, it is not necessarily a good indicator about
the quality of any particular piece of research work. For instance,
a paper essentially reminding us yet again about the dangers of consuming
too many burgers, fries or sodas could have a high impact factor largely
because it is published in a popular medical journal, because medical science
tends to get the most research funding and because many of them tend to
study our eating habits (a lot more people than those looking into say,
the computational aesthetics aspect of chess). On the other hand, I write
for ChessBase (with no impact factor) because it is a kind of ‘community
service’. As academics, we are expected to convey our research to
the public in more palatable and widespread forms than just technical papers
which tend to be of rather limited distribution and out of the layman’s
typical scope of understanding, and in many cases even those from outside
the particular field.
With regard to my chess-playing expertise, I never bothered to obtain an
official chess rating even though I have been playing casually for 30 years
and have won several medals in local tournaments. In fact, I am quite confident
I could last at least 20 moves even against Magnus Carlsen under tournament
conditions. If I had an official Elo rating, my probability of winning that
match (or my ‘expected score’) can indeed be calculated and
would probably be so low one might think I would lose faster than Bill Gates.
I could probably beat him too, by the way. So, yes, I can say that I do
indeed “know how to play” but would not consider myself an ‘official
master’ at the game. The truth is, given my line of work, I simply
do not need to be a chess master as there are many official chess masters
only too happy to assist and work with me on projects. I am frankly quite
amazed at how open-minded and forward-thinking some of them are.
The same can be said about scientists who study say, bodybuilding. They
are not and need not necessarily be renowned bodybuilders themselves (though
they do probably work with a few). Having said all that, I do not think
I am smarter or “more intelligent than everyone else”. It is
not like my IQ is in the 180-range or anything like that. I took a scientifically-accurate
test back in 2003 and it was only 131 with the “unusual distinction
of being equally good at math and verbal skills”. It is unfortunate
that some people interpreted the original article here on ChessBase to be
misogynistic, having “gratuitous sexist content” and claim that
perhaps I did not even know any women. I was also ‘threatened’
that my academic standing and credibility would be undermined by all this
and that I should think about my future in academia. Untrue on all counts,
I would say. I have known plenty of women in my time. At last count, 52
from 23 different countries, as a matter of fact; and most of them would
only have nice things to say about me, I am fairly confident.
As for academic standing, I am more concerned about scientific truth than
what the effects of revealing it might have on my career. Certainly, not
revealing it (the file-drawer effect) or trying to bury it without a good
enough reason would have a greater effect on society (myself included).
Besides, not all academics are so desperately looking for tenure or its
equivalent and would ‘do or conceal anything’ to get it. Some
of us (though I am not necessarily claiming to be in this group) –
and presumably just like some grandmasters – may also be independently
wealthy and could retire tomorrow if we pleased; never having to work another
day in our lives. So now, after hopefully having set the record straight
on these matters, let us look into some of the other concerns about the
experiments in my
paper that suggested men play more beautiful chess than women.
The first thing one should realize when reading a scientific paper is that
there is probably always a scope specified (e.g. three-move mate sequences).
There ought to be. Scientists do not claim to know everything and
the scope serves as an indicator about the extent to which whatever was
being tested was actually tested or could be tested. This does
not mean nothing useful can be said about the subject matter. For instance,
we may only know how certain parts of the brain function with respect to
certain aspects of human activity, but that does not mean those findings
are useless until and unless neuroscientists know how the whole
brain works with regard to all of human activity. Science is a cumulative
and self-corrective process.
Now, some ‘experts’ may feel their personal or collective intuitions
about certain things trump experimental validation. However, from a scientific
standpoint they are wrong. What you need to trump experimental validation
is more or better experimental validation. ‘Common sense’ is
not a scientific argument and has been known to be wrong or misleading.
Just like one might be inclined to think that a bowling ball would hit the
ground faster than a feather dropped from the same height in a vacuum chamber.
So if anyone would like to analyze longer or different types of sequences
in chess using say, some other method, you will first need to develop and
experimentally validate your own aesthetics model for those types of sequences
or all you have is essentially just personal (and quite probably biased)
intuitions. Being a master player does not help you scientifically here.
Moving on to the perfectly valid question about whether playing strength
correlates with aesthetics. In other words, do stronger players play more
aesthetically? In the original study, that was not taken into account but
the study does contrast the aesthetics of play between two engines, i.e.
Rybka 3 vs. Fritz 8 (10+10) and Rybka 3 vs. Fritz 8 (1+1) scoring, on average,
1.979 and 1.992, respectively. The difference was not statistically
significant. So this would suggest that playing strength is not necessarily
relevant to beauty. However, I did happen to have two older databases with
me with 1,000 randomly selected games that ended in mate between players
with an Elo rating above 2,500 and between players with an Elo rating below
1,500. The games were sourced from Big Database 2011 and gender was irrelevant
here, even though most were likely games between men, especially given the
first set (so perhaps the result does not even apply to games between women).
The average aesthetics scores (using the same statistical approaches as
described in the original paper) were 1.815 and 1.693, respectively, and
the difference was indeed statistically significant. So this suggests further
that playing strength is relevant in the aesthetics of three-move mating
sequences that result from play between humans.
What are the implications of this? Should playing strength have been taken
into account in the original study so that only games between women within
the same Elo range as the games between men were used? Perhaps it should
but unfortunately, this was not possible without tampering with the selection
process which is supposed to be random because there is no automatic (and
unbiased) way to search for players based on their gender and there were
simply not enough games between women in the database ending with ‘exclusivity’
(read the original paper to learn what this means) and mate that were also
within any particular Elo range. Of course, most games between strong players
do not even end in mate (they tend to resign) but again, the aesthetics
of games like that are at present not scientifically testable. Besides,
in comparing samples of the same kind (i.e. three-move mate sequences) drawn
from a normal population (i.e. whatever was randomly obtainable from the
database) the differences between men and women are still valid (within
that scope, obviously).
The original study minimized introducing any kind of bias into the samples
(of both men and women) by assuming that whatever was in the 6+ million
game database used was an unbiased representation of games played throughout
the world by both men and women. If it happens that there was a greater
number of strong male players than female players in that database and therefore
the samples of each used also reflected that and thus the games between
females would necessarily score lower aesthetically... well, that begs the
question, why, in a normal population, are there more games by stronger
male players to begin with? This is not something that can be ‘adjusted
for’ without introducing bias into the samples. If I were to select
only specific, strong female players to compare against specific, strong
male players... that would introduce so much bias I would have to justify
how and why each of those players were chosen. It does not reflect what
is typically found in the real-world population of players and what can
realistically be selected at random from that. Now, imagine the
additional biases introduced if arbitrary, ‘manual’ filters
based on age were also applied.
Similarly, if I were to test a mixture of longer mates and study-like endings
(which Chesthetica can also analyze aesthetically now) along with three-movers,
arguments could be made that one sample had more of one type of mate or
ending than the other sample and that affected the outcome because
experiments also show that studies score, on average, higher aesthetically
using the model than mates. Never mind yet the issue of deciding how far
back one needs to go from the ending of a game to determine where the ‘study’
starts and how that decision was made for each game (talk
about introducing bias!). This is why a doable, testable scope and consistency
in experimentation is paramount. Otherwise, it makes the conclusions and
implications of the research only more tenuous. So in summary, the original
study assumed that the database used had an ‘as-fair-as-can-get-without-introducing-bias’
distribution of games between men and between women and compensated further
for bias by using the average aesthetics score.
The two games shown in the original ChessBase article, for instance, should
therefore not be seen as comparing apples and oranges but rather what the
aesthetics model thinks of the sequences themselves, independent
of who the players are or the conditions under which those moves were made
(something humans might find very difficult to ignore). Additionally, by
themselves, these two sequences are not ‘proof’ of anything
and were never intended to be. The samples of 1,069 games each that were
used surely also contained some games between women that were of higher
quality than some of the games between men. This is the beauty of random
selection and the bell curve. Hence the necessity for comparing only averages
and not drawing grand conclusions from individual games or sequences. More
games, I suppose, could have been used (e.g. by artificially flipping the
colors where Black mates and treating the position as White mates in games
that never actually occurred in that form) but again, this would have introduced
bias, especially if playing with the white or black pieces influences the
way people play at all. So since both samples featured only White wins (like
the standard for most chess problems), comparisons between them are technically
still valid. Besides, the randomly-selected 1,069 games in each sample were
considered a sufficient number for experimental purposes.
What, then, about games between men and women or games between higher and
lower rated players? How does the aesthetics analysis account for these?
In the original study, games between men and women were scarcer still and
virtually impossible to obtain automatically and randomly, so that is why
they were not used. As was pointed out to me, it is also the case that there
are ‘women only’ tournaments but no ‘men only’ tournaments.
A strange (perhaps even sexist) double-standard that automatically excludes
men (even low Elo ones) from some tournaments but does not exclude women
from any. So this would further explain the aforementioned scarcity. As
for higher rated players versus lower rated players, this was considered
introducing more variability (read as ‘lack of consistency’)
into the samples compared to using players of about the same rating. For
instance, if the mate occurred as a result of a 2,500 Elo player defeating
a 1,600 Elo player (I am guessing such games rarely take place to begin
with), the larger gap in rating points (i.e. 900) would inherently introduce
more things to be accounted for than if the difference was only, say, 150
Elo points.
There is also no evidence that a large Elo gap necessarily permits the
stronger player to play more beautiful chess but it is certainly something
I could test in future experiments given sufficient data. I do not know
if there were necessarily more games like this in the female sample used
in the original study but trying to find out and then arbitrarily deciding
which ones to include and which ones to reject (and then doing the same
for the male sample) would, once again, introduce more bias than the source
database itself yielded automatically and with no interference from me.
How about the argument that ‘forced’ three-move mate sequences
undermine creativity and aesthetics? Well, in previous research work (Aesthetics
in Mate-in-3 Combinations: Part II: Normality, ICGA Journal, December 2010),
I have shown that forced mates, on average, are actually no different aesthetically,
according to the experimentally-validated model and in the case of games
between human players, than those that are not forced.
A human player or composer may be influenced to think somewhat less of
a sequence that is not forced upon doing some deeper analysis on the position,
however. This is why forced sequences are typically considered more beautiful
and preferred in experiments because eventually, humans are going to perceive
them. Again, as long as both samples are similar in the sense of being forced
(or unforced) mates, the comparisons between them are more credible than
say, if one sample was forced and the other was not. Now, do not get me
wrong. Overall there are probably several dozen if not hundreds of different
variations or permutations of the original study that could also have been
done by filtering this out and compensating for that in order to test specifically
for this with regard to that, but those, precisely, are other experiments
with different scopes and different sets of constraints
and limitations. I really do hope there are people who can find the funding
and time to do them all; armchair commentators included. I would certainly
be interested to read about the results and happy they have contributed
to the literature on the subject, in however small a way.
Finally, the conclusions of the original study are actually supported by
the fact that in the world of chess problem composition (typically having
the highest aesthetics scores, even according to my model), the best compositions
(if not just about all of them) are by men. It could be that the
male ‘patriarchy’ of the composition world are secretly dismissing
some of the most fantastic compositions ever composed simply because they
are submitted by women, or it could also be that women, in general, are
less interested in the aesthetics of chess for reasons that neuroscientists
might be curious about (assuming learning more about the physiological differences
between male and female brains and their implications is not yet considered
forbidden research). I will leave it to readers to decide for themselves
which explanation is more likely. I have no vested interests in the outcome
and am more interested in the truth. I doubt women are so feeble-minded
and lacking confidence as to be discouraged from chess by findings such
as this and even if they turn out to be true, it is not necessarily something
that cannot be compensated for with the right tutelage from a man (or woman)
with greater skill. If anything, I hope the original study motivates even
more women into playing the game and into the world of chess problem composition
to prove they are indeed equal or even superior to men in this regard as
well.
Wrapping up, let me also congratulate Google’s DeepMind on AlphaGo’s
victory over humanity’s Go champion, Lee Sedol. I had absolutely no
doubt this would happen and can only wonder why it took so long to achieve.
By the way, Google, if you happen to have one of your quantum computers
just lying around doing nothing, I would love to plug Chesthetica into it
for a while for some serious computational creativity DSNS processing if
the two are compatible. Just kidding (well, not really). Anyway, good show
and respect to all the industry big boys out there breaking new ground and
taking board game AI seriously.
This article was first uploaded
to ResearchGate and you may contact Dr Azlan Iqbal via e-mail with any
further questions or concerns you may have, at his official
email address with a c.c. to his private
address. Yes, he does reply to the best of his ability.
Previous ChessBase articles by Prof. Azlan Iqbal
-
2/26/2016 – Do
women play more beautiful chess?
Azlan Iqbal, senior lecturer at the Universiti Tenaga Nasiona in Malaysia,
has been working for years in the field of Artificial Intelligence,
trying to program machines to evaluate aesthetics. After making the
Chesthetica software that is able to create an unlimited number of problem-like
chess constructs he has turned his attention to gender-based playing
style. Here are first results.
- 2/24/2016 – Azlan
Iqbal: Recomposition contest result
Over Christmas we showed you an interesting problem: say you have found
some moves somewhere, in coordinate notation without piece names –
is it possible to reconstruct the original supposedly meaningful position
to which they apply? The author, who has a Ph.D. in artificial intelligence,
tried to do it, but with modest success. A reader presented a more plausible
solution and won a valuable prize
-
12/29/2015 – ChessBase
Chrismas Puzzles 2015 (5)
Here's an interesting problem: say you have found some moves somewhere,
in coordinate notation without piece names – e.g. 1.h7g5 d8g5
2.b5d5 d1c2 etc. Can one reconstruct the original supposedly meaningful
position to which they apply? Azlan Iqbal, who has a Ph.D. in artificial
intelligence, retraces his thought processes when he tried, in this
unique exercise in forensic chess. Help him and you can win a special
prize.
-
5/31/2015 – Celebrating
300 machine generated problems
As we reported before, Chesthetica, a program by Azlan Iqbal, is autonomously
generating mate in three problems by the hundreds, and the author is
posting his selections in a very pleasing format on YouTube. The technology
behind the program’s creativity is a new AI approach and Dr. Iqbal
is looking for a substantial research grant for applications in other
fields.
- 4/7/2015 – Switch-Side
Chain-Chess Revisited
The search continues for a chess variant which retains the flavour of
the original game but does not succumb to the brute calculating power
of modern computers. AI researcher Azlan Iqbal has proposed his own unique
variant. Now he provides some test games and shows how Carlsen could have
won (instead of lost) WCCh Game 3 against Anand in Sochi had Switch-Side
rules applied.
-
2/6/2015 – Computer
generated chess problems for everyone
Now they are composing problems that fulfil basic aesthetic criteria!
Chesthetica, a program written by Azlan Iqbal, is churning out mate
in three constructs by the hundreds, and the author is posting them
in a very pleasing format on Youtube. How long will Chesthetica theoretically
be able to generate new three-movers? Quite possibly for tens of thousands
of years.
-
11/7/2014 – A
machine that composes chess problems
Chess problems are an art – positions and solutions, pleasing
to the mind and satisfying high aesthetic standards. Only humans can
compose real chess problems; computers will never understand true beauty.
Really? Dr Azlan Iqbal, an expert on automatic aesthetic evaluation,
imbued his software with enough creativity to generate problems indefinitely.
The results are quite startling.
-
7/26/2014 – Best
‘Chess Constructs’ by ChessBase readers
Chess constructs are basically an intermediate form of composition or
chess problem, lying somewhere between brilliancies from chess history
– and artistic chess problems, between real game sequences and
traditional award-winning compositions. A month ago Dr Azlan Iqbal explained
the concept asked our readers to submit compositions of their own. Here
are the winners.
-
6/29/2014 – Azlan
Iqbal: Introducing ‘Chess Constructs’
People love brilliancies from chess history – and artistic chess
problems. But there is a big gap between the two. Positions from games
demonstrate the natural beauty of actual play, while chess problems
are highly technical, with little practical relevance. The author of
this interesting article suggest an intermediate form, one you can try
your hand at – and win a prize in the process.
-
9/2/2009 – Can
computers be made to appreciate beauty?
Or at least to identify and retrieve positions that human beings consider
beautiful? While computers may be able to play at top GM level, they
are not able to tell a beautiful combination from a bland one. This
has left a research gap which Dr Mohammed Azlan Mohamed Iqbal, working
at Universiti Tenaga Nasional, Malaysia, has tried to close. Here's
his delightfully interesting PhD thesis.
-
12/15/2012 – A
computer program to identify beauty in problems and studies
Computers today can play chess at the grandmaster level, but cannot
tell a beautiful combination from a bland one. In this research, which
has been on-going for seven years, the authors of this remarkable article
show that a computer can indeed be programmed to recognize and evaluate
beauty or aesthetics, at least in three-move mate problems and more
recently endgame studies. Fascinating.
-
2/2/2014 – A
new, challenging chess variant
Ever since desktop computers can play at its highest levels and beat
practically all humans, the interest of the Artificial Intelligence
community in this game has been sagging. That concerns Dr Azlan Iqbal,
a senior lecturer with a PhD in AI, who has created a variant of the
game that is designed to rekindle the interest of computer scientists
– and be enjoyable to humans as well: Switch-Side Chain-Chess.
- 5/11/2014 – Kasparov
in Malaysia
He was mobbed, but in a good way: a large number of chess fans and autograph
hunters sought close contact to the legendary World Champion, who officiated
the opening of the PMB National Age Group Championship 2014, and took
time to discuss a variety of topics with an expert on aesthetics-recognition
technology in chess, our author Dr Azlan Iqbal – who sent us a big
pictorial report.