KnightCap: A chess program that learns by combining TD( ) with game-tree search

Page created by Dorothy Wheeler
 
CONTINUE READING
KnightCap: A chess program that learns by combining TD( ) with game-tree
                                  search

           Jonathan Baxter                           Andrew Tridgell                           Lex Weaver
   Department of Systems Engineering          Department of Computer Science         Department of Computer Science
     Australian National University            Australian National University         Australian National University
       Canberra 0200, Australia                  Canberra 0200, Australia               Canberra 0200, Australia
        Jon.Baxter@anu.edu.au                  Andrew.Tridgell@anu.edu.au               Lex.Weaver@anu.edu.au

                         Abstract                                backgammon players [9]. In TD-Gammon the neural net-
                                                                 work played a dual role, both as a predictor of the expected
     In this paper we present TDLeaf(  ), a varia-              cost-to-go of the position and as a means to select moves.
     tion on the TD(  ) algorithm that enables it to            In any position the next move was chosen greedily by eval-
     be used in conjunction with game-tree search.               uating all positions reachable from the current state, and
     We present some experiments in which our                    then selecting the move leading to the position with small-
     chess program “KnightCap” used TDLeaf(  )                  est expected cost. The parameters of the neural network
     to learn its evaluation function while play-                were updated according to the TD(  ) algorithm after each
     ing on the Free Internet Chess Server (FICS,                game.
     fics.onenet.net). The main success we re-                   Although the results with backgammon are quite striking,
     port is that KnightCap improved from a 1650 rat-            there is lingering disappointment that despite several at-
     ing to a 2150 rating in just 308 games and 3 days           tempts, they have not been repeated for other board games
     of play. As a reference, a rating of 1650 corre-            such as othello, Go and the “drosophila of AI” — chess
     sponds to about level B human play (on a scale              [10, 12, 6].
     from E (1000) to A (1800)), while 2150 is human
     master level. We discuss some of the reasons for            Many authors have discussed the peculiarities of backgam-
     this success, principle among them being the use            mon that make it particularly suitable for Temporal Dif-
     of on-line, rather than self-play.                          ference learning with self-play [8, 6, 4]. Principle among
                                                                 these are speed of play: TD-Gammon learnt from sev-
                                                                 eral hundred thousand games of self-play, representation
1 Introduction                                                   smoothness: the evaluation of a backgammon position
                                                                 is a reasonably smooth function of the position (viewed,
Temporal Difference learning, first introduced by Samuel         say, as a vector of piece counts), making it easier to find
[5] and later extended and formalized by Sutton [7] in his       a good neural network approximation, and stochasticity:
TD(  ) algorithm, is an elegant technique for approximat-       backgammon is a random game which forces at least a min-
ing the expected long term future cost (or cost-to-go) of a      imal amount of exploration of search space.
stochastic dynamical system as a function of the current
                                                                 As TD-Gammon in its original form only searched one-
state. The mapping from states to future cost is imple-
                                                                 ply ahead, we feel this list should be appended with: shal-
mented by a parameterized function approximator such as
                                                                 low search is good enough against humans. There are two
a neural network. The parameters are updated online af-
                                                                 possible reasons for this; either one does not gain a lot
ter each state transition, or possibly in batch updates after
                                                                 by searching deeper in backgammon (questionable given
several state transitions. The goal of the algorithm is to im-
                                                                 that recent versions of TD-Gammon search to three-ply
prove the cost estimates as the number of observed state
                                                                 and this significantly improves their performance), or hu-
transitions and associated costs increases.
                                                                 mans are simply incapable of searching deeply and so TD-
Perhaps the most remarkable success of TD(  ) is Tesauro’s      Gammon is only competing in a pool of shallow searchers.
TD-Gammon, a neural network backgammon player that               Although we know of no psychological studies investigat-
was trained from scratch using TD(  ) and simulated self-       ing the depth to which humans search in backgammon, it
play. TD-Gammon is competitive with the best human               is plausible that the combination of high branching fac-
tor and random move generation makes it quite difficult to            The remainder of this paper is organized as follows. In
search more than one or two-ply ahead. In particular, ran-            section 2 we describe the TD(  ) algorithm as it applies to
dom move generation effectively prevents selective search             games. The TDLeaf(  ) algorithm is described in section 3.
or “forward pruning” because it enforces a lower bound on             Experimental results for internet-play with KnightCap are
the branching factor at each move.                                    given in section 4. Section 5 contains some discussion and
                                                                      concluding remarks.
In contrast, finding a representation for chess, othello or
Go which allows a small neural network to order moves at
one-ply with near human performance is a far more diffi-              2 The TD( algorithm applied to games
cult task. It seems that for these games, reliable tactical
evaluation is difficult to achieve without deep lookahead.            In this section we describe the TD(  ) algorithm as it applies
As deep lookahead invariably involves some kind of mini-              to playing board games. We discuss the algorithm from the
max search, which in turn requires an exponential increase            point of view of an agent playing the game.
in the number of positions evaluated as the search depth
                                                                      Let denote the set of all possible board positions in the
increases, the computational cost of the evaluation func-
                                                                      game. Play proceeds in a series of moves at discrete time
tion has to be low, ruling out the use of expensive evalua-
                                                                      steps  . At time the agent finds itself in
tion functions such as neural networks. Consequently most
                                                                      some position  , and has available a set of moves,
chess and othello programs use linear evaluation functions
                                                                      or actions   (the legal moves in position ! ). The agent
(the branching factor in Go makes minimax search to any
                                                                      chooses an action "#$  and makes a transition to state
significant depth nearly infeasible).
                                                                       &%(' with probability )*+  , &%*' ," . Here  &%*' is the po-
In this paper we introduce TDLeaf(  ), a variation on the            sition of the board after the agent’s move and the oppo-
TD  algorithm that can be used to learn an evaluation            nent’s response. When the game is over, the agent receives
function for use in deep minimax search. TDLeaf(  ) is               a scalar reward, typically “1” for a win, “0” for a draw and
identical to TD(  ) except that instead of operating on the          “-1” for a loss.
positions that occur during the game, it operates on the leaf
                                                                      For ease of notation we will assume all games have a fixed
nodes of the principal variation of a minimax search from
                                                                      length of - (this is not essential). Let .&/0 denote the re-
each position (also known as the principal leaves).
                                                                      ward received at the end of the game. If we assume that the
To test the effectiveness of TDLeaf(  ), we incorporated it          agent chooses its actions according to some function "1&2
into our own chess program—KnightCap. KnightCap has                   of the current state  (so that "1&134 ), the expected
a particularly rich board representation enabling relatively          reward from each state 56 is given by
fast computation of sophisticated positional features, al-                                     798
though this is achieved at some cost in speed (KnightCap is                                          &1;:<     =   ?>A@  .&2/BC           (1)
about 10 times slower than Crafty—the best public-domain              where the expectation is with respect to the transition prob-
chess program—and 6,000 times slower than Deep Blue).                 abilities )(&!C,!&%*'D,"1+!EE and possibly also with respect
We trained KnightCap’s linear evaluation function using               to the actions "1&E if the agent chooses its actions stochas-
TDLeaf(  ) by playing it on the Free Internet Chess Server           tically.
(FICS, fics.onenet.net) and on the Internet Chess
Club (ICC, chessclub.com). Internet play was used                     For vary large state spaces it is not possible store the
                                                                                7 8
to avoid the premature convergence difficulties associated            value of      &1 for every FG , so instead we might try
                                                                                           7 8
self-play1 .The main success story we report is that starting         to
                                                                      H7   approximate            using a parameterized function class
from an evaluation function in which all coefficients were               :#IKJMLONPJ , for    H7  example linear function, splines, neu-
set to zero except the values of the pieces, KnightCap went           ral networks, etc. EQR,ST is assumed to be a differentiable
from a 1650-rated player to a 2150-rated player in just three         function of its parameters SGU+S'VES L  . The aim is to
days and 308 games. KnightCap is an ongoing project with              find a parameter vector S6JWL that minimizes          some mea-
                                                                                                                      7H           7 8
new features being added to its evaluation function all the           sure of error between the approximation XQYEST and EQ  .
time. We use TDLeaf(  ) and Internet play to tune the co-            The TD(  ) algorithm, which we describe now, is designed
efficients of these features.                                         to do exactly that.
                                                                      Suppose 1'DZ,/T[\'DE2/ is a sequence of states in one
    1                                                                 game. For a given parameter vector S , define the temporal
      Randomizing move choice is another way of avoiding prob-
lems associated with self-play (this approach has been tried in Go    difference associated with the transition !]N^&%*' by
[6]), but the advantage of the Internet is that more information is                     _            7H                        7H
provided by the opponents play.                                                             :`          &&%*'V,STWa         &!C,STC   (2)
_
Note that  7 measures
                H               the difference between the reward                                                                                                     Successive parameter updates according to the TD(  ) al-
predicted   by    X
                   Y
                    Q E
                       T
                        S  at time cbd , and the reward predicted                                                                                                    gorithm should, over time,      lead to improved predictions of
    7H                                                     7 8                                                                                                                                  7H
by XQYEST at time . The true evaluation function            has                                                                                                    the expected reward EQR,ST . Provided the actions "1&E
the property                                                                                                                                                          are independentH of the parameter vector S , it can be shown
                                                                                                                                                                                     7
                                                       7*8                                    7*8                                                                     that for linear XQYESn , the TD(  ) algorithm converges to a
                              =   fecgC@  \h                  & &%*' (a                          +  ji\             k!                                  near-optimal parameter       vector [11]. Unfortunately, there is
                                                                                                                                                                                               7H
         7H                                                                                                                7 8                                _       no such guarantee if XQYEST is non-linear [11], or if "1&E
so if EQR,ST is a good approximation to        , =  felg @                                                                                                    depends on S [2].
should  be  close to zero. For ease of notation we will assume
     7H
that &/mESnTo.&/0 always, so that the final temporal
difference satisfies                                                                                                                                                  3 Minimax Search and TD( )
_                   7H                                 7H                                                                         7H
    /[p';                   +/KESTVa                       +/T[\'DESnqr.&2/B?a                                               &/T[p'V,STC         For argument’s sake, assume any action " taken in state 
               _                                                                                                                                                      leads to predetermined state
                                                                                                                                                                                                7H
                                                                                                                                                                                                     which7 we will denote by 2Œ .
That is, /T[p' is the difference between the true outcome                                                                                                                                                     8
                                                                                                                                                                      Once an approximation XQYESn to        has been found, we
of the game and the prediction at the penultimate move.                                                                                                               can use it to choose actions in state  by picking the action
At the end of the game, the TD(  ) algorithm updates the                                                                                                             "Ž  whose successor state  Œ minimizes the opponent’s

parameter vector S according to the formula                                                                                                                           expected reward2:
                                                                                                                           |
                                                                                                     z{                                                                                  8                                7H
                                                  /[p'                                                    /[p'                                                                     "       +2;:<   argmin   ‹‘p’        + Œ ,STC   (6)
                                                   v                   7H                                   v|
                                                                                                                                [ _
              S:
4                                                                                                                           4
                                                          A                                                                                                                           A
                         -5                                                                      4                                                                 4                                   4
                         B                                                                       C                                                                 B                                   C
          3                              -5                              4                                            5                                   4                  10                4                5
          D                              E                               F                                            G                                   D                  E                 F                G

     H           I               J              K              L                      M                   N                       O                  H         I        J         K       L        M       N        O
     3          -9              -5              -6             4*                     2                   -9                      5                  4*       -9       10         8       4*       2       -9       5

Figure 1: Full breadth, 3-ply search tree illustrating the                                                                                  Figure 2: A search tree with a non-unique principal varia-
minimax rule for propagating values. Each of the leaf                                                                                       tion (PV). In this case the derivative of the root node A with
nodes       (H–O) is given a score by the evaluation function,                                                                              respect to the parameters of theH leaf-node evaluation         func-
7H                                                                                                                                                                           7               7H
   XQYEST . These scores are then propagated back up the tree                                                                            tion is multi-valued, either       ~œ6EST or     +—EST . Ex-
by assigning to each opponent’s internal node the minimum                                                                                                                 x
                                                                                                                                            cept for transpositions (in which    case H andx L are identical
of its children’s values, and to each of our internal nodes the                                                                             and the derivative is single-valued anyway), such “colli-
maximum of its children’s values. The principle variation                                                                                   sions” are likely to be extremely rare, so in TDLeaf(  ) we
is then the sequence of best moves for either side starting                                                                                 ignore them by choosing a leaf node arbitrarily from the
from the root node, and this is illustrated by a dashed line                                                                                available candidates.
in the figure. Note that the score at the root node A is the
evaluation of the leaf node (L) of the principal variation. As
there are no ties between any siblings, the derivative        of A’s
                                                     7H                                                                                                                7H•
score with respect to the parameters S is just          +—EST .                                                                         of   S. Note that    &*EST is also a continuous function of
                                                                                                                                                            7H
                                                                                                      x                                     S   whenever +9ESn is a continuous function of           S . This
                                                                                                                                                                                                7lH •
                                                                                                                                            implies that even for the “bad” pairs &9,ST ,          &*EST is
tions \'VE2/         in a game we define the temporal differ-                                                                    only undefined because it is multi-valued. Thus   x            we can
ences                                                                                                                                                                                            7lH •
                                                                                                                                            still arbitrarily choose a particular value for             +9ESn if
                     _               7lH •                              7lH •                                                                 S happens to land on one of the bad points; x
                         q:<                +!&%*'DESTMa                       +!CESn                                      (7)
                                                                                                                                            Based on these observations we modified the TD(  ) al-
as per equation (2), and then the TD(  ) algorithm (3) for                                                                                 gorithm to take account of minimax search in an almost
updating the parameter vector S becomes                                                                                                     trivial way: instead of working with the root positions
                                                                                                 |
                                                                                z{                                                           ' Z, / , the TD(  ) algorithm is applied to the leaf po-
                                 /T[\'                                               /T[p'
                                  v               7lH •                               v|
                                                                                                     [2 _
                                                                                                                                            sitions found by minimax search from the root positions.
         S:<     Stb˜u                                   &!C,ST                                         +}                  (8)   We call this algorithm TDLeaf(  ). Full details are given in
                                     &w*'nx                                           w\
                                                                                                                                            figure 3.
                                                                                             _6™                      7lH •
One problem with equation (8) is that for           , &*EST
is not a necessarily a H differentiable function of S for all
                        7
values of S , even if XQYESn is everywhere differentiable.                                                                               4 TDLeaf( ) and Chess
This is because for some values of S there will be “ties” in
the minimax search, i.e. there will be more than one best
move available in some of the positions along the principal
                                                                                                                                            In this section we describe the outcome of several ex-
variation, which means that the principal variation will not
                                                                                                                                            periments in which the TDLeaf(  ) algorithm was used
be unique (see 7lfigure
                 H•       2). Thus, the evaluation assigned to
                                                                                                                                            to train the weights of a linear evaluation function in
the root node, &*EST , will be the evaluation of any one
                                                                                                                                            our chess program “KnightCap”. KnightCap is a reason-
of a number of leaf nodes.
                                                                                                                                            ably sophisticated computer chess program for Unix sys-
Fortunately, under  some mild technical assumptions on the                                                                                  tems. It has all the standard algorithmic features that
             7H
behavior of &*EST , it can be H shown that for each state  ,                                                                           modern chess programs tend to have as well as a num-
                                 7l•
the set of Sš›J L for which +9,ST is not differentiable                                                                                ber of features that are much less common. For more
has Lebesgue measure H zero. Thus for all states  and for                                                                                  details on KnightCap, including the source code, see
                          7l•
“almost all” SšdJML , &*EST is a differentiable function                                                                               wwwsyseng.anu.edu.au/lsg.
7H
Let EQR,ST be a class of evaluation functions parameterized by SsJML . Let  ' , / be - positions that occurred
                                                                                                       7H
during the course of a game, with .& /  the outcome of the game. For notational convenience set + / ESnA:`s.& /  .
                                     7lH •                                                                                                       _                                          7H
 1. For each state 2ž , compute &ž,ESn by performing minimax search to depth from 2ž and using                                                                                            XQYEST   to score the
                            _
    leaf nodes. Note that may vary from position to position.
 2. Let !Ÿž denote the leaf node of the principle variation starting at ž . If there is more than one principal variation, choose
    a leaf node from the available candidates at random. Note that
                                                                   7H•                          7H
                                                                          & ž ESnq                   + Ÿž ESTZ                                                                                              (9)

 3. For qFZ,-†a      , compute the temporal differences:
                                                        _                 7H                                 7H
                                                            :<               + Ÿ&%*' ESnWa                      + Ÿ ESnC                                                                               (10)

 4. Update S according to the TDLeaf(  ) formula:
                                                                                                                                     |
                                                                                                                  z{
                                                                         /T[\'                                          /T[\'
                                                                          v           7H                                 v|
                                                                                                                                         [ _
                                               S:`sStb$u                                  & Ÿ EST                                          }                                                              (11)
                                                                         &w('Tx                                           w\

                                                  Figure 3: The TDLeaf(  ) algorithm

                                                                                                     ¦                                                                                                  7H
4.1 Experiments with KnightCap                                                                               The raw (linear) leaf node evaluations +!Ÿž EST were
                                                                                                             converted to a score between a0 and by computing
In our main experiment we took KnightCap’s evaluation                                                                                                                           7H
                                                                                                                                                     §
function and set all but the material parameters to zero.                                                                                                ž Ÿ :`   ¨©‹ª«5ƒY”        & Ÿž ,ST¬„;
The material parameters were initialized to the standard
“computer” values: 1 for a pawn, 4 for a knight, 4 for a                                                     This ensured small fluctuations in the relative values
bishop, 6 for a rook and 12 for a queen. With these pa-                                                      of leaf nodes did not produce large temporalH differ-
                                                                                                                                                                   7
rameter settings KnightCap (under the pseudonym “Wimp-                                                       ences (the values § ž Ÿ were used in place of +!Ÿž EST
Knight”) was started on the Free Internet Chess server                                                       in the TDLeaf(  ) calculations). The outcome of the
(FICS, fics.onenet.net) against both human and                                                               game .& /  was set to 1 for a win, a0 for a loss
computer opponents. We played KnightCap for 25 games                                                         and k for a draw. ” was set to ensure that a value
                                                                                                                           7H
without modifying its evaluation function so as to get a rea-                                                of ¨©‹ª« ƒ ” & Ÿž ,ST „ ok`‹¢ was equivalent to a ma-
sonable idea of its rating. After 25 games it had a blitz (fast                                              terial superiority of 1 pawn (initially).
time control) rating of ¡¢‹k¤£‰¢Dk 3 , which put it at about                                                                                                              _
                                                                                                     ¦                                      §           §
B-grade human performance (on a scale from E (1000) to                                                       The temporal differences,       &Ÿ %*' a      Ÿ , were mod-
                                                                                                                                                                         _
A (1800)), although of course the kind of game KnightCap                                                     ified in the following way. Negative values of 
plays with just material parameters set is very different to                                                 were left unchanged as any decrease in the evalua-
human play of the same level (KnightCap makes no short-                                                      tion from one position to the next can be    _
                                                                                                                                                                  viewed as
term tactical errors but is positionally completely ignorant).                                               mistake. However, positive values of  can occur
We then turned on the TDLeaf(  ) learning algorithm, with                                                   simply because the opponent has made a blunder. To
sk!`¥ and the learning rate u›o‹ k . The value of  was                                                  avoid KnightCap trying to learn to predict its oppo-
chosen heuristically, based on the typical delay in moves                                                    nent’s blunders, we set all positive temporal differ-
before an error takes effect, while u was set high enough                                                    ences to zero unless KnightCap predicted the oppo-
to ensure rapid modification of the parameters. A couple of                                                  nent’s move4
minor modifications to the algorithm were made:                                                          4
                                                                                                  In a later experiment we only set positive temporal differ-
                                                                                             ences to zero if KnightCap did not predict the opponent’s move
                                                                                             and the opponent was rated less than KnightCap. After all, pre-
    3
      the standard deviation for all ratings reported in this section                        dicting a stronger opponent’s blunders is a useful skill, although
is about 50                                                                                  whether this made any difference is not clear.
¦
       The value of a pawn was kept fixed at its initial value                  2150
       so as to allow easy interpretation of weight values                      2100
       as multiples of the pawn value (we actually experi-                      2050
       mented with not fixing the pawn value and found it                       2000
       made little difference: after 1764 games with an ad-                     1950
       justable pawn its value had fallen by less than 7 per-                   1900
                                                                       ­

                                                                       Rating
       cent).
                                                                                1850

                                                                                1800
Within 300 games KnightCap’s rating had risen to 2150, an                       1750
increase of 500 points in three days, and to a level compa-                     1700
rable with human masters. At this point KnightCap’s per-                        1650
formance began to plateau, primarily because it does not                        1600
have an opening book and so will repeatedly play into weak
                                                                                   -100 -50 0   50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 95010001050
lines. We have since implemented an opening book learn-                                                                         Games

ing algorithm and with this KnightCap now plays at a rating
of 2400–2500 (peak 2575) on the other major internet chess             Figure 4: KnightCap’s rating as a function of games played
server: ICC, chessclub.com5 It often beats Interna-                    (second experiment). Learning was turned on at game 0.
tional Masters at blitz. Also, because KnightCap automati-
cally learns its parameters we have been able to add a large
number of new features to its evaluation function: Knight-             store the whole position in the hash table so small mem-
Cap currently operates with 5872 features (1468 features               ory really hurts the performance). Another reason may also
in four stages: opening, middle, ending and mating6 . With             have been that for a portion of the run we were performing
this extra evaluation power KnightCap easily beats ver-                paramater updates after every four games rather than every
sions of Crafty restricted to search only as deep as itself.           game.
However, a big caveat to all this optimistic assessment is
that KnightCap routinely gets crushed by faster programs               Plots of various parameters as a function of the number of
searching more deeply. It is quite unlikely this can be eas-           games played are shown in Figure 5 (these plots are from
ily fixed simply by modifying the evaluation function, since           the same experiment in figure 4). Each plot contains three
for this to work one has to be able to predict tactics stat-           graphs corresponding to the three different stages of the
ically, something that seems very difficult to do. If one              evaluation function: opening, middle and ending7.
could find an effective algorithm for “learning to search se-          Finally, we compared the performance of KnightCap with
lectively” there would be potential for far greater improve-           its learnt weight to KnightCap’s performance with a set of
ment.                                                                  hand-coded weights, again by playing the two versions on
Note that we have twice repeated the learning experiment               ICC. The hand-coded weights were close in performance
and found a similar rate of improvement and final perfor-              to the learnt weights (perhaps 50-100 rating points worse).
mance level. The rating as a function of the number of a               We also tested the result of allowing KnightCap to learn
games from one of these repeat runs is shown in figure 4               starting from the hand-coded weights, and in this case it
(we did not record this information in the first experiment).          seems that KnightCap performs better than when start-
Note that in this case KnightCap took mearly twice as long             ing from just material values (peak performance was 2632
to reach the 2150 mark, but this was partly because it was             compared to 2575, but these figures are very noisy). We are
operating with limited memory (8Mb) until game 500 at                  conducting more tests to verify these results. However, it
which point the memory was increased to 40Mb (Knight-                  should not be too surprising that learning from a good qual-
Cap’s search algorithm—MTD(f) [3]—is a memory inten-                   ity set of hand-crafted parameters is better than just learn-
sive variant of u –” and when learning KnightCap must                  ing from material parameters. In particular, some of the
                                                                       handcrafted parameters have very high values (the value of
    5
      There appears to be a systematic difference of around 200–       an “unstoppable pawn”, for example) which can take a very
250 points between the two servers, so a peak rating of 2575 on        long time to learn under normal playing conditions, partic-
ICC roughly corresponds to a peak of 2350 on FICS. We trans-
ferred KnightCap to ICC because there are more strong players          ularly if they are rarely active in the principal leaves. It is
playing there.
    6                                                                      7
      In reality there are not 1468 independent “concepts” per stage         KnightCap actually has a fourth and final stage “mating”
in KnightCap’s evaluation function as many of the features come        which kicks in when all the pieces are off, but this stage only uses
in groups of 64, one for each square on the board (like the value      a few of the coefficients (opponent’s king mobiliity and proximity
of placing a rook on a particular square, for example)                 of our king to the opponent’s king).
DOUBLED_PAWN
                           200                                                                                 close in parameter space to many far superior param-
                                                                                          Opening
                                                                                           Middle
                                                                                           Ending
                                                                                                               eter settings.
                           150
                                                                                                           3. Most players on FICS prefer to play opponents of sim-
                                                                                                              ilar strength, and so KnightCap’s opponents improved
Score (1 pawn = 10000)

                           100
                                                                                                              as it did. This may have had the effect of guiding
    ®                                                                                                         KnightCap along a path in weight space that led to
                                                                                                              a strong set of weights.
                            50

                                                                                                           4. KnightCap was learning on-line, not by self-play. The
                                                                                                              advantage of on-line play is that there is a great deal
                             0
                                                                                                              of information provided by the opponent’s moves. In
                                                                                                              particular, against a stronger opponent KnightCap was
                            -50                                                                               being shown positions that 1) could be forced (against
                                  0   50   100   150   200     250   300    350   400   450   500   550
                                                                 Games
                                                             CASTLE_BONUS
                                                                                                              KnightCap’s weak play) and 2) were mis-evaluated by
                          1000
                                                                                          Opening
                                                                                                              its evaluation function. Of course, in self-play Knight-
                                                                                           Middle
                           900
                                                                                           Ending             Cap can also discover positions which are misevalu-
                           800                                                                                ated, but it will not find the kinds of positions that
                           700                                                                                are relevant to strong play against other opponents. In
                                                                                                              this setting, one can view the information provided by
Score (1 pawn = 10000)

                           600

                                                                                                              the opponent’s moves as partially solving the “explo-
    ®                      500
                                                                                                              ration” part of the exploration/exploitation tradeoff.
                           400

                           300
                                                                                                          To further investigate the importance of some of these
                           200
                                                                                                          reasons, we conducted several more experiments.
                           100

                             0
                                                                                                          Good initial conditions.
                           -100
                                  0   50   100   150   200    250   300
                                                                Games
                                                                            350   400   450   500   550   A second experiment was run in which KnightCap’s co-
                                                                                                          efficients were all initialised to the value of a pawn. The
Figure 5: Evolution of two paramaters (bonus for castling                                                 value of a pawn needs to be positive in KnightCap be-
and penalty for a doubled pawn) as a function of the num-                                                 cause it is used in many other places in the code: for
ber of games played. Note that each parameter appears                                                     example we deem the MTD search to have converged if
three times: once for each of the three stages in the evalua-                                             uG¯‰”6brk kl¥° PAWN. Thus, to set all parameters equal to
tion function.                                                                                            the same value, that value had to be a pawn.
                                                                                                          Playing with the initial weight settings KnightCap had a
                                                                                                          blitz rating of around 1250. After more than 1000 games
not yet clear whether given a sufficient number of games
                                                                                                          on FICS KnightCap’s rating has improved to about 1550,
this dependence on the initial conditions can be made to
                                                                                                          a 300 point gain. This is a much slower improvement
vanish.
                                                                                                          than the original experiment. We do not know whether
                                                                                                          the coefficients would have eventually converged to good
4.2 Discussion                                                                                            values, but it is clear from this experiment that starting
                                                                                                          near to a good set of weights is important for fast con-
There appear to be a number of reasons for the remarkable                                                 vergence. An interesting avenue for further exploration
rate at which KnightCap improved.                                                                         here is the effect of  on the learning rate. Because the
                                                                                                          initial evaluation function is completely wrong, there
                         1. As all the non-material weights were initially zero,                          would be some justification in setting ‚± early on so
                            even small changes in these weights could cause very                          that KnightCap only tries to predict the outcome of the
                            large changes in the relative ordering of materially                          game and not the evaluations of later moves (which are
                            equal positions. Hence even after a few games Knight-                         extremely unreliable).
                            Cap was playing a substantially better game of chess.

                         2. It seems to be important that KnightCap started out                           Self-Play
                            life with intelligent material parameters. This put it                        Learning by self-play was extremely effective for TD-
Gammon, but a significant reason for this is the randomness       On the theoretical side, it has recently been shown that
of backgammon which ensures that with high probabil-              TD(  ) converges for linear evaluation functions[11] (al-
ity different games have substantially different sequences        though only in the sense of prediction, not control). An
of moves, and also the speed of play of TD-Gammon                 interesting avenue for further investigation would be to de-
which ensured that learning could take place over several         termine whether TDLeaf(  ) has similar convergence prop-
hundred-thousand games. Unfortunately, chess programs             erties.
are slow, and chess is a deterministic game, so self-play by
a deterministic algorithm tends to result in a large number       Acknowledgements
of substantially similar games. This is not a problem if the
games seen in self-play are “representative” of the games         Thanks to several of the anonymous referees for their help-
played in practice, however KnightCap’s self-play games           ful remarks. Jonathan Baxter was supported by an Aus-
with only non-zero material weights are very different to         tralian Postdoctoral Fellowship. Lex Weaver was sup-
the kind of games humans of the same level would play.            ported by an Australian Postgraduate Research Award.

To demonstrate that learning by self-play for KnightCap is
not as effective as learning against real opponents, we ran       References
another experiment in which all but the material parame-
                                                                   [1] D. F. Beal and M. C. Smith. Learning Piece values
ters were initialised to zero again, but this time KnightCap
                                                                       Using Temporal Differences. Journal of The Interna-
learnt by playing against itself. After 600 games (twice as
                                                                       tional Computer Chess Association, September 1997.
many as in the original FICS experiment), we played the re-
sulting version against the good version that learnt on FICS       [2] D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic
for a further 100 games with the weight values fixed. The              Programming. Athena Scientific, 1996.
self-play version scored only 11% against the good FICS
version.                                                           [3] A. Plaat, J. Schaeffer, W. Pijls, and A. de Bruin. Best-
                                                                       First Fixed-Depth Minmax Algorithms. Artificial In-
Simultaneously with the work presented here, Beal                      telligence, 87:255–293, 1996.
and Smith [1] reported positive results using essentially
TDLeaf(  ) and self-play (with some random move choice)           [4] J. Pollack, A. Blair, and M. Land. Coevolution of
when learning the parameters of an evaluation function that            a Backgammon Player. In Proceedings of the Fifth
only computed material balance. However, they were not                 Artificial Life Conference, Nara, Japan, 1996.
comparing performance against on-line players, but were
primarily investigating whether the weights would con-             [5] A. L. Samuel. Some Studies in Machine LEarning
verge to “sensible” values at least as good as the naive (1, 3,        Using the Game of Checkers. IBM Journal of Re-
3, 5, 9) values for (pawn, knight, bishop, rook, queen) (they          search and Development, 3:210–229, 1959.
did, within 2000 games, and using a value of oPk! ²l¢            [6] N. Schraudolph, P. Dayan, and T. Sejnowski. Tempo-
which supports the discussion in “good initial conditions”             ral Difference Learning of Position Evaluation in the
above).                                                                Game of Go. In J. Cowan, G. Tesauro, and J. Alspec-
                                                                       tor, editors, Advances in Neural Information Process-
5 Conclusion                                                           ing Systems 6, San Fransisco, 1994. Morgan Kauf-
                                                                       mann.
We have introduced TDLeaf(  ), a variant of TD(  ) suitable
                                                                   [7] R. Sutton. Learning to Predict by the Method of Tem-
for training an evaluation function used in minimax search.
                                                                       poral Differences. Machine Learning, 3:9–44, 1988.
The only extra requirement of the algorithm is that the leaf-
nodes of the principal variations be stored throughout the         [8] G. Tesauro. Practical Issues in Temporal Difference
game.                                                                  Learning. Machine Learning, 8:257–278, 1992.
We presented some experiments in which a chess evalua-             [9] G. Tesauro. TD-Gammon, a self-teaching backgam-
tion function was trained from B-grade to master level us-             mon program, achieves master-level play. Neural
ing TDLeaf(  ) by on-line play against a mixture of human             Computation, 6:215–219, 1994.
and computer opponents. The experiments show both the
importance of “on-line” sampling (as opposed to self-play)        [10] S. Thrun. Learning to Play the Game of Chess. In
for a deterministic game such as chess, and the need to                G. Tesauro, D. Touretzky, and T. Leen, editors, Ad-
start near a good solution for fast convergence, although              vances in Neural Information Processing Systems 7,
just how near is still not clear.                                      San Fransisco, 1995. Morgan Kaufmann.
[11] J. N. Tsitsikilis and B. V. Roy. An Analysis of Tem-
     poral Difference Learning with Function Approxi-
     mation. IEEE Transactions on Automatic Control,
     42(5):674–690, 1997.
[12] S. Walker, R. Lister, and T. Downs. On Self-Learning
     Patterns in the Othello Board Game by the Method
     of Temporal Differences. In C. Rowles, H. liu,
     and N. Foo, editors, Proceedings of the 6th Aus-
     tralian Joint Conference on Artificial Intelligence,
     pages 328–333, Melbourne, 1993. World Scientific.
You can also read