General Video Game AI: a Multi-Track Framework for Evaluating Agents, Games and Content Generation Algorithms - arXiv

 
CONTINUE READING
General Video Game AI: a Multi-Track Framework
                                            for Evaluating Agents, Games and Content
                                                      Generation Algorithms
                                                Diego Perez-Liebana, Jialin Liu, Ahmed Khalifa, Raluca D. Gaina, Julian Togelius, Simon M. Lucas

                                            Abstract—General Video Game Playing (GVGP) aims at                      The General Video Game AI competition [6] was founded
arXiv:1802.10363v2 [cs.AI] 27 Jul 2018

                                         designing an agent that is capable of playing multiple video           on the belief that the best way to stop AI researchers from
                                         games with no human intervention. In 2014, The General Video           relying on game-specific engineering in their agents is to
                                         Game AI (GVGAI) competition framework was created and
                                         released with the purpose of providing researchers a common            make it impossible. The idea is that researchers develop their
                                         open-source and easy to use platform for testing their AI              agents without knowing what games they will be playing, and
                                         methods with potentially infinity of games created using Video         after submitting their agents to the competition all agents are
                                         Game Description Language (VGDL). The framework has been               evaluated using an unseen set of games. Every competition
                                         expanded into several tracks during the last few years to meet the     event requires the design of a new set of games, as reusing
                                         demand of different research directions. The agents are required
                                         to either play multiples unknown games with or without access          previous games would make it possible to adapt agents to
                                         to game simulations, or to design new game levels or rules.            existing games.
                                         This survey paper presents the VGDL, the GVGAI framework,                  In order to make this possible, the Video Game Description
                                         existing tracks, and reviews the wide use of GVGAI framework           Language (VGDL) was developed to easily create new games
                                         in research, education and competitions five years after its birth.    to be played by agents. VGDL was designed so that it would
                                         A future plan of framework improvements is also described.
                                                                                                                be easy to create games both for humans and algorithms, even-
                                                                                                                tually allowing for automated generation of testbed games. A
                                                                  I. I NTRODUCTION                              game engine was also created to allow games specified in this
                                            Game-based benchmarks and competitions have been used               language to be visualized and played, both by humans and AI
                                         for testing artificial intelligence capabilities since the inception   agents, forming the basis of the competition.
                                         of the research field. For the first four or five decades,                 While the GVGAI competition was initially focused on
                                         such testing was almost exclusively based on board games.              benchmarking AI algorithms for playing the game, the compe-
                                         However, since the early 2000s a number of competitions                tition and its associated software has multiple uses. In addition
                                         and benchmarks based on video games have sprung up. The                to the competition tracks dedicated to game-playing agents,
                                         more well-known include the Ms Pac-Man competition [1],                there are now competition tracks focused on generating game
                                         the Mario AI competition [2], the Simulated Car Racing                 levels or rules. There is also the potential to use VGDL and
                                         competition [3], the Arcade Learning Environment [4] and the           GVGAI as a game prototyping system, and there is a rapidly
                                         StarCraft [5] competitions.                                            growing body of research using this framework for everything
                                            So far, most competitions and game benchmarks challenge             from building mixed-initiative design tools to demonstrating
                                         the agents to play a single game (an exception is the Ar-              new concepts in game design.
                                         cade Learning Environment which contains several dozens                    The objective of this paper is to provide an overview of the
                                         of games, but so far almost all published studies deal with            different efforts from the community on the use of the GVGAI
                                         learning agents for one game at a time). This leads to an over-        framework (and, by extension, of its competition) for General
                                         specialization, or overfitting, of agents to individual games.         Game Artificial Intelligence. This overview aims at identifying
                                         This is reflected in the outcome of individual competitions            the main approaches that have been used so far for agent AI
                                         – for example, over the more than five years the Simulated             and PCG, in order to compare them and recognize possible
                                         Car Racing Competition ran, submitted car controllers got              lines of future research within this field. The paper starts with a
                                         better at completing races fast, but incorporated more and more        brief overview of the framework and the different competition
                                         game-specific engineering and arguably less of general AI and          tracks, for context and completeness, which summarizes work
                                         machine learning algorithms. Similarly, the well-performing            published in other papers by the same authors. The bulk of the
                                         bots on the StarCraft competitions are generally highly domain         paper is centered in the next few sections, which are devoted to
                                         – and even strategy-specific, and display very little in the way       discussing the various kinds of AI methods that have been used
                                         of learning and adaptation.                                            in the submissions to each track. Special consideration is given
                                            It should not come as a surprise that, whenever possible,           to the single-player planning track, as it has existed for longest
                                         researchers will incorporate domain knowledge about the game           and received the most submissions up to date. This is followed
                                         they design an agent for. Yet, this trend threatens to negate the      by a section cataloguing some of the non-competition research
                                         usefulness of game-based AI competitions for spurring and              uses of the GVGAI software. The final few sections provide a
                                         testing the development of stronger and more general AI.               view on the future use and development of the framework and
competition: how it can be used in teaching, open research            for creating learning agents, both in Java and in Python,
problems (specifically related to the planning tracks), and the       which would learn to play any game is given in an episodic
future evolution of the competition and framework itself.             manner [17]. At the moment of writing, the framework counts
                                                                      on 120 single-player and 60 two-player games.
         II. VGDL AND      THE   GVGAI F RAMEWORK
                                                                                    III. GVGAI C OMPETITION            TRACKS
   The idea of Video Game Description Language (VGDL)
was initially proposed by Ebner et al. [7] at the Dagstuhl Sem-        This section summarizes the different tracks featured in the
inar on Artificial and Computational Intelligence in Games,           GVGAI competition.
during which the first VGDL game, Space Invaders, was
created. Then, Schaul continued this work by completing and           A. Single-player planning
improving the language in a Python framework for model-
                                                                         The first competition track is Single-player Planning [11],
based learning and released the first game engine in 2013 [8],
                                                                      in which one agent (also referred to as bot or controller) plays
[9]. VGDL is a text description language that allows for
                                                                      single-player games with the aid of the Forward Model. Each
the definition of two-dimensional, arcade, single-player, grid-
                                                                      controller has 1 second for initialization and 40ms at each
based physics and (generally) stochastic games and levels.
                                                                      game tick as decision time.
With an ontology defined in the framework, VGDL permits
                                                                         All GVGAI tracks follow a similar structure: there are
the description of objects (sprites), their properties and the
                                                                      several public sets of games, included in the framework,
consequences of them colliding with each other. Additionally,
                                                                      allowing the participants to train their agents on them. For
termination conditions and reward signals (in form of game
                                                                      each edition, there is one validation and one test set. Both
scores) can be defined for these games. In total, four different
                                                                      sets are private and stored in the competition server1 . There
sets are defined: Sprite Set (which defines what kind of sprites
                                                                      are 10 games on each set, with 5 different levels each.
take part in the game), Interaction Set (rules that govern the
                                                                         Participants can submit their entries any time before the
effects of two sprites colliding with each other), Termination
                                                                      submission deadline to all training and validation sets, and
Set (which defines how does the game end) and Mapping Set
                                                                      preliminary rankings are displayed in the competition website
(that specifies which characters in the level file map to which
                                                                      (the names of the validation set games are anonymized). All
sprites from the Sprite set).
                                                                      controllers are run on the test set after the submission deadline
   Ebner et al. [7] and Levine et al. [10] described, in their
                                                                      to determine the final rankings of the competition, executing
Dagstuhl 2013 follow-up, the need and interest for such a
                                                                      each agent 5 times on each level.
framework that could accommodate a competition for re-
                                                                         Rankings are computed by first sorting all entries per game
searchers to tackle the challenge of General Video Game Play-
                                                                      according to victory rates, scores and time steps, in this order.
ing (GVGP). Perez-Liebana et al. [6] implemented a version
                                                                      These per-game rankings award points to the first 10 entries,
of Schaul’s initial framework in Java and organized the first
                                                                      from first to tenth position: 25, 18, 15, 12, 10, 8, 6, 4, 2 and 1.
General Video Game AI (GVGAI) competition in 2014 [11].
                                                                      The winner of the competition is the submission that sums
In the following years, this framework was extended to ac-
                                                                      more points across all games in the test set. For a more
commodate two-player games [12], [13], level [14], rule [15]
                                                                      detailed description of the competition and its rules, the reader
generation, and real-world physics games [16].
                                                                      is referred to [11].
   This framework provided an API for creating bots that
                                                                         Table I shows the winners of all editions up to date, along
would be able to play in any game defined in VGDL, without
                                                                      with the section of this survey in which the method is included
giving them access to the rules of the games nor to the
                                                                      and the paper that describes the approach more in depth.
behaviour of other entities defined in the game. However, the
agents do have access to a Forward Model (in order to roll
the current state forward given an action) during the thinking        B. Two-player planning
time, set to 40ms in the competition settings. Controllers also          The Two-player Planning track [12] was added in 2016,
had access at every game tick to game state variables, such           with the aim of testing general AI agents in environments
game status - winner, time step and score -, state of the player      which are more complex and present more direct player
(also referred to in this paper as avatar) - position, orientation,   interaction. Games in this setting are played by two-players in
resources, health points -, history of collisions and positions of    a simultaneous move fashion. The rules of the games included
the different sprites in the game identified with a unique type       in the competition are still secret to the players, similar to
id. Additionally, sprites are grouped in categories attending to      the Single-player track, but an additional piece of information
their general behaviour: Non-Player Characters (NPC), static,         is hidden: whether the game is competitive or cooperative.
movable, portals (which spawn other sprites in the game, or           Having both types of games ensures that agents are not tuned
behave as entry or exit point in the levels) and resources (that      to only compete against their opponents, instead having to
can be collected by the player).                                      correctly judge various situations and identify what the goal
   All this information is also provided to agents for the            of the other intelligent presence in the game is. However, this
learning setting of the framework and competition, with the           question of an appropriate opponent model remains open, as
exception of the Forward Model. In its last (to date) mod-
ification of the framework, GVGAI included compatibility                1 www.gvgai.net;   Intel Core i5 machine, 2.90GHz, and 4GB of memory.
TABLE I                                    1) The 2017 Single-Player Learning track: The 2017
 W INNERS OF ALL EDITIONS OF THE GVGAI P LANNING COMPETITION .       Single-Player Learning track is briefly introduced below. More
    2P INDICATES 2-P LAYER TRACK . H YBRID DENOTES 2 OR MORE
 TECHNIQUES COMBINED IN A SINGLE ALGORITHM . H YPER - HEURISTIC      technical details of the framework and competition procedure
 HAS A HIGH LEVEL DECISION MAKER TO DECIDES WHICH SUB - AGENT        are explained in the track technical manual [17].
     MUST PLAY ( SEE S ECTION IV). TABLE EXTENDED FROM [18].
                                                                        The learning track is based on the GVGAI framework, but
    Contest Leg       Winner               Type           Section    a key difference is that no forward model is provided to
      CIG-14          OLETS         Tree Search Method   IV-B [11]   the controllers. Hence, learning needs to be achieved by an
    GECCO-15         YOLOBOT          Hyper-heuristic    IV-E [19]   episodic repetition of games played. Note that, even while no
      CIG-15          Return42        Hyper-heuristic    IV-E [18]   forward model is accessible during the learning and validation
     CEEC-15         YBCriber            Hybrid          IV-D [20]   phases, controllers receive an observation of the current game
    GECCO-16         YOLOBOT          Hyper-heuristic    IV-E [19]   state via a StateObservation object, which is provided as a
      CIG-16         MaastCTS2      Tree Search Method   IV-B [21]
   WCCI-16 (2P)       ToVo2               Hybrid         V-A [13]
                                                                     Gson object or as a screen-shot of the game screen (in png
   CIG-16 (2P)       Number27             Hybrid          V-B [13]   format). The agent is free to select either or both forms
    GECCO-17         YOLOBOT          Hyper-heuristic    IV-E [19]   of game state observation at any game tick. Similar to the
   CEC-17 (2P)        ToVo2              Hybrid          V-A [13]    planning tracks, controllers (either in Java or in Python) inherit
                                                                     from an abstract class and a constructor and three methods can
                                                                     be implemented: INIT (called at the beginning of every game,
all competition entries so far have employed random or very          must finish in no more than 1s of CPU time), ACT (called at
simple techniques.                                                   every game tick, determines the next action of the controller
   Two legs of this track were organized for the first time in       in 40ms of CPU time) and RESULT (called at the end of every
2016, at the IEEE World Congress on Computational Intel-             game, with no time limit to allow for processing the outcome
ligence (WCCI) and the IEEE Conference on Computational              and perform some learning).
Intelligence and Games (CIG). Although ToVo2 won the first              As in the planning track, games sets count on 10 games
leg and Number27 the second, the winner (adrienctx) and              with 5 levels each. The execution in a given set is split in two
runner-up (MaastCTS2) of the overall championship showed             phases: a learning phase, which allows training in the first 3
that a general agent may not be the best at a particular             levels of each game, and a validation phase, which uses the
subset of problems, but will perform at a high level on many         other 2 available levels. The execution of a controller in a set
different ones. In 2017, ToVo2 did win the overall competition,      has therefore 2 phases:
organized at the IEEE Congress on Evolutionary Computation                 a) Learning phase: In each of the games, each controller
(CEC). The final results of the competition can be seen in           has a limited amount of time, 5 minutes, for learning the first
Table I. For more details, the reader is referred to [13].           3 levels. It will play each of the 3 levels once then be free to
   The work done on 2-player GVGAI has inspired other                choose the next level to play if there is time remaining. The
research on Mathematical General Game Playing. D. Ashlock            controller will be able to send an action called ABORT to finish
et al. [22] implemented general agents for three different           the game at any moment. The method RESULT is called at the
mathematical coordination games, including the Prisoner’s            end of every game regardless if the game has finished normally
Dilemma. Games were presented at once, but switching be-             or has been aborted by the controller. Thus, the controller can
tween them at certain points, and experiments show that agents       play as many games as desired, potentially choosing the levels
can learn to play these games and recognize when the game            to play in, as long as it respects the 5 minutes time limit.
has changed.                                                               b) Validation phase: Immediately after learning phase,
                                                                     the controller plays 10 times the levels 4 and 5 sequentially.
                                                                     There is no more total time limit, but the agent respects the
C. Learning track
                                                                     time limits for init, act and result, and can continue learning
   The GVGAI Single-Player learning track started in 2017.           during the game playing.
Among the GVGAI tracks, this is the only one which accepts              Besides two sample random agents written in Java and
submission of agent written not only in Java, but also in            Python and one sample agent using SARSA [24] written in
Python, in order to accommodate for popular machine learning         Java, the first GVGAI single-player learning track received
libraries written in this language. The first learning track was     three submissions written in Java and one in Python [25]. Table
organized at the 2017 IEEE Conference on Computational               II illustrates the score and ranking of learning agents on the
Intelligence and Games (IEEE CIG2017)2 using the GVGAI               training and test sets. Table III compares the score achieved
framework. Then, Torrado et al. [23] interfaced the GVGAI            by the best planning agent in 2017 on each of the games in
framework to the OpenAI Gym environment to facility the              the test set and the score achieved by the best learning agent.
usage and enable the application of existing Reinforcement
                                                                        2) GVGAI Gym: Torrado et al. [23] interfaced the GVGAI
Learning agents to the GVGAI games. From 2018, the more
                                                                     framework to the OpenAI Gym environment and obtained
user friendly GVGAI Gym3 is used in the learning competi-
                                                                     GVGAI Gym. An agent still receives a screen-shot of the game
tions.
                                                                     screen and the score and returns a valid action at every game
  2 http://www.cig2017.com/                                          tick, while the interface between an agent and the framework
  3 https://github.com/rubenrtorrado/GVGAI   GYM                     is much easier. Additionally, longer learning time will be
TABLE II                                  the generator with more votes. The winner of the contest
TABLE SHOWS THE SCORE AND RANKING OF THE SUBMITTED LEARNING        was the Easablade generator, a cellular automata described
AGENTS HAVE OBTAINED ON THE TRAINING AND TEST SET. † DENOTES A
                    SAMPLE CONTROLLER .                            in Section VII-A4.
                                                                      The competition was run again during IEEE CIG 2017 with
                              Training set         Test set        same configuration from previous year (one month for imple-
          Agent
                            Score   Ranking   Score    Ranking
         kkunan              125        6      184          1      mentation followed by on site judging). Unfortunately, only
     sampleRandom†           154        2      178          2      one submission was received, hence the the competition was
  DontUnderestimateUchiha    149        3      158          3      canceled. This submission used a n-gram model to generate
      sampleLearner†         149        4      152          4
      ercumentilhan          179        1      134          5      new constrained levels using a recorded player keystrokes.
        YOLOBOT              132        5      112          6

                                                                   E. Rule Generation
                            TABLE III
TABLE COMPARES THE BEST SCORES BY SINGLE - PLAYER PLANNING AND        The Rule Generation track [15] was introduced and held
  LEARNING AGENTS ON THE SAME TEST SET. N OTE THAT ONE OF THE      during CIG 2017. The aim of the track is to generate the
GAMES IN THE TEST SET IS REMOVED FROM THE FINAL RANKING DUE TO
                                                                   interaction set and termination conditions for a certain level
   BUGS IN THE GAME ITSELF. † DENOTES A SAMPLE CONTROLLER .
                                                                   with a fixed set of game sprites in a fixed amount of time.
Game
       1-P Planning                   1-P Learning                 To participate in the track, competitors have to provide their
         Best score        Best score             Agent            own rule generator. The framework provides the competitors
 G2   109.00 ± 38.19     31.5 ± 14.65         sampleRandom†
 G3     1.00 ± 0.00           0±0                   *              with game sprites, a certain level, and a forward model to
 G4     1.00 ± 0.00        0.2 ± 0.09             kkunan           simulate generated games, as in the Level Generation case.
 G5   216.00 ± 24.00          1±0                   *              The generated games are represented as two arrays of strings.
 G6     5.60 ± 0.78       3.45± 0.44     DontUnderestimateUchiha
 G7 31696.10 ± 6975.78 29371.95±2296.91           kkunan           The first array contains the interaction set, while the second
 G8  1116.90 ± 660.84     35.15±8.48              kkunan           array contains the termination conditions.
 G9     1.00 ± 0.00       0.05± 0.05          sampleRandom†           The first rule generation competition was intended to be
 G10   56.70 ± 25.23      2.75 ± 2.04         sampleLearner†
                                                                   run at CIG 2017. Three different sample rule generators were
                                                                   provided (as discussed in Section VIII) and the contest ran
provided (two weeks on 3 games instead of 5 minutes on each        over a month’s period. Unfortunately, no submissions were
of the 10 games) in the competition organized at CIG2018.          received for this track.

                                                                   F. Models and Methodologies
D. Level Generation
                                                                      The GVGAI framework offers an AI challenge at multiple
   The Level Generation track [14] started in 2016. The aim
                                                                   levels. Each one of the tracks is designed to serve as bench-
of the track is to allow competitors to develop a general level
                                                                   mark for a particular type of problems and approaches. The
generation algorithm. To participate in this track, competitors
                                                                   planning tracks provide a forward model, which favours the
must implement their own level generator. Competitors have to
                                                                   use of statistical forward planning and model-based reinforce-
provide at least one function that is responsible for generating
                                                                   ment learning methods. In particular, this is enhanced in the
the level. The framework provides the generator with all the
                                                                   two-player planning track with the challenge of player model-
information needed about the game such as game sprites,
                                                                   ing and interaction with other another agent in the game. The
interaction set, termination conditions and level mapping. Ad-
                                                                   learning track promotes research in model-free reinforcement
ditionally, the framework also provides access to the Forward
                                                                   learning techniques and similar approaches, such as evolution
Model, in order to allow testing the generated levels via agent
                                                                   and neuro-evolution. Finally, the level and rule generation
simulation. The levels are generated in the form of 2d matrix
                                                                   tracks focus on content creation problems and the algorithms
of characters, with each character representing the game sprites
                                                                   that are traditionally used for this: search-based (evolution-
at the specific location determined by the matrix. Competitors
                                                                   ary algorithms and forward planning methods), solver (SAT,
have the choice to either use the provided level mapping or
                                                                   Answer Set Programming), cellular automata, grammar-based
provide their own one.
                                                                   approaches, noise and fractals.
   The first level generation competition was held at the In-
ternational Joint Conference on Artificial Intelligence (IJCAI)
in 2016, which received four participants. Each participant              IV. M ETHODS FOR S INGLE P LAYER P LANNING
was provided a month to submit a new level generator. Three           This section describes the different methods that have been
different level generators were provided in order to help          implemented for Single Player Planning in GVGAI. All the
the users get started with the system (see Section VII for         controllers that face this challenge have in common the possi-
a description of these). Three out of the four participants        bility of using the forward model to sample future states from
were simulation-based level generators while the remaining         the current game state, plus the fact that they have a limited
was based on cellular automata. During the competition day,        action-decision time. While most attempts abide by the 40ms
people attending the conference were encouraged to try pairs       decision time imposed by the competition, other efforts in the
of generated levels and select which level they liked (one,        literature compel their agents to obey a maximum number of
both, or neither). Finally, the winner was selected based on       uses of the forward model.
Section IV-A briefly introduces the most basic methods              node), bread-first tree initialization (direct successors of the
that can be found within the framework. Then Section IV-B             root note are explored before MCTS starts), safety pre-pruning
describes the different tree search methods that have been            (prune those nodes with high number of game loses found),
implemented for this settings by the community, followed by           loss avoidance (MCTS ignores game lose states when found
Evolutionary Methods in Section IV-C. Often, more than one            for the first time by choosing a better alternative), novelty-
method is combined into the algorithm, which gives place to           based pruning (in which states with features rarely seen are
Hybrid methods (Section IV-D) or Hyper-heurisric algorithms           less likely to be pruned), knowledge based evaluation [29]
(Section IV-E). Further discussion on these methods and their         and deterministic game detection. The authors experimented
common take-aways has been included in Section X.                     with all these enhancements in 60 games of the framework,
                                                                      showing that most of them improved the performance of
A. Basic Methods                                                      MCTS significantly and their all-in-one combination increased
                                                                      the average win rate of the sample agent in 17 percentage
   The GVGAI framework contains several agents aimed at
                                                                      points. The best configuration was winner of one of the
demonstrating how a controller can be created for the single-
                                                                      editions of the 2016 competitions (see Table I).
player planning track of the competition [11]. Therefore, these
                                                                         F. Frydenberg studied yet another set of enhancements
methods are not particularly strong.
                                                                      for MCTS. The authors showed that using MixMax backups
   The simplest of all methods is, without much doubt, doNoth-
                                                                      (weighing average and maximum rewards on each node)
ing. This agent returns the action NIL at every game tick with-
                                                                      improved the performance in only some games, but its com-
out exception. The next agent in complexity is sampleRandom,
                                                                      bination with reversal penalty (to penalize visiting the same
which returns a random action at each game tick. Finally,
                                                                      location twice in a play-out) offers better results than vanilla
onesteplookahead is another sample controller that rolls the
                                                                      MCTS. Other enhancements, such as macro-actions (by re-
model forward for each one of the available actions in order
                                                                      peating an action several times in a sequence) and partial
to select the one with the highest action value, determined
                                                                      expansion (a child node is considered expanded only if its
by a function that tries to maximize score while minimizing
                                                                      children have also been expanded) did not improve the results
distances to NPCs and portals.
                                                                      obtained.
                                                                         Perez-Liebana et al. [29] implemented KB-MCTS, a ver-
B. Tree Search Methods                                                sion of MCTS with two main enhancements. First, distance
   One of the strongest and influential sample controllers            to different sprites were considered features for a linear
is sampleMCTS, which implements the Monte Carlo Tree                  combination, where the weights were evolved to bias the
Search [26] algorithm for real-time games. Initially imple-           MCTS rollouts. Secondly, a Knowledge Base (KB) is kept
mented in a closed loop version (the states visited are stored        about how interesting for the player the different sprites are,
in the tree node, without requiring the use of the Forward            where interesting is a measure of curiosity (rollouts are biased
Model during the tree policy phase of MCTS), it achieved the          towards unknown sprites) and experience (a positive/negative
third position (out of 18 participants) in the first edition of the   bias for getting closer/farther to beneficial/harmful entities).
competition.                                                          The results of applying this algorithm to the first set of games
   The winner of that edition, Couëtoux, implemented Open            of the framework showed that the combination of these two
Loop Expectimax Tree Search (OLETS), which is an open                 components gave a boost in performance in most games of
loop (states visited are never stored in the associated tree          the first training set.
node) version of MCTS which does not include rollouts and                The work in [29] has been extended by other researchers in
uses Open Loop Expectimax (OLE) for the tree policy. OLE              the field, which also put a special effort on biasing the Monte
substitutes the empirical average reward by rM , a weighted           Carlo (MC) simulations. In [30], the authors modified the
sum of the empirical average of rewards and the maximum of            random action selection in MCTS rollouts by using potential
its children rM values [11].                                          fields, which bias the rollouts by making the agent move in
   Schuster, in his MSc thesis [27], analyzes several enhance-        a direction akin to the field. The authors showed that KB-
ments and variations for MCTS in different sets of the GV-            MCTS provides a better performance if this potential field
GAI framework. These modifications included different tree            is used instead of the Euclidean distance between sprites
selection, expansion and play-out policies. Results show that         implemented in [29]. Additionally, in a similar study [31],
combinations of Move-Average Sampling Technique (MAST)                the authors substituted the Euclidean distance for a measure
and N-Gram Selection Technique (NST) with Progressive                 calculated by a pathfinding algorithm. This addition achieved
History provided an overall higher rate of victories than             some improvements over the original KB-MCTS, although the
their counterparts without these enhancements, although this          authors noted in their study that using pathfinding does not
result was not consistent across all games (with some simpler         provide a competitive advantage in all games.
algorithms achieving similar results).                                   Another work by Park and Kim [32] tackles this challenge
   In a different study, Soemers [21], [28] explored multiple         by a) determining the goodness of the other sprites in the
enhancements for MCTS: Progressive History (PH) and NST               game; b) computing an Influence Map (IM) based on this;
for the tree selection and play-out steps, tree re-use (by starting   and c) using the IM to bias the simulations, in this occasion
at each game tick with the subtree grown in the previous frame        by adding a third term to the Upper Confidence Bound
that corresponds to the action taken, rather than a new root          (UCB) [33] equation for the tree policy of MCTS. Although
not compared with KB-MCTS, the resultant algorithm im-              methods for GVGAI is that of T. and Geffner [20] (winner
proves the performance of the sample controllers in several         of one of the editions of the 2015 competition, YBCriber, as
games of the framework, albeit performing worse than these          indicated in Table I), who implemented Iterated Width (IW;
in some of the games used in the study.                             concretely IW(1)). IW(1) is a breadth-first search with a crucial
   Biasing rollouts is also attempted by dos Santos et al. [34],    alteration: a new state found during search is pruned if it does
who introduced Redundant Action Avoidance (RAA) and Non-            not make true a new tuple of at most 1 atom, where atoms
Defeat Policy (NDP); RAA analyzes changes in the state to           are boolean variables that refer to position (and orientations
avoid selecting sequences of actions that do not produce any        in the case of avatars) changes of certain sprites at specific
alteration on position, orientation, properties or new sprites in   locations. The authors found that IW(1) performed better than
the avatar. NDP makes the recommendation policy ignore all          MCTS in many games, with the exception of puzzles, where
children of the root node who found at least one game loss in       IW(2) (pruning according to pairs of atoms) showed better
a simulation from that state. If all children are marked with       performance. This agent was declared winner in the CEEC
a defeat, normal (higher number of visits) recommendation is        2015 edition of the Single-player planning track [6].
followed. Again, both modifications are able to improve the            Babadi [39] implemented several versions of Enforced Hill
performance of MCTS in some of the games, but not in all.           Climbing (EHC), a breadth-first search method that looks for
   de Waard et al. [35] introduced the concept of options           a successor of the current state with a better heuristic value.
or macro-actions in GVGAI. Each option is associated with           EHC obtained similar results to KB-MCTS in the first set of
a goal, a policy and a termination condition. The selection         games of the framework, with a few disparities in specific
and expansion steps in MCTS are modified so the search              games of the set.
tree branches only if an option is finished, allowing for a            Nelson [40] ran a study on MCTS in order to investigate if,
deeper search in the same amount of time. Their results show        giving a higher time budget to the algorithm (i.e. increasing
that Option MCTS outperforms MCTS in games with small               the number of iterations), MCTS was able to master most
levels or a few number of sprites, but loses in the comparison      of the games. In other words, if the real-time nature of
to MCTS when the games are bigger due to these options              the GVGAI framework and competition is the reason why
becoming too large.                                                 different approaches fail to achieve a high victory rate. This
   In a similar line, Perez-Liebana et al. [16] employed macro-     study provided up to 30 times more budget to the agent, but
actions for GVGAI games that used continuous (rather than           the performance of MCTS only increased marginally even at
grid-based) physics. These games have a larger state space,         that level. In fact, this improvement was achieved by means
which in turn delays the effects of the player’s actions and        of losing less often rather than by winning more games. This
modifies the way agents navigate through the level. Macro-          paper concludes that the real-time aspect is not the only factor
actions are defined as a sequence or repetition of the same         in the challenge, but also the diversity in the games. In other
action during M steps, which is arguably the simplest kind of       words, increasing the computational budget is not the answer
macro-actions that can be devised. MCTS performed better            to the problem GVGAI poses, at least for MCTS.
without macro-actions on average across games, but there               Finally, another study on the uses of MCTS for single
are particular games where MCTS needs macro-actions to              player planning is carried out by I. Bravi et al. [41]. In
avoid losing at every attempt. The authors also concluded that      this work, the focus is set on understanding why and under
the length M of the macro-actions impacts different games           which circumstances different MCTS agents make different
distinctly, although shorter ones seem to provide better results    decisions, allowing for a more in-depth description and be-
than larger ones, probably due to a more fine control in the        havioural logging. This study proposes the analysis of different
movement of the agents.                                             metrics (recommended action and their probabilities, action
   Some studies have brought multi-objective optimization           values, consumed budget before converging on a decision,
to this challenge. For instance, Perez-Liebana et al. [36]          etc.) recorded via a shadow proxy agent, used to compare
implemented a Multi-Objective version of MCTS, concretely           algorithms in pairs. The analysis described in the paper shows
maximizing score and level exploration simultaneously. In the       that traditional win-rate performance can be enhanced with
games tested, the rate of victories grew from 32.24% (normal        these metrics in order to compare two or more approaches.
MCTS) to 42.38% in the multi-objective version, showing
great promise for this approach. In a different study, Khalifa et
al. [37] apply multi-objective concepts to evolving parameters      C. Evolutionary Methods
for a tree selection confidence bounds equation. A previous            The second big group of algorithms used for single-player
work by Bravi [38] (also discussed later in Section IV-D)           planning is that of evolutionary algorithms (EA). Concretely,
provided multiple UCB equations for different games. The            the use of EAs for this real-time problem is mostly imple-
work in [37] evolved, using S-Metric Selection Evolutionary         mented in the form of Rolling Horizon EAs (RHEA). This
Multi-objective Optimization Algorithm (SMS-EMOA), the              family of algorithms evolves sequences of actions with the
linear weights of an UCB equation that results of combining         use of the forward model. Each sequence is an individual of
all from [38] in a single one. All these components respond         an EA which fitness is the value of the state found at the end
to different and conflicting objectives, and their results show     of the sequence. Once the time budget is up, the first action of
that it is possible to find good solutions for the games tested.    the sequence with the highest fitness is chosen to be applied
   A significant exception to MCTS with regards to tree search      in that time step.
The GVGAI competition includes SampleRHEA as a sample             in that time step. This enhancement produced better results
controller. SampleRHEA has a population size of 10, individual       with shorter individuals and smaller population sizes. The
length of 10 and implements uniform crossover and mutation,          shift buffer enhancement provided the best improvement in
where one action in the sequence is changed for another one          performance, which consist of shifting the sequences of the
(position and new action chosen uniformly at random) [11].           individuals of the population one action to the left, removing
   Gaina et al. [42] analyzed the effects of the RHEA parame-        the action from the previous time step. This variation, similar
ters on the performance of the algorithm in 20 games, chosen         to keeping the tree between frames in MCTS, combined with
among the existent ones in order to have a representative            the addition of rollouts at the end of the sequences provided
set of all games in the framework. The parameters analyzed           an improvement victory rate (20 percentile points over vanilla
were population size and individual length, and results showed       RHEA) and scores.
that higher values for both parameters provided higher victory          A similar (and previous) study was conducted by Horn et
rates. This study motivated the inclusion of Random Search           al. [48]. In particular, this study features RHEA with rollouts
(SampleRS) as a sample in the framework, which is equivalent         (as in [46]), RHEA with MCTS for alternative actions (where
to RHEA but with an infinite population size (i.e. only              MCTS can determine any action with the exception of the one
one generation is evaluated until budget is consumed) and            recommended by RHEA), RHEA with rollouts and sequence
achieves better results than RHEA in some games. [42] also           planning (same approach as the shift buffer in [46]), RHEA
compared RHEA with MCTS, showing better performance for              with rollouts and occlusion detection (which removes not
an individual length of 10 and high population sizes.                needed actions in a sequence that reaches a reward) and
   A different Evolutionary Computation agent was proposed           RHEA with rollouts and NPC attitude check (which rewards
by Jia et al. [43], [44], which consists of a Genetic Pro-           sequences in terms of proximity to sprites that provide a
gramming (GP) approach. The authors extract features from a          positive or negative reward). Results show that RHEA with
screen capture of the game, such as avatar location and the          rollouts improved performance in many games, although all
positions and distances to the nearest object of each type.          the other variants and additions performed worse than the
These features are inputs to a GP system that, using arithmetic      sample agents. It is interesting to see that in this case the
operands as nodes, determines the action to execute as a result      shift buffer did not provide an improvement in the victory
of three trees (horizontal, vertical and action use). The authors    rate, although this may be due to the use of different games.
report that all the different variations of the inputs provided to      Schuster [27] proposes two methods that combine MCTS
the GP algorithm give similar results to those of MCTS, on           with evolution. One of them, (1+1)−EA as proposed by [29],
the three games tested in their study.                               evolves a vector of weights for a set of game features in order
                                                                     to bias the rollouts towards more interesting parts of the search
                                                                     space. Each rollout becomes an evaluation for an individual
D. Hybrids
                                                                     (weight vector), using the value of the final state as fitness.
   The previous studies feature techniques in which one tech-        The second algorithm is based on strongly-typed GP (STGP)
nique is predominant in the agent created, albeit they may           and uses game features to evolve state evaluation functions
include enhancements which can place them in the boundary            that are embedded within MCTS. These two approaches join
of hybrids. This section describes those approaches that, in the     MAST and NST (see Section IV-B) in a larger comparison,
opinion of the authors, would in their own right be considered       and the study concludes that different algorithms outperform
as techniques that mix more than one approach in the same,           others in distinct games, without an overall winner in terms
single algorithm.                                                    of superior victory rate, although superior to vanilla MCTS in
   An example of one of these approaches is presented by             most cases.
Gaina et al. [45] analyzed the effects of seeding the initial           The idea of evolving weight vectors for game features
population of the RHEA using different methods. Part of              during the MCTS rollouts introduced in [29] (KB-MCTS4 )
the decision time budget is dedicated to initialize a popu-          was explored further by van Eeden in his MSc thesis [49]. In
lation with sequences that are promising, as determined by           particular, the author added A* as a pathfinding algorithm to
onesteplookahead and MCTS. Results show that both seeding            replace the euclidean distance used in KB-MCTS for a more
options provide a boost in victory rate when population size         accurate measure and changing the evolutionary approach.
and individual length are small, but the benefits vanish when        While KB-MCTS used a weight for each pair feature-action,
these parameters are large.                                          being the action chosen at each step by the Softmax equation,
   Other enhancements for RHEA proposed in [46] are in-              this work combines all move actions on a single weight and
corporating a bandit-based mutation, a statistical tree, a shift     picks the action using Gibbs sampling. The author concludes
buffer and rollouts at the end of the sequences. The bandit-         that the improvements achieved by these modifications are
based mutation breaks the uniformity of the random mutations         marginal, and likely due to the inclusion of pathfinding.
in order to choose new values according to suggestions given            Additional improvements on KB-MCTS are proposed by
by a uni-variate armed bandit. However, the authors reported         Chu et al. [50]. The authors replace the Euclidean distance
that no improvement on performance was noticed. A statistical        feature to sprites with a grid view of the agent’s surroundings,
tree, previously introduced in [47], keeps a game tree with
visit count and accumulated rewards in the root node, which            4 This approach could also be considered an hybrid. Given its influence in
are subsequently used for recommending the action to take            other tree approaches, it has also been partially described in Section IV-B
and also the (1 + 1) − EA with a Q-Learning approach to bias      of a combination of two methods: a heuristic Best First
the MCTS rollouts, making the algorithm update the weights at     Search (BFS) for deterministic environments and MCTS for
each step in the rollout. The proposed modifications improved     stochastic games. Initially, the algorithm employs BFS until
the victory rate in several sets of games of the framework        the game is deemed stochastic, an optimal solution is found
and also achieved the highest average victory rate among the      or a certain game tick threshold is reached, extending through
algorithms it was compared with.                                  several consecutive frames if needed for the search. Unless the
   İlhan and Etaner-Uyar [51] implemented a combination of       optimal sequence of actions is found, the agent will execute
MCTS and true online SARSA (λ) [52]. The authors use              an enhanced MCTS consistent of informed priors and rollout
MCTS rollouts as episodes of past experience, executing true      policies, backtracking, early cutoffs and pruning. The resultant
online Sarsa with each iteration with a ǫ-greedy selection        agent has shown consistently a good level of play in multiple
policy. Weights are learnt for features taken as the smallest     game sets of the framework.
euclidean distance to sprites of each type. Results showed that      Another hyper-heuristic approach, also winner of one of
the proposed approached improved the performance on vanilla       the 2015 editions of the competition (Return42, see Table I),
MCTS in the majority of the 10 games used in the study.           determines first if the game is deterministic or stochastic. In
   Evolution and MCTS have also been combined in different        case of the former, A* is used to direct the agent to sprites of
ways. In one of them, Bravi et al. [53] used a GP system          interest. Otherwise, random walks are employed to navigate
to evolve different tree policies for MCTS. Concretely, the       through the level [18].
authors evolve a different policy for each one of the (five)         The fact that this type of portfolio agents has shown very
games employed in the study, aiming to exploit the character-     promising results has triggered more research into hyper-
istics of each game in particular. The results showed that the    heuristics and game classification. The work by Bontrager
tree policy plays a very important role on the performance of     et al. [58] used K-means to cluster games and algorithms
the MCTS agent, although in most cases the performance is         attending to game features derived from the type of sprites
poor - none of the evolved heuristics performed better than       declared in the VGDL description files. The resulting classi-
the default UCB in MCTS.                                          fication seemed to follow a difficulty pattern, with 4 clusters
   Finally, Sironi et al. [54] designed three Self-Adaptive       that grouped games that were won by the agents at different
MCTS (SA-MCTS) that tuned the parameters of MCTS (play-           rates.
out depth and exploration factor) on-line, using Naive Monte-        Mendes et al. [59] built a hyper-agent which selected
Carlo, an Evolutionary Algorithm (λ, µ) and the N-Tuple           automatically an agent from a portfolio of agents for playing
Bandit Evolutionary Algorithm (NTBEA, [55]). Results show         individual game and tested it on the GVGAI framework. This
that all tuning algorithms improve the performance of MCTS        approached employed game-based features to train different
there where vanilla MCTS performs poorly, while keeping a         classifiers (Support Vector Machines - SVM, Multi-layer Per-
similar rate of victories in those where MCTS performs well.      ceptrons, Decision Trees - J48, among others) in order to select
In a follow-up study, however, C. Sironi and M. Winands [56]      which agent should be used for playing each game. Results
extend the experimental study to show that online parameter       show that the SVM and J48 hyper-heuristics obtained a higher
tuning impacts performance in only a few GVGP games, with         victory rate than the single agents separately.
NTBEA improving performance significantly in only one of             Horn et al. [48] (described before in Section IV-D) also
them. The authors conclude that online tuning is more suitable    includes an analysis on game features and difficulty estimation.
for games with longer budget times, as it struggles to improve    The authors suggest that the multiple enhancements that are
performance in most GVGAI real-time games.                        constantly attempted in many algorithms could potentially be
                                                                  switched on and off depending on the game that is being
                                                                  played, with the objective of dynamically adapting to the
E. Hyper-heuristics / algorithm selection                         present circumstances.
   Several authors have also proposed agents that use several        Ashlock et al. [18] suggest the possibility of creating a
algorithms, but rather than combining them into a single one,     classification of games, based on the performance of multiple
there is a higher level decision process that determines which    agents (and their variations: different enhancements, heuristics,
one of them should be used at each time.                          objectives) on them. Furthermore, this classification needs
   Ross, in his MSc thesis [57] proposes an agent that is a       to be stable, in order to accommodate the ever-increasing
combination of two methods. This approach uses A* with            collection of games within the GVGAI framework, but also
Enforced Hill Climbing to navigate through the game at a          flexible enough to allow an hyper-heuristic algorithm to choose
high level and switches to MCTS when in close proximity           the version that better adapts to unseen games.
to the goal. The work highlights the problems of computing           Finally, R. Gaina et al. [60] gave a first step towards
paths in the short time budget allowed, but indicate that goal    algorithm selection from a different angle. The authors trained
targeting with path-finding combined with local maneuvering       several classifiers on agent log data across 80 games of the
using MCTS does provide good performance in some of the           GVGAI framework, in particular obtained only from player
games tested.                                                     experience (i.e. features extracted from the way search was
   Joppen et al. implemented YOLOBOT [19], arguably the           conducted, rather than potentially human-biased game fea-
most successful agent for GVGAI up to date, as it has won         tures), to determine if the game will be won or not at the
several editions of the competition. Their approach consists      end. Three models are trained, for the early, mid and late
game, respectively, and tested in previously not seen games.       supplied to tune the parameters in the Stochastic Gradient
Results show that these predictors are able to foresee, with       Descent function employed, learning rate and mini batch size.
high reliability, if the agent is going to lose or win the game.
These models would therefore allow to indicate when and if         B. Evolutionary methods
the algorithm used to play the game should be changed.
                                                                      Two of the 2016 competition entries used an EA tech-
                                                                   nique as a base as an alternative to MCTS: Number27 and
        V. M ETHODS FOR T WO -P LAYER P LANNING                    CatLinux [13].
   This section approaches agents developed by researchers            Number27 was the winner of the CIG 2016 leg, the
within the Two-Player Planning setting. Most of these entries      controller placing 4th overall in the 2016 Championship.
have been submitted to the Two-Player Planning track of the        Number27 uses a Genetic Algorithm, with one population
competition [12]. Two methods stood out as the base of most        containing individuals which represent fixed-length action
entries received so far, Monte Carlo Tree Search (MCTS) and        sequences. The main improvement it features on top of the
Evolutionary Algorithms (EA) [13]. On the one hand, MCTS           base method is the generation of a value heat-map, used to
performed better in cooperative games, as well as showing the      encourage the agent’s exploration towards interesting parts of
ability to adapt better to asymmetric games, which involved a      the level. The heat-map is initialized based on the inverse
role switch between matches in the same environment. EAs,          frequency of each object type (therefore a lower value the
on the other hand, excelled in games with long lookaheads,         higher the object number) and including a range of influence
such as puzzle games, which rely on a specific sequence of         on nearby tiles. The event history is used to evaluate game
moves being identified.                                            objects during simulations and to update the value map.
   Counterparts of the basic methods described in Section IV-A        CatLinux was not a top controller on either of the individual
are available in the framework for the Two-Player track as         legs run in 2016, but placed 5th overall in the Championship.
well, the only difference being in the One Step Lookahead          This agent uses a Rolling Horizon Evolutionary Algorithm
agent which requires an action to be supplied for the opponent     (RHEA). A shift buffer enhancement is used to boost per-
when simulating game states. The opponent model used by the        formance, specifically keeping the population evolved during
sample agent assumes they will perform a random move (with         one game tick in the next, instead of discarding it, each action
the exception of those actions that would cause a loss of the      sequence is shifted one action to the left (therefore removing
game).                                                             the previous game step) and a new random action is added at
                                                                   the end to complete the individual to its fixed length.
A. Tree Search methods                                                No offline learning used by any of the EA agents, although
                                                                   there could be scope for improvement through parameter
   Most of the competition entries in the 2016 season were
                                                                   tuning (offline or online).
based on MCTS (see Section IV-B).
   Some entries employed an Open Loop version, which would
only store statistics in the nodes of the trees and not game       C. Opponent model
states, therefore needing to simulate through the actions at          Most agents submitted to the Two-Player competition use
each iteration for a potentially more accurate evaluation of the   completely random opponent models. Some entries have
possible game states. Due to this being unnecessarily costly       adopted the method integrated within the sample One Step
in deterministic games, some entries such as MaasCTS2 and          Lookahead controller, choosing a random but non-losing ac-
YOLOBOT switched to Breadth-First Search in such games             tion. In the 2016 competition, webpigeon assumed the oppo-
after an initial analysis of the game type, a method which has     nent would always cooperate, therefore play a move beneficial
shown ability to finding the optimal solution if the game lasts    to the agent. MaasCTS2 used the only advanced model at
long enough.                                                       the time: it remembered Q-values for the opponent actions
   Enhancements brought to MCTS include generating value           during simulations and added them to the statistics stored
maps, either regarding physical positions in the level, or         in the MCTS tree nodes; an ǫ-greedy policy was used to
higher-level concepts (such as higher values being assigned to     select opponent actions based on the Q-values recorded. This
states where the agent is closer to objects it hasn’t interacted   provided a boost in performance on the games in the WCCI
with before; or interesting targets as determined by controller-   2016 leg, but it did not improve the controller’s position in
specific heuristics). The winner of the 2016 WCCI leg, ToVo2,      the rankings for the following CIG 2016 leg.
also employed dynamic Monte Carlo roll-out length adjust-             Opponent models were found to be an area to explore
ments (increased with the number of iterations to encourage        further in [13] and Gonzalez and Perez-Liebana looked at 9
further lookahead if budget allows) and weighted roll-outs (the    different models integrated within the sample MCTS agent
weights per action generated randomly at the beginning of          provided with the framework [61]. Alphabeta builds a tree
each roll-out).                                                    incrementally, returning the best possible action in each time
   All agents use online learning in one way or another            tick, while Minimum returns the worst possible action. Average
(the simplest form being the base Monte Carlo Tree Search          uses a similar tree structure, but it computes the average
backups, used to gather statistics about each action through       reward over all the actions and it returns the action closest
multiple simulations), but only the overall 2016 Championship      to the average. Fallible returns the best possible action with
winner, adrienctx, uses offline learning on the training set       a probability p = 0.8 and the action with the minimum
reward otherwise. Probabilistic involved offline learning over     {1, . . . , 2000}, 0.5 ≥ ǫT ≥ 0.3. As a result, random decisions
20 games in the GVGAI framework in order to determine              are made for approximately 40% time.
the probability of an MCTS agent to select each action,               3) State-Action-Reward-State-Action: sampleLearner and
and then using these to determine the opponent action while        ercumentilhan are based on the State-Action-Reward-State-
playing online. Same Action returns the same action the agent      Action (SARSA) algorithm. Both agents choose to use a subset
plays, while Mirror returns its opposite. Finally, LimitedBuffer   of the whole game state information to build a new state to
records the last n = 20 actions performed by the player            reduce the amount of information to be saved and to take
and builds probabilities of selecting the next action based on     into account similar situations. The main difference is that the
this data, while UnlimitedBuffer records the entire history of     former uses a square region with fixed size centered at the
actions during the game. When all 9 opponent models were           avatar’s position, while the latter uses a first-person view with
tested in a round robin tournament against each other, the         a fixed distance.
probabilistic models achieve the highest win rates and two            4) Q-learning: kkunan, by K. Kunanusont, is a simple Q-
models, Probabilistic and UnlimitedBuffer outperforming a          learning agent using most of the avatar’s current information
random opponent model.                                             as features, which a few exceptions (as avatar’s health and
                                                                   screen size, as these elements that vary greatly from game
      VI. M ETHODS        FOR   S INGLE -P LAYER L EARNING         to game). The reward at game tick t + 1 is defined as the
   The GVGAI framework has also been used from an agent            difference between the game score at game tick t + 1 and
learning perspective. In this setting, the agents do not use the   the one at t. The learning rate α and discounted factor γ are
forward model to plan ahead actions to execute in the real         manually set to 0.05 and 0.8. During the learning phase, an
game. Instead, the algorithms learn the games by repeatedly        random is performed with probability ǫ = 0.1, otherwise, the
playing them multiple times (as episodes in Reinforcement          best action is selected. During the validation phase, the best
Learning), ideally improving their performance progressively.      action is always selected. Despite it’s simplicity, it won the
   This section first describes those approaches that tackled      only edition of this track up to date, being the only entry that
the challenge set in the single-player learning track of the       ranked above the sample random agent.
competition (described in Section III-C), to then move to those       5) Tree search methods: YOLOBOT is an adaption of
the ones not adhered to the competition format (Section VI-B).     the YOLOBOT planning agent (as described previously in
                                                                   Section IV-E). As the forward model is no more accessible
A. 2017 competition entries                                        in the learning track, the MCTS is substituted by a greedy
                                                                   algorithm to pick the action that minimizes the distance to
   Some of the entries are available in the framework5.            the chosen object at most. According to the authors, the poor
   1) Random agent: A sample random agent, which selects           performance of YOLOBOT in the learning track, contrary to
an action at uniform random at every game tick, is included        its success in the planning tracks, was due to the collision
in the framework (in both Java and Python) for the purposes        model created by themselves that did not work well.
of testing. This agent is also meant to be taken as a baseline:
a learner is expected to perform better than an agent which
acts randomly and does not undertake any learning.                 B. Other learning agents
   2) Multi-armed bandit algorithms: DontUnderestima-                 One of the first works that used this framework as a learning
teUchiha by Kunanusont is based on two popular Multi-              environment was carried out by Samothrakis et al. [63], who
Armed Bandit (MAB) algorithms, ǫ-Decreasing Greedy Al-             employed Neuro-Evolution in 10 games of the benchmark.
gorithm [62] and UCB [33]. At any game tick T , the current        Concretely, the authors experimented with Separable Natural
best action a∗ with probability 1 − ǫT is picked, otherwise an     Evolution Strategies (S-NES) using two different policies (ǫ-
action is uniformly randomly selected. The best action a∗ at       greedy versus softmax) and a linear function approximator
time T is defined as in Equation 1.                                versus a neural network as a state evaluation function. Features
                                        r          !               like score, game status, avatar and other sprites information
                                ˆ a+       2 log T                 were used to evolve learners during 1000 episodes. Results
          a∗ = arg max ∆score                        ,      (1)
                    a∈A                       ta                   show that ǫ-greedy with a linear function approximator was
   where ta denotes the number of times that the action a          the better combination to learn how to maximize scores on
                            ˆ a denotes the empirical mean
has been selected, and ∆score                                      each game.
increment of score by applying the action a so far.                   Braylan and Miikkulainen [64] performed a study in which
   This is a very interesting combination, as the UCB-style        the objective was to learn a forward model on 30 games. The
selection (Equation 1) and the ǫ-Decreasing Greedy Algorithm       objective was to learn the next state from the current one
both aim at balancing the trade-off between exploiting more        plus an action, where the state is defined as a collection of
the best-so-far action and exploring others. Additionally, ǫ0      attribute values of the sprites (spawns, directions, movements,
is set to 0.5 and it decreases slowly along time, formalized       etc.), by means of logistic regression. Additionally, the authors
as ǫT = ǫ0 − 0.0001T . According to the competition setting,       transfer the learnt object models from game to game, under
all games will last longer than 2, 000 game ticks, so ∀T ∈         the assumption that many mechanics and behaviours are
                                                                   transferable between them. Experiments showed the effective
  5 https://github.com/GAIGResearch/GVGAI/tree/master/clients/     value of object model transfer in the accuracy of learning
You can also read