Winning Isn't Everything: Enhancing Game Development with Intelligent Agents - arXiv

Page created by Doris Erickson
 
CONTINUE READING
Winning Isn't Everything: Enhancing Game Development with Intelligent Agents - arXiv
1

                                                     Winning Isn’t Everything: Enhancing Game
                                                        Development with Intelligent Agents
                                         Yunqi Zhao,* Igor Borovikov,* Fernando de Mesentier Silva,* Ahmad Beirami,* Jason Rupert, Caedmon Somers,
                                          Jesse Harder, John Kolen, Jervis Pinto, Reza Pourabolghasem, James Pestrak, Harold Chaput, Mohsen Sardari,
                                                                Long Lin, Sundeep Narravula, Navid Aghdaie, and Kazi Zaman

                                            Abstract—Recently, there have been several high-profile
                                         achievements of agents learning to play games against humans
arXiv:1903.10545v2 [cs.AI] 20 Aug 2019

                                         and beat them. In this paper, we study the problem of training
                                         intelligent agents in service of game development. Unlike the
                                         agents built to “beat the game”, our agents aim to produce
                                         human-like behavior to help with game evaluation and balancing.
                                         We discuss two fundamental metrics based on which we measure
                                         the human-likeness of agents, namely skill and style, which are
                                         multi-faceted concepts with practical implications outlined in this
                                         paper. We discuss how this framework applies to multiple games
                                         under development at Electronic Arts, followed by some of the
                                         lessons learned.
                                           Index Terms—Artificial intelligence; playtesting; non-player
                                         character (NPC); multi-agent learning; reinforcement learning;
                                         imitation learning; deep learning; A* search.

                                                                    I. I NTRODUCTION                                       Fig. 1. A depiction of the possible ranges of AI agents and the possible
                                                                                                                           tradeoff/balance between skill and style. In this tradeoff, there is a region that
                                            The history of artificial intelligence (AI) can be mapped                      captures human-like skill and style. AI Agents may not necessarily land in
                                         by its achievements playing and winning various games.                            the human-like region. High-skill AI agents land in the green region while
                                         From the early days of Chess-playing machines to the most                         their style may fall out of the human-like region.
                                         recent accomplishments of Deep Blue [10], AlphaGo [11],
                                         and AlphaStar [12], game-playing AI1 has advanced from
                                         competent, to competitive, to champion in even the most                           we train game-playing AI agents to perform tasks ranging
                                         complex games. Games have been instrumental in advancing                          from automated playtesting to interaction with human players
                                         AI, and most notably in recent times through tree search and                      tailored to enhance game-play immersion.
                                         deep reinforcement learning (RL).                                                    To approach the challenge of creating agents that generate
                                            Complementary to these great efforts on training high-skill                    meaningful interactions that inform game developers, we pro-
                                         gameplaying agents, at Electronic Arts, our primary goal is                       pose techniques to model different behaviors. Each of these
                                         to train agents that assist in the game design process, which                     has to strike a different balance between style and skill. We
                                         is iterative and laborious. The complexity of modern games                        define skill as how efficient the agent is at completing the task
                                         steadily increases, making the corresponding design tasks                         it is designed for. Style is vaguely defined as how the player
                                         even more challenging. To support designers in this context,                      engages with the game and what makes the player enjoy their
                                                                                                                           game-play. Defining and gauging skill is usually much easier
                                            This paper was presented in part at 14th AAAI Conference on Artificial
                                         Intelligence and Interactive Digital Entertainment (AIIDE 18) [1], NeurIPS        than that of style. However, we attempt to evaluate style of
                                         2018 Workshop on Reinforcement Learning under Partial Observability [2],          an artificial agent using statistical properties of the underlying
                                         [3], AAAI 2019 Workshop on Reinforcement Learning in Games (Honolulu,             model in this paper.
                                         HI) [4], AAAI 2019 Spring Symposium on Combining Machine Learning
                                         with Knowledge Engineering [5], ICML 2019 Workshop on Human in                       One of the most crucial tasks in game design is the process
                                         the Loop Learning [6], ICML 2019 Workshop on Imitation, Intent, and               of playtesting. Game designers usually rely on playtesting
                                         Interaction [7], The 23rd Annual Signal and Image Sciences Workshop at            sessions and feedback they receive from playtesters to make
                                         Lawrence Livermore National Laboratory, 2019 [8], and GitHub, 2019 [9].
                                            J. Rupert and C. Somers are with EA Sports, Electronic Arts, 4330              design choices in the game development process. Playtesting is
                                         Sanderson Way, Burnaby, BC V5G 4X1, Canada.                                       performed to guarantee quality game-play that is free of game-
                                            The rest of the authors are with EA Digital Platform – Data & AI, Electronic   breaking exceptions (e.g., bugs and glitches) and delivers
                                         Arts, Redwood City, CA 94065 USA.
                                            *Y. Zhao, I. Borovikov, FDM Silva and A. Beirami contributed                   the experience intended by the designers. Since games are
                                         equally to this paper. e-mails: {yuzhao, iborovikov, fdemesentiersilva,           complex entities with many moving parts, solving this multi-
                                         abeirami}@ea.com.                                                                 faceted optimization problem is even more challenging. An
                                            1 We refer to game-playing AI as any AI solution that powers an agent
                                         in the game. This can range from scripted AI solutions all the way to the         iterative loop where data is gathered from the game by one
                                         state-of-the-art deep reinforcement learning agents.                              or more playtesters, followed by designer analysis is repeated
Winning Isn't Everything: Enhancing Game Development with Intelligent Agents - arXiv
2

many times throughout the game development process.                    Another advantage in creating specialized AI is the cost of
   To mitigate this expensive process, one of our major efforts        implementation and training. The agents needed for these tasks
is to implement agents that can help automate aspects of               are, commonly, of smaller complexity than their optimal play
playtesting. These agents are meant to play through the game,          alternative, making it easier to implement as well as faster to
or a slice of it, trying to explore behaviors that can generate        train.
data to assist is answering questions that designers pose. These          To summarize, we mainly pursue two use-cases for having
can range from exhaustively exploring certain sequences of             AI agents enhance the game development process.
actions, to trying to play a scenario from start to finish in            1) playtesting AI agents to provide feedback during game
the least amount of actions possible. We showcase use-cases                  design, particularly when a large number of concurrent
focused on creating AI agents to playtest games at Electronic                players are to interact in huge game worlds.
Arts and discuss the related challenges.                                 2) game-playing AI agents to interact with real human
   Another key task in game development is the creation of in-               players to shape their game-play experience.
game characters that are human-facing and interact with real              The rest of the paper is organized as follows. In Section II,
human players. Agents must be trained and delicate tuning              we review the related work on training agents for playtesting
has to be performed to guarantee quality experience. An AI             and NPCs. In Section III, we describe our training pipeline.
adversary that reacts in a small amount of frames can be               In Sections IV and V, we provide four case studies that cover
deemed unfair rather than challenging. On the other hand, a            playtesting and game-playing, respectively. These studies are
pushover agent might be an appropriate introductory opponent           performed to help with the development process of multiple
for novice players, while it fails to retain player interest after a   games at Electronic Arts. These games vary considerably in
handful of matches. While traditional AI solutions are already         many aspects, such as the game-play platform, the target
providing excellent experiences for the players, it is becoming        audience, and the engagement duration. The solutions in
increasingly more difficult to scale those traditional solutions       these case studies were created in constant collaboration with
up as the game worlds are becoming larger and the content is           the game designers. The first case study in Section IV-A,
becoming dynamic.                                                      which covers game balancing and playtesting was done in
   As Fig. 1 shows, there is a range of style/skill pairs that are     conjunction with the development of The Sims Mobile. The
achievable by human players, and hence called human-like.              other case studies are performed on games that are still under
On the other hand, high-skill game-playing agents may have             development, at the moment this paper was written. Hence,
an unrealistic style rating, if they rely on high computational        we had to omit specific details regarding them purposely to
power and memory size, unachievable by humans. Efforts to              comply with company confidentiality. Finally, the concluding
evaluate techniques to emulate human-like behavior have been           remarks are provided in Section VI.
presented [13], but measuring non-objective metrics such as
fun and immersion is an open research question [14], [15].
                                                                                            II. R ELATED W ORK
Further, we cannot evaluate player engagement prior to the
game launch, so we rely on our best approximation: designer            A. Playtesting AI agents
feedback. Through an iterative process, the designers evaluate            To validate their design, game designers conduct playtesting
the game-play experience by interacting with the AI agents             sessions. Playtesting consists of having a group of players
to measure whether the intended game-play experience is                interact with the game in the development cycle to not only
provided.                                                              gauge the engagement of players, but also to discover elements
   These challenges each require a unique equilibrium be-              and states that result in undesirable outcomes. As a game goes
tween style and skill. Certain agents could take advantage of          through the various stages of development, it is essential to
superhuman computation to perform exploratory tasks, most              continuously iterate and improve the relevant aspects of the
likely relying more heavily on skill. Others need to interact          game-play and its balance. Relying exclusively on playtesting
with human players, requiring a style that won’t break player          conducted by humans can be costly and inefficient. Artificial
immersion. Then there are agents that need to play the game            agents could perform much faster play sessions, allowing the
with players cooperatively, which makes them rely on a much            exploration of much more of the game space in much shorter
more delicate balance that is required to pursue a human-like          time. This becomes even more valuable as game worlds grow
play style. Each of these these individual problems call for           large enough to hold tens of thousands of simultaneously
different approaches and have significant challenges. Pursuing         interacting players. Games at this scale render traditional
human-like style and skill can be as challenging (if not more)         human playtesting infeasible.
than achieving high performance agents.                                   Recent advances in the field of RL, when applied to playing
   Finally, training an agent to satisfy a specific need is often      computer games assume that the goal of a trained agent is to
more efficient than trying to reach such solution through high-        achieve the best possible performance with respect to clearly
skill AI agents. This is the case, for example, when using             defined rewards while the game itself remains fixed for the
game-playing AI to automatically run multiple playthroughs of          foreseen future. In contrast, during game development the
a specific in-game scenario to trace the origin of an unintended       objectives and the settings are quite different and vary over
game-play behavior. In this scenario, an agent that would              time. The agents can play a variety of roles with the rewards
explore the game space would potentially be a better option            that are not obvious to define formally, e.g., an objective of
than one that reaches the goal state of the level more quickly.        an agent exploring a game level is different from foraging,
Winning Isn't Everything: Enhancing Game Development with Intelligent Agents - arXiv
3

defeating all adversaries, or solving a puzzle. Also, the game
                                                                          Agent environment                              Gameplay environment
environment changes frequently between the game builds. In
such settings, it is desirable to quickly train agents that help                                         loop
                                                                                                                                     Game
with automated testing, data generation for the game balance                                                                         Code
evaluation and wider coverage of the game-play features. It                  Reusable Agents

is also desirable that the agent be mostly re-usable as the
game build is updated with new appearance and game-play
                                                                     Fig. 2. The AI agent training pipeline, which is consisted of two main
features. Following the direct path of throwing computational        components: 1) game-play environment; and 2) agent environment. The agent
resources combined with substantial engineering efforts at           submits actions to the game-play environment and receives back the next state.
training agents in such conditions is far from practical and
calls for a different approach.
   The idea of using artificial agents for playtesting is not new.   playing the game [36]. MCTS has also been applied to the
Algorithmic approaches have been proposed to address the             game of 7 Wonders [37] and Ticket to Ride [38]. Furthermore,
issue of game balance, in board games [16], [17] and card            Baier et al. biased MCTS with a player model, extracted from
games [18], [19], [20]. More recently, Holmgard et al. [21],         game-play data, to have an agent that was competitive while
as well as, Mugrai et al. [22] built variants of MCTS to create      approximating human-like play [39]. Tesauro [40], on the
a player model for AI Agent based playtesting. Guerrero-             other hand, used TD-Lambda which is a temporal difference
Romero et al. created different goals for general game-playing       RL algorithm to train Backgammon agents at a superhuman
agents in order to playtest games emulating players of dif-          level. The impressive recent progress on RL to solve video
ferent profiles [23]. These techniques are relevant to creating      games is partly due to the advancements in processing power
rewarding mechanisms for mimicking player behavior. AI and           and AI computing technology.2 More recently, following the
machine learning can also play the role of a co-designer,            success stories in deep learning, deep Q networks (DQNs)
making suggestions during development process [24]. Tools            use deep neural networks as function approximators within
for creating game maps [25] and level design [26], [27] are          Q-learning [42]. DQNs can use convolutional function ap-
also proposed. See [28], [29] for a survey of these techniques       proximators as a general representation learning framework
in game design.                                                      from the pixels in a frame buffer without need for task-specific
   In this paper, we describe our framework that supports            feature engineering.
game designers with automated playtesting. This also entail             DeepMind researchers remarried the two approaches by
a training pipeline that universally applies this framework to a     showing that DQNs combined with MCTS would lead to AI
variety of games. We then provide two case studies that entail       agents that play Go at a superhuman level [11], and solely
different solution techniques.                                       via self-play [43], [44]. Subsequently, OpenAI researchers
                                                                     showed that a policy optimization approach with function ap-
                                                                     proximation, called Proximal Policy Optimization (PPO) [45],
B. Game-playing AI agents
                                                                     would lead to training agents at a superhuman level in Dota
   Game-playing AI has been a main constituent of games              2 [46]. Cuccu et al. proposed learning policies and state
since the dawn of video gaming. Analogously, games,                  representations individually, but at the same time, and did so
given their challenging nature, have been a target for AI            using two novel algorithms [47]. With such approach they
research [30]. Over the years, AI agents have become more            were able to play Atari games with neural networks of 18
sophisticated and have been providing excellent experiences          neurons or less. Recently, highly publicized progress was
to millions of players as games have grown in complexity.            reported by DeepMind on StarCraft II, where AlphaStar was
Scaling traditional AI solutions in ever growing worlds with         unveiled to play the game at a superhuman level by combining
thousands of agents and dynamic content is a challenging             a variety of techniques including attention networks [12].
problem calling for alternative approaches.
   The idea of using machine learning for game-playing AI
dates back to Arthur Samuel [31], who applied some form                                        III. T RAINING P IPELINE
of tree search combined with basic reinforcement learning to            To train AI agents efficiently, we have developed a unified
the game of checkers. His success motivated researchers to           training pipeline that is applicable to all of EA games, regard-
target other games using machine learning, and particularly          less of the platform and the genre of the game. In this section,
reinforcement learning.                                              we present our training pipeline that is used for solving the
   IBM Deep Blue followed the tree search path and was the           case studies presented in the section that follows.
first artificial game agent who beat the chess world champion,
Gary Kasparov [10]. A decade later, Monte Carlo Tree Search
(MCTS) [32], [33] was a big leap in AI to train game agents.         A. Gameplay and Agent Environments
MCTS agents for playing Settlers of Catan were reported                The AI agent training pipeline, which is depicted in Fig. 2,
in [34], [35] and shown to beat previous heuristics. Other           consists of two key components:
work compares multiple approaches of agents to one another in
the game Carcassonne on the two-player variant of the game             2 The amount of AI compute has been doubling every 3-4 months in the
and discusses variations of MCTS and Minimax search for              past few years [41].
4

   • Gameplay environment refers to the simulated game                                   The presence of frame buffers as the representative of
     world that executes the game logic with actions submitted                           game state would significantly increase this communica-
     by the agent every timestep and produces the next state.3                           tion cost whereas derived game state features enable more
   • Agent environment refers to the medium where the agent                              compact encodings.
     interacts with the game world. The agent observes the                          (d) Obtaining an artificial agent in a reasonable time (a few
     game state and produces an action. This is where training                           hours at most) usually requires that the game be clocked
     occurs. Note that in case of reinforcement learning, the                            at a rate much higher than the usual game-play speed. As
     reward computation and shaping also happens in the agent                            rendering each frame takes a significant portion of every
     environment.                                                                        frame’s time, overclocking with rendering enabled is
In practice, the game architecture can be complex and it might                           not practical. Additionally, moving large amount of data
be too costly for the game to directly communicate the com-                              from GPU to main memory drastically slows down the
plete state space information to the agent at every timestep. To                         game execution and can potentially introduce simulation
train artificial agents, we create a universal interface between                         artifacts, by interfering with the target timestep rate.
the game-play environment and the learning environment.4                            (e) Last but not least, we can leverage the advantage of
The interface extends OpenAI Gym [48] and supports actions                               having privileged access to the game code to let the game
that take arguments, which is necessary to encode action                                 engine distill a compact state representation that could be
functions and is consistent with PySC2 [49], [50]. In addition,                          inferred by a human player from the game and pass it to
our training pipeline enables creating new players on the game                           the agent environment. By doing so we also have a better
server, logging in/out an existing player, and gathering data                            hope of learning in environments where the pixel frames
from expert demonstrations. We also adapt Dopamine [51]                                  only contain partial information about the the state space.
to this pipeline to make DQN [42] and Rainbow [52] agents                             The compact state representation could include the inven-
available for training in the game. Additionally, we add support                   tory, resources, buildings, the state of neighboring players,
for more complex preprocessing other than the usual frame                          and the distance to target. In an open-world shooter game
buffer stacking, which we explicitly exclude following the                         the features may include the distance to the adversary, angle
motivation presented in the next section.                                          at which the agent approaches the adversary, presence of line
                                                                                   of sight to the adversary, direction to the nearest waypoint
                                                                                   generated by the game navigation system, and other features.
B. State Abstraction
                                                                                   The feature selection may require some engineering efforts but
   The use of frame buffer as an observation of the game                           it is logically straightforward after the initial familiarization
state has proved advantageous in eliminating the need for                          with the game-play mechanics, and often similar to that of
manual feature-engineering in Atari games [42]. However, to                        traditional game-playing AI, which will be informed by the
achieve the objectives of RL in a fast-paced game development                      game designer. We remind the reader that our goal is not to
process, the drawbacks of using frame buffer outweigh its ad-                      train agents that win but to simulate human-like behavior, so
vantages. The main considerations which we take into account                       we train on information that would be accessible to a human
when deciding in favor of a lower-dimensional engineered                           player.
representation of game state are:
 (a) During almost all stages of the game development, the                                        IV. P LAYTESTING AI AGENTS
     game parameters are evolving on a daily basis. In par-
                                                                                   A. Measuring player experience for different player styles
     ticular, the art may change at any moment and the look
     of already learned environments can change overnight.                            In this section, we consider the early development of The
     Hence, it is desirable to train agents using features that are                Sims Mobile, whose game-play is about “emulating life”:
     more stable to minimize the need for retraining agents.                       players create avatars, called Sims, and conduct them through
 (b) Another important advantage of state abstraction is that                      a variety of everyday activities. In this game, there is no single
     it allows us to train much smaller models (networks)                          predetermined goal to achieve. Instead, players craft their
     because of the smaller input size and use of carefully                        own experiences, and the designer’s objective is to evaluate
     engineered features. This is critical for deployment for                      different aspects of that experience. In particular, each player
     real time applications in console and consumer PC en-                         can pursue different careers, and as a result will have a
     vironments where rendering, animation and physics are                         different experience and trajectory in the game. In this specific
     occupying much of the GPU and CPU power.                                      case study, the designer’s goal is to evaluate if the current
 (c) In playtesting, the game-play environment and the learn-                      tuning of the game achieves the intended balanced game-
     ing environment may reside in physically separate nodes.                      play experience across different careers. For example, different
     Naturally, closing the RL state-action-reward loop in such                    careers should prove similarly difficult to complete. We refer
     environments requires a lot of network communication.                         the interested reader to [1] for a more comprehensive study of
                                                                                   this problem.
   3 Note that the reward is usually defined by the user as a function of the
                                                                                      The game is single-player, deterministic, real-time, fully
state and action outside of the game-play environment.                             observable and the dynamics are fully known. We also have
   4 These environments may be physically separated, and hence, we prefer a
thin (i.e., headless) client that supports fast cloud execution, and is not tied   access to the complete game state, which is composed mostly
to frame rendering.                                                                of character and on-going action attributes. This simplified
5

                                                                                  reward R. Instead, we learn parameters that define U (a, s)
                                                                                  to optimize the environment rewards using the black-box ES
                                                                                  optimization technique. In that sense optimizing R by learning
                                                                                  parameters of U is similar to Proximal Policy Optimization
                                                                                  (PPO), however, in much more constrained settings. To this
                                                                                  end, we design utility of an action as a weighted sum of
                                                                                  the immediate action rewards r(a) and costs c(a). These are
                                                                                  vector-valued quantities and are explicitly present in the game
                                                                                  tuning describing the outcome of executing such actions. The
                                                                                  parameters evolving by the ES are the linear weights for the
Fig. 3. Comparison of the average amount of career actions (appointments)         utility function U explained below and the temperature of the
taken to complete the career using A* search and evolution strategy adapted
from [1].                                                                         softmax function. An additional advantage of the proposed
                                                                                  linear design of the utility function is a certain level of
                                                                                  interpretability of the weights corresponding to the perceived
case allows for the extraction of a lightweight model of the                      by the agent utilities of the individual components of the
game (i.e., state transition probabilities). While this requires                  resources or the immediate rewards. Such interpretability can
some additional development effort, we can achieve a dra-                         guide changes to the tuning data.
matic speedup in training agents by avoiding (reinforcement)                         Concretely, given the game state s, we design the utility U
learning and resorting to planning techniques instead.                            of an action a as
   In particular, we use the A* algorithm for the simplicity
of proposing a heuristic that can be tailored to the specific                                     U (s, a) = r(a)v(s) + c(a)w(s)
designer need by exploring the state transition graph instead                     The immediate reward r(a) here is a vector that can include
of the more expensive iterative processes, such as dynamic                        quantities like the amount of experience, amount of career
programming,5 and even more expensive Monte Carlo search                          points earned for the action and the events triggered by it.
based algorithms. The customizable heuristics and the target                      The costs c(a) is a vector defined similarly. The action costs
states corresponding to different game-play objectives, which                     specify the amounts of resources required to execute such an
represent the style we are trying to achieve, provide sufficient                  action, e.g., how much time, energy, hunger, etc. a player needs
control to conduct various experiments and explore multiple                       to spend to successfully trigger and complete the action. The
aspects of the game.                                                              design of the tuning data makes the both quantities r and c
   Our heuristic for the A* algorithm is the weighted sum of                      only depend on the action itself. Since both - the immediate
the 3 main parameters that contribute for career progression:                     reward r(a) and c(a) are vector values, the products in the
career level, current career experience points and amount                         definition of U above are dot products. The vectors v(s) and
of completed career events. These parameters are directly                         w(s) introduce dependence of the utility on the current game
related. To gain career levels players have to accumulate career                  state and are the weights defining relative contribution of the
experience points and to obtain experience, players have to                       immediate resource costs and immediate rewards towards the
complete career events. The weights are attributed based on                       current goals of the agent.
the order of magnitude each parameter has. Since levels are                          The inferred utilities of the actions depend on the state since
the most important, it receives the highest weight. The amount                    some actions in certain states are more beneficial than in other
of completed career events has the lowest weight because it is                    states. E.g., triggering a career event while not having enough
already partially factored into the the amount of career points                   resources to complete it successfully may be wasteful and an
received so far.                                                                  optimal policy should avoid it. The relevant state components
   We also compare A* results to the results from an opti-                        s = (s1 , .., sk ) include available commodities like energy and
mization problem over a subspace of utility-based policies                        hunger and a categorical event indicator (0 if outside of the
approximately solved with an evolution strategy (ES) [54].                        event and 1 otherwise) wrapped into a vector. The total number
Our goal, in this case, is to achieve a high environment reward                   of the relevant dimensions here is k. We design the weights
against selected objective, e.g., reach the end of a career track                 va (s) and wa (s) as bi-linear functions with the coefficients
while maximizing earned career event points. We design ES                         p = (p1P   , ..., pk ) and q = (q1 , ..., qkP
                                                                                                                              ) that we are learning:
objective accordingly. The agent performs an action a based on                    va (s) = i=1,..,k pi si and wa (s) = i=1,..,k qi si .
a probabilistic policy by taking a softmax on the utility U (a, s)                   To define the optimization objective J, we construct it as
measure of the actions in a game state s. Utility here serves                     a function of the number of successfully completed events N
as an action selection mechanism to compactly represent a                         and the number of attempted events M . We aim to maximize
policy. In a sense, it is a proxy to a state-action value Q-                      the ratio of successful to attempted events times total number
function Q(a, s) in RL. However, we do not attempt to derive                      of successful events in the episode as follows:
utility from Bellman’s equation and the actual environment
                                                                                                                  N (N + )
  5 Unfortunately,
                                                                                                      J(N, M ) =            .
                    in the dynamic programming every node will participate in                                       M +
the computation while it is often true that most of the nodes are not relevant
to the shortest path problem in the sense that they are unlikely candidates for   where  is a small number less than 1 eliminating division by
inclusion in a shortest path [53].                                                zero when the policy fails to attempt any events. The overall
6

optimization problem looks like:                                   The Sims Mobile in the sense that it requires the players to
                                                                   exhibit strategic decision making for them to progress in the
                        max J(N, M ),
                         p,q                                       game. When the game dynamics are unknown or complex,
subject to the policy defined by actions selected with a softmax   most recent success stories are based on model-free RL (and
over their utilities parameterized with the (learned) parameters   particularly variants of DQN and PPO). In this section, we
p and q.                                                           show how such model-free control techniques fit into the
   The utility-based ES, as we describe it here, captures the      paradigm of playtesting modern games.
design intention of driving career progression in the game-play       In this game, the goal of the player (and subsequently the
by successful completion of career events. Due to the emphasis     agent) is to level up and reach a particular milestone in the
on the events completion, our evolution strategy setup is not      game. To this end, the player needs to make smart decisions
necessarily resulting in an optimization problem equivalent to     in terms of resource mining and resource management for
the one we solve with A∗ . However, as we discuss below, it has    different tasks. In the process, the agent needs to also upgrade
similar optimum most of the time, supporting the design view       some buildings. Each upgrade requires a certain level of
on the progression. A similar approach works for evaluating        resources. If the player’s resources are insufficient, the upgrade
relationship progression, which is another important element       is not possible. A human player will be able to visually
of the game-play.                                                  discern the validity of such action by clicking on the particular
   We compare the number of actions that it takes to reach         building for upgrade. The designer’s primary concern in this
the goal for each career in Fig. 3 as computed by the two          case study is to measure how a competent player would
approaches. We emphasize that our goal is to show that a           progress in the early stages of this game. In particular, the
simple planning method, such as A*, can sufficiently satisfy       competent player is required to balance resources and make
the designer’s goal in this case. We can see that the more         other strategic choices that the agent needs to discern as well.
expensive optimization based evolution strategy performs sim-         We consider a simplified version of the state space that con-
ilarly to the much simpler A* search.                              tains information about this early stage of the game ignoring
   The largest discrepancy arises for the Barista career, which    the full state space. The relevant part of the state space
might be explained by the fact that this career has an action      consists of ∼50 continuous and ∼100 discrete state variables.
that does not reward experience by itself, but rather enables      The set of possible actions α is a subset of a space A, which
another action that does it. This action can be repeated often     consists of ∼25 action classes, some of which are from a
and can explain the high numbers despite having half the           continuous range of possible action values, and some are from
number of levels. Also, we observe that in the case of the         a discrete set of action choices. The agent has the ability
medical career, the 2,000 node A* cutoff was potentially           to generate actions a ∈ A but not all of them are valid at
responsible for the under performance in that solution.            every game state since α = α(s, t), i.e., α depends on the
   When running the two approaches, another point of com-          timestep and the game state. Moreover, the subset α(s, t) of
parison can be made: how many sample runs are required             valid actions may only partially be known to the agent. If
to obtain statistically significant results? We performed 2,000    the agent attempts to take an unavailable action, such as a
runs for the evolution strategy while it is notable that the A*    building upgrade without sufficient resources, the action will
agent learns a deterministic playstyle, which has no variance.     be deemed invalid and no actual action will be taken by the
On the other hand, the agent trained using an evolution strategy   game server.
has a high variance and requires a sufficiently high number           While the problem of a huge state space [55], [56], [57],
of runs of the simulation to approach a final reasonable           a continuous action space [58], and a parametric action
strategy [1].                                                      space [59] could be dealt with, these techniques are not
   In this use case, we were able to use a planning algorithm,     directly applicable to our problem. This is because, as we
A*, to explore the game space to gather data for the game          shall see more, some actions will be invalid at times and
designers to evaluate the current tuning of the game. This was     inferring that information may not be fully possible from the
possible due to the goal being straightforward, to evaluate pro-   observation space. Finally, the game is designed to last tens
gression in the different careers. With such, the requirements     of millions of timesteps, taking the problem of training a
of skill and style for the agent were achievable and simple        functional agent in such an environment outside of the domain
to model. Over the next use cases, we analyze scenarios that       of previously explored problems.
call for different approaches as consequence of having more           We study game progression while taking only valid actions.
complex requirements and subjective agent goals.                   As we already mentioned, the set of valid actions α may not
                                                                   be fully determined by the current observation, and hence,
                                                                   we deal with a partially observable Markov decision process
B. Measuring competent player progression                          (POMDP). Given the practical constraints outlined above, it is
   In the next case study, we consider a real-time multi-player    infeasible to apply deep reinforcement learning to train agents
mobile game, with a stochastic environment, with sequential        in the game in its entirety. In this game, we show progress
actions. The game is designed to engage a large number of          toward training an artificial agent that takes valid actions and
players for months. The game dynamics are governed by a            progresses in the game like a competent human player. To this
complex physics engine, which makes it impractical to apply        end, we wrap this game in the game environment and connect
planning methods. This game is much more complex than              it to our training pipeline with DQN and Rainbow agents. In
7

Fig. 4. This plot belongs to Section IV-B. Average cumulative reward (return) in training and evaluation for the agents as a function of the number of
iterations. Each iteration is worth ∼60 minutes of game-play. The trained agents are: (1) a DQN agent with complete state space, (2) a Rainbow agent with
complete state space, (3) a DQN agent with augmented observation space, and (4) a Rainbow agent with augmented observation space.

the agent environment, we use a feedforward neural network                       We trained four types of agents as shown in Fig. 4, where we
with two fully connected hidden layers, each with 256 neurons                 are plotting the average undiscounted return in each training
followed by ReLU activation.                                                  episode. By design, this quantity is upper bounded by ‘+1’,
   As a first step in measuring game progression, we define                   which is achieved if the agent keeps taking valid actions until
an episode by setting an early goal state in the game that takes              reaching the final goal state. In reality, this may not always
an expert human player ∼5 minutes to reach. We let the agent                  be achievable as there are periods of time where no action is
submit actions to the game server every second. We may have                   available and the agent has to choose the “do nothing” action
to revisit this assumption for longer episodes where the human                and be rewarded with ‘-0.1’. Hence, the best a competent
player is expected to interact with the game more periodically.               human player would achieve on these episodes would be
We use a simple rewarding mechanism, where we reward the                      around zero.
agent with ‘+1’ when they reach the goal state, ‘-1’ when they                   We see that after a few iterations, both the Rainbow and
submit an invalid action, ‘0’ when they take a valid action, and              DQN agents converge to their asymptotic performance values.
‘-0.1’ when they choose the “do nothing” action. The game is                  The Rainbow agent converges to a better asymptotic perfor-
such that at times the agent has no other valid action to choose,             mance level as compared to the DQN agent. However, in the
and hence they should choose “do nothing”, but such periods                   the transient behavior we observe that the DQN agent achieves
do not last more than a few seconds in the early stages of the                the asymptotic behavior faster than the Rainbow agent. We
game, which is the focus of this case study.                                  believe this might be due to the fact that we did not tune
                                                                              hyperparameters of prioritized experience replay [60], and
   We consider two different versions of the observation space,               distributional RL [61].6 We used the default values that worked
both extracted from the game engine (state abstraction). The                  best on Atari games with frame buffer as state space. Extra
first is what we call the “complete” state space. The complete                hyperparameter tuning would have been costly in terms of
state space contains information that is not straightforward to               cloud infrastructure for this particular problem as the game
infer from the real observation in the game and is only used as               server does not allow speedup and training once already takes
a baseline for the agent. In particular, the complete state space             a few hours.
also includes the list of available actions at each state. The                   As expected, we see in Fig. 4 that the augmented obser-
polar opposite of this state space could be called the “naive”                vation space makes the training slower and also results in
state space, which only contains straightforward information                  a worse performance on the final strategy. In addition, at
that is always shown on the screen of the player. The second                  evaluation time, the agent keeps attempting invalid actions in
state space we consider is what we call the “augmented” ob-                   some cases as the state remains mostly unchanged after each
servation space, which contains information from the “naive”                  attempt and the policy is (almost) deterministic. These results
state space and information the agent would reasonably infer                  in accumulating large negative returns in such episodes which
and retain from current and previous game observations. For                   account for the dips in the right-hand-side panel in Fig. 4 at
example, this includes the amount of resources needed for                     evaluation time.
an upgrade after the agent has checked a particular building
                                                                                 The observed behavior drew our attention to the question of
for an upgrade. The augmented observation space does not
                                                                              whether it is too difficult to discern and keep track of the set of
include the set of all available actions, and hence, we rely
                                                                              valid actions for a human player as well. In fact, after seeking
on the game server to validate whether a submitted action is
                                                                              more extensive human feedback the game designers concluded
available because it is not possible to encode and pass the set
                                                                              that better visual cues were needed for a human player on
α of available actions. Hence, if an invalid action is chosen
by the agent, the game server will ignore the action and will                   6 This is consistent with the results of Section V-C, where Rainbow with
flag the action so that we can provide a ‘-1’ reward.                         default hyperparameters does not outperform DQN either.
8

information about valid actions at each state so that the human     are also other cases where the preferred solution for training
players could progress more smoothly without being blocked          agents would utilize a few relatively short demonstration
by invalid actions. As next steps, we intend to experiment          episodes played by the software developers or designers at
with shaping the reward function for achieving different play       the end of the current development cycle [67].
styles to be able to better model different player clusters.           In this application, we consider training artificial agents
We also intend to investigate augmenting the replay buffer          in an open-world video game, where the game designer is
with expert demonstrations for faster training and also for         interested in training non-player characters that exhibit certain
generative adversarial imitation learning [62] once the game        behavioral styles. The game we are exploring is a shooter with
is released and human play data is available.                       contextual game-play and destructible environment. While the
   We remark that without state abstraction (streamlined access     game can run in multi-player mode, we focus on single-player,
to the game state), the neural network function approximator        which provides us with an environment tractable yet rich
used for Q-learning would have needed to discern all such           enough to test the approach we discuss in this section. An
information from the pixels in the frame buffer, and hence we       agent in such a game would have its state composed of a
would not have been able to get away with such a simple two-        3D-vector location, velocity, animation pose, health, weapon
layer feedforward function approximator to solve this problem.      type, ammunition, scope on-off, cone of sight, collision state,
However, we observe that the training within the paradigm           distance to the obstacles in the principle directions, etc. Overall
of model-free RL remains costly. Specifically, even using the       the dimensionality of the agent state can grow to several
complete state space, it takes several hours to train an agent      dozens of variables with some of them continuous and the
that achieves a level of performance expected of a competent        other categorical. We construct similar vectors for NPCs with
human player on this relatively short episode of ∼5 minutes.        which the agent needs to engage.
This calls for the exploration of complementary approaches             The NPC state variables account for partial observability
to augment the training process. In particular, we also would       and line-of-site constraints imposed by the level layout, its
like to streamline this process by training reusable agents and     destruction state, and the current location of the agent rel-
capitalizing on existing human data through imitation learning.     ative to the NPCs. The NPCs in this environment represent
                                                                    adversarial entities, trying to eliminate the agent by attack-
                    V. G AME - PLAYING AI                           ing it until the agent runs out of health. Additionally, the
                                                                    environment can contain objects of interest, like health and
   We have shown the value of simulated agents in a fully
                                                                    ammo boxes, dropped weapons, etc. The environment itself
modeled game, and the potential of training agents in a com-
                                                                    is non-deterministic stochastic, i.e., there is no single random
plex game to model player progression for game balancing.
                                                                    seed which we can set to control all random choices in the
We can take these techniques a step further and make use of
                                                                    environment. Additionally, frequent saving and reloading game
agent training to help build the game itself. Instead of applying
                                                                    state is not practical due to relatively long loading times.
RL to capture player behaviors, we consider an approach
                                                                       The main objective for us in this application of the agents
to game-play design where the player agents learn behavior
                                                                    training is to provide designer with a tool to playtest the game
policies from the game designers. The primary motivation
                                                                    by interacting with the game in a number of particular styles
of that is to give direct control into the designer hands and
                                                                    to emulate different players. The styles can include:
enable easy interactive creation of various behavior types,
                                                                       • Aggressive, when an agent tries to find, approach and
e.g., aggressive, exploratory, stealthy, etc. At the same time,
                                                                          defeat any adversarial NPC,
we aim to complement organic demonstrations with bootstrap
                                                                       • Sniper, when an agent finds a good sniping spot and waits
and heuristics to eliminate the need for a human to train an
                                                                          for adversarial NPCs to appear in the cone of sight to
agent on the states normally not encountered by humans, e.g.,
                                                                          shoot them,
unblocking an agent using obstacle avoidance.
                                                                       • Exploratory, when an agent attempts to visit as many
                                                                          locations and uncover as many objects of interest in the
A. Human-Like Exploration in an Open-World Game                           level as possible without getting engaged into combat
   To bridge the gap between the agent and the designer,                  unless encountering an adversarial NPC,
we introduce imitation learning (IL) to our system [62],               • Sneaky, when an agent actively avoids combat while
[63], [64], [65]. In the present application, IL allows us to             trying to achieve its objectives like reaching a particular
translate the intentions of the game designer into a primer               points on the map.
and a target for our agent learning system. Learning from              Additionally, combat style can vary with deploying different
expert demonstrations has traditionally proved very helpful in      weapons, e.g., long-range or hand-to-hand.
training agents, including in games [66]. In particular, the           Obviously, manual test and evaluation of its results while
original Alpha Go [11] used expert demonstrations in training       following outlined styles is very time consuming and tedious.
a deep Q network. While it is argued in subsequent work             Having an agent that can replace a designer in this process
that learning via self-play could achieve a better asymptotic       would be a great time saver. Conceivably, an agent trained
return compared to relying on expert demonstrations, the            as the design helper can also be playing as a stand-in or an
better performance comes with significantly higher cost in          “avatar” of an actual human player to replace the player when
terms of training computational resources and the superhuman        she is not online or the connection drops out for short period
performance is not what we are seeking in this work. There          of time. An agent trained in a specific style can also fill a
9

Fig. 5. Model performance measures the probability of the event that the Markov agent finds at least one previous action from human-played demonstration
episodes in the current game state. The goal of interactive learning is to add support for new game features to the already trained model or improve its
performance in underexplored game states. Plotted is the model performance during interactive training from demonstrations in a proprietary open-world game
as a function of time measured in milliseconds (with the total duration around 10 minutes). The corresponding section of the paper offers additional details
for the presented experiment.

vacant spot in a squad, or to help with the cold start problem                 [8] .
for multi-player games.                                                           1) Markov Decision Process with Extended State: Follow-
                                                                               ing the established treatment of training artificial agents, we
   In terms of the skill/style tradeoff laid out earlier in the
                                                                               place the problem into the standard MDP framework and aug-
paper, these agents are not designed to have any specific level
                                                                               ment it as follows. Firstly, we ignore the difference between
of performance (e.g., a certain kill-death ratio) and they may
                                                                               the observation and the actual state s of the environment.
not necessarily follow any long-term goals. These agents are
                                                                               The actual state may be observable by the teacher but may
intended to explore the game and also be able to interact with
                                                                               be impractical to expose to the agent. To mitigate the partial
human players at a relatively shallow level of engagement.
                                                                               observability, we extended observations with a short history
Hence, the problem boils down to efficiently training an agent
                                                                               of previously taken actions. In addition to implicitly encoding
using demonstrations capturing only certain elements of the
                                                                               the intent of a teacher and her reactions to potentially richer
game-play. The training process has to be computationally
                                                                               observations, it also helps to preserve the stylistic elements of
inexpensive and the agent has to imitate the behavior of the
                                                                               human demonstrations.
teacher(s) by mimicking their relevant style (in a statistical
                                                                                  Concretely, we assume the following. The interaction of the
sense) for implicit representation of the teacher’s objectives.
                                                                               agent and the environment takes place at discrete moments
   Casting this problem directly into the RL framework is                      t = 1, . . . , T with the value of t trivially observable by the
complicated by two issues. First, it is not straightforward how                agent. The agent, after receiving an observation st at time
to design a rewarding mechanism for imitating the style of                     t, can take an action at ∈ A(s, t) from the set of allowed
the expert. While inverse RL aims at solving this problem, its                 actions A(s, t) using policy π : s → a. Executing an action in
applicability is not obvious given the reduced representation                  the environment results in a new state st+1 . Since we focus
of the huge state-action space that we deal with and the ill-                  on the stylistic elements of the agent behavior, the rewards
posed nature of the inverse RL problem [68], [69]. Second, the                 are are inconsequential for the model we build, and we drop
RL training loop often requires thousands of episodes to learn                 them from the discussion. Next, we consider the episode-
useful policies, directly translating to a high cost of training in            based environment, i.e., after reaching a certain condition, the
terms of time and computational resources. Hence, rather than                  execution of the described state-action loop ends. A complete
using more complex solutions such as generative adversarial                    episode is a sequence E = {(st , at )}t∈1,...,T . The fundamental
imitation learning [62] which use an RL network as their                       assumption regarding the described decision process is that it
generator, we propose a solution to the stated problem based                   has the Markov property.
on an ensemble of multi-resolution Markov models. One of                          Besides the most recent action taken before time t, i.e.,
the major benefits of the proposed model is the ability to                     action at−1 , we also consider a recent history of the past n
perform an interactive training within the same episode. As                    actions, where 1 ≤ n < T , αt,n := at−1  t−n = {at−n , . . . , at−1 },
useful byproduct of our formulation, we can also sketch a                      whenever it is defined in an episode E. For n = 0, we define
mechanism for numerical evaluation of the style associated                     at,0 as the empty sequence. We augment the directly observed
with the agents we train. We outline the main elements of the                  state st with the action history αt,n , to obtain an extended state
approach next and for additional details point at [6], [2], [5],               St,n = (st , αt,n ).
10

   The purpose of including the action history is to capture               For illustration purposes, it is sufficient to consider only the
additional information (i.e., stylistic features and the elements          uniform quantization Qσ . In practice, most variables have
of the objective-driven behavior of the teacher) from human                naturally defined limits which are at least approximately
controlling the input during interactive demonstrations. An                known. Knowing the environment scale gives an estimate of
extended policy πn , which operates on the extended states                 the smallest step size σ0 at which we will have complete
πn : St,n → at , is useful for modeling human actions in a                 information loss, i.e., all observed values map to a single
manner similar to n-grams text models in natural language                  bin. For each continuous variable in the state-action space,
processing (NLP) (e.g., [70], [71], [72]). Of course, the                  we consider a sequence of quantizers with decreasing step
analogy with n-gram models in NLP works only if both state                 size Q = {Qσj }j=0,...,K , σj > σj+1 , which naturally gives
and action spaces are discrete. We will address this restriction           a quantization sequence Q̄ for the entire state-action space,
in the next subsection using multi-resolution quantization.                provided K is fixed across the continuous dimensions. To
   For a discrete state-action space and various n, we can                 simplify notation, we collapse the sub index and write Qj
compute probabilities P {at |St,n } of transitions St,n → at               to stand for Qσj . For more general quantization schemes, the
occurring in (human) demonstrations and use them as a                      main requirement is the decreasingly smaller reconstruction
Markov model Mn of order n of (human) actions. We say                      error for Qj+1 in comparison to Qj .
that the model Mn is defined on an extended state S.,n if                     For an episode E, we compute its quantized representation
the demonstrations contain at least one occurrence of S.,n .               in an obvious component-wise manner:
When a model Mn is defined on S, we can use P {at |St,n }                            Ej = Q̄j (E) = {(Q̄j (st ), Q̄j (at ))}t∈1,...,T     (1)
to sample the next action from all ever observed next actions
in state S.,n . Hence, Mn defines a partial stochastic mapping             which defines a multi-resolution representation of the episode
Mn : S.,n → A from extended states to action space A.                      as a corresponding ordered set {Ej }j∈{0,...,K} of quantized
   2) Stacked Markov models: We call a sequence of Markov                  episodes, where Q̄ is the vector version of quantization Q.
models Mn = {Mi }i=0,...,n a stack of models. A (partial)                     In the quantized Markov model Mn,j = Q̄j (Mn ), which
policy defined by Mn computes the next action at a state st ,              we construct from the episode Ej , we compute extended states
see [6] for the pseudo-code of the corresponding algorithm.                using the corresponding quantized values. Hence, the extended
Such policy performs a simple behavior cloning. The policy                 state is Q̄j (St,n ) = (Q̄(st ), Q̄(αt,n )). Further, we define the
is partial since it may not be defined on all possible extended            model Q̄j (Mn ) to contain probabilities P {at |Q̄j (St,n )} for
states and needs a fallback policy π∗ to provide a functional              the original action values. In other words, we do not rely on
agent acting in the environment.                                           the reconstruction mapping Q̄−1    j   to recover action but store
   Note that it is possible to implement sampling from a                   the original actions explicitly. In practice, continuous action
Markov model using an O(1) complexity operation with hash                  values tend to be unique and the model samples from the set
tables, making the inference very efficient and suitable for real-         of values observed after the occurrences of the corresponding
time execution in a video game or other interactive application            extended state. Our experiments show that replaying the orig-
where expected inference time has to be on the scale of 1 ms               inal actions instead of their quantized representation provides
or less 7 .                                                                better continuity and natural true-to-the-demonstration look of
                                                                           the cloned behavior.
   3) Quantization: Quantization (aka discretization) allows
                                                                              4) Markov Ensemble: Combining together stacking and
us to work around the limitation of discrete state-action space
                                                                           multi-resolution quantization of Markov models, we obtain
enabling the application of the Markov Ensemble approach
                                                                           Markov Ensemble E as an array of Markov models param-
to environments with continuous dimensions. Quantization is
                                                                           eterized by the model order n and the quantization schema
commonly used in solving MDPs [73] and has been exten-
                                                                           Qj :
sively studied in the signal processing literature [74], [75]. Us-
ing quantization schemes that have been optimized for specific
                                                                                     EN,K = Mi,j , i = 0, . . . , N, j = 0, . . . , K     (2)
objectives can lead to significant gains in model performance,
improving various metrics vs. ad-hoc quantization schemes,                    The policy defined by the ensemble (2) computes each
e.g., [73], [76].                                                          next action in an obvious manner (). The Markov Ensemble
   Instead of trying to pose and solve the problem of optimal              technique, together with the policy defined by it, are our
quantization, we use a set of quantizers covering a range of               primary tools for cloning behavior from demonstrations.
schemes from coarse to fine. At the conceptual level, such                    Note, that with the coarsest quantization σ0 present in the
an approach is similar to multi-resolution methods in image                multi-resolution schema, the policy should always return an
processing, mip-mapping and Level-of-Detail (LoD) represen-                action sampled using one of the quantized models, which at
tations in computer graphics [77]. The simplest quantization               the level 0 always finds a match. Hence, such models always
is a uniform one with step σ:                                              “generalize” by resorting to simple sampling of actions when
                                     jxk                                   no better match found in the observations. Excluding too
                         Qσ (x) = σ                                        coarse quantizers and Markov order 0 will result in executing
                                       σ                                   some “default policy” π∗ , which we discuss in the next section.
   7 A modern video game runs at least at 30 frames per second with lots
                                                                           The agent execution with the outlined ensemble of quantized
computations happening during about 33 ms allowed per frame, drastically   stacked Markov models is easy to express as an algorithm,
limiting the “budget” allocated for inference.                             which in essence boils down to a look-up tables [6].
You can also read