Encoding Human Domain Knowledge to Warm Start Reinforcement Learning

Page created by Sidney Wells
 
CONTINUE READING
Encoding Human Domain Knowledge to Warm Start Reinforcement Learning
Encoding Human Domain Knowledge to Warm Start Reinforcement Learning
                                                                                         Andrew Silva, Matthew Gombolay
                                                                                     Institute for Robotics and Intelligent Machines
                                                                                             Georgia Institute of Technology
                                                                              andrew.silva@gatech.edu, matthew.gombolay@cc.gatech.edu
arXiv:1902.06007v4 [cs.LG] 23 Sep 2020

                                                                      Abstract                                 Nets (P RO L O N ETS), a new approach to directly encode do-
                                           Deep reinforcement learning has been successful in a variety        main knowledge as a set of propositional rules into a neural
                                           of tasks, such as game playing and robotic manipulation. How-       network, as depicted in Figure 1. Our approach leverages de-
                                           ever, attempting to learn tabula rasa disregards the logical        cision tree policies from humans to directly initialize a neural
                                           structure of many domains as well as the wealth of readily          network (Figure 2). We use decision trees to allow humans to
                                           available knowledge from domain experts that could help             specify behaviors to guide the agent through a given domain,
                                           “warm start” the learning process. We present a novel rein-         such as high-level instructions for keeping a pole balanced on
                                           forcement learning technique that allows for intelligent ini-       the cart pole problem. Importantly, this policy specification
                                           tialization of a neural network weights and architecture. Our       does not require the human to demonstrate the balancing act
                                           approach permits the encoding domain knowledge directly             in all possible states, nor does it require the human to label
                                           into a neural decision tree, and improves upon that knowl-
                                                                                                               actions as being “good” or “bad.”
                                           edge with policy gradient updates. We empirically validate
                                           our approach on two OpenAI Gym tasks and two modified                  By directly imbuing logical propositions from the tree into
                                           StarCraft 2 tasks, showing that our novel architecture outper-      neural network weights, an RL agent can immediately begin
                                           forms multilayer-perceptron and recurrent architectures. Our        learning productive strategies. This approach leverages read-
                                           knowledge-based framework finds superior policies compared          ily available domain knowledge while still retaining the abil-
                                           to imitation learning-based and prior knowledge-based ap-           ity to learn and improve over time, eventually outperforming
                                           proaches. Importantly, we demonstrate that our approach can         the expertise with which it was initialized. By exploiting the
                                           be used by untrained humans to initially provide > 80% in-          structural and logical rules inherent to many tasks to which
                                           crease in expected reward relative to baselines prior to training   RL is applied, we can bypass early random exploration and
                                           (p < 0.001), which results in a > 60% increase in expected          expedite an agent’s learning in a new domain.
                                           reward after policy optimization (p = 0.011).
                                                                                                                  We demonstrate that our approach can outperform standard
                                                                                                               deep RL across two OpenAI gym domains (Brockman et al.
                                                               1    Introduction                               2016) and two modified StarCraft II domains (Vinyals et al.
                                         As reinforcement learning (RL) is applied to increasingly             2017), and that our framework is superior to state-of-the-art,
                                         complex domains, such as real-time strategy games or robotic          IL-based RL, even with observation of that same domain
                                         manipulation RL and imitation learning (IL) approaches fail           expert knowledge. Finally, in a wildfire simulation domain,
                                         to quickly capture the wealth of expert knowledge that al-            we show that our framework can work with untrained human
                                         ready exists for many domains. Existing approaches to using           participants. Our three primary contributions include:
                                         IL as a warm start require large datasets or tedious human
                                                                                                               1. We formulate a novel approach for capturing human do-
                                         labeling as the agent learns everything, from vision to control
                                                                                                                  main expertise in a trainable RL framework via our archi-
                                         to policy, all at once. Unfortunately, these large datasets often
                                                                                                                  tecture, P RO L O N ETS, which we show outperforms base-
                                         do not exist, as collecting these data is impractical or expen-
                                                                                                                  line RL approaches, including IL-based (Cheng et al. 2018)
                                         sive, and humans will not patiently label data for IL-based
                                                                                                                  and knowledge-based techniques (Humbird, Peterson, and
                                         agents (Amershi et al. 2014). While humans may not label
                                                                                                                  McClarren 2018), obtaining > 100% more average reward
                                         enough state-action pairs to train IL-based agents , there is
                                                                                                                  on a StarCraft 2 mini-game.
                                         an opportunity to improve warm starts by soliciting exper-
                                         tise from a human once, and then leveraging this expertise            2. We introduce dynamic growth to P RO L O N ETS, enabling
                                         to initialize an RL agent’s neural network architecture and              greater expressivity over time to surpass original initial-
                                         policy. With this approach, we circumvent the need for IL                izations and yielding twice as much average reward in the
                                         and instead directly imbue human expertise into an RL agent.             lunar lander domain.
                                            To achieve this blending of human domain knowledge                 3. We conduct a user study in which non-expert humans
                                         with the strengths of RL, we propose Propositional Logic                 leveraged P RO L O N ETS to specify policies that resulted in
                                         Copyright c 2021, Association for the Advancement of Artificial          higher cumulative rewards, both before and after training,
                                         Intelligence (www.aaai.org). All rights reserved.                        relative to all baselines (p < 0.05).
Encoding Human Domain Knowledge to Warm Start Reinforcement Learning
Figure 1: A visualization of our approach as it applies to our user study. Participants interact with a UI of state-checks and
actions to construct a decision tree policy that is then used to directly initialize a P RO L O N ET’s architecture and parameters. The
P RO L O N ET can then begin reinforcement learning in the given domain, outgrowing its original specification.

                    2    Related work                                  ever, DJINN does not explicitly initialize rules, nor does it
                                                                       leverage rules solicited from humans. This distinction means
Warm starts have been used for RL (Cheng et al. 2018;                  that DJINN creates an architecture for routing information
Zhang and Sridharan 2019; Zhu and Liao 2017) as well                   appropriately, but the decision-criteria in each layer must be
as in supervised learning for many tasks (Garcez, Broda,               learned from scratch. Our work, on the other hand, directly
and Gabbay 2012; Hu et al. 2016; Kontschieder et al. 2015;             initializes both the structure and the rules of a neural network,
Wang et al. 2017). While these warm start or knowledge-                meaning that the human’s expertise is more completely lever-
based systems have provided an interesting insight into the            aged for a more useful warm start in RL domains. We build
efficacy of warm starts or human-in-the-loop learning in               on decades of research demonstrating the value of human-
various domains, these systems typically involve either large          in-the-loop learning (Towell and Shavlik 1994; Zhang et al.
labeled datasets with tedious human labeling and feedback,             2019) to leverage logical rules solicited from humans in the
or they require some automated oracle to label actions as              form of a decision tree to intelligently initialize the structure
“good” or “bad.” In highly challenging domains or problems,            and rules of a deep network.
building such an oracle is rarely feasible. Moreover, it is not           Our work is related to IL and to knowledge-based or
always possible to acquire a large labeled dataset for new             human-in-the-loop RL frameworks (Zhang et al. 2019; Zhang
domains. However, it is often possible to solicit a policy from        and Sridharan 2019; MacGlashan et al. 2017) and apprentice-
a human in the form of a high-level series of if-then checks in        ship learning and IRL (Abbeel and Ng 2004; Knox and Stone
critical states. These decisions can be collected as a decision        2009). Importantly, however, our approach does not require
tree. Our research seeks to convert decision tree into a neural        demonstrations or datasets to mimic human behavior. While
network for RL.                                                        our approach directly initializes with a human-specified pol-
   Researchers have previously sought to bridge the gap be-            icy, IL methods require large labeled datasets (Edwards et al.
tween decision trees and deep networks (Humbird, Peterson,             2018) or an oracle to label data before transitioning to RL, as
and McClarren 2018; Kontschieder et al. 2015; Laptev and               in the LOKI (Cheng et al. 2018) framework. Our approach
Buhmann 2014). This work has focused on either partitioning            translates human expertise directly into an RL agent’s policy
a subspace of the data for more efficient inference (Tanno             and begins learning immediately, sidestepping the IL and
et al. 2018), to enable more explicit interpretability by visu-        labeling phase.
alizing a network’s classification policy (Frosst and Hinton
2017; Silva et al. 2020), or for warm starting through super-
vised pre-training on labeled data. As discussed, this data                                3    Preliminaries
may not be available thus creating a need for methods which           Within RL, we consider problems presented as a Markov
can solicit this initialization tree directly from a human.           decision process (MDP), which is a 5-tuple hS, A, T, R, λi
   Most closely related to our work is deep jointly-informed          where s ∈ S are states drawn from the state space or domain,
neural networks (DJINN) (Humbird, Peterson, and McClar-               a ∈ A are possible actions drawn from the action space,
ren 2018), which is the latest in a long line of knowledge-           T (s0 , a, s) is the transition function representing the likeli-
based neural network research (França, Zaverucha, and                hood of reaching a next state s0 by taking some action a in a
Garcez 2014; Garcez, Broda, and Gabbay 2012; Maclin and               given state s, R(s) is the reward function which determines
Shavlik 1996; Richardson and Domingos 2006; Towell and                the reward for each state, and λ is a discount factor. In this
Shavlik 1994). DJINN uses a decision tree learned over a              work, we examine discrete action spaces and semantically
training set in order to initialize the structure of a network’s      meaningful state spaces– intelligent initialization for continu-
hidden layers and to route input data appropriately. How-             ous outputs and unstructured inputs is left to future work. The
Encoding Human Domain Knowledge to Warm Start Reinforcement Learning
(Move Right). Finally, we set the paths Z(l~0 ) = D0 and
                                                                     Z(l~1 ) = (¬ D0 ). The resulting probability distribution over
                                                                     the agent’s actions is a softmax over (D0 ∗ l~0 + (1 − D0 ) ∗ l~1 ).

                                                                     Algorithm 1 Intelligent Initialization
                                                                      1:   Input: Expert Propositional Rules Rd
                                                                      2:   Input: Input Size IS , Output Size OS
                                                                      3:   W, C, L = {}
                                                                      4:   for r ∈ Rd do
Figure 2: A traditional decision tree and a P RO L O N ET. De-        5:     if r is a state check then
cision nodes become linear layers, leaves become action               6:        s = feature index in r
weights, and the final output is a sum of the leaves weighted         7:        w = ~0IS , w[s] = 1
by path probabilities.                                                8:        c = comparison value in r
                                                                      9:        W = W ∪ w, C = C ∪ c
                                                                     10:     end if
goal of our RL agent is to find a policy, π(a|s), that selects       11:     if r is an action then
actions in states to maximize the agent’s expected long-term         12:        a = action index in r
cumulative reward. IL approaches, such as ILPO (Edwards              13:        l = ~0OS , l[a] = 1
et al. 2018), operate under a similar framework, though they         14:        L=L∪l
do not make use of the reward signal and instead perform             15:     end if
supervised learning according to oracle data.                        16:   end for
                                                                     17:   Return: W , C, L
                       4    Approach
We provide a visual overview of the P RO L O N ET architecture
in Figure 2. To intelligently initialize a P RO L O N ET, a human    Algorithm 2 Dynamic Growth
user first provides a policy in the form of some hierarchical
set of decisions. These policies are solicited through simple         1:   Input: P RO L O N ET Pd
user interactions for specifying instructions, as in Section 6.       2:   Input: Deeper P RO L O N ET Pd+1
The user’s decision-making process is then translated into a          3:   Input:  = minimum confidence
set of weights w~n ∈ W and comparator values cn ∈ C rep-              4:   H(~li ) = Entropy of leaf ~li ,
resenting each rule, shown in Algorithm 1. Each weight w~n            5:   for li ∈ L ∈ Pd do
determines which input features to consider, and, optionally,         6:     Calculate H(li )
how to weight them, as there is a unique weight value for             7:     Calculate H(ld1 ), H(ld2 )
each input feature (i.e. |w~n | == |S| for an input space S).                for leaves under li in Pd+1
The comparator cn is used as a threshold for the weighted             8:     if H(li ) > (H(ld1 ) + H(ld2 ) + ) then
features.                                                             9:         Deepen Pd at li using ld1 and ld2
   Each decision node Dn throughout the network is repre-            10:         Deepen Pd+1 at ld1 and ld2 randomly
                                                                     11:     end if
sented as Dn = σ[α(w~n T ∗ X    ~ − cn )], where X  ~ is the input
                                                                     12:   end for
data, σ is the sigmoid function, and α serves to throttle the
confidence of decision nodes. Less confidence in the tree
allows for more uncertainty in decision making (Yuan and                After all decision nodes are processed, the values of Dn
Shaw 1995), leading to more exploration, even from an ex-            from each node represent the likelihood of that condition
pert initialization. High values of α emphasize the difference       being T RU E. In contrast, (1−Dn ) represents the likelihood
between the comparator and the weighted input, thus pushing          of the condition being F ALSE. With these likelihoods, the
the tree to be more boolean. Lower values of α encourage             network then multiplies out the probabilities for different
a smoother tree, with α = 0 producing uniformly random               paths to all leaf nodes. Every leaf ~l ∈ L contains a path
decisions. We allow α to be a learned parameter.                     z ∈ Z, a set of decision nodes which should be T RU E or
Example 1 (P RO L O N ET Initialization). Assume we are in           F ALSE in order to reach ~l, as well as a prior set of weights
the cart pole domain (Barto, Sutton, and Anderson 1983) and          for each output action a ∈ ~a. For example, in Figure 2,
have solicited the following from a human: “If the cart’s x          z1 = D1 ∗ D2 , and z3 = (1 − D1 ) ∗ D3 . The likelihood
position is right of center, move left; otherwise, move right,”      of each action a in leaf ~li is determined by multiplying the
and that the user indicates x position is the first input fea-       probability of reaching leaf ~li by the prior weight of the
ture of four and that the center is at 0. We therefore initialize
our primary node D0 with w~0 = [1, 0, 0, 0] and c0 = 0, fol-         outputs within leaf ~li . After calculating the outputs for every
lowing lines 5-8 in Alg. 1. Following lines 11-13, we create         leaf, the leaves are summed and passed through a softmax
                                                                     function to provide the final output distribution.
a new leaf l~0 = [1, 0] (Move Left) and a new leaf l~1 = [0, 1]
Encoding Human Domain Knowledge to Warm Start Reinforcement Learning
pole agent’s shallower actor has found a local minimum with
                                                                    l1 = [0.5, 0.5], while the deeper actor has l3 = [0.9, 0.1]
                                                                    and l4 = [0.1, 0.9]. Seeing that l1 is offering little benefit
                                                                    to the current policy, and D2 in the deeper actor is able to
                                                                    make a decision about which action offers the most reward,
                                                                    the agent would dynamically deepen at l1 , copying over the
                                                                    deeper actor’s parameters and becoming more decisive in
                                                                    that area of its policy. The deeper actor would also grow with
                                                                    a random set of new parameters, as shown in Figure 3.

Figure 3: The dynamic growth process with a deeper P RO -                       5    Experimental evaluation
L O N ET shown in paler colors and dashed lines. When               We conduct two complementary evaluations of the P RO -
H(L3 ) + H(L4 ) < H(L1 ), the agent replaces L1 with D2 ,           L O N ET as a framework for RL with human initialization.
L3 , and L4 and adds a new level to the deeper actor.               The first is a controlled investigation with expert initializa-
                                                                    tion in which an author designs heuristics for a set of domains
                                                                    with varying complexity; this allows us to confirm that our ar-
Example 2 (P RO L O N ET Inference). Consider an example            chitecture is competitive with baseline learning frameworks.
cart pole state, X=[2, 1, 0, 3] passed to the P RO L O N ET         We also perform an ablation of intelligent initialization and
from Example 1. Following Dn = σ[α(w~n T ∗ X       ~ − cn )], the   dynamic growth in this set of experiments. The second evalu-
network arrives at σ([1, 0, 0, 0] ∗ [2, 1, 0, 3] − 0) = 0.88 for    ation is a user study to support our claim that untrained users
D0 , meaning “mostly true.” This probability propagates to          can specify policies that serve to improve RL.
the two leaf nodes using their respective paths, making the            In our first evaluation, we assess our algorithm in StarCraft
output of the network a probability given by (0.88 ∗ [1, 0] +       II (SC2) for macro and micro battles as well as the OpenAI
(1 − 0.88) ∗ [0, 1]) = [0.88, 0.12]. Accordingly, the agent         Gym (Brockman et al. 2016) lunar lander and cart pole envi-
selects the first action with probability 0.88 and the second       ronments. Optimization details, hyperparameters, and code
action otherwise. An algorithmic expression of the forward-         are all provided in the supplementary material.
pass is provided in the supplementary material.                        To evaluate the impact of dynamic growth and intelligent
                                                                    initialization, we perform an ablation study and include re-
Dynamic Growth – P RO L O N ETS are able to follow expert           sults from these experiments in Table 1. For each N -mistake
strategies immediately, but they may lack the expressive ca-        agent, weights, comparators, and leaves are randomly negated
pacity to learn more optimal policies once they are deployed        according to N , up to a maximum of 2N for each category.
into a domain. If an expert policy involves a small number of
decisions, the network will have a small number of weight           5.1   Agent formulations
vectors and comparators to use for its entire existence. To         We compare several agents across our experimental domains.
enable the P RO L O N ET architecture to continue to grow be-       The first is a P RO L O N ET agent as described above and
yond its initial definition, we introduce a dynamic growth          with expert initialization. We also evaluate a multi-layer
procedure, which is outlined in Algorithm 2 and Figure 3.           perceptron (MLP) agent and a long short-term memory
   Upon initialization, a P RO L O N ET agent maintains two         (LSTM) (Hochreiter and Schmidhuber 1997) agent, both us-
copies of its actor. The first is the shallower, unaltered ini-     ing ReLU activations (Nair and Hinton 2010). We include
tialized version, and the second is a deeper version in which       comparisons to a P RO L O N ET with random initialization
each leaf is transformed into a randomly initialized decision       (Random P RO L O N ET) as well as the Heuristic used to ini-
node with two new randomly initialized leaves (line 1 of            tialize our agents. We compare to an IL agent trained with
Alg. 2). This deeper agent has more parameters to potentially       the LOKI framework, in which the agent imitates for the
learn more complex policies, but at the cost of added random-       first N episodes (Cheng et al. 2018), where N is a tuned hy-
ness and uncertainty, reducing the utility of the intelligent       perparameter, and then transitions to RL. The LOKI agent
initialization.                                                     supervises with the same heuristic that is used to initialize
   As the agent interacts with its environment, it relies on        the P RO L O N ET agent. Finally, although the original DJINN
the shallower network to generate actions, as the shallow           framework (Humbird, Peterson, and McClarren 2018) re-
network represents the human’s domain knowledge. After              quires a decision tree learned over a labeled dataset, we ex-
each episode, the off-policy update is run over the shallower       tend the DJINN architecture to allow for initialization with a
and deeper networks. Finally, after the off-policy updates, the     hand-crafted decision tree in order to compare to a DJINN
agent compares the entropy of the shallower actor’s leaves          agent that is initialized using the same heuristic as LOKI and
to the entropy of the deeper actor’s leaves and selectively         P RO L O N ET, but built with the DJINN architecture.
deepens when the leaves of the deeper actor are less uniform
than those of the shallower actor (lines 3-7). We find that this    5.2   Environments
dynamic growth mechanism improves stability and average             We consider four environments to empirically evaluate P RO -
cumulative reward.                                                  L O N ETS: cart pole, lunar lander, the FindAndDefeatZer-
Example 3 (P RO L O N ET Dynamic Growth). Assume the cart           glings minigame from the SC2LE (Vinyals et al. 2017), and
Encoding Human Domain Knowledge to Warm Start Reinforcement Learning
(a) Cart Pole                         (b) Lunar Lander                      (c) FindAndDefeatZerglings

Figure 4: A comparison of architectures on cart pole, lunar lander, and FindAndDefeatZerglings (Vinyals et al. 2017). As the
domain complexity increases, we see that intelligent initialization is increasingly important, and P RO L O N ETS are the most
effective method for leveraging domain expertise, and perform well even when domain expertise is unnecessary, as in cart pole.

a full game of SC2 against the in-game artificial intelligence       hacks and tricks (Engstrom et al. 2019), we observe that the
(AI). These environments provide us with a steady increase           P RO L O N ET approaches are able to succeed with the same
in difficulty, from a toy problem to the challenging game            PPO implementation and learning environment.
of full SC2. These evaluations also showcase the ability of
the P RO L O N ET framework to compete with state-of-the-art
approaches in simple domains and excel in more complex               StarCraft II: FindAndDefeatZerglings – For this prob-
domains. For the SC2 and SC2LE problems, we use the SC2              lem, we assign an agent to each individual allied unit. The
API1 to manufacture 193D and 37D state spaces, respectively,         best-performing initialization in this domain has 6 decision
and 44D and 10D action spaces, respectively. In the full SC2         nodes and 7 leaves. Running reward is depicted in Figure
domain, making the right parameter update is a significant           4c, again averaged over 5 runs. Intelligent initialization is
challenge for RL agents. As such, we verify that the agent’s         crucial in this more complex domain, and the Random P RO -
parameter updates increase its probability of victory, and if a      L O N ET fails to find much success despite having the same
new update has decreased the agent’s chances of success, then        architecture as the P RO L O N ET. LOKI performs on par with
the update is rolled back, and the agent gathers experience          the Heuristic used to supervise actions, but LOKI is unable
for a different step, similar to the checkpointing approach in       to generalize beyond the Heuristic. MLP and LSTM agents
Hosu and Rebedea (2016).                                             use a 7-layer architecture after a hyperparameter search, and
                                                                     we extend this to the full game of SC2. Importantly, this
                                                                     result (Figure 4c) shows user-initialized P RO L O N ETS can
OpenAI Gym – As depicted in Figure 4a and 4b, P RO -                 outperform our baselines and that this initialization is key
L O N ETS are able to either match or exceed performance             to efficient exploration and learning. The importance of the
of standard reinforcement and imitation learning based RL            initialization policy is again shown in Table 1, where even
architectures. Furthermore, we find that the P RO L O N ET           negating 10% of the agent’s parameters results in a signifi-
architecture–even without intelligent-initialization–is com-         cantly lower average reward.
petitive with baseline architectures in the OpenAI Gym. Run-
ning reward in these domains is averaged across five runs, as
recommended by Henderson et al. (2018). MLP and LSTM                 StarCraft II: Full Game – After 5,000 episodes, no agent
agents use 1-layer architectures which maintain input dimen-         other than the P RO L O N ET is able to win a single game
sion until the output layer. We find success with intelligent        against the in-game AI at the easiest setting. Even the LOKI
initializations using as few as three nodes for the cart pole        and DJINN agents, which have access to the same heuristics
domain and as few as 10 nodes for the lunar lander. These            used by the P RO L O N ET, are unable to win one game. The
results show that P RO L O N ETS can leverage user knowledge         P RO L O N ET, on the other hand, is able to progress to the
to achieve superior results, and our ablation study results          “hard” in-game AI, achieving 100% win rates against easier
(Table 1) show that the architecture is robust to sub-optimal        opponents as it progresses. Even against the “hard” in-game
initialization in these domains.                                     AI, the P RO L O N ET agent is able to double its win rate from
   Even where intelligent initialization is not always neces-        initialization. This result demonstrates the importance of an
sary or where high-level instruction is difficult to provide,        intelligent initialization in complex domains, where only a
as in cart pole, it does not hinder RL from finding solutions        very narrow and specific set of actions yield successful re-
to the problem. Further, while baselines appear unstable in          sults. Access to oracle labeling (LOKI) or a knowledge-based
these domains, potentially owing to missing implementation           architecture (DJINN) does not suffice; the agent requires the
                                                                     actual warm start of having intelligent rules built-in. Thus, we
   1
       https://github.com/Blizzard/s2client-api                      believe these results demonstrate that our novel formulation
Encoding Human Domain Knowledge to Warm Start Reinforcement Learning
R ANDOM .       S HALLOW
               D OMAIN         P RO L O N ET   P RO L O N ET   P RO L O N ET    N = 0.05     N = 0.1         N = 0.15
               C ART P OLE     449±15          401±26          415± 27          426± 30      369± 28         424± 29
               L UNAR          86 ± 33         55±19           49± 20           50± 22       45± 22          45± 22
               Z ERGLINGS      8.9±1.5         -1.3±0.6        8.8±1.5          5.1±1.1      5.9±1.2         4.1±1.1

                   Table 1: P RO L O N ET ablation study of average cumulative reward. Units are in thousands.

                    P RO L O N ET    P RO L O N ET AT      A LL           network weights are shared. The reward function is the nega-
 AI D IFFICULTY       (O URS )      I NITIALIZATION       OTHERS          tive distance between drones and fire centroids, encouraging
 V ERY E ASY           100%              14.1 %            0%             drones to follow the fire as closely as possible.
 E ASY                 100%              10.9 %            0%
 M EDIUM               82.2%             11.3 %            0%             6.1    Study details
 H ARD                  26%              10.7 %            0%
                                                                         To solicit policy specifications from users, we designed a user
                                                                         interface that enabled participants to select from a set of pre-
Table 2: Win rates against the StarCraft II in-game AI. “All             made state checks and actions. Participants were first briefed
Others” includes all agents in Section 5.1.                              on the domain and shown a visualization and then asked to
                                                                         talk-through a strategy for monitoring the fires with two in-
                                                                         dependent drones. After describing a solution and seeing the
is singularly capable of harnessing domain knowledge.                    domain, participants were presented with the UI to build out
                                                                         their policies. As the participant selected options, those rules
          6    User study with non-experts                               were composed into a decision tree. Once participants com-
Our second evaluation investigates the utility of our frame-             pleted the study, we leveraged their policy specifications to
work with untrained humans providing the expert initializa-              initialize the structure and parameters of a P RO L O N ET. The
tion for P RO L O N ETS. As presented in Section 6.2, our user           P RO L O N ET was then deployed to the wildfire domain, where
study shows that untrained users can leverage P RO L O N ETS             it further improved through RL. Our results are presented in
to train RL policies with superior performance. These results            Figure 6 and described below. We present both the highest
provide evidence that our approach can help democratize RL.              performing participant (“Best”), as well as the median over
                                                                         all participants (“Median”), and compare against the agents
                                                                         presented in Section 5.1. LOKI and DJINN agents use the
Hypotheses – We seek to investigate whether an untrained                 “Best” participant policy specification as a heuristic.
user can provide a useful initial policy for P RO L O N ETS. Hy-
pothesis 1 (H1): Expert initializations may be solicited from             6.2    Study results
average users, requiring no particular training of the user,
and these initializations are superior to random initializations.        Our IRB-approved study involved 15 participants (nine male,
Hypothesis 2 (H2): RL can improve significantly upon these               six female) between 21 and 29 years old (M = 24, SD = 2).
initializations, yielding superior policies after training.              The study took approximately 45 minutes, and participants
                                                                         were compensated for their time. Our pre-study survey re-
                                                                         vealed varying degrees of experience with robots and games,
Metrics – To test H1, we measure the reward over time for                though we note that our participants were mostly computer
our best participant, all participants, and baseline methods.            science students. Importantly, we found that their prior expe-
Testing H2, we measure the average reward for the first 50               rience with robots, learning from demonstration, or strategy
and final 50 episodes for all agents specified by participants           games did not impact their ability to specify useful policies
and our strongest baseline. Our metrics allow us to effectively          for our agents.
examine our hypotheses in the context of expert initialization              Nearly all participants provided policy specifications that
in our study domain.                                                     were superior to random exploration. After performing RL
                                                                         over participant specifications, we can see in Figure 6 that
Domain: Wildfire Tracking – We develop a Python simu-                    intelligent initialization yields the most successful RL
lator for a domain that is both suited to RL and of relevance            agents, even from non-experts. We compare to the best per-
to a wider audience: wildfire tracking. We randomly instanti-            forming baseline, Random P RO L O N ET in Figure 5. We can
ate two fires and two drones in a 500x500 grid. The drones               again see that the participants’ initializations are not only
receive a 6D state as input, containing distances to fire cen-           better than random initialization, but are also better than the
troids and Boolean flags for which a drone is the closest to             trained RL agent. A Wilcoxon signed-rank test shows that our
each fire. The action space for drones is a 4D discrete deci-            participants’ initializations (Median = -23, IQR = 19) were
sion of which cardinal direction to move into. Pre-made state            significantly better than a baseline initialization (Median =
checks include statements such as “If I am the closest drone             -87, IQR = 26), W (15) = 1.0, p < .001. Our participants’
to Fire 2” and “If Fire 1 is to my west.” The two drones are             agents (Median = -7.9 , IQR = 29) were also significantly bet-
controlled by separate agents without communication, and                 ter than a baseline (Median = -52, IQR = 7.9) after training,
Encoding Human Domain Knowledge to Warm Start Reinforcement Learning
Figure 5: Initial and final distance between drones and wild-
fire centroids in our user study domain, where lower distance
is better. Participant initializations are significantly better at   Figure 6: Wildfire tracking results, again demonstrating the
tracking fires than random, showing that untrained users can         importance of direct intelligent initialization (P RO L O N ET)
leverage our approach to provide useful warm starts.                 rather than IL or random initialization.

W (15) = 15.0, p = 0.011. These results are significant after        knowledge to initialize rules as well as structure, rather than
applying a Bonferroni correction to test the relative perfor-        simple architecture and routing information, as in DJINN, is
mance both before and after training. This result supports           a key difference that enables the success of our approach.
hypothesis H1, showing that average users can specify use-              Through our user study, we demonstrated the practicality
ful policies for RL agents to explore more efficiently than          of our approach and shown that average participants, even
random search and significantly outperform baselines.                those with no prior experience in the given domain, can pro-
                                                                     duce policy specifications which significantly exceed random
   Furthermore, our participants’ agents are significantly bet-
                                                                     initialization (p < 0.05). Furthermore, we have demonstrated
ter post-training than at initialization, as shown by a Wilcoxon
                                                                     that RL can significantly improve upon these policies, learn-
signed-rank test (W (15) = 4.0, p < 0.01). This finding sup-
                                                                     ing to refine “good enough” solutions into optimal ones for a
ports hypothesis H2, showing that RL improves on human
                                                                     given domain. This result shows us that our participants did
specifications, not merely repeating what the humans have
                                                                     not simply provide our agents with optimal solutions iterated
demonstrated. By combining human intuition and expertise
                                                                     upon needlessly. Instead, our participants provided good but
with computation and optimization for long-term expected
                                                                     sub-optimal starting points for optimization. These starting
reward, we are able to produce agents that outperform both
                                                                     policies were then refined into a solution that was more robust
humans and traditional RL approaches.
                                                                     than either the human’s solution or the best baseline solution.
   Finally, we qualitatively demonstrate the utility of intel-       Our study confirms that our approach can leverage readily
ligent initialization and the P RO L O N ET architecture by de-      available human initializations for success in deep RL, and
ploying the top performing agents from each method to two            moreover, that the combination of human initialization and
drones with simulated fires to track. Videos of the top four         RL yields the best of both worlds.
agents are included as supplementary material.
                                                                                          8    Conclusion
                      7    Discussion                                We present a new architecture for deep RL agents, P RO -
We proposed two complementary evaluations of our proposed            L O N ETS, which permits intelligent initialization of agents.
architecture, demonstrating the significance of our contribu-        P RO L O N ETS grant agents the ability to grow their network
tion. Through our first set of experiments on an array of RL         capacity as necessary, and are surprisingly capable even with
benchmarks with a domain expert building heuristics, we              random initialization. We show that P RO L O N ETS permit ini-
empirically validated that P RO L O N ETS are competitive with       tialization from average users and achieve a high-performing
baseline methods when initialized randomly and, with a hu-           policy as a result of the blend of human instruction and RL.
man initialization, outperforms state-of-the-art imitation and       We demonstrate, first, that our approach is superior to imita-
RL baselines. As we see in Figure 4, P RO L O N ETS are as fast      tion and reinforcement learning on traditional architectures
or faster than baseline methods to learn an optimal policy           and, second, that intelligent initialization allows deep RL
over the same environments and optimization frameworks.              agents to explore and learn in environments that are too
In our more complex domains, we identify the importance of           complex for randomly initialized agents. Further, we have
an intelligent initialization. While the IL baseline performs        confirmed that we can solicit these useful warm starts from
well in the FindAndDefeatZerglings minigame, LOKI cannot             average participants and still develop policies superior to
improve on the imitated policy. In the full game of SC2, no          baseline approaches in the given domains, paving the way
approach apart from our intelligently-initialized P RO L O N ET      for reinforcement learning to become a more collaborative
wins even a single game. The ability to leverage domain              enterprise across a variety of complex domains.
Encoding Human Domain Knowledge to Warm Start Reinforcement Learning
Ethical Considerations                               Amershi, S.; Cakmak, M.; Knox, W. B.; and Kulesza, T.
Our work is a contribution targeted at democratizing rein-           2014. Power to the people: The role of humans in interactive
forcement learning in complex domains. The current state of          machine learning. AI Magazine 35(4): 105–120.
the art in reinforcement learning in complex domains requires        Barto, A. G.; Sutton, R. S.; and Anderson, C. W. 1983. Neu-
compute time and power beyond the capacity of many labs,             ronlike adaptive elements that can solve difficult learning
hand-engineering which is rarely explained publicly, or large        control problems. IEEE transactions on systems, man, and
labeled datasets which are not always shared. By providing           cybernetics SMC-13(5): 834–846.
a means for intelligent initialization by practitioners and im-
proved exploration in many domains, we attempt to lower              Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.;
the barrier to entry for research in reinforcement learning          Schulman, J.; Tang, J.; and Zaremba, W. 2016. OpenAI
and to broaden the number of potential applications of rein-         Gym.
forcement learning to more grounded, real-world problems.            Cheng, C.-A.; Yan, X.; Wagener, N.; and Boots, B. 2018.
While there are risks with any technology being misused,             Fast Policy Learning through Imitation and Reinforcement.
we believe the benefits of democratizing RL outweigh the             arXiv preprint arXiv:1805.10413 .
risks. We posit that giving everyone the ability to use RL
rather than just large corporations and select universities is a     Edwards, A. D.; Sahni, H.; Schroeker, Y.; and Isbell, C. L.
positive contribution to society.                                    2018. Imitating Latent Policies from Observation. arXiv
                                                                     preprint arXiv:1805.07914 .
Beneficiaries – Our work seeks to improve and simplify               Engstrom, L.; Ilyas, A.; Santurkar, S.; Tsipras, D.; Janoos, F.;
reinforcement learning research for all labs and to take steps       Rudolph, L.; and Madry, A. 2019. Implementation Matters in
toward democratizing reinforcement learning for non-experts.         Deep RL: A Case Study on PPO and TRPO. In International
We feel that the computational and dataset savings of our            Conference on Learning Representations.
work stand to benefit all researchers within reinforcement           França, M. V.; Zaverucha, G.; and Garcez, A. S. d. 2014. Fast
learning.                                                            relational learning using bottom clause propositionalization
                                                                     with artificial neural networks. Machine learning 94(1): 81–
Negatively affected parties – We do not feel that any                104.
group of people or research direction is negatively impacted         Frosst, N.; and Hinton, G. 2017. Distilling a neural network
by this work. Our work is complementary to other explo-              into a soft decision tree. arXiv preprint arXiv:1711.09784 .
rations within reinforcement learning, and insights from imi-
tation learning translate naturally into insights on the qualities   Garcez, A. S. d.; Broda, K. B.; and Gabbay, D. M. 2012.
of useful or harmful intelligent initializations.                    Neural-symbolic learning systems: foundations and applica-
                                                                     tions. Springer Science & Business Media.
Implications of failure – While our method seeks to sim-             Henderson, P.; Islam, R.; Bachman, P.; Pineau, J.; Precup,
plify reinforcement learning, in the worst case the initializa-      D.; and Meger, D. 2018. Deep reinforcement learning that
tion falls back to random and the learning agent is again faced      matters. In Thirty-Second AAAI Conference on Artificial
with an intractable random exploration problem. Adversar-            Intelligence.
ial agents using our approach would be able to instantiate           Hochreiter, S.; and Schmidhuber, J. 1997. Long short-term
a worse-than-random agent, though our results imply that             memory. Neural computation 9(8): 1735–1780.
it is possible to overcome such an initialization in simple
domains.                                                             Hosu, I.-A.; and Rebedea, T. 2016. Playing atari games with
                                                                     deep reinforcement learning and human checkpoint replay.
                                                                     arXiv preprint arXiv:1607.05077 .
Bias and fairness – Our work does rely on the “bias” of its
initialization–that is, it is biased towards the actions which a     Hu, Z.; Ma, X.; Liu, Z.; Hovy, E.; and Xing, E. 2016. Har-
human has pre-specified. While this biased exploration may           nessing deep neural networks with logic rules. arXiv preprint
fail to accurately explore or understand the intricacies of a        arXiv:1603.06318 .
complex domain, the alternative (years of compute with ran-          Humbird, K. D.; Peterson, J. L.; and McClarren, R. G. 2018.
dom exploration) is simply unavailable to many researchers.          Deep Neural Network Initialization With Decision Trees.
This bias may be overcome through diversification of intel-          IEEE transactions on neural networks and learning systems .
ligent initializations which may lead to a diversity of final
strategies. However, the unification of such diverse policies        Knox, W. B.; and Stone, P. 2009. Interactively shaping agents
into a single agent and the thorough study of diverse initial-       via human reinforcement: The TAMER framework. In Pro-
izations is left to future work.                                     ceedings of the fifth international conference on Knowledge
                                                                     capture, 9–16.
                         References                                  Kontschieder, P.; Fiterau, M.; Criminisi, A.; and Rota Bulo,
Abbeel, P.; and Ng, A. Y. 2004. Apprenticeship learning              S. 2015. Deep neural decision forests. In Proceedings of
via inverse reinforcement learning. In Proceedings of the            the IEEE international conference on computer vision, 1467–
twenty-first international conference on Machine learning, 1.        1475.
Encoding Human Domain Knowledge to Warm Start Reinforcement Learning
Laptev, D.; and Buhmann, J. M. 2014. Convolutional deci-
sion trees for feature learning and segmentation. In German
Conference on Pattern Recognition, 95–106. Springer.
MacGlashan, J.; Ho, M. K.; Loftin, R.; Peng, B.; Wang, G.;
Roberts, D. L.; Taylor, M. E.; and Littman, M. L. 2017. Inter-
active learning from policy-dependent human feedback. In
Proceedings of the 34th International Conference on Machine
Learning-Volume 70, 2285–2294. JMLR. org.
Maclin, R.; and Shavlik, J. W. 1996. Creating advice-taking
reinforcement learners. Machine Learning 22(1-3): 251–281.
Nair, V.; and Hinton, G. E. 2010. Rectified linear units im-
prove restricted boltzmann machines. In Proceedings of the
27th international conference on machine learning (ICML-
10), 807–814.
Richardson, M.; and Domingos, P. 2006. Markov logic net-
works. Machine learning 62(1-2): 107–136.
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and
Klimov, O. 2017. Proximal policy optimization algorithms.
arXiv preprint arXiv:1707.06347 .
Silva, A.; Killian, T.; Rodriguez, I. D. J.; Son, S.-H.; and
Gombolay, M. 2020. Optimization Methods for Interpretable
Differentiable Decision Trees in Reinforcement Learning.
In International Conference on Artificial Intelligence and
Statistics.
Tanno, R.; Arulkumaran, K.; Alexander, D. C.; Criminisi, A.;
and Nori, A. V. 2018. Adaptive Neural Trees. arXiv preprint
arXiv:1807.06699 .
Tieleman, T.; and Hinton, G. 2012. Lecture 6.5-rmsprop:
Divide the gradient by a running average of its recent mag-
nitude. COURSERA: Neural networks for machine learning
4(2): 26–31.
Towell, G. G.; and Shavlik, J. W. 1994. Knowledge-based
artificial neural networks. Artificial intelligence 70(1-2): 119–
165.
Vinyals, O.; Ewalds, T.; Bartunov, S.; Georgiev, P.; Vezhn-
evets, A. S.; Yeo, M.; Makhzani, A.; Küttler, H.; Agapiou, J.;
Schrittwieser, J.; et al. 2017. Starcraft ii: A new challenge for
reinforcement learning. arXiv preprint arXiv:1708.04782 .
Wang, J.; Wang, Z.; Zhang, D.; and Yan, J. 2017. Combining
knowledge with deep convolutional neural networks for short
text classification. In Proceedings of IJCAI, volume 350.
Yuan, Y.; and Shaw, M. J. 1995. Induction of fuzzy decision
trees. Fuzzy Sets and systems 69(2): 125–139.
Zhang, R.; Torabi, F.; Guan, L.; Ballard, D. H.; and Stone, P.
2019. Leveraging human guidance for deep reinforcement
learning tasks. arXiv preprint arXiv:1909.09906 .
Zhang, S.; and Sridharan, M. 2019. AAAI Tutorial:
Knowledge-based Sequential Decision-Making under Un-
certainty. AAAI Workshop: Knowledge-based Sequential
Decision-Making under Uncertainty .
Zhu, F.; and Liao, P. 2017. Effective warm start for the online
actor-critic reinforcement learning based mhealth interven-
tion. arXiv preprint arXiv:1704.04866 .
Encoding Human Domain Knowledge to Warm Start Reinforcement Learning
A     P RO L O N ET Forward Pass                                   C     Experimental Domain Details
An algorithmic step-through of the forward pass for the P RO -    C.1       Cart Pole
L O N ET is provided in Algorithm 3. The example from the
                                                                  Cart pole is an RL domain (Barto, Sutton, and Anderson
main paper is included here:
                                                                  1983) where the object is to balance an inverted pendulum on
Example 4 (P RO L O N ET Inference). Consider an example          a cart that moves left or right. The state space is a 4D vector
cart pole state, X=[2, 1, 0, 3]. Following the equation in        representing {cart position, cart velocity, pole angle, pole
Line 3 of Algorithm 3, the network arrives at σ([1, 0, 0, 0] ∗    velocity}, and the action space is is {left, right}. We use the
[2, 1, 0, 3] − 0) = 0.88 for D0 , meaning “mostly true.” This     cart pole domain from the OpenAI Gym (Brockman et al.
decision probability propagates to the two leaf nodes using       2016).
their respective paths (Lines 9-15 in Algorithm 3), making the       For the cart pole domain, we set all agent’s learning rates
output of the network a probability given by (0.88 ∗ [1, 0] +     to 0.01, the batch size is set to dynamically grow as there
(1 − 0.88) ∗ [0, 1]) = [0.88, 0.12]. Accordingly, the agent       is more replay experience available, we initialized α = 1,
selects the first action with probability 0.88 and the second     and each agent trains on all data gathered after each episode,
action otherwise.                                                 then empties its replay buffer. All agents train on 2 sim-
                                                                  ulations concurrently, pooling replay experience after each
Algorithm 3 P RO L O N ET Forward Pass                            episode, and updating their policy parameters. For the LOKI
                                                                  agent, we set N =200. All agents are updated according to
  Input: Input Data X, P RO L O N ET P                            the standard PPO loss function. We selected all parameters
  for dn ∈ D ∈ P do                                               empirically to produce the best results for each method.
     σn = σ[α(w~n T ∗ X ~ − cn )]
  end for                                                         C.2       Lunar Lander
  A~ OU T = Output Actions
                                                                  Lunar lander is the second domain we use from the OpenAI
  for ~li ∈ L do                                                  Gym (Brockman et al. 2016), and is based on the classic
     Path to ~li = Z(L)                                           Atari game of the same name. Lunar lander is a game where
     z=1                                                          the player attempts to land a small ship (the lander) safely on
     for σi ∈ Z(L) do                                             the ground, keeping the lander upright and touching down
         if σi should be T RU E ∈ Z(L) then                       slowly. The 8D state consists of the lander’s {x, y} position
            z = z ∗ σi                                            and velocity, the lander’s angle and angular velocity, and two
         else                                                     binary flags which are true when the left or right legs have
            z = z ∗ (1 − σi )                                     touched down.
         end if                                                      We use the discrete lunar lander domain, and so the 4D
     end for                                                      action space contains {do nothing, left engine, main engine,
     A~ OU T = A ~ OU T + ~li ∗ z
                                                                  right engine}. For the lunar lander domain, we set most
  end for                                                         hyperparameters to the same values as in the cart pole domain.
  Return: A   ~ OU T
                                                                  The two exceptions are the number of concurrent processes,
                                                                  which we set to 4, and the LOKI agent’s N , which is set to
                                                                  300. All agents use the standard PPO loss function.
    B    Hyperparameters and Optimization
                     Details                                      C.3       FindAndDefeatZerglings
All actors are updated with proximal policy optimization          FindAndDefeatZerglings is a minigame from the SC2LE
(PPO) (Schulman et al. 2017). Notably, for the two SC2            designed to challenge RL agents to learn how to effectively
domains, we find that multiplying the PPO update by the           micromanage their individual attacking units in SC2. The
Kullback-Leibler divergence between old and new policies          agent controls three attacking units on a small, partially-
yields superior performance. The critic’s loss function is        observable map, and must explore the map while killing
the mean-squared error between the output of the critic and       enemy units. The agent receives +1 reward for each enemy
the reward from the state-action pair. All approaches are         unit that is killed, and -1 for each allied unit that is killed.
trained with RMSProp (Tieleman and Hinton 2012). We set           Enemy units respawn in random locations, and so the best
our reward discount factor to 0.99, learning rates to 1e-2 for    agents are ones that continuously explore and kill enemy
Gym environments, and 1e-4 for the SC2 domains, following         units until the three minute timer has elapsed.
a hyperparameter search between 1-e2 and 1e-5. Update                We leverage the SC2 API 2 to manufacture a
batch sizes dynamically grow as more replay experience is         37D state which contains {x position, y position, health,
available. In all domains, the P RO L O N ET α parameter is       weapon cooldown} for three allied units, and {x position,
initialized to 1. Our agents utilize two separate networks: one   y position, health, weapon cooldown, is baneling} the five
for the actor and one for the critic. For our approach, the       nearest visible enemy units. Missing information is filled with
critic network is initialized as a copy of the actor as we do     -1. Our action space is 10D, containing move commands for
not solicit intelligent value predictions, only policies. Our     north, east, south, west, attack commands for each of the five
dynamic growth hyperparameter  is set to  = 0.1 based
                                                                     2
upon experimental observation.                                           https://github.com/Blizzard/s2client-api
nearest visible enemies, and a “do nothing” command. For              AI, then move up to successive levels of difficulty as they
this problem, we assign an agent to each individual allied            achieve > 80% win-rates. The agents in this domain update
unit, which generates actions for only that unit. Experience          according to the loss function in Equation 1.
from each agent stops accumulating when the unit dies. All
experience is pooled for policy updates after each episode,           C.5    User Study Domain: Wildfire Tracking
and parameters are shared between agents.                             The objective in the wildfire tracking domain is to keep two
   For the SC2LE minigame, we set all agents’ learning rates          drones on top of two fire centroids as they progress through
to 0.001, we again initialized α = 1, and the batch size to           the map. The task is complicated by the fact that the two
4. Each agent trains on replay data for 50 update iterations          drones do not communicate, and do not have complete access
per episode, and pools experience from 2 concurrent pro-              to the state of the world. Instead, they have access to a 6D
cesses. The LOKI agent’s N , is set to 500. The agents in             vector containing { DN (F1 ), DW (F1 ), DN (F2 ), DW (F1 ),
this domain update according to the loss function in Equation         C(F1 ), C(F2 ) } where DN is the “distance to the north“
1.                                                                    function and C(F1 ) is the “closer to fire 1” boolean flag.
                                                                         The actions available to the drones include move com-
                                    (A ∗ log(a|πnew ))                mands in four directions: north, east, south, and west.
  L(a, s, πnew , πold ) =
                           KL(P (~a|πnew , s), P (~a|πold , s))
                                                               (1)     D     Initialization Heuristics in Experimental
   Where A is the advantage gained by taking action a in                                     Evaluation
state s, πnew is the current set of model parameters, and
πold is the set of model parameters used during the episode           D.1    Cart Pole Heuristics
which generated this state-action pair. ~a is the probability         We use a simple set of heuristics for the cart pole problem,
distribution over all actions that a policy π yields given state s.   visualized in Figure 7. If the cart is close enough to the center,
As in prior work, the advantage A is calculated by subtracting        we move in the direction opposite to the lean of the pole, as
the reward (obtained by taking action a in state s) from the          long as that motion will not push us too far from the center.
value prediction for taking action a in state s, given by a           If the cart is close to an edge, the agent attempts to account
critic network.                                                       for the cart’s velocity and recenter the cart, though this is
                                                                      often an unrecoverable situation for the heuristic. The longest
C.4    SC2 Full Game                                                  run we saw for a P RO L O N ET with no training was about 80
Our simplified StarCraft 2 state contains:                            timesteps.
• Allied Unit Counts: A 36x1 vector in which each index cor-
                                                                      D.2    Lunar Lander Heuristics
   responds to a type of allied unit, and the value corresponds
   to how many of those units exist.                                  For the lunar lander problem, the heuristic rules are split
                                                                      into two primary phases. The first phase is engaged at the
• Pending Unit Counts: As above, but for units that are cur-          beginning of an episode while the lander is still high above
   rently in production and do not exist yet.                         the surface. In this phase, the lander focuses on keeping the
• Enemy Unit Counts: A 112x1 vector in which each index               lander’s angle as close to 0 as possible. Phase two occurs
   corresponds to a type of unit, and the value corresponds to        when the lander gets closer to the surface, and the agent then
   how many of those types are visible.                               focuses on keeping the y velocity lower than 0.2. As is de-
• Player State: A 9x1 vector of specific player state informa-        picted in Figure 8, there are many checks for both lander legs
   tion, including minerals, vespene gas, supply, etc.                being down. We found that both LOKI and P RO L O N ETS
                                                                      were prone to landing successfully, but continuing to fire
   The disparity between allied unit counts and enemy unit
                                                                      their left or right boosters. In an attempt to ameliorate this
counts is due to the fact that we only play as the Protoss race,
                                                                      problem, we added the extra “legs down” checks.
but we can play against any of the three races.
   The number of actions in SC2 can be well into the thou-            D.3    FindAndDefeatZerglings Heuristics
sands if one considers every individual unit’s abilities. As
we seek to encode a high-level strategy, rather than rules for        For the SC2LE minigame, the overall strategy of our heuristic
moving every individual unit, we restrict the action space            is to stay grouped up and fight or explore as a group. As such,
for our agent. Rather than using exact mouse and camera               the first four checks are all in place to ensure that the marines
commands for individual units, we abstract actions out to             are all close to each other. After they pass the proximity
simply: “Build Pylon.” As such, our agents have 44 available          checks, they attack whatever is nearest. If nothing is nearby,
actions, including 35 building and unit production commands,          they will move in a counter-clockwise sweep around the
4 research commands, and 5 commands for attack, defend,               periphery of the map, searching for more zerglings. Our
harvest resources, scout, and do nothing.                             heuristic is shown in Figure 9.
   For the full SC2 game, we set all agents’ learning rates
to 0.0001, we again initialized α = 1, set the batch size to          D.4    SC2 Full Game Heuristics
4, and updates per episode to 8. We run 4 episodes between            The SC2 full game heuristic first checks for important actions
updates, and set the LOKI N =1000. Agents train for as long           that should always be high priority, such as attacking, defend-
as necessary to achieve a > 80% win-rate against the easiest          ing, harvesting resources, and scouting. Once initial checks
for these are all passed, the heuristic descends into the build    E.3    FindAndDefeatZerglings
order, where it simply uses building or unit count checks to       We failed to find a M LP architecture that succeeded in this
determine when certain units should be built or trained. After     task, and so we choose one that compromised between the
enough attacking units are trained, the heuristic indicates that   depth of the P RO L O N ET and the simplicity that M LP agents
it is time to attack. The SC2 full game heuristic is depicted      seemed to prefer for toy domains. The final network is a 7-
in Figure. 10.                                                     layer network with the following sequence:
                                                                      37x37 – 37x37 – 37x37 – 37x37 – 37x37 – 37x37 – 37x10.
       E     Architectures for Algorithms in                          We choose to keep the size to 37 after testing 37 and
               Experimental Evaluation                             64 as sizes, and deciding that trying to get as close to the
In this section we briefly overview the M LP and LST M             P RO L O N ET architecture was the best bet.
action network information. The LOKI agent maintained                 The LST M network for FindAndDefeatZerglings features
the same architecture as the M LP agent.                           more hidden layers than the LST M for lunar lander and cart
                                                                   pole. The hidden size is set to 37, and the LSTM unit is
E.1    Cart Pole                                                   followed by 5 layers. The final sequence is:
The cart pole M LP network is a 3-layer network following             37x37 – LSTM(37x37) – 37x37 – 37x37 – 37x37 – 37x37
the sequence:                                                      – 37x10.
   4x4 – 4x4 – 4x2.                                                   We experimented with hidden-sizes for the LSTM unit
   We experimented with sizes ranging from 4-64 and num-           from 37 to 64 and varied the number of successive layers
bers of hidden layers from 1 to 10, and found that the small       from 4-10.
network performed the best.                                           The P RO L O N ET agent for FindAndDefeatZerglings fea-
   The LST M network for cart pole is the same as the M LP         tured 10 nodes and 11 leaves. We tested architectures from
network, though with an LSTM unit inserted between the             6 to 15 nodes and from 7 to 13 leaves, and found that the
first and second layers. The LSTM unit’s hidden size is 4, so      initialized policy and architecture had more of an immediate
the final sequence is:                                             impact for this task. The 7 node policy allowed agents to
   4x4 – LSTM(4x4) – 4x4 – 4x2.                                    spread out too much, and they died quickly, whereas the 15
   We experimented with hidden-sizes for the LSTM unit             node policy had agents moving more than shooting, and they
from 4 to 64, though none were overwhelmingly successful,          would walk around while being overrun.
and we varied the number of layers after the LSTM unit from
1-10.                                                              E.4    SC2 Full Game
   The P RO L O N ET agent for this task used 9 decision nodes     We again failed to find a M LP architecture that succeeded
and 11 leaves. For the deepening experiment, we tested an          in this task, and so used a similar architecture to that of the
agent with only a single node and 2 leaves, and found that it      FindAndDefeatZerglings task. The 7-layer network is of the
still solved the task very quickly. We tested randomly initial-    sequence:
ized architectures from 1 to 9 nodes and from 2 to 11 leaves,         193x193 – 193x193 – 193x193 – 193x193 – 193x193 –
and we found that all combinations successfully solved the         193x193 – 193x44.
task.                                                                 We again experimented with a variety of shapes and num-
                                                                   ber of layers, though none succeeded.
E.2    Lunar Lander                                                   Again, the LST M network shadows the M LP network
The lunar lander M LP network is a 3-layer network, follow-        for this task. As in the FindAndDefeatZerglings task, we ex-
ing the sequence:                                                  perimented with a variety of LSTM hidden unit sizes, hidden
   8x8 – 8x8 – 8x4.                                                layer sizes, and hidden layer numbers. The final architecture
   We again experimented with sizes from 8-64 and number           reflects the FindAndDefeatZerglings sequence:
of hidden layers from 2 to 11.                                        193x193 – LSTM(193x193) – 193x193 – 193x193 –
   The LST M network for lunar lander mimics the architec-         193x193 – 193x193 – 193x44.
ture from cart pole. The LSTM unit’s hidden size is 8, so the         The P RO L O N ET agent for the SC2 full game featured 10
final sequence is:                                                 nodes and 11 leaves. We tested architectures from 10 to 16
   8x8 – LSTM(8x8) – 8x4.                                          nodes and from 1 to 17 leaves, and found that the initialized
   We experimented with hidden-sizes for the LSTM unit             policy and architecture was not as important for this task as
from 8 to 64, and again we varied the number of layers             it was for the FindAndDefeatZerglings task. As long as we
succeeding the LSTM unit from 1 to 10.                             included a basic build order and the “attack” command, the
   The P RO L O N ET agent for this task featured 14 decision      agent would manage to defeat the VeryEasy in-game AI at
nodes and 15 leaves. We experimented with intelligent ini-         least 10% of the time. We found that constraining the policy
tialization architectures ranging from 10 nodes to 14 and          to fewer nodes and leaves provided less noise as updates
from 10 to 15 leaves, and found little difference between          progressed, and kept the policy close to initialization while
their performances. The additional nodes were an attempt           also providing improvements. An initialization with too many
to encourage the agent to “do nothing” once successfully           parameters often seemed to degrade quickly, presumably
landing, as the agent had a tendency to continue shuffling         due to small changes over many parameters having a larger
left-right after successfully touching down.                       impact than small changes over few parameters.
You can also read