Interpretable Multi-Objective Reinforcement Learning through Policy Orchestration

Page created by Elaine Boyd

Arts & Entertainment

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Interpretable Multi-Objective Reinforcement Learning through Policy Orchestration

Interpretable Multi-Objective Reinforcement Learning through Policy
                                                                               Orchestration
                                                        Ritesh Noothigattu§ , Djallel Bouneffouf† , Nicholas Mattei† , Rachita Chandra† ,
                                                Piyush Madan† , Kush Varshney† , Murray Campbell† , Moninder Singh† , Francesca Rossi†∗

                                                                       †                                             §
                                                                           IBM Research                                  Carnegie Mellon University
                                                               IBM T.J. Watson Research Center                           Machine Learning Department
arXiv:1809.08343v1 [cs.LG] 21 Sep 2018

                                                                 Yorktown Heights, NY, USA                                   Pittsburgh, PA, USA
                                                      {rachitac, krvarshn, mcam, moninder}@us.ibm.com                          riteshn@cmu.edu
                                           {djallel.bouneffouf, n.mattei, piyush.madan1, francesca.rossi2}@ibm.com

                                                                     Abstract                                 understand how to constrain the actions of an AI system by
                                                                                                              providing boundaries within which the system must operate.
                                           Autonomous cyber-physical agents and systems play an in-
                                                                                                                 To tackle this problem, we may take inspiration from hu-
                                           creasingly large role in our lives. To ensure that agents be-
                                           have in ways aligned with the values of the societies in which     mans, who often constrain the decisions and actions they
                                           they operate, we must develop techniques that allow these          take according to a number of exogenous priorities, be they
                                           agents to not only maximize their reward in an environment,        moral, ethical, religious, or business values (Sen 1974), and
                                           but also to learn and follow the implicit constraints of soci-     we may want the systems we build to be restricted in their
                                           ety. These constraints and norms can come from any num-            actions by similar principles (Arnold et al. 2017). The over-
                                           ber of sources including regulations, business process guide-      riding concern is that the autonomous agents we construct
                                           lines, laws, ethical principles, social norms, and moral values.   may not obey these values on their way to maximizing some
                                           We detail a novel approach that uses inverse reinforcement         objective function (Simonite 2018).
                                           learning to learn a set of unspecified constraints from demon-        The idea of teaching machines right from wrong has be-
                                           strations of the task, and reinforcement learning to learn to
                                                                                                              come an important research topic in both AI (Yu et al. 2018)
                                           maximize the environment rewards. More precisely, we as-
                                           sume that an agent can observe traces of behavior of mem-          and farther afield (Wallach and Allen 2008). Much of the re-
                                           bers of the society but has no access to the explicit set of       search at the intersection of artificial intelligence and ethics
                                           constraints that give rise to the observed behavior. Inverse re-   falls under the heading of machine ethics, i.e., adding ethics
                                           inforcement learning is used to learn such constraints, that       and/or constraints to a particular system’s decision making
                                           are then combined with a possibly orthogonal value function        process (Anderson and Anderson 2011). One popular tech-
                                           through the use of a contextual bandit-based orchestrator that     nique to handle these issues is called value alignment, i.e.,
                                           picks a contextually-appropriate choice between the two poli-      the idea that an agent can only pursue goals that follow val-
                                           cies (constraint-based and environment reward-based) when          ues that are aligned to the human values and thus beneficial
                                           taking actions. The contextual bandit orchestrator allows the      to humans (Russell, Dewey, and Tegmark 2015).
                                           agent to mix policies in novel ways, taking the best actions
                                                                                                                 Another important notion for these autonomous decision
                                           from either a reward maximizing or constrained policy. In ad-
                                           dition, the orchestrator is transparent on which policy is be-     making systems is the idea of transparency or interpretabil-
                                           ing employed at each time step. We test our algorithms using       ity, i.e., being able to see why the system made the choices
                                           a Pac-Man domain and show that the agent is able to learn          it did. Theodorou, Wortham, and Bryson (2016) observe
                                           to act optimally, act within the demonstrated constraints, and     that the Engineering and Physical Science Research Coun-
                                           mix these two functions in complex ways.                           cil (EPSRC) Principles of Robotics dictates the implemen-
                                                                                                              tation of transparency in robotic systems. The authors go on
                                                                                                              to define transparency in a robotic or autonomous decision
                                                               1    Introduction                              making system as, “... a mechanism to expose the decision
                                         Concerns about the ways in which cyber-physical and/or au-           making of the robot”.
                                         tonomous decision making systems behave when deployed                   This still leaves open the question of how to provide the
                                         in the real world are growing: what various stakeholder are          behavioral constraints to the agent. A popular technique is
                                         worried about is that the systems achieves its goal in ways          called the bottom-up approach, i.e., teaching a machine what
                                         that are not considered acceptable according to values and           is right and wrong by example (Allen, Smit, and Wallach
                                         norms of the impacted community, also called “specifica-             2005). In this paper, we adopt this approach as we consider
                                         tion gaming” behaviors. Thus, there is a growing need to             the case where only examples of the correct behavior are
                                           ∗                                                                  available to the agent, and it must therefore learn from only
                                               On leave from the University of Padova.

these examples. 2 Related Work
We propose a framework which enables an agent to learn Ensuring that our autonomous systems act in line with our
two policies: (1) πR which is a reward maximizing pol- values while achieving their objectives is a major research
icy obtained through direct interaction with the world and topic in AI. These topics have gained popularity among
(2) πC which is obtained via inverse reinforcement learning a broad community including philosophers (Wallach and
over demonstrations by humans or other agents of how to Allen 2008) and non-profits (Russell, Dewey, and Tegmark
obey a set of behavioral constraints in the domain. Our agent 2015). Yu et al. (2018) provide an overview of much of the
then uses a contextual-bandit-based orchestrator to learn to recent research at major AI conferences on ethics in artificial
blend the policies in a way that maximizes a convex com- intelligence.
bination of the rewards and constraints. Within the RL com- Agents may need to balance objectives and feedback from
munity this can be seen as a particular type of apprenticeship multiple sources when making decisions. One prominent ex-
learning (Abbeel and Ng 2004) where the agent is learning ample is the case of autonomous cars. There is extensive re-
how to be safe, rather than only maximizing reward (Leike search from multidisciplinary groups into the questions of
et al. 2017). when autonomous cars should make lethal decisions (Bon-
One may argue that we should employ πC for all deci- nefon, Shariff, and Rahwan 2016), how to aggregate soci-
sions as it will be more “safe” than employing πR . Indeed, etal preferences to make these decisions (Noothigattu et al.
although one could only use πC for the agent, there are a 2017), and how to measure distances between these notions
number of reasons to employ the orchestrator. First, the hu- (Loreggia et al. 2018a; Loreggia et al. 2018b). In a recom-
mans or other demonstrators, may be good at demonstrating mender systems setting, a parent or guardian may want the
what not to do in a domain but may not provide examples agent to not recommend certain types of movies to chil-
of how best to maximize reward. Second, the demonstrators dren, even if this recommendation could lead to a high re-
may not be as creative as the agent when mixing the two ward (Balakrishnan et al. 2018a; Balakrishnan et al. 2018b).
policies (Ventura and Gates 2018). By allowing the orches- Recently, as a compliment to their concrete problems in AI
trator to learn when to apply which policy, the agent may saftey which includes reward hacking and unintended side
be able to devise better ways to blend the policies, leading effects (Amodei et al. 2016), a DeepMind study has com-
to behavior which both follows the constraints and achieves piled a list of specification gaming examples, where very
higher reward than any of the human demonstrations. Third, different agents game the given specification by behaving in
we may not want to obtain demonstrations of what to do unexpected (and undesired) ways.1
in all parts of the domain e.g., there may be dangerous or Within the field of reinforcement learning there has been
hard-to-model regions, or there may be mundane parts of specific work on ethical and interpretable RL. Wu and
the domain in which human demonstrations are too costly to Lin (2017) detail a system that is able to augment an ex-
obtain. In this case, having the agent learn through RL what isting RL system to behave ethically. In their framework,
to do in the non-demonstrated parts is of value. Finally, as the assumption is that, given a set of examples, most of the
we have argued, interpretability is an important feature of examples follow ethical guidelines. The system updates the
our system. Although the policies themselves may not be di- overall policy to obey the ethical guidelines learned from
rectly interpretable (though there is recent work in this area demonstrations using IRL. However, in this system only one
(Verma et al. 2018; Liu et al. 2018)), our system does cap- policy is maintained so it has no transparency. Laroche and
ture the notion of transparency and interpretability as we can Feraud (2017) introduce a system that is capable of select-
see which policy is being applied in real time. ing among a set of RL policies depending on context. They
Contributions. We propose and test a novel approach to demonstrate an orchestrator that, given a set of policies for
teach machines to act in ways that achieve and compromise a particular domain, is able to assign a policy to control the
multiple objectives in a given environment. One objective is next episode. However, this approach use the classical multi-
the desired goal and the other one is a set of behavioral con- armed bandit, so the state context is not considered on the
straints, learnt from examples. Our technique uses aspects choice of the policy.
of both traditional reinforcement learning and inverse rein- Interpretable RL has received significant attention in re-
forcement learning to identify policies that both maximize cent years. Luss and Petrik (2016) introduce action con-
rewards and follow particular constraints within an environ- straints over states to enhance the interpretability of policies.
ment. Our agent then blends these policies in novel and in- Verma et al. (2018) present a reinforcement learning frame-
terpretable ways using an orchestrator based on the contex- work, called Programmatically Interpretable Reinforcement
tual bandits framework. We demonstrate the effectiveness of Learning (PIRL), that is designed to generate interpretable
these techniques on the Pac-Man domain where the agent and verifiable agent policies. PIRL represents policies using
is able to learn both a reward maximizing and a constrained a high-level, domain-specific programming language. Such
policy, and select between these policies in a transparent way programmatic policies have the benefit of being more eas-
based on context, to employ a policy that achieves high re- ily interpreted than neural networks, and being amenable
ward and obeys the demonstrated constraints. to verification by symbolic methods. Additionally, Liu et
1
38 AI “specification gaming” examples are available at: https:
//docs.google.com/spreadsheets/d/e/2PACX-1vRPiprOaC3HsCf5Tuum8bRfzYUiKLRqJmbOoC-
32JorNdfyTiRRsR7Ea5eWtvsWzuxo8bjOxCG84dAg/pubhtml

al. (2018) introduce Linear Model U-trees to approximate         3.2    Inverse Reinforcement Learning
neural network predictions. An LMUT is learned using a           IRL seeks to find the most likely reward function RE , which
novel on-line algorithm that is well-suited for an active play   an expert E is executing (Abbeel and Ng 2004; Ng and Rus-
setting, where the mimic learner observes an ongoing inter-      sell 2000). The IRL methods assume the presence of an ex-
action between the neural net and the environment. Empir-        pert that solves an MDP, where the MDP is fully known and
ical evaluation shows that an LMUT mimics a Q function           observable by the learner except for the reward function.
substantially better than five baseline methods. The transpar-   Since the state and action of the expert is fully observable
ent tree structure of an LMUT facilitates understanding the      by the learner, it has access to trajectories executed by the
learned knowledge by analyzing feature influence, extract-       expert. A trajectory consists of a sequence of state and ac-
ing rules, and highlighting the super-pixels in image inputs.    tion pairs, T r = (s0 , a0 , s1 , a1 , . . . , sL−1 , aL−1 , sL ), where
                                                                 st is the state of the environment at time t, at is the action
                    3     Background                             played by the expert at the corresponding time and L is the
3.1   Reinforcement Learning                                     length of this trajectory. The learner is given access to m
Reinforcement learning defines a class of algorithms solv-       such trajectories {T r(1) , T r(2) , . . . , T r(m) } to learn the re-
ing problems modeled as a Markov decision process (MDP)          ward function. Since the space of all possible reward func-
(Sutton and Barto 1998).                                         tions is extremely large, it is common to represent the re-
   A Markov decision problem is usually denoted by the tu-       ward function asPa linear combination of ` > 0 features.
ple (S, A, T , R, γ), where                                      Rb w (s, a, s0 ) = ` wi φi (s, a, s0 ), where wi are weights
                                                                                      i=1
                                                                 to be learned, and φi (s, a, s0 ) → R is a feature function
• S is a set of possible states
                                                                 that maps a state-action-state tuple to a real value, denoting
• A is a set of actions                                          the value of a specific feature of this tuple (Abbeel and Ng
• T is a transition function defined by T (s, a, s0 ) =          2004). Current state-of-the-art IRL algorithms utilize feature
   Pr(s0 |s, a), where s, s0 ∈ S and a ∈ A                       expectations as a way of evaluating the quality of the learned
• R : S × A × S 7→ R is a reward function                        reward function (Abbeel and Ng 2004). For a policy π, the
                                                                 feature expectations starting from state so is defined as
• γ is a discount factor that specifies how much long term                               "∞                               #
   reward is kept.                                                                        X
                                                                              µ(π) = E          γ t φ(st , at , st+1 ) π ,
   The goal in an MDP is to maximize the discounted long
                                                                                            t=0
term reward received. Usually the infinite-horizon objective
is considered:                                                   where the expectation is taken with respect to the state se-
                       X∞                                        quence s1 , s2 , . . . achieved on taking actions according to π
                  max       γ t R(st , at , st+1 ).      (1)     starting from s0 . One can compute an empirical estimate of
                        t=0                                      the feature expectations of the expert’s policy with the help
   Solutions come in the form of policies π : S 7→ A,            of the trajectories {T r(1) , T r(2) , . . . , T r(m) }, using
which specify what action the agent should take in any given                              m L−1
                                                                                     1 XX t         (i) (i) (i) 
state deterministically or stochastically. One way to solve                  µ̂E =             γ φ st , at , st+1 .                  (3)
this problem is through Q-learning with function approx-                             m i=1 t=0
imation (Bertsekas and Tsitsiklis 1996). The Q-value of a
                                                                 Given a weight vector w, one can compute the optimal pol-
state-action pair, Q(s, a), is the expected future discounted
reward for taking action a ∈ A in state s ∈ S. A com-            icy πw for the corresponding reward function R   b w , and esti-
mon method to handle very large state spaces is to approx-       mate its feature expectations µ̂(πw ) in a way similar to (3).
imate the Q function as a linear function of some features.      IRL compares this µ̂(πw ) with expert’s feature expectations
Let ψ(s, a) denote relevant features of the state-action pair    µ̂E to learn best fitting weight vectors w. Instead of a single
hs, ai. Then, we assume Q(s, a) = θ · ψ(s, a), where θ           weight vector, the IRL algorithm by Abbeel and Ng (2004)
is an unknown vector to be learned by interacting with the       learns a set of possible weight vectors, and they ask the agent
environment. Every time the reinforcement learning agent         designer to pick the most appropriate weight vector among
takes action a from state s, obtains immediate reward r and      these by inspecting their corresponding policies.
reaches new state s0 , the parameter θ is updated using
                      h                     i                    3.3    Contextual Bandits
                                      0 0
        difference = r + γ max     Q(s , a )  − Q(s, a)          Following Langford and Zhang (2008), the contextual ban-
                               a0                         (2)    dit problem is defined as follows. At each time t ∈
                θi ← θi + α · difference · ψi (s, a),            {0, 1, . . . , (T − 1)}, the player is presented with a context
where α is the learning rate. -greedy is a common strategy      vector c(t) ∈ Rd and must choose an arm k ∈ [K] =
used for exploration. That is, during the training phase, a      {1, 2, . . . , K}. Let r = (r1 (t), . . . , rK (t)) denote a reward
random action is played with a probability of  and the ac-      vector, where rk (t) is the reward at time t associated with
tion with maximum Q-value is played otherwise. The agent         the arm k ∈ [K]. We assume that the expected reward is a
follows this strategy and updates the parameter θ accord-        linear function of the context, i.e. E[rk (t)|c(t)] = µTk c(t),
ing to Equation (2) until the Q-value converge or for a large    where µk is an unknown weight vector (to be learned from
number of time-steps.                                            the data) associated with the arm k.

The purpose of a contextual bandit algorithm A is to min-
imize the cumulative regret. Let H : C → [K] where
C is the set of possible contexts and c(t) is the context
at time t, ht ∈ H a hypothesis computed by the algo-
rithm A at time t and h∗t = argmax rht (c(t)) (t) the opti-
                                    ht ∈H
mal hypothesis at the same round. The cumulative regret is:
          PT
R(T ) = t=1 rh∗t (c(t)) (t) − rht (c(t)) (t).
   One widely used way to solve the contextual bandit prob-
lem is the Contextual Thompson Sampling algorithm (CTS)
(Agrawal and Goyal 2013) given as Algorithm 1. In CTS,
the reward rk (t) for choosing arm k at time t follows a para-
metric likelihood function P r(r(t)|µ̃). Following Agrawal
and Goyal (2013), the posterior distribution at time t + 1,                                  Figure 1: Layout of Pac-Man
P r(µ̃|r(t)) ∝ P r(r(t)|µ̃)P r(µ̃) is given by a multivari-
ate Gaussian distribution N (µˆk (t + 1), v 2 Bk (t + 1)−1 ),
                         Pt−1
where Bk (t) = Id + τ =1 c(τ               >                             is still the same as single objective
                                                                                                          P∞ reinforcement        learning,
                                 q )c(τ ) , d is the size of                                                      t
                                      24                                 which is trying to maximize         t=0 γ Ri (st , at , st+1 ) for
the context vectors c, v = R          z d   ·   ln( γ1 )   and we have   each i ∈ [l].
R > 0, z ∈ [0, 1], γ ∈ [0, 1] constants, and µ̂(t) =
          Pt−1
Bk (t)−1 ( τ =1 c(τ )rk (τ )).                                                                    4    Approach
                                                                         4.1     Domain
Algorithm 1 Contextual Thompson Sampling Algorithm                       We demonstrate the applicability of our approach using the
 1: Initialize: Bk = Id , µ̂k = 0d , fk = 0d for k ∈ [K].                classic game of Pac-Man. The layout of Pac-Man we use
 2: Foreach t = 0, 1, 2, . . . , (T − 1) do                              for this is given in Figure 1, and the following are the rules
 3:   Sample µ˜k (t) from N (µ̂k , v 2 Bk−1 ).                           used for the environment (adopted from Berkeley AI Pac-
 4:   Play arm kt = argmax c(t)> µ˜k (t)                                 Man2 ). The goal of the agent (which controls Pac-Man’s mo-
                          k∈[K]                                          tion) is to eat all the dots in the maze, known as Pac-Dots,
 5:   Observe rkt (t)                                                    as soon as possible while simultaneously avoiding collision
 6:   Bkt = Bkt + c(t)c(t)T , fkt = fkt + c(t)rkt (t),                   with ghosts. On eating a Pac-Dot, the agent obtains a reward
      µ̂kt = Bk−1 fkt                                                    of +10. And on successfully winning the game (which hap-
 7: End
                t
                                                                         pens on eating all the Pac-Dots), the agent obtains a reward
                                                                         of +500. In the meantime, the ghosts in the game roam the
                                                                         maze trying to kill Pac-Man. On collision with a ghost, Pac-
   Every step t consists of generating a d-dimensional                   Man loses the game and gets a reward of −500. The game
sample µ˜k (t) from N (µˆk (t), v 2 Bk (t)−1 ) for each arm.             also has two special dots called capsules or Power Pellets in
We then decide which arm k to pull by solving for                        the corners of the maze, which on consumption, give Pac-
argmaxk∈[K] c(t)> µ˜k (t). This means that at each time step             Man the temporary ability of “eating” ghosts. During this
we are selecting the arm that we expect to maximize the ob-              phase, the ghosts are in a “scared” state for 40 frames and
served reward given a sample of our current beliefs over the             move at half their speed. On eating a ghost, the agent gets
distribution of rewards, c(t)> µ˜k (t). We then observe the ac-          a reward of +200, the ghost returns to the center box and
tual reward of pulling arm k, rk (t) and update our beliefs.             returns to its normal “unscared” state. Finally, there is a con-
                                                                         stant time-penalty of −1 for every step taken.
3.4   Problem Setting                                                       For the sake of demonstration of our approach, we define
In our setting, the agent is in multi-objective Markov de-               not eating ghosts as the desirable constraint in the game of
cision processes (MOMDPs), instead of the usual scalar                   Pac-Man. However, please recall that this constraint is not
reward function R(s, a, s0 ), a reward vector R(s,      ~ a, s0 ) is     given explicitly to the agent, but only through examples. To
                       ~ a, s0 ) consists of l dimensions
present. The vector R(s,                                                 play optimally in the original game one should eat ghosts
or components representing the different objectives, i.e.,               to earn bonus points, but doing so is being demonstrated as
 ~ a, s0 ) = (R1 (s, a, s0 ), . . . , Rl (s, a, s0 )). However, not
R(s,                                                                     undesirable. Hence, the agent has to combine the goal of
all components of the reward vector are observed in our set-             collecting the most points while not eating ghosts if possible.
ting. There is an objective v ∈ [l] that is hidden, and the
                                                                         4.2     Overall Approach
agent is only allowed to observe expert demonstrations to
learn this objective. These demonstrations are given in the              The overall approach we follow is depicted by Figure 2. It
form of trajectories {T r(1) , T r(2) , . . . , T r(m) }. To summa-      has three main components. The first is the inverse reinforce-
rize, for some objectives, the agent has rewards observed                ment learning component to learn the desirable constraints
from interaction with the environment, and for some ob-                     2
                                                                                http://ai.berkeley.edu/project overview.html
jectives the agent has only expert demonstrations. The aim

(depicted in green in Figure 2). We apply inverse reinforce- objectives in the process. And once we have the combined
ment learning to the demonstrations depicting desirable be- data, we can perform inverse reinforcement learning to
havior, to learn the underlying constraint rewards being opti- learn the appropriate rewards, followed by reinforcement
mized by the demonstrations. We then apply reinforcement learning to learn the corresponding policy.
learning on these learned rewards to learn a strongly con- Aggregating at the policy phase is where we go all the
straint satisfying policy πC . way to the end of the pipeline learning a policy for each of
Next, we augment this with a pure reinforcement learning the objectives, followed by aggregating them. This is the ap-
component (depicted in red in Figure 2). For this, we directly proach we follow as mentioned in Section 4.2. Note that, we
apply reinforcement learning to the original environment re- have a parameter λ (as described in more detail in Section
wards (like Pac-Man’s unmodified game) to learn a domain 5.3) that trades off environmental rewards and rewards cap-
reward maximizing policy πR . Just to recall, the reason we turing constraints. A similar parameter can be used by the
have this second component is that the inverse reinforcement reward aggregation and data aggregation approaches, to de-
learning component may not be able to pick up the origi- cide how to weigh the two objectives while performing the
nal environment rewards very well since the demonstrations corresponding aggregation.
were intended mainly to depict desirable behavior. Further, The question now is, “which of these aggregation proce-
since these demonstrations are given by humans, they are dures is the most useful?”. The reason we use aggregation at
prone to error, amplifying this issue. Hence, the constraint the policy stage is to gain interpretability. Using an orches-
obeying policy πC is likely to exhibit strong constraint sat- trator to pick a policy at each point of time helps us identify
isfying behavior, but may not be optimal in terms of max- which policy is being played at each point of time and also
imizing environment rewards. Augmenting with the reward the reason for which it is being chosen (in the case of an in-
maximizing policy πR will help the system in this regard. terpretable orchestrator, which it is in our case). More details
So now, we have two policies, the constraint-obeying pol- on this are mentioned in Section 6.
icy πC and the reward-maximizing policy πR . To combine
these two, we use the third component, the orchestrator (de-
picted in blue in Figure 2). This is a contextual bandit al-
5 Concretizing Our Approach
gorithm that orchestrates the two policies, picking one of Here we describe the exact algorithms we use for each of the
them to play at each point of time. The context is the state of components of our approach.
the environment (state of the Pac-Man game); the bandit de-
cides which arm (policy) to play at the corresponding point 5.1 Details of the Pure RL
of time. For the reinforcement learning component, we use Q-
learning with linear function approximation as described in
4.3 Alternative Approaches Section 3.1. For Pac-Man, some of the features we use for an
Observe that in our approach, we combine or “aggregate” hs, ai pair (for the ψ(s, a) function) are: “whether food will
the two objectives (environment rewards and desired con- be eaten”, “distance of the next closest food”, “whether a
straints) at the policy stage. Alternative approaches to doing scared (unscared) ghost collision is possible” and “distance
this are combining the two objectives at the reward stage or of the closest scared (unscared) ghost”.
the demonstrations stage itself: For the layout of Pac-Man we use (shown in Figure 1),
an upper bound on the maximum score achievable in the
• Aggregation at reward phase. As before, we can per- game is 2170. This is because there are 97 Pac-Dots, each
form inverse reinforcement learning to learn the under- ghost can be eaten at most twice (because of two capsules
lying rewards capturing the desired constraints. Now, in- in the layout), Pac-Man can win the game only once and
stead of learning a policy for each of the two reward func- it would require more than 100 steps in the environment.
tions (environment rewards and constraint rewards) fol- On playing a total of 100 games, our reinforcement learn-
lowed by aggregating them, we could just combine the ing algorithm (the reward maximizing policy πR ) achieves
reward functions themselves. And then, we could learn an average game score of 1675.86, and the maximum score
a policy on this “aggregated” rewards to perform well achieved is 2144. We mention this here, so that the results in
on both the objectives, environment reward and favorable Section 6 can be seen in appropriate light.
constraints. (This captures the intuitive idea of “incorpo-
rating the constraints into the environment rewards” if we 5.2 Details of the IRL
were explicitly given the penalty of violating constraints).
For inverse reinforcement learning, we use the linear IRL al-
• Aggregation at data phase. Moving another step back- gorithm as described in Section 3.2. For Pac-Man, observe
ward, we could aggregate the two objectives of play at that the original reward function R(s, a, s0 ) depends only
the data phase. This could be performed as follows. We on the following factors: “number of Pac-Dots eating in this
perform pure reinforcement learning as in the original ap- step (s, a, s0 )”, “whether Pac-Man has won in this step”,
proach given in Figure 2 (depicted in red). Once we have “number of ghosts eaten in this step” and “whether Pac-Man
our reward maximizing policy πR , we use it to generate has lost in this step”. For our IRL algorithm, we use exactly
numerous reward-maximizing demonstrations. Then, we these as the features φ(s, a, s0 ). As a sanity check, when IRL
combine these environment reward trajectories with the is run on environment reward optimal trajectories (generated
original constrained demonstrations, aggregating the two from our policy πR ), we recover something very similar to

IRL for Constraints
Constrained Rewards Capturing Constrained a(t)
Demonstration Constraints R
bC Policy Orchestrator

πC

Environment

RL for Game Rewards πR
r(t)
Environment Reward Maxi-
Rewards R mizing Policy s(t + 1)

Figure 2: Overview of our system. At each time step the Orchestrator selects between two policies, πC and πR depending on the observations
from the Environment. The two policies are learned before engaging with the environment. πC is obtained using IRL on the demonstrations
to learn a reward function that captures the particular constraints demonstrated. The second, πR is obtained by the agent through RL on the
environment directly.

the original reward function R. In particular, the weights of and eats just 0.03 ghosts on an average. Note that, when eat-
the reward features learned is given by ing ghosts is prohibited in the domain, an upper bound on
the maximum score achievable is 1370.
1
[+2.44, +138.80, +282.49, −949.17],
1000 5.3 Orchestration with Contextual Bandits
which when scaled is almost equivalent to the true weights We use contextual bandits to pick one of the policies (πR
[+10, +500, +200, −500] in terms of their optimal policies. and πC ) to play at each point of time. These two policies act
The number of trajectories used for this is 100. as the two arms of the bandit, and we use a modified CTS al-
Ideally, we would prefer to have the constrained demon- gorithm to train the bandit. The context of the bandit is given
strations given to us by humans. But for our domain of Pac- by features of the current state (for which we want to decide
Man, we generate them synthetically as follows. We learn which policy to choose), i.e., c(t) = Υ(st ) ∈ Rd . For the
?
a policy πC by training it on the game with the original re- game of Pac-Man, the features of the state we use for con-
ward function R augmented with a very high negative re- text c(t) are: (i) A constant 1 to represent the bias term, and
?
ward (−1000) for eating ghosts. This causes πC to play well (ii) The distance of Pac-Man from the closest scared ghost
in the game while avoiding eating ghosts as much as pos- in st . One could use a more sophistical context with many
sible.3 Now, to emulate erroneous human behavior, we use more features, but we use this restricted context to demon-
?
πC with an error probability of 3%. That is, at every time strate a very interesting behavior (shown in Section 6).
step, with 3% probability we pick a completely random ac- The exact algorithm used to train the orchestrator is given
?
tion, and otherwise follow πC . This gives us our constrained in Algorithm 2. Apart from the fact that arms are policies
demonstrations, on which we perform inverse reinforcement (instead of atomic actions), the main difference from the
learning to learn the rewards capturing the constraints. The CTS algorithm is the way rewards are fed into the bandit.
weights of the reward function learned is given by For simplicity, we call the constraint policy πC as arm 0
1 and the reward policy πR as arm 1. We now go over Algo-
[+2.84, +55.07, −970.59, −234.34], rithm 2. First, all the parameters are initialized as in the CTS
1000 algorithm (Line 1). For each time-step in the training phase
and it is evident that it has learned that eating ghosts strongly (Line 3), we do the following. Pick an arm kt according to
violates the favorable constraints. The number of demon- the Thompson Sampling algorithm and the context Υ(st )
strations used for this is 100. We scale these weights to (Lines 4 and 5). Play the action according to the chosen pol-
have a similar L1 norm as the original reward weights icy πkt (Line 6). This takes us to the next state st+1 . We
[+10, +500, +200, −500], and denote the corresponding re- also observe two rewards (Line 7): (i) the original reward in
ward function by R bC . environment, raRt (t) = R(st , at , st+1 ) and (ii) the constraint
Finally, running reinforcement learning on these rewards rewards according to the rewards learnt by inverse reinforce-
Rb C , gives us our constraint policy πC . On playing a total of ment learning, i.e., raCt (t) = R
b C (st , at , st+1 ). raC (t) can in-
t
100 games, πC achieves an average game score of 1268.52 tuitively be seen as the predicted reward (or penalty) for any
constraint satisfaction (or violation) in this step.
3
We do this only for generating demonstrations. In real do- To train the contextual bandit to choose arms that perform
mains, we would not have access to the exact constraints that we well on both metrics (environment rewards and constraints),
?
want to be satisfied, and hence a policy like πC cannot be learned; we feed it a reward that is a linear combination of raRt (t) and
learning from human demonstrations would then be essential.

Algorithm 2 Orchestrator Based Algorithm
1: Initialize: Bk = Id , µ̂k = 0d , fk = 0d for k ∈ {0, 1}.
2: Observe start state s0 .
3: Foreach t = 0, 1, 2, ..., (T − 1) do
4: Sample µ˜k (t) from N (µ̂k , v 2 Bk−1 ).
5: Pick arm kt = argmax Υ(st )> µ˜k (t).
k∈{0,1}
6: Play corresponding action at = πkt (st ).
7: Observe rewards raCt (t) and raRt (t), and the next state
st+1 .
8: Define rkt (t) = λ raCt (t) + γV C (st+1 )
+(1 − λ) raRt (t) + γV R (st+1 )
9: Update Bkt = Bkt + Υ(st )Υ(st )> , fkt = fkt +
Υ(st )rkt (t), µ̂kt = Bk−1
t
fkt Figure 3: Both performance metrics as λ is varied. The red curve
10: End depicts the average game score achieved, and the blue curve depicts
the average number of ghosts eaten.

raCt (t) (Line 8). Another important point to note is that raRt (t) On the other hand, when λ is smaller than this threshold, it
and raCt (t) are immediate rewards achieved on taking action learns the reverse and eats as many ghosts as possible.
at from st , they do not capture long term effects of this ac-
Policy-switching. As mentioned before, one of the most im-
tion. In particular, it is important to also look at the “value”
portant property of our approach is interpretability, we know
of the next state st+1 reached, since we are in the sequential
exactly which policy is being played at each point of time.
decision making setting. Precisely for this reason, we also
For moderate values of λ, the orchestrator learns a very in-
incorporate the value-function of the next state st+1 accord-
teresting policy-switching technique: whenever at least one
ing to both the reward maximizing component and constraint
of the ghosts in the domain is scared, it plays πC , but if no
component (which encapsulate the long-term rewards and
ghosts are scared, it plays πR . In other words, it starts off
constraint satisfaction possible from st+1 ). This gives ex-
the game by playing πR until a capsule is eaten. As soon
actly Line 8, where V C is the value-function according the
as the first capsule is eaten, it switches to πC and plays it
constraint policy πC , and V R is the value-function accord-
till the scared timer runs off. Then it switches back to πR
ing to the reward maximizing policy πR . In this equation, λ
until another capsule is eaten, and so on.4 It has learned a
is a hyperparameter chosen by a user to decide how much to
very intuitive behavior: when there is no scared ghost in the
trade off environment rewards for constraint satisfaction. For
domain, there is no possibility of violating constraints, and
example, when λ is set to 0, the orchestrator would always
hence the agent is as greedy as possible (i.e., play πR ), but
play the reward policy πR , while for λ = 1, the orchestrator
when there are scared ghosts, better to be safe (i.e., play πC ).
would always play the constraint policy πC . For any value
of λ in-between, the orchestrator is expected to pick poli-
cies at each point of time that would perform well on both 7 Discussion
metrics (weighed according to λ). Finally, for the desired In this paper, we have considered the problem of au-
reward rkt (t) and the context Υ(st ), the parameters of the tonomous agents learning policies that are constrained by
bandit are updated according to the CTS algorithm (Line 9). implicitly-specified norms and values while still optimiz-
ing their policies with respect to environmental rewards.
6 Evaluation and Test We have taken an approach that combines IRL to deter-
mine constraint-satisfying policies from demonstrations, RL
We test our approach on the Pac-Man domain given in Fig-
to determine reward-maximizing policies, and a contextual
ure 1, and measure its performance on two metrics, (i) the
bandit to orchestrate between these policies in a transparent
total score achieved in the game (the environment rewards)
way. This proposed architecture and approach for the prob-
and (ii) the number of ghosts eaten (the constraint violation).
lem is novel. It also requires a novel technical contribution
We also vary λ, and observe how these metrics are traded
in the contextual bandit algorithm because the arms are poli-
off against each other. For each value of λ, the orchestrator
cies rather than atomic actions, thereby requiring rewards
is trained for 100 games. The results are shown in Figure 3.
to account for sequential decision making. We have demon-
Each point in the graph is averaged over 100 test games.
strated the algorithm on the Pac-Man video game and found
The graph shows a very interesting pattern. When λ is at
it to perform interesting switching behavior among policies.
most than 0.215, the agent eats a lot of ghosts, but when it is
We feel that the contribution herein is only the starting
above 0.22, it eats almost no ghosts. In other words, there is
point for research in this direction. We have identified sev-
a value λo ∈ [0.215, 0.22] which behaves as a tipping point,
across which there is drastic change in behavior. Beyond the 4
A video of our agent demonstrating this behavior is uploaded
threshold, the agent learns that eating ghosts is not worth the in the Supplementary Material. The agent playing the game in this
score it is getting and so it avoids eating as much as possible. video was trained with λ = 0.4.

eral avenues for future research, especially with regards to      [2017] Laroche, R., and Feraud, R. 2017. Reinforcement
 IRL. We can pursue deep IRL to learn constraints with-             learning algorithm selection. In Proceedings of the 6th Inter-
 out hand-crafted features, develop an IRL that is robust to        national Conference on Learning Representations (ICLR).
 noise in the demonstrations, and research IRL algorithms to       [2017] Leike, J.; Martic, M.; Krakovna, V.; Ortega, P.;
 learn from just one or two demonstrations (perhaps in con-         Everitt, T.; Lefrancq, A.; Orseau, L.; and Legg, S. 2017.
 cert with knowledge and reasoning). In real-world settings,        AI safety gridworlds. arXiv preprint arXiv:1711.09883.
 demonstrations will likely be given by different users with
                                                                   [2018] Liu, G.; Schulte, O.; Zhu, W.; and Li, Q. 2018. To-
 different versions of abiding behavior; we would like to ex-
                                                                    ward interpretable deep reinforcement learning with linear
 ploit the partition of the set of traces by user to improve the
                                                                    model u-trees. CoRR abs/1807.05887.
 policy or policies learned via IRL. Additionally, the current
 orchestrator selects a single policy at each time, but more       [2018a] Loreggia, A.; Mattei, N.; Rossi, F.; and Venable,
 sophisticated policy aggregation techniques for combining          K. B. 2018a. Preferences and ethical principles in decision
 or mixing policies is possible. Lastly, it would be interesting    making. In Proceedings of the 1st AAAI/ACM Conference
 to investigate whether the policy aggregation rule (λ in the       on AI, Ethics, and Society (AIES).
 current proposal) can be learned from demonstrations.             [2018b] Loreggia, A.; Mattei, N.; Rossi, F.; and Venable,
                                                                    K. B. 2018b. Value alignment via tractable preference dis-
 Acknowledgments We would like to thank Gerald
                                                                    tance. In Yampolskiy, R. V., ed., Artificial Intelligence Safety
 Tesauro and Aleksandra Mojsilovic for their helpful feed-
                                                                    and Security. CRC Press. chapter 16.
 back and comments on this project.
                                                                   [2016] Luss, R., and Petrik, M. 2016. Interpretable policies
                         References                                 for dynamic product recommendations. In Proc. Conf. Un-
                                                                    certainty Artif. Intell., 74.
[2004] Abbeel, P., and Ng, A. Y. 2004. Apprenticeship
 learning via inverse reinforcement learning. In Proceedings       [2000] Ng, A. Y., and Russell, S. J. 2000. Algorithms for
 of the 21st International Conference on Machine Learning           inverse reinforcement learning. In Proceedings of the Sev-
 (ICML).                                                            enteenth International Conference on Machine Learning,
                                                                    ICML ’00, 663–670. San Francisco, CA, USA: Morgan
[2013] Agrawal, S., and Goyal, N. 2013. Thompson sam-               Kaufmann Publishers Inc.
 pling for contextual bandits with linear payoffs. In ICML
 (3), 127–135.                                                     [2017] Noothigattu, R.; Gaikwad, S.; Awad, E.; Dsouza, S.;
                                                                    Rahwan, I.; Ravikumar, P.; and Procaccia, A. D. 2017. A
[2005] Allen, C.; Smit, I.; and Wallach, W. 2005. Artifi-           voting-based system for ethical decision making. In Pro-
 cial morality: Top-down, bottom-up, and hybrid approaches.         ceedings of the 32nd AAAI Conference on Artificial Intelli-
 Ethics and Information Technology 7(3):149–155.                    gence (AAAI).
[2016] Amodei, D.; Olah, C.; Steinhardt, J.; Christiano, P.;       [2015] Russell, S.; Dewey, D.; and Tegmark, M. 2015. Re-
 Schulman, J.; and Mané, D. 2016. Concrete problems in AI          search priorities for robust and beneficial artificial intelli-
 safety. arXiv preprint arXiv:1606.06565.                           gence. AI Magazine 36(4):105–114.
[2011] Anderson, M., and Anderson, S. L. 2011. Machine             [1974] Sen, A. 1974. Choice, ordering and morality. In
 Ethics. Cambridge University Press.                                Körner, S., ed., Practical Reason. Oxford: Blackwell.
[2017] Arnold, T.; Thomas; Kasenberg, D.; and Scheutzs, M.         [2018] Simonite, T. 2018. When bots teach themselves to
 2017. Value alignment or misalignment - what will keep             cheat. Wired.
 systems accountable? In AI, Ethics, and Society, Papers from
 the 2017 AAAI Workshop.                                           [1998] Sutton, R. S., and Barto, A. G. 1998. Introduction to
                                                                    Reinforcement Learning. Cambridge, MA, USA: MIT Press,
[2018a] Balakrishnan, A.; Bouneffouf, D.; Mattei, N.; and           1st edition.
 Rossi, F. 2018a. Using contextual bandits with behavioral
 constraints for constrained online movie recommendation.          [2016] Theodorou, A.; Wortham, R. H.; and Bryson, J. J.
 In Proc. IJCAI.                                                    2016. Why is my robot behaving like that? designing trans-
                                                                    parency for real time inspection of autonomous robots. In
[2018b] Balakrishnan, A.; Bouneffouf, D.; Mattei, N.; and           AISB Workshop on Principles of Robotics. University of
 Rossi, F. 2018b. Incorporating behavioral constraints in           Bath.
 online AI systems. arXiv preprint arXiv:1809.05720.
                                                                   [2018] Ventura, D., and Gates, D. 2018. Ethics as aesthetic:
[1996] Bertsekas, D., and Tsitsiklis, J. 1996. Neuro-dynamic        A computational creativity approach to ethical behavior. In
 programming. Athena Scientific.                                    Proc. Int. Conf. Comput. Creativity, 185–191.
[2016] Bonnefon, J.-F.; Shariff, A.; and Rahwan, I. 2016.          [2018] Verma, A.; Murali, V.; Singh, R.; Kohli, P.; and
 The social dilemma of autonomous vehicles. Science                 Chaudhuri, S. 2018. Programmatically interpretable rein-
 352(6293):1573–1576.                                               forcement learning. In Proceedings of the 35th International
[2008] Langford, J., and Zhang, T. 2008. The Epoch-Greedy           Conference on Machine Learning, ICML 2018, 5052–5061.
 Algorithm for Contextual Multi-armed Bandits. In Proc.            [2008] Wallach, W., and Allen, C. 2008. Moral machines:
 21st NIPS.                                                         Teaching robots right from wrong. Oxford University Press.

[2017] Wu, Y.-H., and Lin, S.-D. 2017. A low-cost
 ethics shaping approach for designing reinforcement learn-
 ing agents. In Proceedings of the 32nd AAAI Conference on
 Artificial Intelligence (AAAI).
[2018] Yu, H.; Shen, Z.; Miao, C.; Leung, C.; Lesser, V. R.;
 and Yang, Q. 2018. Building ethics into artificial intelli-
 gence. In Proceedings of the 27th International Joint Con-
 ference on Artificial Intelligence (IJCAI), 5527–5533.

You can also read