Mitigating Negative Side Effects via Environment Shaping

Page created by Alvin Carpenter
 
CONTINUE READING
Mitigating Negative Side Effects via Environment Shaping
Mitigating Negative Side Effects via Environment Shaping
                                                                            Sandhya Saisubramanian and Shlomo Zilberstein
                                                                                     College of Information and Computer Sciences,
                                                                                          University of Massachusetts Amherst
arXiv:2102.07017v1 [cs.AI] 13 Feb 2021

                                                                     Abstract
                                           Agents operating in unstructured environments often produce
                                           negative side effects (NSE), which are difficult to identify at
                                           design time. While the agent can learn to mitigate the side
                                           effects from human feedback, such feedback is often expen-
                                           sive and the rate of learning is sensitive to the agent’s state
                                           representation. We examine how humans can assist an agent,
                                           beyond providing feedback, and exploit their broader scope
                                           of knowledge to mitigate the impacts of NSE. We formulate
                                           this problem as a human-agent team with decoupled objec-
                                           tives. The agent optimizes its assigned task, during which its         (a) Initial Environment      (b) Protective sheet over rug
                                           actions may produce NSE. The human shapes the environ-
                                           ment through minor reconfiguration actions so as to mitigate
                                           the impacts of the agent’s side effects, without affecting the     Figure 1: Example configurations for boxpushing domain:
                                           agent’s ability to complete its assigned task. We present an al-   (a) denotes the initial setting in which the actor dirties the
                                           gorithm to solve this problem and analyze its theoretical prop-    rug when pushing the box over it; (b) denotes a modification
                                           erties. Through experiments with human subjects, we assess         that avoids the NSE, without affecting the actor’s policy.
                                           the willingness of users to perform minor environment modi-
                                           fications to mitigate the impacts of NSE. Empirical evaluation
                                           of our approach shows that the proposed framework can suc-
                                           cessfully mitigate NSE, without affecting the agent’s ability
                                                                                                              edge about NSE, and therefore they do not have the ability
                                           to complete its assigned task.                                     to minimize NSE. How can we leverage human assistance
                                                                                                              and the broader scope of human knowledge to mitigate neg-
                                                                                                              ative side effects, when agents are unaware of the side effects
                                                                 Introduction                                 and the associated penalties?
                                         Deployed AI agents often require complex design choices                 A common solution approach in the existing literature
                                         to support safe operation in the open world. During design           is to update the agent’s model and policy by learning
                                         and initial testing, the system designer typically ensures that      about NSE through feedbacks (Hadfield-Menell et al. 2017;
                                         the agent’s model includes all the necessary details relevant        Zhang, Durfee, and Singh 2018; Saisubramanian, Kamar,
                                         to its assigned task. Inherently, many other details of the en-      and Zilberstein 2020). The knowledge about the NSE can
                                         vironment that are unrelated to this task may be ignored.            be encoded by constraints (Zhang, Durfee, and Singh 2018)
                                         Due to this model incompleteness, the deployed agent’s ac-           or by updating the reward function (Hadfield-Menell et al.
                                         tions may create negative side effects (NSE) (Amodei et al.          2017; Saisubramanian, Kamar, and Zilberstein 2020). These
                                         2016; Saisubramanian, Zilberstein, and Kamar 2020). For              approaches have three main drawbacks. First, the agent’s
                                         example, consider an agent that needs to push a box to a             state representation is assumed to have all the necessary fea-
                                         goal location as quickly as possible (Figure 1(a)). Its model        tures to learn to avoid NSE (Saisubramanian, Kamar, and
                                         includes accurate information essential to optimizing its as-        Zilberstein 2020; Zhang, Durfee, and Singh 2018). In prac-
                                         signed task, such as reward for pushing the box. But details         tice, the model may include only the features related to the
                                         such as the impact of pushing the box over a rug may not be          agent’s task. This may affect the agent’s learning and adapta-
                                         included in the model if the issue is overlooked at system de-       tion since the NSE could potentially be non-Markovian with
                                         sign. Consequently, the agent may push a box over the rug,           respect to this limited state representation. Second, exten-
                                         dirtying it as a side effect. Mitigating such NSE is critical to     sive model revisions will likely require suspension of op-
                                         improve trust in deployed AI systems.                                eration and exhaustive evaluation before the system can be
                                            It is practically impossible to identify all such NSE dur-        redeployed. This is inefficient when the NSE is not safety-
                                         ing system design since agents are deployed in varied set-           critical, especially for complex systems that have been care-
                                         tings. Deployed agents often do not have any prior knowl-            fully designed and tested for critical safety aspects. Third,
Mitigating Negative Side Effects via Environment Shaping
updating the agent’s policy may not mitigate NSE that are                      Actor-Designer Framework
unavoidable. For example, in the boxpushing domain, NSE            The problem of mitigating NSE is formulated as a collabo-
are unavoidable if the entire floor is covered with a rug.         rative actor-designer framework with decoupled objectives.
   The key insight of this paper is that agents often operate in   The problem setting is as follows. The actor agent operates
environments that are configurable, which can be leveraged         based on a Markov decision process (MDP) Ma in an envi-
to mitigate NSE. We propose environment shaping for de-            ronment that is configurable and described by a configura-
ployed agents: a process of applying modest modifications          tion file E, such as a map. The agent’s model Ma includes
to the current environment to make it more agent-friendly          the necessary details relevant to its assigned task, which is
and minimize the occurrence of NSE. Real world examples            its primary objective oP . A factored state representation is
show that modest modifications can substantially improve           assumed. The actor computes a policy π that optimizes oP .
agents’ performance. For example, Amazon improved the              Executing π may lead to NSE, unknown to the actor. The
performance of its warehouse robots by optimizing the ware-        environment designer, typically the user, measures the im-
house layouts, covering skylights to reduce the glare, and         pact of NSE associated with the actor’s π and shapes the en-
repositioning the air conditioning units to blow out to the        vironment, if necessary. The actor and the environment de-
side (Simon 2019). Recent studies show the need for infras-        signer share the configuration file of the environment, which
tructure modifications such as traffic signs with barcodes and     is updated by the environment designer to reflect the modi-
redesigned roads to improve the reliability of autonomous          fications. Optimizing oP is prioritized over minimizing the
vehicles (Agashe and Chapman 2019; Hendrickson, Biehler,           NSE. Hence, shaping is performed in response to the actor’s
and Mashayekh 2014; McFarland 2020). Simple modifica-              policy. In the rest of the paper, we refer to the environment
tions to the environment have also accelerated agent learn-        designer simply as designer—not to be confused with the
ing (Randløv 2000) and goal recognition (Keren, Gal, and           designer of the agent itself.
Karpas 2014). These examples show that environment shap-              Each modification is a sequence of design actions. An ex-
ing can improve an agent’s performance with respect to its         ample is {move(table, l1 , l2 ), remove(rug)}, which moves
assigned task. We target settings in which environment shap-       the table from location l1 to l2 and removes the rug in the
ing can reduce NSE, without significantly degrading the per-       current setting. We consider the set of modifications to be
formance of the agent in completing its assigned task.             finite since the environment is generally optimized for the
   Our formulation consists of an actor and a designer. The        user and the agent’s primary task (Shah et al. 2019) and the
actor agent computes a policy that optimizes the completion        user may not be willing to drastically modify it. Addition-
of its assigned task and has no knowledge about NSE of its         ally, the set of modifications for an environment is included
actions. The designer agent, typically a human, shapes the         in the problem specification since it is typically controlled
environment through minor modifications so as to mitigate          by the user and rooted in the NSE they want to mitigate.
the NSE of the actor’s actions, without affecting the actor’s         We make the following assumptions about the nature of
ability to complete its task. The environment is described us-     NSE and the modifications: (1) NSE are undesirable but not
ing a configuration file, such as a map, which is shared be-       safety-critical, and its occurrence does not affect the actor’s
tween the actor and the designer. Given an environment con-        ability to complete its task; (2) the start and goal conditions
figuration, the actor computes and shares sample trajectories      of the actor are fixed and cannot be altered, so that the mod-
of its optimal policy with the designer. The designer mea-         ifications do not alter the agent’s task; and (3) modifications
sures the NSE associated with the actor’s policy and mod-          are applied tentatively for evaluation purposes and the en-
ifies the environment by updating the configuration file, if       vironment is reset if the reconfiguration affects the actor’s
necessary. We target settings in which the completion of the       ability to complete its task or the actor’s policy in the modi-
actor’s task is prioritized over minimizing NSE. The sample        fied setting does not minimize the NSE.
trajectories and the environment configuration are the only        Definition 1. An actor-designer framework to miti-
knowledge shared between the actor and the designer.               gate negative side effects (AD-NSE) is defined by
   The advantages of this decoupled approach to minimize           hE0 , E, Ma , Md , δA , δD i with:
NSE are: (1) robustness—it can handle settings in which            • E0 denoting the initial environment configuration;
the actor is unaware of the NSE and does not have the nec-
                                                                   • E denoting a finite set of possible reconfigurations of E0 ;
essary state representation to effectively learn about NSE;
and (2) bounded-performance—by controlling the space of            • Ma = hS, A, T, R, s0 , sG i is the actor’s MDP with a dis-
valid modifications, bounded-performance of the actor with           crete and finite state space S, discrete actions A, transi-
respect to its assigned task can be guaranteed.                      tion function T , reward function R, start state s0 ∈ S,
                                                                     and a goal state sG ∈ S;
   Our primary contributions are: (1) introducing a novel
framework for environment shaping to mitigate the unde-            • Md = hΩ, Ψ, C, N i is the model of the designer with
sirable side effects of agent actions; (2) presenting an algo-       – Ω denoting a finite set of valid modifications that
rithm for shaping and analyzing its properties; (3) providing           are available for E0 , including ∅ to indicate that no
the results of a human subjects study to assess the attitudes           changes are made;
of users towards negative side effects and their willingness         – Ψ : E × Ω → E determines the resulting environment
to shape the environment; and (4) empirical evaluation of the           configuration after applying a modification ω ∈ Ω to
approach on two domains.                                                the current configuration, and is denoted by Ψ(E, ω);
Mitigating Negative Side Effects via Environment Shaping
– C : E × Ω → R is a cost function that specifies the cost
    of applying a modification to an environment, denoted
    by C(E, ω), with C(E, ∅) = 0, ∀E ∈ E;
  – N = hπ, E, ζi is a model specifying the penalty for
    negative side effects in environment E ∈ E for the
    actor’s policy π, with ζ mapping states in π to E for             Figure 2: Actor-designer approach for mitigating NSE.
    severity estimation.
• δA ≥ 0 is the actor’s slack, denoting the maximum al-             Definition 4. A robust shaping results in an environment
  lowed deviation from the optimal value of oP in E0 , when         configuration E where all valid policies of the actor, given
  recomputing its policy in a modified environment; and             δA , produce NSE within δD . That is, NπE ≤ δD , ∀π :
• δD ≥ 0 indicates the designer’s NSE tolerance threshold.          VP∗ (s0 |E0 ) − VPπ (s0 |E) ≤ δA .

Decoupled objectives The actor’s objective is to compute            Shared knowledge The actor and the designer do not have
a policy π that maximizes its expected reward for oP , from         details about each other’s model. Since the objectives are de-
the start state s0 in the current environment configuration E:      coupled, knowledge about the exact model parameters of the
                                                                    other agent is not required. Instead, the two key details that
                    max VPπ (s0 |E),
                    π∈Π                                             are necessary to achieve collaboration are shared: configu-
                      X                                             ration file describing the environment and the actor’s policy.
 π
VP (s) = R(s, π(s)) +     T (s, π(s), s0 )VPπ (s0 ), ∀s ∈ S.
                                                                    The shared configuration file, such as the map of the envi-
                           s0 ∈S
                                                                    ronment, allows the designer to effectively communicate the
When the environment is modified, the actor recomputes its          modifications to the actor. The actor’s policy is required for
policy and may end up with a longer path to its goal. The           the designer to shape the environment. Compact represen-
slack δA denotes the maximum allowed deviation from the             tation of the actor’s policy is a practical challenge in large
optimal expected reward in E0 , VP∗ (s0 |E0 ), to facilitate min-   problems (Guestrin, Koller, and Parr 2001). Therefore, in-
imizing the NSE via shaping. A policy π 0 in a modified en-         stead of sharing the complete policy, the actor provides a
vironment E 0 satisfies the slack δA if                             finite number of demonstrations D = {τ1 , τ2 , . . . , τn } of
                                   0                                its optimal policy for oP in the current environment config-
              VP∗ (s0 |E0 ) − VPπ (s0 |E 0 ) ≤ δA .                 uration. Each demonstration τ is a trajectory from start to
   Given the actor’s π and environment configuration E, the         goal, following π. Using D, the designer can extract the ac-
designer first estimates its corresponding NSE and the asso-        tor’s policy by associating states with actions observed in D,
ciated penalty, denoted by NπE . The environment is modified        and measure its NSE. The designer does not have knowledge
if NπE > δD , where δD is the designer’s tolerance threshold        about the details of the agent’s model but is assumed to be
for NSE, assuming π is fixed. Given π and a set of valid mod-       aware of the agent’s objectives and its general capabilities
ifications Ω, the designer selects a modification that maxi-        of the agent. This makes it straightforward to construe the
mizes its utility:                                                  agent’s trajectories. Naturally, increasing the number and di-
                                                                    versity of sample trajectories that cover the actor’s reachable
                     max Uπ (ω)                                     states helps the designer improve the accuracy of estimating
                     ω∈Ω
                                                                    the actor’s NSE and select an appropriate modification. If
          Uπ (ω) = NπE − NπΨ(E,ω) −C(E, ω).
                                 
                                                             (1)    D does not starve any reachable state, following π, then the
                  |      {z      }
                        reduction in NSE                            designer can extract the actor’s complete policy.
   Typically the designer is the user of the system and their                         Solution Approach
preferences and tolerance for NSE is captured by the util-
ity of a modification. The designer’s objective is to balance       Our solution approach for solving AD-NSE, described in Al-
the tradeoff between minimizing the NSE and the cost of             gorithm 1, proceeds in two phases: a planning phase and a
applying the modification. It is assumed that the cost of           shaping phase. In the planning phase (Line 4), the actor com-
the modification is measured in the same units as the NSE           putes its policy π for oP in the current environment config-
penalty. The cost of a modification may be amortized over           uration and generates a finite number of sample trajectories
episodes of the actor performing the task in the environment.       D, following π. The planning phase ends with disclosing D
The following properties are used to guarantee bounded-             to the designer. The shaping phase (Lines 7-15) begins with
performance of the actor when shaping the environment to            the designer associating states with actions observed in D
minimize the impacts of NSE.                                        to extract a policy π̂, and estimating the corresponding NSE
                                                                    penalty, denoted by Nπ̂E . If Nπ̂E > δD , the designer applies
Definition 2. Shaping is admissible if it results in an envi-       a utility-maximizing modification and updates the configu-
ronment configuration in which (1) the actor can complete           ration file. The planning and shaping phases alternate until
its assigned task, given δA and (2) the NSE does not in-            the NSE is within δD or until all possible modifications have
crease, relative to E0 .                                            been tested. Figure 2 illustrates this approach.
Definition 3. Shaping is proper if (1) it is admissible and            The actor returns D = {} when the modification affects
(2) reduces the actor’s NSE to be within δD .                       its ability to reach the goal, given δA . Modifications are ap-
Algorithm 1 Environment shaping to mitigate NSE                     Algorithm 2 Diverse modifications(b, Ω, Md , E0 )
Require: hE0 , E, Ma , Md , δA , δD i: AD-NSE                        1: Ω̄ ← Ω
Require: d: Number of sample trajectories                            2: if b ≥ |Ω| then return Ω
Require: b: Budget for evaluating modifications                      3: for each ω1 , ω2 ∈ Ω̄ do
 1: E ∗ ← E0                   . Initialize best configuration       4:      if similarity(ω1 , ω2 ) ≤  then
 2: E ← E0                                                           5:           Ω̄ = Ω̄ \ arg maxω (C(E0 , ω1 ), C(E0 , ω2 ))
 3: n ← ∞                                                            6:           if |Ω̄| = b then return Ω̄
 4: D ← Solve Ma for E0 and sample d trajectories                    7:      end if
 5: Ω̄ ← Diverse modifications(b, Ω, Md , E0 )                       8: end for
 6: while |Ω̄| > 0 do
 7:     π̂ ← Extract policy from D
 8:     if Nπ̂E < n then                                            ing to evaluate. When b < |Ω|, it is beneficial to evaluate b di-
 9:         n ← Nπ̂E                                                verse modifications. Algorithm 2 presents a greedy approach
10:         E∗ ← E                                                  to select b diverse modifications. If two modifications result
11:     end if                           Shaping phase              in similar environment configurations, the algorithm prunes
12:     if Nπ̂E ≤ δD then break                                     the modification with a higher cost. The similarity threshold
13:     ω ∗ ← arg maxω∈Ω̄ Uπ̂ (ω)                                   is controlled by . This process is repeated until b modifica-
14:     if ω ∗ = ∅ then break                                       tions are identified. Measures such as the Jaccard distance or
15:     Ω̄ ← Ω̄ \ ω ∗                                               embeddings may be used to estimate the similarity between
16:     D0 ← Solve Ma for Ψ(E0 , ω ∗ ) and sample d trajec-         two environment configurations.
    tories
17:     if D0 6= {} then                                                            Theoretical Properties
18:         D ← D0
19:         E ← Ψ(E0 , ω ∗ )                                        For the sake of clarity of the theoretical analysis, we assume
20:     end if                                                      that the designer shapes the environment based on the actor’s
21: end while                                                       exact π, without having to extract it from D, and with b =
22: return E ∗                                                      |Ω|. These assumptions allow us to examine the properties
                                                                    without approximation errors and sampling biases.
                                                                    Proposition 1. Algorithm 1 produces admissible shaping.

plied tentatively for evaluation and the environment is reset       Proof. Algorithm 1 ensures that a modification does not
if the actor returns D = {} or if the reconfiguration does not      negatively impact the actor’s ability to complete its task,
minimize the NSE. Therefore, all modifications are applied          given δA (Lines 16-19), and stores the configuration with
to E0 and it suffices to test each ω without replacement as         least NSE (Line 10). Therefore, Algorithm 1 is guaranteed
the actor always calculates the corresponding optimal pol-          to return an E ∗ with NSE equal to or lower than that of the
icy. When Ma is solved approximately, without bounded-              initial configuration E0 . Hence, shaping using Algorithm 1
guarantees, it is non-trivial to verify if δA is violated but the   is admissible.
designer selects a utility-maximizing ω corresponding to this
policy. Algorithm 1 terminates when at least one of the fol-           Shaping using Algorithm 1 is not guaranteed to be proper
lowing conditions is satisfied: (1) NSE impact is within δD ;       because (1) no ω ∈ Ω may be able to reduce the NSE below
(2) every ω has been tested; or (3) the utility-maximizing op-      δD ; or (2) the cost of such a modification may exceed the
tion is no modification, ω ∗ = ∅, indicating that cost exceeds      corresponding reduction in NSE. With each reconfiguration,
the reduction in NSE or no modification can reduce the NSE          the actor is required to recompute its policy. We now show
further. When no modification reduces the NSE to be within          that there exists a class of problems for which the actor’s
δD , configuration with the least NSE is returned.                  policy is unaffected by shaping.
    Though the designer calculates the utility for all modifi-      Definition 5. A modification is policy-preserving if the ac-
cations, using N , there is an implicit pruning in the shap-        tor’s policy is unaltered by environment shaping.
ing phase since only utility-maximizing modifications are              This property induces policy invariance before and af-
evaluated (Line 16). However, when multiple modifications           ter shaping and is therefore considered to be backward-
have the same cost and produce similar environment con-             compatible. Additionally, the actor is guaranteed to com-
figurations, the algorithm will alternate multiple times be-        plete its task with δA = 0 when a modification is policy-
tween planning and shaping to evaluate all these modifica-          preserving with respect to the actor’s optimal policy in E0 .
tions, which is |Ω| in the worst case. To minimize the num-         Figure 3 illustrates the dynamic Bayesian network for this
ber of evaluations in settings with large Ω, consisting of mul-     class of problems, where rP denotes the reward associated
tiple similar modifications, we present a greedy approach to        with oP and rN denotes the penalty for NSE. Let F denote
identify and evaluate diverse modifications.                        the set of features in the environment, with the actor’s policy
Selecting diverse modifications Let 0 < b ≤ |Ω| denote              depending on FP ⊂ F and FD ⊂ F altered by the designer’s
the maximum number of modifications the designer is will-           modifications. Let f~ and f~0 be the feature values before and
The actor’s best response is its optimal policy for oP in the
                                                                    current environment configuration, given δA :
                                                                                                                     0
                                                                       bA (E) = {π ∈ Π|∀π 0 ∈ Π, VPπ (s0 ) ≥ VPπ (s0 ) ∧
                                                                                            VPπ0 (s0 ) − VPπ (s0 ) ≤ δA },      (3)
                                                                    where π0 denotes the optimal policy in E0 before initiating
                                                                    environment shaping. These best responses are pure strate-
                                                                    gies since there exists a deterministic optimal policy for a
                                                                    discrete MDP and a modification is selected deterministi-
                                                                    cally. Proposition 3 shows that AD-NSE is an extensive-
                                                                    form game, which is useful to understand the link between
Figure 3: A dynamic Bayesian network description of a               two often distinct fields of decision theory and game the-
policy-preserving modification.                                     ory. This also opens the possibility for future work on game-
                                                                    theoretic approaches for mitigating NSE.
                                                                    Proposition 3. An AD-NSE with policy constraints induced
after environment shaping. Given the actor’s π, a policy-
                                                                    by Equations 2 and 3 induces an equivalent extensive form
preserving ω follows FD = F \ FP , ensuring f~P = f~P0 .            game with incomplete information, denoted by N .
Proposition 2. Given an environment configuration E and
actor’s policy π, a modification ω ∈ Ω is guaranteed to be          Proof Sketch. Our solution approach alternates between the
policy-preserving if FD = F \ FP .                                  planning phase and the shaping phase. This induces an ex-
                                                                    tensive form game N with strategy profiles Ω for the de-
Proof. We prove by contradiction. Let ω ∈ Ω be a modifi-            signer and Π for the actor, given start state s0 . However,
cation consistent with FD ∩ FP = ∅ and F = FP ∪ FD . Let            each agent selects a best response based on the information
π and π 0 denote the actor’s policy before and after shaping        available to it and its payoff matrix, and is unaware of the
with ω such that π 6= π 0 . Since the actor always computes         strategies and payoffs of the other agent. The designer is un-
an optimal policy for oP (assuming fixed strategy for tie-          aware of how its modifications may impact the actor’s policy
breaking), a difference in the policy indicates that at least       until the actor recomputes its policy and the actor is unaware
one feature in FP is affected by ω, f~P 6= f~P0 . This is a con-    of the NSE of its actions. Hence this is an extensive form
tradiction since FD ∩ FP = ∅. Therefore f~P = f~P0 and π = π 0 .    game with incomplete information.
Thus, ω ∈ Ω is policy-preserving when FD = F \ FP .
                                                                       Using Harsanyi transformation, N can be converted to
Corollary 1. A policy-preserving modification identified by         a Bayesian extensive-form game, assuming the players are
Algorithm 1 guarantees admissible shaping.                          rational (Harsanyi 1967). This converts N to a game with
   Furthermore, a policy-preserving modification results in         complete but imperfect information as follows. Nature se-
                                                                    lects a type τi , actor or designer, for each player, which de-
robust shaping when P r(FD = f~D ) = 0, ∀f~ = f~P ∪ f~D . The
                                                                    termines its available actions and its payoffs parameterized
policy-preserving property is sensitive to the actor’s state
                                                                    by θA and θD for the actor and designer, respectively. Each
representation. However, many real-world problems exhibit
                                                                    player knows its type but does not know τ and θ of the other
this feature independence. In the boxpushing example, the
                                                                    player. If the players maintain a probability distribution over
actor’s state representation may not include details about the
                                                                    the possible types of the other player and update it as the
rug since it is not relevant to the task. Moving the rug to
                                                                    game evolves, then the players select a best response, given
a different part of the room that is not on the actor’s path to
                                                                    their strategies and beliefs. This satisfies consistency and se-
the goal is an example of a policy-preserving and admissible
                                                                    quential rationality property, and therefore there exists a per-
shaping. Modifications such as removing the rug or covering
                                                                    fect Bayesian equilibrium for N .
it with a protective sheet are policy-preserving and robust
shaping since there is no direct contact between the box and
the rug, for all policies of the actor (Figure 1(b)).                           Extension to Multiple Actors
                                                                    Our approach can be extended to settings with multiple ac-
Relation to Game Theory Our solution approach with de-              tors and one designer, with slight modifications to Equa-
coupled objectives can be viewed as a game between the ac-          tion 1 and Algorithm 1. This models large environments
tor and the designer. The action profile or the set of strategies   such as infrastructure modifications to cities. The actors are
for the actor is the policy space Π, with respect to oP and E,      assumed to be homogeneous in their capabilities, tasks, and
with payoffs defined by its reward function R. The action           models, but may have different start and goal states. For ex-
profile for the designer is the set of all modifications Ω with     ample, consider multiple self-driving cars that have the same
payoffs defined by its utility function U . In each round of        capabilities, task, and model but are navigating between dif-
the game, the designer selects a modification that is a best        ferent locations in the environment. When optimizing travel
response to the actor’s policy π:                                   time, the vehicles may travel at a high velocity through pot-
                                                                    holes, resulting in a bumpy ride for the passenger as a side
       bD (π) = {ω ∈ Ω|∀ω 0 ∈ Ω, Uπ (ω) ≥ Uπ (ω 0 )}. (2)           effect. Driving fast through deep potholes may also damage
the car. If the road is rarely used, the utility-maximizing de-
sign may be to reduce the speed limit for that segment. If the
road segment with potholes is frequently used by multiple
vehicles, filling the potholes reduces the NSE for all actors.
The designer’s utility for an ω, given K actors, is:
                   XK                    
        U~π (ω) =         NπEi − NπΨ(E,ω)
                                    i
                                             − C(E, ω),
                  i=1
   with ~π denoting policies of the K actors. More complex
ways of estimating the utility may be considered. The pol-                       Figure 4: User tolerance score.
icy is recomputed for all the actors when the environment
is modified. It is assumed that shaping does not introduce a
multi-agent coordination problem that did not pre-exist. If a     such bumpiness when the AV drives fast through potholes
modification impacts the agent coordination for oP , it will      and 57.22% were willing to tolerate relatively severe NSE
be reflected in the slack violation. Shaping for multiple ac-     such as hard braking at a stop sign. Participants were also
tors requires preserving the slack guarantees of all actors. A    required to enter a tolerance score on a scale of 1 to 5, with 5
policy-preserving modification induces policy invariance for      being the highest. Figure 4 shows the distribution of the user
all the actors in this setting.                                   tolerance score for the two domains in our human subjects
                                                                  study. The mean, along with the 95% confidence interval for
                        Experiments                               the tolerance scores are 3.06±0.197 for the household robot
We present two sets of experimental results. First, we report     domain and 3.17 ± 0.166 for the AV domain. For household
the results of our user study conducted specifically to vali-     robot domain, 66.11% voted a score of ≥ 3 and 75% voted
date two key assumptions of our work: (1) users are willing       a score of ≥ 3 for the AV domain. Furthermore, 52.22%
to engage in shaping and (2) users want to mitigate NSE           respondents of the household robot survey voted that their
even when they are not safety-critical. Second, we evalu-         trust in the system’s abilities is unaffected by the NSE and
ate the effectiveness of shaping in mitigating NSE on two         34.44% indicated that their trust may be affected if the NSE
domains in simulation. The user study complements the ex-         are not mitigated over time. Similarly for the AV driving do-
periments that evaluate our algorithmic contribution to solve     main, 50.55% voted that their trust is unaffected by the NSE
this problem.                                                     and 42.78% voted that NSE may affect their trust if the NSE
                                                                  are not mitigated over time. These results suggest that (1)
User Study                                                        individual preferences and tolerance of NSE varies and de-
Setup We conducted two IRB-approved surveys focus-                pends on the severity of NSE; (2) users are generally willing
ing on two domains similar to the settings in our simula-         to tolerate NSE that are not severe or safety-critical, but pre-
tion experiments. The first is a household robot with capa-       fer to reduce them as much as possible; and (3) mitigating
bilities such as cleaning the floor and pushing boxes. The        NSE is important to improve trust in AI systems.
second is an autonomous driving domain. We surveyed the           Willingness to perform shaping We surveyed participants
participants to understand their attitudes to NSE such as the     to determine their willingness to perform environment shap-
household robot spraying water on the wall when cleaning          ing to mitigate the impacts of NSE. In the household robot
the floor, the autonomous vehicle (AV) driving fast through       domain, shaping involved adding a protective sheet on the
potholes which results in a bumpy ride for the users, and the     surface. In the driving domain, shaping involved installing a
AV slamming the brakes to halt at stop signs which results        pothole-detection sensor that detects potholes and limits the
in sudden jerks for the passengers. We recruited 500 partici-     vehicle’s speed. Our results show that 74.45% participants
pants on Amazon Mturk to complete a pre-survey question-          are willing to engage in shaping for the household robot do-
naire to assess their familiarity with AI systems and fluency     main and 92.5% for the AV domain. Among the respondents
in English. Based on the pre-survey responses, we invited         who were willing to install the sheet in the household robot
300 participants aged above 30 to complete the main survey,       domain, 64.39% are willing to purchase the sheet ($10) if it
since they are less likely to game the system (Downs et al.       not provided by the manufacturer. Similarly, 62.04% were
2010). Participants were required to select the option that       willing to purchase the sensor for the AV, which costs $50.
best describes their attitude towards NSE occurrence. Re-         These results indicate that (1) users are generally willing to
sponses that were incomplete or with a survey completion          engage in environment shaping to mitigate the impacts of
time of less than one minute were discarded and 180 valid         NSE; and (2) many users are willing to pay for environment
responses for each domain are analyzed.                           shaping as long as it is generally affordable, relative to the
Tolerance to NSE We surveyed participants to understand           cost of the AI system.
their attitude towards NSE that are not safety-critical and
whether such NSE affects the user’s trust in the system.          Evaluation in Simulation
   For the household robot setting, 76.66% of the partici-        We evaluate the effectiveness of shaping in two domains:
pants indicated willingness to tolerate the NSE. For the driv-    boxpushing (Figure 1) and AV driving (Figure 5). The ef-
ing domain, 87.77% were willing to tolerate milder NSE            fectiveness of shaping is studied in settings with avoidable
agent’s location represented by hx, yi. We consider a grid of
                                                                  size 25 × 15. The actor can move in all four directions at
                                                                  low and high speeds, with costs 2 and 1 respectively. Driv-
                                                                  ing fast through shallow potholes results in a bumpy ride for
                                                                  the user, which is a mild NSE with a penalty of 2. Driving
                                                                  fast through a deep pothole may damage the car in addi-
                                                                  tion to the unpleasant experience for the rider and therefore
                                                                  it is a severe NSE with a penalty of 5. Infrastructure man-
                                                                  agement authorities are the designers for this setting and the
  Figure 5: AV environment shaping with multiple actors.          state space is divided into four zones, similar to geographical
                                                                  divisions of cities for urban planning.
NSE, unavoidable NSE, and with multiple actors.                   Available Modifications        We consider 24 modifications
Baselines The performance of shaping with budget is com-          for the boxpushing domain, such as adding a protective sheet
pared with the following baselines. First is the Initial ap-      over the rug, moving the vase to corners of the room, remov-
proach in which the actor’s policy is naively executed and        ing the rug, block access to the rug and vase area, among
does not involve shaping or any form of learning to mitigate      others. Removing the rug costs 0.4/unit area covered by the
NSE. Second is the shaping with exhaustive search to select       rug, moving the vase costs 1, and all other modifications cost
a modification, b = |Ω|. Third is the Feedback approach in        0.2/unit. Except for blocking access to the rug and vase area,
which the gent performs a trajectory of its optimal policy        all the other modifications are policy-preserving for the ac-
for oP and the human approves or disapproves the observed         tor state representation described earlier. The modifications
actions based on the NSE occurrence (Saisubramanian, Ka-          considered for the driving domain are: reduce the speed limit
mar, and Zilberstein 2020). The agent then disables all the       in zone i, 1 ≤ i ≤ 4; reduce speed limits in all four zones;
disapproved actions and recomputes a policy for execution.        fill all potholes; fill deep potholes in all zones; reduce the
If the updated policy violates the slack, the actor ignores the   speed limit in zones with shallow potholes; and fill deep
feedbacks and executes its initial policy since oP is priori-     potholes. Reducing the speed costs 0.5 units per pothole in
tized. We consider two variants of the feedback approach:         that zone and filling each pothole costs 2 units. Reducing
with and without generalizing the gathered information to         the speed limit disables the ‘move fast’ action for the actor.
unseen situations. In Feedback w/ generalization, the actor       Filling the potholes is a policy-preserving modification.
uses the human feedback as training data to learn a predic-       Effectiveness of shaping         The effectiveness of shaping
tive model of NSE occurrence. The actor disables the actions      is evaluated in terms of the average NSE penalty incurred
disapproved by the human and those predicted by the model,        and the expected value of oP after shaping. Figure 6 plots
and recomputes a policy.                                          the results with δA = 0, δD = 0, and b = 3, as the num-
   We use Jaccard distance to measure the similarity between      ber of observed actor trajectories is increased. The results
environment configurations. A random forest classifier from       are plotted for the boxpushing domain with one actor in set-
the sklearn Python package is used for learning a predic-         tings with avoidable and unavoidable NSE. Feedback budget
tive model. The actor’s MDP is solved using value iteration.      is 500. Feedback without generalization does not minimize
The algorithms were implemented in Python and tested on           NSE and performs similar to initial baseline, because the ac-
a computer with 16GB of RAM. We optimize action costs,            tor has no knowledge about the undesirability of actions not
which are negation of rewards. Values averaged over 100 tri-      included in the demonstration. As a result, when an action is
als of planning and execution, along with the standard errors,    disapproved, the actor may execute an alternate action that
are reported for the following domains.                           is equally bad or worse than the initial action in that state,
Boxpushing We consider a boxpushing domain (Saisub-               in terms of NSE. Generalizing the feedback overcomes this
ramanian, Kamar, and Zilberstein 2020) in which the actor         drawback and mitigates the NSE relatively. With at most five
is required to minimize the expected time taken to push a         trajectories, the designer is able to select a policy-preserving
box to the goal location (Figure 1). Each action costs one        modification that avoids NSE. Shaping w/ budget b = 3
time unit. The actions succeed with probability 0.9 and may       performs similar to shaping by evaluating all modifications.
slide right with probability 0.1. Each state is represented as    The trend in the relative performances of the different tech-
hx, y, bi where x, y denote the agent’s location and b is a       niques is similar for both avoidable and unavoidable NSE.
boolean variable indicating if the agent is pushing the box.      Table 6(f) shows that shaping w/ budget reduces the number
Pushing the box over the rug or knocking over a vase on its       of shaping evaluations by 50%.
way results in NSE, incurring a penalty of 5. The designers
are the system users. We experiment with grid size 15× 15.        Effect of slack on shaping        We vary δA and δD , and
                                                                  plot the resulting NSE penalty in Figure 7. We report re-
Driving     Our second domain is based on simulated au-           sults for the driving domain with a single actor and designer,
tonomous driving (Saisubramanian, Kamar, and Zilberstein          and shaping based on 100 observed trajectories of the actor.
2020) in which the actor’s objective is to minimize the ex-       We vary δA between 0-25% of VP∗ (s0 |E0 ) and δD between
pected cost of navigation from start to a goal location, dur-     0-25% of the NSE penalty of the actor’s policy in E0 . Fig-
ing which it may encounter some potholes. Each state is the       ure 7 shows that increasing the slack helps reduce the NSE,
(a) Average NSE penalty.                (b) Expected cost of oP .                (c) Average NSE penalty.

                                                                                     # Trajectories    Shaping    Shaping w/ budget
                                                                                            2             1               1
                                                                                            5             2               1
                                                                                           10             2               1
                                                                                           20             2               1
                                                                                           50             6               3
                                                                                          100             6               3
       (d) Expected cost of oP .              (e) Legend for plots (a)-(d)          (f) # Modifications evaluated when NSE are avoidable

                Figure 6: Results on boxpushing domain with avoidable NSE (a-b) and unavoidable NSE (c-d).

Figure 7: Effect of δA and δD on the average NSE penalty.

as expected. In particular, when δD ≥ 15%, the NSE penalty
is considerably reduced with δA = 15%. Overall, increasing                           Figure 8: Results with multiple actors.
δA is most effective in reducing the NSE. We also tested the
effect of slack on the cost for oP . The results showed that the
cost was predominantly affected by δA . We observed similar             practical. Overall, as we increase the number of agents, sub-
performance for all values of δD for a fixed δA and therefore           stantial reduction in NSE is observed with shaping. The time
do not include that plot. This is an expected behavior since            taken for shaping and shaping w/ budget are comparable up
oP is prioritized and shaping is performed in response to π.            to 100 actors, beyond which we see considerable time sav-
                                                                        ings of using shaping w/ budget. For 300 actors, the average
Shaping with multiple actors We measure the cumula-                     time taken for shaping is 752.27 seconds and the time taken
tive NSE penalty incurred by varying the number of actors               for shaping w/budget is 490.953 seconds.
in the environment from 10 to 300. Figure 8 shows the re-
sults for the driving domain, with δD = 0 and δA = 25% of                            Discussion and Future Work
the optimal cost for oP , for each actor. The start and goal lo-
cations were randomly generated for the actors. Shaping and             We present an actor-designer framework to mitigate the im-
feedbacks are based on 100 observed trajectories of each ac-            pacts of NSE by shaping the environment. The general idea
tor. Shaping w/ budget results are plotted with b = 4, which            of environment modification to influence the behaviors of
is 50%|Ω|. The feedback budget for each actor is 500. Al-               the acting agents has been previously explored in other con-
though the feedback approach is sometimes comparable to                 texts (Zhang, Chen, and Parkes 2009), such as to accelerate
shaping when there are fewer actors, it requires overseeing             agent learning (Randløv 2000), to quickly infer the goals of
each actor and providing individual feedback, which is im-              the actor (Keren, Gal, and Karpas 2014), and to maximize
the agent’s reward (Keren et al. 2017). We study how en-           man, J.; and Mané, D. 2016. Concrete Problems in AI
vironment shaping can mitigate NSE. We also identify the           Safety. arXiv preprint arXiv:1606.06565 .
conditions under which the actor’s policy is unaffected by         Downs, J. S.; Holbrook, M. B.; Sheng, S.; and Cranor,
shaping and show that our approach guarantees bounded-             L. F. 2010. Are Your Participants Gaming the System?
performance of the actor.                                          Screening Mechanical Turk Workers. In Proceedings of the
   Our approach is well-suited for settings where the agent is     SIGCHI Conference on Human Factors in Computing Sys-
well-trained to perform its assigned task and must perform         tems, 2399–2402.
it repeatedly, but NSE are discovered after deployment. Al-
                                                                   Guestrin, C.; Koller, D.; and Parr, R. 2001. Max-norm Pro-
tering the agent’s model after deployment requires the envi-
                                                                   jections for Factored MDPs. In Proceedings of the 17th In-
ronment designer to be (or have the knowledge of) the de-
                                                                   ternational Joint Conference on Artificial Intelligence.
signer of the AI system. Otherwise, the system will have to
be suspended and returned to the manufacturer. Our frame-          Hadfield-Menell, D.; Milli, S.; Abbeel, P.; Russell, S. J.; and
work provides a tool for the users to mitigate NSE without         Dragan, A. 2017. Inverse Reward Design. In Advances in
making any assumptions about the agent’s model, its learn-         Neural Information Processing Systems, 6765–6774.
ing capabilities, the representation format for the agent to ef-   Harsanyi, J. C. 1967. Games with Incomplete Informa-
fectively use new information, or the designer’s knowledge         tion Played by “Bayesian” Players, Part I. The Basic Model.
about the agent’s model. We argue that is many contexts, it is     Management science 14(3): 159–182.
much easier for users to reconfigure the environment than to       Hendrickson, C.; Biehler, A.; and Mashayekh, Y. 2014. Con-
accurately update the agent’s model. Our algorithm guides          nected and Autonomous Vehicles 2040 Vision. Pennsylva-
the users in the shaping process.                                  nia Department of Transportation .
   In addition to the greedy approach described in Algo-
rithm 2, we also tested a clustering-based approach to iden-       Keren, S.; Gal, A.; and Karpas, E. 2014. Goal Recognition
tify diverse modifications. In our experiments, the clustering     Design. In Proceedings of the 24th International Conference
approach performed comparably to the greedy approach in            on Automated Planning and Scheduling.
identifying diverse modifications but had a longer run time.       Keren, S.; Pineda, L.; Gal, A.; Karpas, E.; and Zilberstein, S.
It is likely that benefits of the clustering approach would be     2017. Equi-Reward Utility Maximizing Design in Stochas-
more evident in settings with a much larger Ω, since it can        tic Environments. In Proceedings of the 26th International
quickly group similar modifications. In the future, we aim to      Joint Conference on Artificial Intelligence, 4353–4360.
experiment with a large set of modifications, say |Ω| > 100,       McFarland, M. 2020. Michigan Plans to Redesign a Stretch
and compare the performance of the clustering approach             of Road for Self-Driving Cars. URL https://www.cnn.com/
with that of Algorithm 2.                                          2020/08/13/cars/michigan-self-driving-road/index.html.
   When the actor’s model Ma has a large state space, ex-          Randløv, J. 2000. Shaping in Reinforcement Learning by
isting algorithms for solving large MDPs may be leveraged.         Changing the Physics of the Problem. In Proceedings of the
When Ma is solved approximately, without bounded guar-             17th International Conference on Machine Learning.
antees, slack guarantees cannot be established. Extending
our approach to multiple actors with different models is an        Saisubramanian, S.; Kamar, E.; and Zilberstein, S. 2020.
interesting future direction.                                      A Multi-Objective Approach to Mitigate Negative Side Ef-
                                                                   fects. In Proceedings of the 29th International Joint Confer-
   We conducted experiments with human subjects to val-
                                                                   ence on Artificial Intelligence, 354–361.
idate their willingness to perform environment shaping. In
our user study, we focused on scenarios that are more acces-       Saisubramanian, S.; Zilberstein, S.; and Kamar, E. 2020.
sible to participants from diverse backgrounds so that we can      Avoiding Negative Side Effects due to Incomplete Knowl-
obtain reliable responses regarding their overall attitude to      edge of AI Systems. CoRR abs/2008.12146.
NSE and shaping. In the future, we aim to evaluate whether         Shah, R.; Krasheninnikov, D.; Alexander, J.; Abbeel, P.; and
Algorithm 1 aligns with human judgment in selecting the            Dragan, A. 2019. Preferences Implicit in the State of the
best modification. Additionally, we will examine ways to au-       World. In Proceedings of the 7th International Conference
tomatically identify useful modifications in a large space of      on Learning Representations (ICLR).
valid modifications.                                               Simon, M. 2019. Inside the Amazon Warehouse Where Hu-
                                                                   mans and Machines Become One. URL https://www.wired.
Acknowledgments Support for this work was provided                 com/story/amazon-warehouse-robots/.
in part by the Semiconductor Research Corporation under            Zhang, H.; Chen, Y.; and Parkes, D. C. 2009. A General Ap-
grant #2906.001.                                                   proach to Environment Design With One Agent. In Proceed-
                                                                   ings of the 21st International Joint Conference on Artificial
                        References                                 Intelligence, 2002–2014.
                                                                   Zhang, S.; Durfee, E. H.; and Singh, S. P. 2018. Minimax-
Agashe, N.; and Chapman, S. 2019. Traffic Signs in the
                                                                   Regret Querying on Side Effects for Safe Optimality in
Evolving World of Autonomous Vehicles. Technical report,
                                                                   Factored Markov Decision Processes. In Proceedings of
Avery Dennison.
                                                                   the 27th International Joint Conference on Artificial Intel-
Amodei, D.; Olah, C.; Steinhardt, J.; Christiano, P.; Schul-       ligence, 4867–4873.
You can also read