SWIRL: A Sequential Windowed Inverse Reinforcement Learning Algorithm for Robot Tasks With Delayed Rewards - Stanford University

Page created by Deborah Sherman

Society

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

SWIRL: A Sequential Windowed Inverse Reinforcement Learning Algorithm for Robot Tasks With Delayed Rewards - Stanford University

Journal Title
XX(X):1–18
SWIRL: A Sequential Windowed Inverse ©The Author(s) 0000
Reprints and permission:
Reinforcement Learning Algorithm for sagepub.co.uk/journalsPermissions.nav
DOI: 10.1177/ToBeAssigned

Robot Tasks With Delayed Rewards www.sagepub.com/

Sanjay Krishnan1 , Animesh Garg1,3 , Richard Liaw1 , Brijen Thananjeyan1 ,
Lauren Miller1 , Florian T. Pokorny2 , Ken Goldberg1

Abstract
We present Sequential Windowed Inverse Reinforcement Learning (SWIRL), a three-phase algorithm that automatically
partitions a task into shorter-horizon sub-tasks based on transitions that occur consistently across demonstrations.
SWIRL then learns a sequence of local reward functions using Maximum Entropy Inverse Reinforcement Learning.
Once these reward functions are learned, SWIRL applies Q-learning to compute a policy that maximizes the rewards.
SWIRL leverages both expert demonstrations and exploration to find policies for robotic tasks with delayed rewards.
Experiments suggest that SWIRL requires significantly fewer rollouts than pure RL and fewer expert demonstrations
than behavioral cloning to learn a policy. We evaluate SWIRL in two simulated control tasks, parallel parking and a
two-link pendulum. On the parallel parking task, SWIRL achieves the maximum reward on the task with 85% fewer
rollouts than Q-Learning, and 33% fewer rollouts than Q-Learning where the rewards were shaped by IRL. We also
consider physical experiments on surgical tensioning and cutting deformable sheets using the da Vinci surgical robot.
On the deformable tensioning task, SWIRL achieves a 36% relative improvement in reward compared to a baseline of
behavioral cloning with segmentation.

1 Introduction et al. 2014). Often these segments are manually designed or
derived from a dictionary of motion primitives, but recently,
An important problem in robot learning is defining a reward there are several approaches for learning segmentation
function that accurately reflects a robot’s ability to perform criteria automatically by identifying locally similar structure
a task. However, in many cases, the natural reward function in demonstration data (Barbič et al. 2004; Chiappa and
for a task is delayed, where the consequences of an action are Peters 2010; Alvarez et al. 2010; Calinon et al. 2010; Krüger
only observed long after it is taken. Such reward functions et al. 2012; Niekum et al. 2012; Wächter and Asfour 2015;
are difficult to directly optimize with exploration-based Lee et al. 2015). While prior work has mostly considered
techniques like Reinforcement Learning (RL). For example, segmentation to reduce the complexity of deterministic
in a multi-step assembly task, one might have a classifier that planning problems, this paper explores how segmentation
can determine if the full part is correctly assembled. In this can inform reward derivation in the Markov Decision Process
problem, RL would have to rely on random exploration to setting.
achieve the assembled state by chance at least once, before it We model a task as a sequence of quadratic reward
can learn a more efficient policy. functions Rseq = [R1 , ..., Rk ] and transition regions G =
One approach is to use expert demonstrations to learn [ρ1 , ..., ρk ] such that R1 is the reward function until ρ1
a smoother reward function that gives the robot a stronger is reached, after which R2 becomes the reward and so
reward signal at intermediate steps. This idea is related to on. We assume that we have access to a supervisor that
apprenticeship learning (Kolter et al. 2007a; Coates et al. provides demonstrations that are optimal w.r.t an unknown
2008; Abbeel and Ng 2004). In apprenticeship learning, a Rseq , and reach each ρ ∈ G (also unknown) in the same
supervisor provides a small number of initial demonstrations, sequence. Sequential Windowed Inverse Reinforcement
and there is a two-phase approach that first applies Inverse Learning (SWIRL) is an algorithm to recover Rseq and
Reinforcement Learning (IRL) to infer the supervisor’s G from the demonstration trajectories. SWIRL applies
implicit reward function, and then optimizes for this reward to tasks with a discrete or continuous state-space and a
function using RL. We explore whether we can leverage discrete action-space. The state space can represent spatial,
the same basic apprenticeship learning framework to learn kinematic, or sensory states (e.g., visual features), as long as
reward functions for tasks with a sequence of state-space
sub-goals that must be reached.
1
Segmentation is a well-studied problem, and it facilitates AUTOLAB, UC Berkeley automation.berkeley.edu
2
learning localized control policies (Murali* et al. 2015; RPL/CSC, KTH Royal Institute of Technology, Stockholm, Sweden
3
Stanford University
Niekum et al. 2012; Konidaris et al. 2011), adaptation to
unseen scenarios (Ijspeert et al. 2002; Ude et al. 2010), Corresponding author:
and demonstrator skill-assessment (Reiley et al. 2010; Gao Sanjay Krishnan sanjay@eecs.berkeley.edu

Prepared using sagej.cls [Version: 2016/06/24 v1.10]

2                                                                                                           Journal Title XX(X)

the trajectories are smooth and not very high-dimensional.               one such algorithm, and we describe extensions that
The discrete actions are not a fundamental restriction, but              account for non-linearities and discontinuities. For this
relaxing that constraint is deferred to future work. Finally,            class of segmentation algorithms, policy learning can
Rseq and G can be used in an RL algorithm to find an optimal             be efficiently done on an augmented state-space with
policy for a task.                                                       indicators tracking the previously reached segments.
   SWIRL segments the demonstrations using a variant
of a segmentation algorithm proposed in our prior                     3. We apply SWIRL to two simulated control tasks, a
work (Krishnan* et al. 2015; Murali* et al. 2016), called                noisy non-holonomic car and a two-link pendulum,
Transition State Clustering (TSC). TSC identifies locally                and two physical experiments on the da Vinci surgical
similar dynamical segments in a trajectory and fits a                    robot.
Gaussian Mixture Model to the endpoints of the segments
to learn a model to determine when and where a segment
terminates. While our original motivation was to improve          2    Related Work and Background
the robustness of kinematic segmentation algorithms by
pruning sparse clusters, TSC can also be interpreted as           Apprenticeship Learning:         Abbeel and Ng (2004)
inferring the subtask transition regions G. SWIRL extends         argue that the reward function is often a more concise
TSC by (1) formalizing a broader class of Markov                  representation of task than a policy. As such, a concise
segmentation algorithms that apply to the IRL setting where       reward function is more likely to be robust to small
TSC is a special case, and (2) combining TSC with a               perturbations in the task description. The downside is that
kernel embedding to handle certain types of non-linearities       the reward function is not useful on its own, and ultimately
and discontinuities in the state-space. Once the segments         a policy must be retrieved. In the most general case, an
are found, SWIRL applies Maximum Entropy Inverse                  RL algorithm must be used to optimize for that reward
Reinforcement Learning (MaxEnt-IRL) (Ziebart et al. 2008)         function (Abbeel and Ng 2004). It is well-established that
to each segment to find Rseq . Segmentation further simplifies    RL problems often converge slowly in complex tasks when
the estimation of dynamics models, which are required             rewards are sparse and not “shaped” appropriately (Ng et al.
for inference in MaxEnt-IRL, since locally many complex           1999; Judah et al. 2014). Our work re-visits this two-phase
systems can be approximated linearly in a short time horizon.     algorithm in the context of sequential tasks and techniques
   Learning a policy from Rseq and G is nontrivial because        to scale such an approach to longer time horizons. Related
solving k independent problems neglects any shared structure      to SWIRL, Kolter et al. studied Hierarchical Apprenticeship
in the value function during the policy learning phase (e.g.,     Learning to learn bipedal locomotion (Kolter et al. 2007a),
a common failure state). Jointly learning over all segments       where the algorithm is provided with a hierarchy sub-
introduces a dependence on history, namely, any policy            tasks. We explore automatically inferring a sequential task
must complete step i before step i + 1. Learning a memory-        structure from data.
dependent policy could lead to an exponential overhead of         Motion Primitives: The LfD and planning communities
additional states. SWIRL exploits the fact that TSC is a          studied the problem of leveraging motion primitives,
Markov segmentation algorithm and shows that the problem          or libraries of temporally extended action sequences,
can be posed as a proper MDP in a lifted state-space that         to improve generalization. Dynamic Motion Primitives
includes an indicator variable of the highest-index {1, ..., k}   construct new motions through a composition of dynamical
transition region that has been reached so far.                   building blocks (Ijspeert et al. 2002; Pastor et al. 2009;
   The basic model follows from a special case of                 Manschitz et al. 2015). Much of the work in motion
Hierarchical Reinforcement Learning (Sutton et al. 1999).         primitives considered manually identified segments, but
In hierarchical reinforcement learning, multi-step skills are     recently, Niekum et al. (Niekum et al. 2012) proposed
composed of local policies called “options”. Each option          learning the set of primitives from demonstrations using the
executes until a termination condition, and a meta-policy         Beta-Process Autoregressive Hidden Markov Model (BP-
selects the next option to execute. SWIRL is an IRL               AR-HMM). Similarly, Calinon (2014) proposed the task-
framework for inferring termination conditions (G) and local      parametrized movement model with GMMs for automatic
reward functions that guide the agent to these termination        action segmentation. Both Niekum and Calinon considered
states Rseq , where the meta-policy is deterministic and          the motion planning setting in which analytical planning
sequential.                                                       methods are used to solve a task. To the best of our
   In summary, our contributions are:                             knowledge, SWIRL is the first to consider segmentation in
                                                                  the IRL setting, where the dynamics can be stochastic.
    1. We describe a model for sequential robot tasks, where
       rewards sequentially switch upon arrival in a transition   Segmentation: Trajectory segmentation is a well-studied
       region, and an IRL algorithm called Sequential             area of research dating back to early biomechanics and
       Windowed Inverse Reinforcement Learning to infer           robotics research. For example, Viviani and Cenzato (1985)
       the rewards and transitions from demonstrations. The       explored using the “two-thirds” power law coefficient to
       algorithm has three phases: segmentation, inverse          determine segment boundaries in handwriting. Morasso
       reinforcement learning, and policy learning.               (1983) showed that rhythmic 3d motions of a human arm
                                                                  could be modeled as piecewise linear. In a seminal paper,
    2. We describe a class of segmentation algorithms,            Sternad and Schaal (1999) provided a formal framework
       Markov segmentation algorithms, which can be used          for control-theoretic segmentation of trajectories. Botvinick
       to partition a task. Transition State Clustering is        et al. (2009) explored the reinforcement learning analog

Prepared using sagej.cls

of the control-theoretic models. Concurrently, temporal- S × A 7→ R. Associated with each Ri is a transition region
segmentation was developing in the motion capture ρi ⊆ S, which is a subset of the state-space. Each trajectory
community (Moeslund and Granum 2001). Recently, accumulates a reward Ri until it reaches the transition ρi ,
some Bayesian approaches have been proposed for the then the robot switches to the next reward and transition pair.
segmentation problem (Asfour et al. 2006; Calinon and This process continues until ρk is reached. A robot is deemed
Billard 2004; Kruger et al. 2010; Vakanski et al. 2012; successful when all of the ρi ∈ G are reached in sequence.
Tanwani and Calinon 2016). One challenge is collecting Inverse Reinforcement Learning (IRL) Ng and Russell
enough data to employ these techniques and tuning the (2000) describes the problem of observing an agent’s
hyperparameters. In prior work, we observed that under behavior and inferring a reward function that best explains
the assumption that the task is sequential (same order of the agent’s actions (assuming the agent is behaving
primitives in each demonstration) the inference could be optimally). Let D = {di } be a set of demonstrations of a
modeled as a two-level clustering problem (Krishnan* et al. robotic task. Each demonstration of a task d is a discrete-time
2015). This results in improved accuracy and robustness for sequence of T feature vectors. In this paper, we consider the
a small number of demonstrations. Another relevant result is sequential version of this problem, where we have to infer k
from Ranchod et al. (2015), who use an IRL model to define reward functions and k transition regions.
the primitives, in contrast to the problem of learning a policy
Sequential IRL Problem: Given observations of a
after IRL.
successful robot through a set of demonstration trajectories
Hierarchical Reinforcement Learning: The field of hier- D = {d1 , ..., dk }, infer Rseq and G.
archical reinforcement learning has a long history (Sutton
et al. 1999; Barto and Mahadevan 2003) in AI and in the Most IRL algorithms implicitly learn an optimal policy.
analysis of biological systems (Botvinick 2008; Botvinick However, the dynamics of the demonstration environment
et al. 2009; Solway et al. 2014; Zacks et al. 2011; Whiten can differ from the dynamics of the execution environment
et al. 2006). Early work in hierarchical control demonstrated in unknown ways–and the rewards might transfer but the
the advantages of hierarchical structures by handcrafting hi- policies might not. Furthermore, we observed that in practice
erarchical policies (Brooks 1986) and by learning them given the reward function is often more concise than the policy and
various manual specifications: state abstractions (Dayan and more tolerant to estimation errors. If we have only observed
Hinton 1992; Hengst 2002; Kolter et al. 2007b; Konidaris a small number of demonstrations, one might have enough
and Barto 2007), a set of waypoints (Kaelbling 1993), low- data to learn a reasonable reward function but not a viable
level skills (Huber and Grupen 1997; Bacon and Precup policy. Therefore, after learning Rseq and G, there needs to
2015), a set of finite-state meta-controllers (Parr and Russell be a policy learning phase. In the most general case, we will
1997), a set of subgoals (Sutton et al. 1999; Dietterich 2000), have to apply a technique like RL to learn a policy.
or intrinsic reward (Kulkarni et al. 2016). The key abstraction Sequential RL: Given a new instance, Rseq , and G, learn an
used in hierarchical RL is the “options” framework, where optimal policy π ∗ .
sub-skills are represented by local policies, termination con-
ditions, and initialization conditions. A high-level policy
switches between these options and composes them into a 3.2 Modeling Assumptions
larger task policy. In this framework, per sub-skill reward To make these problem statements more formal and compu-
functions are called sub-goals. SWIRL is an algorithm to tationally tractable, we make some modeling assumptions.
learn quadratic sub-goals and termination conditions, where
the high-level policy is deterministic and sequentially iterates Assumption 1. Reward Transitions are Identifiable: The
through the sub-skills. key challenge in this problem is determining when a
transition occurs–identifying the points in time in each
trajectory at which the robot reaches a ρi and transitions
3 Model and Problem Statement the reward function. The natural first question is whether
3.1 Basic Setup this is identifiable, that is, whether it is even theoretically
possible to determine whether a transition ρi → ρi+1 has
Consider a finite-horizon Markov Decision Process (MDP):
occurred after obtaining an infinite number of observations.
M = hS, A, P(·, ·), R, T i, Trivially, this is not guaranteed when Ri+1 = Ri , where it
would be impossible to identify a transition purely from
where S is the set of states (continuous or discrete), A is
the supervisor’s behavior (i.e., no change in reward, implies
the set of actions (finite and discrete), P : S × A 7→ Pr(S)
no change in behavior). Perhaps surprisingly, this is still
is the dynamics model that maps states and actions to a
not guaranteed even if Ri+1 6= Ri due to policy invariance
probability density over subsequent states, T is the time-
classes Ng et al. (1999). Consider a reward function Ri+1 =
horizon, and R is a reward function that maps trajectories
2Ri , which functionally induce the same optimal behavior.
of length T to scalar values. At every state s, we also observe
Therefore, we consider a setting where all of the rewards in
a vector of perceptual features x ∈ X . The feature space
Rseq are distinct and are not equivalent w.r.t optimal policies.
can be a concatenation of kinematic features Xk (e.g., robot
There are known necessary and sufficient conditions, see
position) and sensory features Xs (e.g., visual features from
Theorem 1 in Ng et al. (1999).
the environment). We assume that this feature space is low-
dimensional. Assumption 2. Myopic Optimality: Next, to be able
Sequential tasks are tasks defined in terms of a sequence to infer the reward function we have to assume that the
of reward functions, Rseq = [R1 , ..., Rk ], where each Ri : supervisor is behaving optimally. However, in the sequential

Prepared using sagej.cls

4                                                                                                              Journal Title XX(X)

problem, the globally optimal solution (maximizes the                During the online phase (Policy Learning), the algorithm
cumulative reward of all sub-tasks) is not necessarily locally    only observes the partial trajectory up to the current time-step
optimal. For example, it might be advantageous to be sub-         and does not observe which segment is active. In this sense,
optimal in an earlier sub-task if it leads to a much higher       segmentation introduces a problem of partial observation
reward in a later sub-task. We make the assumption that the       even if the original task is fully observed. The segmentation
supervisor’s behavior is myopic, i.e., the supervisor applies     needs to be estimated from the history of the process.
the optimal stationary policy with respect to its current            Trivially, some algorithms are not applicable since they
reward function ignoring all future rewards.                      might require knowledge of future data (e.g., a forward-
                                                                  backward HMM algorithm). Even if the algorithm is causal,
Assumption 3. Successful Demonstrations: We also need
                                                                  it might have an arbitrary dependence on the past leading
conditions on the demonstrations to be able to infer G. We
                                                                  to inefficient estimation of the currently active segment. To
assume that all demonstrations are successful, that is, they
                                                                  address this problem, we formalize the following condition:
visit each ρi ∈ G in the same sequence.
Assumption 4. Quadratic Rewards: We assume that each              Definition 1. Segmentation. A segmentation of a task is
reward function Ri can be expressed as a quadratic of the         a function F that maps every state-time tuple to an index
form (x − x0 )T Q(x − x0 ) for some positive semi-definite Q,     {1, ..., k}:
some feature vector x that is a function of the current state,                     F : X × Z+ 7→ {1, ..., k}
and a center point x0 with x0T Qx0 = 0. This means that for          A Markov segmentation function is a task segmentation
a d-dimensional feature space there are O(kd 2 ) parameters       where the segmentation index of time t + 1 can be completely
that describe the reward function.                                determined by the featurized state xt at time t and the index
Assumption 5. Ellipsoidal Approximation: Finally, we              it at time t:
assume that the transition regions in G can be approximated                             it+1 = M(xt , it )
by a set of disjoint ellipsoids over the perceptual features.
                                                                  4.2    General Framework
3.3      Algorithm Description                                    We now describe a general framework that takes a
Let D be a set of demonstration trajectories {d1 , ..., dN } of   segmentation algorithm and extracts a Markov segmentation
a task with a delayed reward. SWIRL can be described in           criteria. Suppose, we are given a function that does the
terms of three sub-algorithms:                                    following:
Inputs: Demonstrations D                                          Definition 2. Transition Indicator Function. A transition
1. Sequence Learning: Given D, SWIRL segments the task            indicator function T is a function that maps each featurized
   into k sub-tasks whose start and end are defined by arrival    state x ∈ X in a demonstration d to {0, 1}:
   at a sequence of transitions G = [ρ1 , ..., ρk ].
2. Reward Learning: Given G and D, SWIRL associates                                        T : X 7→ {0, 1}
   a local reward function with each segment resulting in a
   sequence of rewards Rseq .                                        This function just marks candidate segment endpoints,
3. Policy Learning: Given Rseq and G, SWIRL applies               called transitions, in a trajectory. The above definition
   reinforcement learning for I iterations to learn a policy      naturally leads to a notion of transition states, the states and
   for the task π.                                                times at which transitions occur.
Outputs: Policy π                                                 Definition 3. Transition States. For a demonstration set D,
   The transition regions G provide a way to verify that          Transition States are the set of state-time tuples where the
the learned policy is viable. We can rollout the policy and       indicator is 1:
observe whether it reaches all of the ρi ∈ G in the right
sequence. If this is not the case, we can report a failure.                         Γ = {(x,t) ∈ D : T(x) = 1}

                                                                     We model the set Γ as samples from an underlying
4     Phase 1: Sequence Learning
                                                                  distribution over the state-space and time.
The first phase of SWIRL is to segment the demonstrations.
                                                                                             Γ ∼ f (x,t)
4.1      Formalizing Segmentation
                                                                  We approximate this distribution with a GMM:
While there are some different algorithms for segmenting
a set of demonstrations into sub-sequences, not all are                    f (x,t) ≈ GMM(π, {µ1 , ..., µk }, {Σ1 , ..., Σk })
directly applicable to the Sequential IRL problem setting.
In our problem, segments are used in two different ways.          This approximation works in practice when the state-space
During the offline phases of the algorithm (Sequence              is low dimensional and the densities are often relatively
Learning and Reward Learning), the algorithm observes             smooth. The interpretation of this distribution is π describes
the full demonstration trajectory. These segments are used        the fraction of transitions assigned to each mixture compo-
to generate the local reward functions. Since it is offline,      nent, µi describes the centroid of the mixture component,
the segmentation is fully observed, and all of the learning       and Σ describes the covariance. While for some distributions
components know which segment is active at any given time         GMMs are a poor approximation, they have shown empirical
step.                                                             success for trajectory segmentation (Calinon 2014). Our

Prepared using sagej.cls

5

prior work (Krishnan* et al. 2015; Murali* et al. 2016), de-            Algorithm 1: Sequence Learning
scribes a number of practical optimizations such as pruning              Data: Demonstration D
low-confidence mixture components.                                  1    Fit a DP-GMM model to D and identify the set of transitions
   For each mixture component, we can define ellipsoids                   Θ, defined as all (xt ,t) where (xt+1 ,t + 1) has a different
by taking the confidence level-sets in the state-space and                cluster.
time that characterize regions where transitions occur.             2    Fit a DP-GMM to the states in Θ.
These regions are ordered since they are also defined over          3    Prune clusters that do not have one transition from all
                                                                          demonstrations.
time, since we make the assumption that the confidence
                                                                    4    The result of is G = [ρ1 , ρ2 , ..., ρm ] where each ρ is a disjoint
threshold for the level sets is tuned so that the regions                 ellipsoidal region of the state-space and time interval.
are disjoint. Thus, reaching one of these regions defines                Result: G
a testable condition based on the current state, time, and
previously reached regions–which is a Markov Segmentation
Function. The result is exactly the set of transition regions:      et al. 1998). By changing the kernel function (i.e., the
G = [ρ1 , ρ2 , ..., ρk ], and segmentation of each demonstration    similarity metric between states), we can essentially change
trajectory into k segments.                                         the definition of local linearity.
   In typical GMM formulations, one must specify the                   Let κ(xi , x j ) define a kernel function over the states.
number of mixture components k before hand. However, we             For example, if κ is the radial basis function (RBF),
apply results from Bayesian non-parametric statistics and                                    −kxi −x j k2
                                                                                                        2
jointly solve for the component parameters and the number           then: κ(xi , x j ) = e 2σ . κ naturally defines a matrix M
of components using a Dirichlet Process (Kulis and Jordan           where: Mi j = κ(xi , x j ). The top p0 eigenvalues define a new
                                                                                                                    0
2011). The DP places a soft-prior on the number of clusters.        embedded feature vector for each ω in R p . We can now
During inference, the number of components grows with               apply the algorithm above in this embedded feature space.
the complexity of the observed data (we denote this as DP-
GMM). The DP has hyper-parameters which we tune once
for all domains, we use a uniform base measure and a prior          5      Phase 2: Reward Learning
weight of 0.1.                                                      After the sequence learning phase, each demonstration is
                                                                    partitioned into k segments. The reward learning phase
4.3      GMM-based Segmentation                                     uses the learned [ρ1 , ..., ρk ] to construct the local rewards
As an instance of the general framework, we use                     [R1 , ..., Rk ] for the task. Each Ri is a quadratic cost
Gaussian Mixture Models to segment demonstrations in our            parametrized by a positive semi-definite matrix Q. The
experiments. This technique is quite general and applies to a       Algorithm is summarized in below in Phase 2.
large class of linear and non-linear systems.
   A popular approach for transition identification is to use       5.1       Primer on Maximum Entropy Inverse
Gaussian Mixture Models (Calinon 2014), namely, cluster                       Reinforcement Learning
all state observations and identify times at which xt is in a
different cluster than xt+1 . For a given time t, we can define     To fit the local rewards, we apply Maximum Entropy Inverse
a window of length ` as:                                            Reinforcement Learning (MaxEnt-IRL) (Ziebart et al. 2008).
                                                                    The goal of MaxEnt-IRL is to find a reward function such
                            (`)                                     that an optimal policy w.r.t that reward function is close to
                           nt = [xt−` , ..., xt ]|
                                                                    the expert demonstration. In the MaxEnt-IRL model, “close”
Then, for each demonstration trajectory we can also generate        is defined as matching the first moment of the expert feature
a trajectory of Ti − ` windowed states:                             distribution:
                                                                                                        N
                            (`)      (`)      (`)
                                                                                                   1 XX
                           di = [n` , ..., nTi ]                                         γexpert =         xi ,
                                                                                                   Z
                                                                                                            d∈D i=1
Over the entire set of windowed demonstrations, we collect a        where Z is an appropriate normalization constant (total
                        (`)
dataset of all of the nt vectors. We fit GMM model to these         number of states in all demonstrations). MaxEnt-IRL uses
vectors. The GMM model defines m multivariate Gaussian              the following linear parametrized representation:
                                                            (`)
distributions and a probability that each observation nt is
sampled from each of the m distributions. We annotate each                                           R(x) = xT θ ,
observation with the most likely mixture component. Times
            (`)      (`)
such that nt and nt+1 have different most likely components         where x is a feature vector representing the state of the
are marked as transitions. This has the interpretation of fitting   system. The agent is modeled as nosily optimal, where it
a locally linear regression to the data (refer to (Moldovan         takes actions from a policy π:
et al. 2015; Khansari-Zadeh and Billard 2011; Kruger et al.
2010; Krishnan* et al. 2015; Murali* et al. 2016) for details).                           π(a | s, θ ) ∝ exp{Aθ (s, a)}.
   If the system’s local dynamics are non-linear or
discontinuous, we can smooth out the dynamics with a                Aθ is the advantage function (Q function minus the Value
kernel embedding of the trajectories. The basic idea is             function) for the reward parameterized by θ . The objective
to apply Kernelized PCA to the features before learning             is to maximize the log-likelihood that the demonstration
the transitions–a technique used in Computer Vision (Mika           trajectories were generated by θ .

Prepared using sagej.cls

6                                                                                                                     Journal Title XX(X)

  Under the exponential distribution model, it can be shown             Algorithm 2: Reward Learning
that the gradient for this likelihood optimization is:                   Data: Demonstration D and sub-goals [ρ1 , ..., ρk ]
                           ∂L                                       1    Based on the transition states, segment each demonstration di
                              = γexpert − γθ ,                            into k sub-sequences where the jth is denoted by di [ j].
                           dθ
                                                                    2    Apply MaxEnt-IRL or Equation 1 to each set of sub-sequences
where γθ is the first moment of the feature distribution of an            1...k.
optimal policy under θ .                                                 Result: Rseq
   SWIRL applies MaxEnt-IRL to each segment of the task
but with a small modification to learn quadratic rewards
instead of linear ones. Let µi be the centroid of the next          applies MaxEnt-IRL to the sub-sequences of demonstrations
transition region. We want to learn a reward function of the        between 0 and ρ1 , and then from ρ1 to ρ2 and so on. The
form:                                                               result is an estimated local reward function Ri modeled as a
                Ri (x) = −(x − µi )T Q(x − µi ).                    linear function of states that is associated with each ρi .
for a positive semi-definite Q (negated since this is a negative    5.2.3 Model-free: Local Quadratic Rewards Sometimes
quadratic cost). With some re-parametrization (dropping µi          estimating the local dynamics can be unreliable if there
for convenience and without loss of generality), this reward        isn’t sufficient demonstration data. As a baseline, we also
function can be written as:                                         considered a much simpler reward learning approach that
                                  d X
                                  X d                               just estimates the covariance in each feature. Interestingly
                     Ri (x) = −             qi j x[ j]x[l].         enough, this approach worked reasonably well empirically
                                  j=1 l=1                           in many problems.
                                                                       The role of the reward function is to guide the robot to the
which is linear in the feature-space y = x[ j]x[l]:                 next transition region ρi . A straight forward thing approach
                               Ri (x) = θ T y.                      is for each segment i, we can define a reward function as
                                                                    follows:
5.2      Two Inference Settings: Discrete and                                            Ri (x) = −kx − µi k22 ,
         Continuous                                                 which is just the Euclidean distance to the centroid.
In MaxEnt-IRL gradient can be estimated reliably in two               A problem with using Euclidean distance directly is that it
cases, discrete and linear-gaussian systems, since it requires      uniformly penalizes disagreement with µ in all dimensions.
an efficient forward search of the policy given a particular        During different stages of a task, some features will likely
reward parametrized by θ . In both these cases, we have to          naturally vary more than others–this is learned through
estimate the system dynamics within each segment.                   IRL. To account for this, we derive a reasonable Q that is
                                                                    independent of the dynamics:
5.2.1    Discrete Consider the case when the state-space is
discrete (with cardinality N) and the action-space is discrete.                                Q[ j, l] = Σ−1
                                                                                                           x ,
To estimate the dynamics, we construct an N × N matrix
of zeros for each action and each of the components of              which is the inverse of the covariance matrix of all of the
this matrix corresponds to the transition probability of a          state vectors in the segment:
pair of states. For each, (s, a, s0 ) observation in the segment,                                        end
we increment (+1) the appropriate element of the matrix.
                                                                                                         X
                                                                                          Q[ j, l] = (         xxT )−1 ,              (1)
Finally, we normalize the elements to sum to one across the                                          t=start
set of actions. An additional optimization could be to add
smoothing to this estimate (i.e., initialize the matrix with        which is a p × p matrix defined as the covariance of all of
some non-zero constant value), we found that this was not           the states in the segment i − 1 to i. Intuitively, if a feature has
necessary on the sparse domains in our experiments. The             low variance during this segment, deviation in that feature
result is an estimate for the P(s0 | s, a). Given this estimate,    from the desired target it gets penalized. This is exactly the
γθ can be efficiently calculated with the forward-backward          Mahalonabis distance to the next transition.
technique described in (Ziebart et al. 2008).                          For example, suppose one of the features j measures the
                                                                    distance to a reference trajectory ut . Further, suppose in
5.2.2  Linear The discrete model is difficult to scale to           step one of the task the demonstrator’s actions are perfectly
continuous state-spaces. If we discretize, the number of            correlated with the trajectory (Qi [ j, j] is low where variance
bins required would be exponential in the dimensionality.           is in the distance) and in step two the actions are uncorrelated
However, linear models are another class of dynamics                with the reference trajectory (Qi [ j, j] is high). Thus, Q will
models for which the estimation and inference is tractable.         respectively penalize deviation from µi [ j] more in step one
We can fit local linear models to each of the segments              than in step two.
discovered in the previous section:
                               N seg
                               X   X j end
                                                    (i)       (i)
                                                                    6      Phase 3: Policy Learning
              A j = arg min                      kAxt − xt+1 k
                           A                                        SWIRL uses the learned transitions [ρ1 , ..., ρk ] and Rseq as
                               i=1 seg j start
                                                                    rewards for a Reinforcement Learning algorithm. In this
With A j known, γθ can be analytically solved with                  section, we describe learning a policy π given rewards Rseq
techniques proposed in (Ziebart et al. 2012). SWIRL                 and an ordered sequence of transitions G. However, this

Prepared using sagej.cls

7

problem is not trivial since solving k independent problems         lifted space, the problem is a fully observed MDP. Then, the
neglects potential shared value structure between the local         additional complexity of representing the reward with history
problems (e.g., a common failure state). Furthermore,               over S × [k] is only O(k) instead of exponential in the time
simply taking the aggregate of the rewards can lead to              horizon.
inconsistencies since there is nothing enforcing the order
of operations. We show that a single policy can be learned          6.3    Segmented Q-Learning
jointly over all segments over a modified problem where the         At a high-level, the objective of standard Q-Learning is to
state-space with additional variables that keep track of the        learn the function Q(s, a) of the optimal policy, which is the
previously achieved segments.                                       expected reward the agent will receive taking action a in state
                                                                    s, assuming future behavior is optimal. Q-Learning works
6.1      Off Policy RL Algorithms                                   by first initializing a random Q function. Then, it samples
There are two classes of RL algorithms, on-policy algorithms        rollouts from an exploration policy collecting (s, a, r, s0 )
(e.g., Policy Gradients, Trust Region Policy Optimization)          tuples. From these tuples, one can calculate the following
and off-policy algorithms (e.g., Q-Learning). An on-policy          value:
algorithm learns the value of the policy being carried out by                         yi = R(s, a) + arg max Q(s0 , a)
                                                                                                           a
the agent and incrementally optimizes this policy. On policy
are often more efficient since the robot learns to optimize the     Each of the yi can be used to define a loss function since if
reward function in states that it is likely to visit, however,      Q were the true Q function, then the following recurrence
it requires that exploration is done with a specific policy         would hold:
that is continuously updated. On the other hand, off-policy
                                                                                  Q(s, a) = R(s, a) + arg max Q(s0 , a)
algorithms learn the value of the optimal policy regardless                                                    a
of the policy used to collect the data, as long the robot
                                                                    So, Q-Learning defines a loss:
sufficiently explores the space. This is highly beneficial for
our problem setting. A single fixed exploration policy can be                               X
                                                                                   L(Q) =       kyi − Q(s, a)k22
used to collect a large batch of data up front, which we can
                                                                                                  i
use to refine our model. This is the motivation for using a
Q-Learning approach in SWIRL.                                       This loss can be optimized with gradient descent. When
                                                                    the state and action space is discrete, the representation
6.2      Segmentation Introduces Memory                             of the Q function is a table, and we get the familiar Q-
                                                                    Learning algorithm–where each gradient step updates the
In our sequential task definition, we cannot transition to
                                                                    table with the appropriate value. When Q function needs
reward Ri+1 unless all previous transition regions ρ1 , ...ρi are
                                                                    to be approximated, then we get the Deep Q Network
reached in sequence. This introduces a dependence on the
                                                                    algorithm.
history which violates the MDP structure.
                                                                       SWIRL applies a variant of Q-Learning to optimize the
   Naively addressing this problem can lead to an exponential
                                                                    policy over the sequential rewards. This is summarized
cost in the state-representation. Given a finite-horizon MDP
                                                                    in Algorithm 3. The basic change to the algorithm is to
M as defined in Section 3, we can define an MDP MH
                                                                    augment the state-space with indicator vector that indicates
as follows. Let H denote set of all dynamically feasible
                                                                    the transition regions that have been reached. So each of the
sequences of length smaller than T comprised of the
                                                                    rollouts, now records a tuple (s, v, a, r, s0 , v’) that additionally
elements of S. Therefore, for an agent at any time t, there
                                                                    stores this information. The Q function is now defined over
is a sequence of previously visited states Ht ∈ H. The MDP
                                                                    states, actions, and segment index–which also selects the
MH is defined as:
                                                                    appropriate local reward function:
                MH = hS × H, A, P0 (·, ·), R(·, ·), T i.
                                                                               Q(s, a, v) = Rv (s, a) + arg max Q(s0 , a, v0 )
                                                                                                               a
For this MDP,        P0
                     not only defines the transitions from the
current state s 7→ s0 , but also increments the history sequence    We also need to define an exploration policy, i.e., a stochastic
Ht+1 = Ht ts. Accordingly, the parametrized reward function         policy with which we will collect rollouts. To initialize the
R is defined over S, A, and Ht+1 . MH allows us to address          Q-Learning, we apply Behavioral Cloning locally for each
the sequentiality problem since the reward is a function of         of the segments to get a policy πi . We apply an ε-greedy
the state and the history sequence. However, without some           version of these policies to collect rollouts.
parametrization of Ht , directly solving this MDPs with RL is       Remarks: Initializing with a Behavioral Cloning policy
impractical since it adds an overhead of O(eT ) states.             is not strictly necessary, and a random initialization would
   We can leverage the definition of the Markov Segmen-             suffice. In practice, we found that this was much more
tation function formalized earlier to avoid this exponential        efficient on problems where the difference between the
complexity. We know that the reward transitions (Ri to Ri+1 )       demonstration domain and execution domain was small.
only depend on an arrival at the transition state ρi and not
any other aspect of the history. Therefore, we can store an
index v, that indicates whether a transition state i ∈ 0, ..., k    7     Experiments
has been reached. This index can be efficiently incremented         We evaluate SWIRL on two standard RL benchmarks and in
when the current state s ∈ ρi+1 . The result is an augmented       deformable cutting and tensioning on the da Vinci surgical
state-space vs to account for previous progress. In this            robot.

Prepared using sagej.cls

8 Journal Title XX(X)

Algorithm 3: Policy Learning
Data: Transition States G, Reward Sequence Rseq , exploration
policy π
1 Initialize Q( vs , a) randomly
2 foreach iter ∈ 0, ..., I do
3 Draw s0 from initial conditions
4 Initialize v to be [0, ..., 0]
5 Initialize j to be 1
6 foreach t ∈ 0, ..., T do
7 Choose best action a based on π.
8 Observe Reward R j
9 Update state to s0 and Q via Q-Learning update
10 If s0 is ∈ ρ j update v[ j] = 1 and j = j + 1
Result: Policy π Figure 3. For a fixed number of demonstrations 5, we vary the
number of rollouts and measure the average reward at each
rollout. (QL) denotes Q-learning, (SVM) denotes a baseline of
behavioral cloning with a SVM policy representation, (IRL-E)
denotes MaxEnt-IRL with estimated dynamics, (IRL-G) denotes
MaxEnt-IRL with ground truth dynamics, (SWIRL-E) denotes
SWIRL with local MaxEnt-IRL and estimated dynamics,
(SWIRL-G) denotes SWIRL with local MaxEnt-IRL and ground
truth dynamics, and (SWIRL-MF) denotes the model-free
version of SWIRL. SWIRL achieves the same reward as QL with
15% of the rollouts, and the same reward as IRL with 66% of
Figure 1. (A) Simulated control task with a car with noisy the rollouts.
non-holonomic dynamics. The car (A1 ) is controlled by
accelerating and turning in discrete increments. The task is to
park the car between two obstacles. hyperparameters k = 5, σ = 0.1 respectively. The radial basis
function hyper-parameters were tuned manually to achieve
the fastest convergence in the experimental task.
Behavioral Cloning (SVM): We generated N demonstra-
tions using an RRT motion planner (assuming deterministic
dynamics). The next baseline is to directly learn a policy
from the generated plans using behavioral cloning. We use
an L1 hinge-loss SVM with L2 regularization α = 5e − 3
Figure 2. (Left) the 5 demonstration trajectories for the parallel
parking task, and (Right) the sub-goals learned by SWIRL.
to predict the action from the state. The hyper-parameters
There are two intermediate goals corresponding to positioning were tuned manually using cross-validation by holding out
the car and orienting the car correctly before reversing. trajectories.
Single-Step IRL (MaxEnt-IRL): We generated N demon-
strations using an RRT motion planner (assuming determin-
7.1 Fully Observed Parallel Parking istic dynamics). We use the collected demonstrations and
We constructed a parallel parking scenario for a robot car infer a quadratic reward function using MaxEnt-IRL (both
with non-holonomic dynamics and two obstacles (Figure 1a). using estimated dynamics and ground truth dynamics). The
The car can accelerate or decelerate in discrete ±0.1 meters learned reward function is optimized using Q-learning with
per second increments (the car can reverse), and change its a radial basis function representation with the same hyper-
heading by 5◦ degree increments. The car’s speed (kẋk+kẏk) parameters as the RL approach.
and heading (θ ) are inputs to a bicycle steering model which SWIRL: Finally, we apply SWIRL to the N demonstrations,
computes the next state. The car observes its x position, y learn segmentation, and quadratic rewards (Figure 2). We
position, orientation, and speed in a global coordinate frame. apply SWIRL with a DP-GMM based segmentation step
The car’s dynamics are noisy and with probability 0.1 will with no kernel transformation (as described in Section
randomly add or subtract 2.5◦ degrees to the steering angle. 4.3). For the local IRL approach, we consider three
If the car parks between the obstacles, i.e., 0 speed within approaches: MaxEnt with ground truth dynamics, MaxEnt
a 15◦ tolerance and a positional tolerance of 5 meters, the with locally estimated dynamics, Model-Free. The learned
task is a success and the car receives a reward of 1. If the reward functions and transition regions are used in the policy
car collides with one of the obstacles or does not park in 200 learning phase with Q-learning with a radial basis function
timesteps, the episode ends with a reward of 0. representation with the same hyper-parameters as the RL
We call this domain Parallel Parking with Full Observation approach.
(PP-FO). We consider the following approaches:
RL (Q-Learning): The baseline approach is modeling the 7.1.1 Fixed Demonstrations, Varying Rollouts In the first
entire problem as an MDP with the sparse delayed reward. experiment, we fix the number of initial demonstrations
We apply Q-Learning to learn a policy for this problem with N = 5, and vary the number of rollouts (Figure 3). The
a radial basis function representation for the Q-function with baseline line Q-Learning approach (QL) is very slow because

Prepared using sagej.cls

it relies on random exploration to achieve the goal at least
once before it can start estimating the value of states and
actions. However, given enough exploration (1250 rollouts),
Q-Learning converges to a solution with a 95% success rate.
In this problem, there will always be some failure cases
to the noise in the system. We collect five demonstrations
and directly learn a policy with an SVM. This policy has
a very poor success rate of 13%. Q-Learning and the SVM
define two extremes, no demonstrations, and no rollouts,
respectively.
Next, we consider combinations of rollouts and demon-
strations. We apply MaxEnt-IRL to the five demonstrations
and learn reward functions. Since the MaxEnt-IRL inference
Figure 4. For 500 rollouts, we vary the number of
procedure requires a dynamics model, we consider two
demonstration trajectories given to each technique. (QL)
variants: (1) estimate the dynamics from the demonstrations, denotes Q-learning, (SVM) denotes a baseline of behavioral
and (2) use the known dynamics model of the car directly. cloning with a SVM policy representation, (IRL-E) denotes
We found that both IRL methods surpassed the SVM policy MaxEnt-IRL with estimated dynamics, (IRL-G) denotes
after only 250 rollouts, and attained the same final reward MaxEnt-IRL with ground truth dynamics, (SWIRL-E) denotes
as Q-Learning in 250 fewer rollouts. Surprisingly, we found SWIRL with local MaxEnt-IRL and estimated dynamics,
that there was little difference between using the estimated (SWIRL-G) denotes SWIRL with local MaxEnt-IRL and ground
truth dynamics, and (SWIRL-MF) denotes the model-free
dynamics model and the ground truth model.
version of SWIRL. SWIRL is less sensitive to the number of
Finally, we considered three variants of SWIRL. (SWIRL- demonstrations observed than the SVM. With only 5
E) is SWIRL with local MaxEnt-IRL and estimated demonstrations, SWIRL is within 10% of its reward if it observed
dynamics, (SWIRL-G) is SWIRL with local MaxEnt-IRL 100 demonstrations.
and ground truth dynamics, and (SWIRL-MF) denotes the
model-free version of SWIRL. SWIRL achieves the same
reward as QL with 15% of the rollouts, and the same
reward as IRL with 66% of the rollouts. SWIRL learns
three segments for this task (Figure 2), and places quadratic
rewards that guide the car to each of these segments. There
are two intermediate goals corresponding to positioning the
car and orienting the car correctly before reversing. With a
single quadratic reward (as in IRL), the car has to learn to
make a sequence of actions that move away from the goal
(pulling up). In the segmented problem, the car can always
move monotonically towards each of the goals.

7.1.2 Fixed Rollouts, Varying Demonstrations Next, we
fix the number of rollouts to 500 and vary the number Figure 5. For 500 rollouts and 100 demonstrations, we
of demonstration trajectories each approach observes. measure the robustness of the approaches to changes in the
The baseline line Q-Learning approach (QL) takes no execution dynamics. (QL) denotes Q-learning, (SVM) denotes a
baseline of behavioral cloning with a SVM policy representation,
demonstrations and has a success rate of 17% after 500
(IRL-E) denotes MaxEnt-IRL with estimated dynamics, (IRL-G)
rollouts. The behavioral cloning approach (SVM) is sensitive denotes MaxEnt-IRL with ground truth dynamics, (SL-E)
to the number of demonstrations it observes. For five denotes SWIRL with local MaxEnt-IRL and estimated dynamics,
demonstrations, it achieves a success rate of only 13%. But (SL-G) denotes SWIRL with local MaxEnt-IRL and ground truth
if it observes 100 demonstrations it can achieve nearly the dynamics, and (SL-MF) denotes the model-free version of
maximum 95% success rate. SWIRL. While the SVM is 95% successful on the original
domain, its success does not transfer to the perturbed setting.
On the other hand, the IRL approaches and SWIRL are
On the other hand, SWIRL learns rewards and segments that
comparatively less sensitive–where they perform nearly as transfer to the new dynamics since they are state-space goals.
well with a small number of demonstrations as they do with
a larger data set. With only five demonstrations, SWIRL is
within 10% of its reward if it observed 100 demonstrations. demonstrations and learn to refine the initial policy through
In this task, the policy is more complex than the reward exploration.
function, which is just a quadratic. It potentially requires
much less data to estimate a quadratic function. 7.1.3 Varying Task Parameters We also explored how
The SVM approach does have the advantage that it doesn’t well the approaches handle transfer if the dynamics
require any further exploration. However, SWIRL and the change between demonstration and execution. We collect
SVM approach are not mutually exclusive. As we show in demonstrations N = 100 on the original task, and then used
our physical experiments, we can initialize Q-learning with the learned rewards or policies on a perturbed task. In the
a behavioral cloning policy. The combination of the two perturbed task, the system dynamics are coupled in a way
approaches allows us to take advantage of a small number of that turning right causes the car to accelerate forward by 0.05

Prepared using sagej.cls

10                                                                                                              Journal Title XX(X)

                                                                    (x, y, θ ). As before, if the car collides with one of the obstacle
                                                                    or does not park in 200 timesteps the episode ends. We call
                                                                    this domain Parallel Parking with Partial Observation (PP-
                                                                    PO).
                                                                       This form of partial observation creates an interesting
                                                                    challenge. There is no longer a stationary policy that can
                                                                    achieve the reward. During the reversing phase of parallel
                                                                    parking, the car does not know that it is currently reversing.
                                                                    So there is ambiguity in that state whether to pull up or
                                                                    reverse. We will see that segmentation can help disambiguate
                                                                    the action in this state.
                                                                       As before, we generated 5 demonstrations using an
                                                                    RRT motion planner (assuming deterministic dynamics) and
Figure 6. We hid the velocity state from the robot, so the robot
                                                                    applied each of the approaches. The techniques that model
only sees (x, y, θ ). For a fixed number of demonstrations 5, we
vary the number of rollouts and measure the average reward at       this problem with a single MDP all fail to converge. The
each rollout. (QL) denotes Q-learning, (SVM) denotes a              Q-Learning approach achieves some non-zero rewards by
baseline of behavioral cloning with a SVM policy representation,    chance. The learned segments in SWIRL help disambiguate
(IRL-E) denotes MaxEnt-IRL with estimated dynamics, (IRL-G)         dependence on history, since the segment indicator tells the
denotes MaxEnt-IRL with ground truth dynamics, (SWIRL-E)            car which stage of the task is currently active (pulling up or
denotes SWIRL with local MaxEnt-IRL and estimated dynamics,         reversing) After 250000 time-steps, the policy learned with
(SWIRL-G) denotes SWIRL with local MaxEnt-IRL and ground
                                                                    model-based SWIRL has a 95% success rate in comparison
truth dynamics, and (SWIRL-MF) denotes the model-free
version of SWIRL. SWIRL converges will the other approaches         to a

Figure 7. We plot the centroids of the learned segments in SWIRL to visualize how SWIRL is partitioning the task. Qualitatively,
SWIRL constructs evenly-spaced way points along the swing up trajectory.

Figure 8. For a fixed number of demonstrations 5, we vary the Figure 9. For a number of rollouts 3000, we vary the number of
number of rollouts and measure the average reward at each demonstration trajectories given to each technique. QL)
rollout. (QL) denotes Q-learning, (KSVM) denotes a baseline of denotes Q-learning, (KSVM) denotes a baseline of behavioral
behavioral cloning with a Kernel SVM policy representation, cloning with a Kernel SVM policy representation, (IRL) denotes
(IRL) denotes MaxEnt-IRL using linearized dynamics learned MaxEnt-IRL using linearized dynamics learned from the
from the demonstrations, (SWIRL) denotes SWIRL with local demonstrations, (SWIRL) denotes SWIRL with local
MaxEnt-IRL and estimated linear dynamics, and (SWIRL-MF) MaxEnt-IRL and estimated linear dynamics, and (SWIRL-MF)
denotes the model-free version of SWIRL. SWIRL converges denotes the model-free version of SWIRL. SWIRL is less
with 2000 fewer rollouts than Q-learning and IRL. senstive to the number of demonstrations observed than the
SVM. With only 15 demonstrations, SWIRL is able to achieve
the maximum reward. In comparison, the SVM requires 250
α = 5e − 3 to predict the action from the state. The hyper- demonstrations.
parameters were tuned manually using cross-validation by
holding out trajectories.
significantly faster than Q-Learning. This is because it mod-
Single-Step IRL (MaxEnt-IRL): We generated N els the reward function as a single quadratic, but there are
demonstrations using the Q-Learning baseline (i.e., run to multiple steps required to swing the pendulum up. The single
convergence and sample from the learned policy). We use quadratic potentially misleads learner in early episodes.
the collected demonstrations and infer a quadratic reward Finally, we see that SWIRL converges with 2000 fewer
function using MaxEnt-IRL. In the acrobot, we only use rollouts than Q-learning and IRL. The model-free method
estimated dynamics because the underlying system is non- converges with 1000 more rollouts. This experiment suggests
linear. The estimated dynamics are a linearization. The that SWIRL is applicable to certain types of non-linear
learned reward function is optimized using Q-learning with systems. We defer a more formal study of this problem to
a radial basis function representation with the same hyper- future work.
parameters as the RL approach.
SWIRL: Finally, we apply SWIRL to the N demonstrations, 7.2.2 Varying Demonstrations, Fixed Rollouts Next, we
learn segmentation, and quadratic rewards (Figure 2). We fix the number of rollouts to 3000, and vary the number of
apply SWIRL with a DP-GMM based segmentation step demonstration trajectories each approach observes (Figure
with a kernel transformation σ = 0.1 (as described in 9). This task is substantially harder to learn than the
Section 4.3). For the local IRL approach, we consider parallel parking task. More demonstration data is required
two approaches: MaxEnt with locally estimated dynamics, to learn the segments, rewards, and policies. The basline
Model-Free. The learned reward functions and transition line Q-Learning approach (QL) takes no demonstrations
regions are used in the policy learning phase with Q-learning and has a success rate of 58% after 3000 rollouts. As
with a radial basis function representation with the same before, the behavioral cloning approach (KSVM) is sensitive
hyper-parameters as the RL approach. to the number of demonstrations it observes. For 50
demonstrations, it acheives a success rate of 0%. It requires
7.2.1 Fixed Demonstrations, Varying Rollouts We gen- 250 demonstrations to have a 100% success rate. Again,
erated N = 15 demonstrations for the Acrobot task and the IRL approaches and SWIRL are less sensitive–where
compared the different approaches (Figure 8). The baseline they perform nearly as well with a small number of
behavioral cloning policy with a kernel svm failed to suc- demonstrations as they do with a larger dataset. With only
ceed. Q-learning required 5000 rollouts to acheive a policy 15 demonstrations, SWIRL is able to achieve the maximum
that was successful 100% of the time. IRL did not converge reward.

Prepared using sagej.cls

12 Journal Title XX(X)

7.3 Physical Experiments with the da Vinci
Surgical Robot
In the next set of experiments, we evaluate SWIRL on
two tasks on the da Vinci Surgical Robot. The da Vinci
Research Kit is a surgical robot originally designed for tele-
operation, and we consider autonomous execution of surgical
subtasks. Based on a chessboard calibration, we found that
the robot has an RMSE kinematic error of 3.5 mm, and thus,
requires feedback from vision for accurate manipulation. In
our robotic setup, there is an overhead endoscopic stereo
camera that can be used to find visual features for learning,
and it is located 650mm above the workspace. This camera
Figure 10. For a number of rollouts 3000 and 250 is registered to the workspace with a RSME calibration error
demonstrations, we measure the transfer as a function of of 2.2 mm.
varying the link size. (QL) denotes Q-learning, (KSVM) denotes
a baseline of behavioral cloning with a Kernel SVM policy 7.3.1 Deformable Sheet Tensioning: In the first experi-
representation, (IRL) denotes MaxEnt-IRL using linearized ment, we consider the task of deformable sheet tensioning.
dynamics learned from the demonstrations, (SWIRL) denotes The experimental setup is pictured in Figure 11. A sheet
SWIRL with local MaxEnt-IRL and estimated linear dynamics, of surgical gauze is fixtured at the two far corners using a
and (SWIRL-MF) denotes the model-free version of SWIRL.
pair of clips. The unclipped part of the gauze is allowed to
The KSVM policy fails as soon the link size is changed. SWIRL
is robust until the change becomes very large. rest on soft silicone padding. The robot’s task is to reach
for the unclipped part, grasp it, lift the gauze, and tension
the sheet to be as planar as possible. An open-loop policy
typically fails on this task because it requires some feedback
of whether gauze is properly grasped, how the gauze has
deformed after grasping, and visual feedback of whether the
gauze is planar. The task is sequential as some grasps pick
up more or less of the material and the flattening procedure
has to be accordingly modified.
The state-space is the 6 DoF end-effector position of the
robot, the current load on the wrist of the robot, and a visual
feature measuring the flatness of the gauze. This is done by a
set of fiducial markers on the gauze which are segmented
Figure 11. A sheet of surgical gauze is fixtured at the two far by color using the stereo camera. Then, we correspond
corners using a pair of clips. The unclipped part of the gauze is
the segmented contours and estimate a z position for each
allowed to rest on soft silicone padding. The robot’s task is to
reach for the unclipped part, grasp it, lift the gauze, and tension marker (relative to the horizontal plane). The variance in
the sheet to be as planar as possible. An open-loop policy the z position is a proxy for flatness and we include this
typically fails on this task because it requires some feedback of as a feature for learning (we call this disparity). The action
whether gauze is properly grasped, how the gauze has space is discretized into an 8 dimensional vector (±x, ±y,
deformed after grasping, and visual feedback of whether the ±z, open/close gripper) where the robot moves in 2mm
gauze is planar.The fiducial markers used to track the gauze are increments.
seen in red.
We provided 15 demonstrations through a keyboard-based
tele-operation interface. The average length of the demon-
strations was 48.4 actions (although we sampled observa-
7.2.3 Varying Task Parameters As in the parallel parking tions at a higher frequency about 10 observations for every
scenario, we evaluate how the different approaches handle action). From these 15 demonstrations, SWIRL identifies
transfer if the dynamics change between demonstration and four segments. Figure 12 illustrates the segmentation of a
execution. With N = 250 demonstrations, we learn the representative demonstration with important states plotted
rewards, policies, and segments on the standard pendulum, over time. One of the segments corresponds to moving to
and then during learning, we vary the size of the second link the correct grasping position, one corresponds to making the
in the pendulum. We plot the success rate (after a fixed 3000 grasp, one lifting the gauze up again, and one corresponds
rollouts) as a function of the increasing link size (Figure 10). to straightening the gauze. One of the interesting aspects of
As the link size increases the even the basline Q-learning this task is that the segmentation requires multiple features.
becomes less successful. This is because the system becomes Figure 12 plots three signals (current load, disparity, and
more unstable and it is harder to learn a policy. The z position), and segmenting any single signal may miss an
behavioral cloning SVM policy immediately fails as the link important feature.
size is increased. IRL is more robust but does not offer much Then, we tried to learn a policy from the rewards
of an advantage in this problem. SWIRL is robust until the constructed by SWIRL. In this experiment, we initialized
change in the link size becomes large. This is because for the the policy learning phase of SWIRL with the Behavioral
larger link size, SWIRL might require different segments (or Cloning policy. We define a Q-Network with a single-layer
one of the learned segments in unreachable). Multi-Layer Perceptron with 32 hidden units and sigmoid

Prepared using sagej.cls

You can also read