Reset-Free Reinforcement Learning via Multi-Task Learning: Learning Dexterous Manipulation Behaviors without Human Intervention

Page created by Gordon Bates

Society

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Reset-Free Reinforcement Learning via Multi-Task Learning: Learning Dexterous Manipulation Behaviors without Human Intervention

Reset-Free Reinforcement Learning via Multi-Task Learning:
                                            Learning Dexterous Manipulation Behaviors without Human Intervention
                                                                Abhishek Gupta∗1   Justin Yu∗1   Tony Z. Zhao∗1     Vikash Kumar∗2
                                                                  Aaron Rovinsky 1   Kelvin Xu 1   Thomas Devlin  1    Sergey Levine1
                                                                          1 UC Berkeley     2 University of Washington
arXiv:2104.11203v1 [cs.LG] 22 Apr 2021

                                            Abstract— Reinforcement Learning (RL) algorithms can in
                                         principle acquire complex robotic skills by learning from
                                         large amounts of data in the real world, collected via trial
                                         and error. However, most RL algorithms use a carefully
                                         engineered setup in order to collect data, requiring human
                                         supervision and intervention to provide episodic resets. This
                                         is particularly evident in challenging robotics problems, such
                                         as dexterous manipulation. To make data collection scalable,
                                         such applications require reset-free algorithms that are able to
                                         learn autonomously, without explicit instrumentation or human
                                         intervention. Most prior work in this area handles single-task
                                         learning. However, we might also want robots that can perform
                                         large repertoires of skills. At first, this would appear to only
                                         make the problem harder. However, the key observation we
                                         make in this work is that an appropriately chosen multi-task
                                         RL setting actually alleviates the reset-free learning challenge,
                                         with minimal additional machinery required. In effect, solving
                                         a multi-task problem can directly solve the reset-free problem      Fig. 1: Reset-free learning of dexterous manipulation behaviors
                                         since different combinations of tasks can serve to perform          by leveraging multi-task learning. When multiple tasks are learned
                                         resets for other tasks. By learning multiple tasks together and     together, different tasks can serve to reset each other, allowing
                                         appropriately sequencing them, we can effectively learn all of      for uninterrupted continuous learning of all of the tasks. This
                                         the tasks together reset-free. This type of multi-task learning     allows for the learning of dexterous manipulation tasks like in-
                                         can effectively scale reset-free learning schemes to much more      hand manipulation and pipe insertion with a 4-fingered robotic hand,
                                         complex problems, as we demonstrate in our experiments. We          without any human intervention, with over 60 hours of uninterrupted
                                         propose a simple scheme for multi-task learning that tackles        training
                                         the reset-free learning problem, and show its effectiveness at
                                         learning to solve complex dexterous manipulation tasks in both
                                         hardware and simulation without any explicit resets. This work      clear lens on the reset-free learning problem. For instance, a
                                         shows the ability to learn dexterous manipulation behaviors in
                                         the real world with RL without any human intervention.
                                                                                                             dexterous hand performing in-hand manipulation, as shown
                                                                                                             in Figure 1 (right), must delicately balance the forces on
                                                                I. I NTRODUCTION                             the object to keep it in position. Early on in training,
                                            Reinforcement learning algorithms have shown promise             the policy will frequently drop the object, necessitating
                                         in enabling robotic tasks in simulation [1], [2], and even          a particularly complex reset procedure. Prior work has
                                         some tasks in the real world [3], [4]. RL algorithms in             addressed this manually by having a human involved in
                                         principle can learn generalizable behaviors with minimal            the training process [9], [10], [11], instrumenting a reset
                                         manual engineering, simply by collecting their own data             in the environment [12], [13], or even by programming a
                                         via trial and error. This approach works particularly well in       separate robot to place the object back in the hand [7]. Though
                                         simulation, where data is cheap and abundant. Success in the        some prior techniques have sought to learn some form of a
                                         real world has been restricted to settings where significant        “reset” controller [14], [15], [16], [8], [17], [18], none of these
                                         environment instrumentation and engineering is available to         are able to successfully scale to solve a complex dexterous
                                         enable autonomous reward calculation and episodic resets            manipulation problem without hand-designed reset systems
                                         [5], [6], [7], [8]. To fully realize the promise of robotic RL      due to the challenge of learning robust reset behaviors.
                                         in the real world, we need to be able to learn even in the             However, general-purpose robots deployed in real-world
                                         absence of environment instrumentation.                             settings will likely be tasked with performing many different
                                            In this work, we focus specifically on reset-free learning       behaviors. While multi-task learning algorithms in these set-
                                         for dexterous manipulation, which presents an especially            tings have typically been studied in the context of improving
                                           ∗ Authors
                                                  contributed equally to this work.                          sample efficiency and generalization [19], [20], [21], [22],
                                            https://sites.google.com/view/mtrf                               [23], [24], in this work we make the observation that multi-

task algorithms naturally lend themselves to the reset-free        leveraging multi-task learning.
learning problem. We hypothesize that the reset-free RL               Our work is certainly not the first to consider the problem
problem can be addressed by reformulating it as a multi-task       of reset-free RL [52]. [14] proposes a scheme that interleaves
problem, and appropriately sequencing the tasks commanded          attempts at the task with episodes from an explicitly learned
and learned during online reinforcement learning. As outlined      reset controller, trained to reach the initial state. Building
in Figure 6, solving a collection of tasks simultaneously          on this work, [8] shows how to learn simple dexterous
presents the possibility of using some tasks as a reset for        manipulation tasks without instrumentation using a pertur-
others, thereby removing the need of explicit per task resets.     bation controller exploring for novelty instead of a reset
For instance, if we consider the problem of learning a variety     controller. [53], [54] demonstrate learning of multi-stage tasks
of dexterous hand manipulation behaviors, such as in-hand          by progressively sequencing a chain of forward and backward
reorientation, then learning and executing behaviors such          controllers. Perhaps the most closely related work to ours
as recenter and pickup can naturally reset the other tasks         algorithmically is the framework proposed in [28], where the
in the event of failure (as we describe in Section IV and          agent learns locomotion by leveraging multi-task behavior.
Figure 6). We show that by learning multiple different tasks       However, this work studies tasks with cyclic dependencies
simultaneously and appropriately sequencing behavior across        specifically tailored towards the locomotion domain. Our work
different tasks, we can learn all of the tasks without episodic    shows that having a variety of tasks and learning them all
resets required at all. This allows us to effectively learn a      together via multi-task RL can allow solutions to challenging
“network” of reset behaviors, each of which is easier than         reset free problems in dexterous manipulation domains.
learning a complete reset controller, but together can execute        Our work builds on the framework of multi-task learn-
and learn more complex behavior.                                   ing [19], [20], [21], [22], [23], [24], [55], [56] which have
   The main contribution of this work is to propose a learning     been leveraged to learn a collection of behaviors, improve
system that can learn dexterous manipulation behaviors             on generalization as well as sample efficiency, and have even
without the need for episodic resets. We do so by leveraging       applied to real world robotics problems [57], [58], [59]. In
the paradigm of multi-task reinforcement learning to make          this work, we take a different view on multi-task RL as a
the reset free problem less challenging. The system accepts        means to solve reset-free learning problems.
a diverse set of tasks that are to be learned, and then trains
reset-free, leveraging progress in some tasks to provide resets                         III. P RELIMINARIES
for other tasks. To validate this algorithm for reset-free
robotic learning, we perform both simulated and hardware             We build on the framework of Markov decision processes
experiments on a dexterous manipulation system with a four         for reinforcement learning. We refer the reader to [60]
fingered anthropomorphic robotic hand. To our knowledge,           for a more detailed overview. RL algorithms consider the
these results demonstrate the first instance of a combined         problem of learning a policy π(a|s) such that the expected
hand-arm system learning dexterous in-hand manipulation            sum of rewards R(st , at ) obtained under such a policy is
with deep RL entirely in the real world with minimal human         maximized when starting from an initial state distribution µ0
intervention during training, simultaneously acquiring both        and dynamics P(st+1 |st , at ). This objective is given by:
the in-hand manipulation skill and the skills needed to retry                                          " T
                                                                                                        X
                                                                                                                          #
the task. We also show the ability of this system to learn                J(π) = E       s0 ∼µ0
                                                                                                              t
                                                                                                            γ R(st , at )      (1)
other dexterous manipulation behaviors like pipe insertion                                at ∼π(at |st )
                                                                                                               t=0
                                                                                      st+1 ∼P(st+1 |st ,at )
via uninterrupted real world training, as well as several tasks
in simulation.                                                        While many algorithms exist to optimize this objective,
                                                                   in this work we build on the framework of actor-critic
                    II. R ELATED W ORK                             algorithms [61]. Although we build on actor critic framework,
   RL algorithms have been applied to a variety of robotics        we emphasize that our framework can be effectively used
problems in simulation [25], [1], [26], [27], and have also seen   with many standard reinforcement learning algorithms with
application to real-world problems, such as locomotion [28],       minimal modifications.
[29], [30], grasping [6], [31], [32], [33], manipulation of           As we note in the following section, we address the reset-
articulated objects [34], [35], [13], [36], [37], and even         free RL problem via multi-task RL. Multi-task RL attempts
dexterous manipulation [38], [10]. Several prior works [39]        to learn multiple tasks simultaneously. Under this setting,
have shown how to acquire dexterous manipulation behaviors         each of K tasks involves a separate reward function Ri ,
with optimization [40], [41], [42], [43], [44], reinforcement      different initial state distribution µi0 and potentially different
learning in simulation [1], [45], [46], and even in the real       optimal policy πi . Given a distribution over the tasks p(i),
world [47], [9], [48], [7], [15], [12], [10], [49], [50], [51].    the multi-task problem can then be described as
These techniques have leaned on highly instrumented setups                                                       "                     #
to provide episodic resets and rewards. For instance, prior                                                        X
                                                                                                                         t
                                                                   J(π0 , . . . , πK−1 ) = Ei∼p(i) E s0 ∼µi0           γ Ri (st , at )
work uses a scripted second arm [7] or separate servo motors
                                                                                                      at ∼πi (st |at )   t
[12] to perform resets. Contrary to these, our work focuses                                          st+1 ∼P(st ,at )
on removing the need for explicit environment resets, by                                                                           (2)

In the following section, we will discuss how viewing Our proposed learning system, which we call Multi-Task
reset-free learning through the lens of multi-task learning can learning for Reset-Free RL (MTRF), attempts to jointly learn
naturally address the challenges in reset-free RL. K different policies πi , one for each of the defined tasks,
by leveraging off-policy RL and continuous data collection
IV. L EARNING D EXTEROUS M ANIPULATION B EHAVIORS in the environment. The system is initialized by randomly
R ESET-F REE VIA M ULTI -TASK RL sampling a task and state s0 sampled from the task’s initial
One of the main advantages of dexterous robots is their state distribution. The robot collects a continuous stream of
ability to perform a wide range of different tasks. Indeed, we data without any subsequent resets in the environment by
might imagine that a real-world dexterous robotic system sequencing the K policies according to a meta-controller
deployed in a realistic environment, such as a home or (referred to as a “task-graph”) G(s) : S → {0, 1, . . . , K − 1}.
office, would likely need to perform a repertoire of different Given the current state of the environment and the learning
behaviors, rather than just a single skill. While this may at process, the task-graph makes a decision once every T time
first seem like it would only make the problem of learning steps on which of the tasks should be executed and trained for
without resets more difficult, the key observation we make in the next T time steps. This task-graph decides what order the
this work is that the multi-task setting can actually facilitate tasks should be learned and which of the policies should be
reset-free learning without manually provided instrumentation. used for data collection. The learning proceeds by iteratively
When a large number of diverse tasks are being learned collecting data with a policy πi chosen by the task-graph for
simultaneously, some tasks can naturally serve as resets T time steps, after which the collected data is saved to a
for other tasks during learning. Learning each of the tasks task-specific replay buffer Bi and the task-graph is queried
individually without resets is made easier by appropriately again for which task to execute next, and the whole process
learning and sequencing together other tasks in the right order. repeats.
By doing so, we can replace the simple forward, reset behavior We assume that, for each task, successful outcomes for
dichotomy with a more natural “network” of multiple tasks tasks that lead into that task according to the task graph
that can perform complex reset behaviors between each other. (i.e., all incoming edges) will result in valid initial states for
Let us ground this intuition in a concrete example. Given that task. This assumption is reasonable: intuitively, it states
a dexterous table-top manipulation task, shown in Fig 6 and that an edge from task A to task B implies that successful
Fig 2 , our reset-free RL procedure might look like this: let outcomes of task A are valid initial states for task B. This
us say the robot starts with the object in the palm and is means that, if task B is triggered after task A, it will learn to
trying to learn how to manipulate it in-hand so that it is succeed from these valid initial states under µB 0 . While this
oriented in a particular direction (in-hand reorient). While does not always guarantee that the downstream controller for
doing so, it may end up dropping the object. When learning task B will see all of the initial states from µB 0 , since the
with resets, a person would need to pick up the object and upstream controller is not explicitly optimizing for coverage,
place it back in the hand to continue training. However, in practice we find that this still performs very well. However,
since we would like the robot to learn without such manual we expect that it would also be straightforward to introduce
interventions, the robot itself needs to retrieve the object and coverage into this method by utilizing state marginal matching
resume practicing. To do so, the robot must first re-center methods [62]. We leave this for future work.
and the object so that it is suitable for grasping, and then The individual policies can continue to be trained by
actually lift and the flip up so it’s in the palm to resume leveraging the data collected in their individual replay buffers
practicing. In case any of the intermediate tasks (say lifting Bi via off-policy RL. As individual tasks become more
the object) fails, the recenter task can be deployed to attempt and more successful, they can start to serve as effective
picking up again, practicing these tasks themselves in the resets for other tasks, forming a natural curriculum. The
process. Appropriately sequencing the execution and learning proposed framework is general and capable of learning a
of different tasks, allows for the autonomous practicing of diverse collection of tasks reset-free when provided with a
in-hand manipulation behavior, without requiring any human task graph that leverages the diversity of tasks. This leaves
or instrumented resets. open the question of how to actually define the task-graph G
to effectively sequence tasks. In this work, we assume that
A. Algorithm Description a task-graph defining the various tasks and the associated
In this work, we directly leverage this insight to build a transitions is provided to the agent by the algorithm designer.
dexterous robotic system that learns in the absence of resets. In practice, providing such a graph is simple for a human user,
We assume that we are provided with K different tasks that although it could in principle be learned from experiential
need to be learned together. These tasks each represent some data. We leave this extension for future work. Interestingly,
distinct capability of the agent. As described above, in the many other reset-free learning methods [14], [53], [54] can
dexterous manipulation domain the tasks might involve re- be seen as special cases of the framework we have described
centering, picking up the object, and reorienting an object above. In our experiments we incorporate one of the tasks as
in-hand. Each of these K different tasks is provided with its a “perturbation” task. While prior work considered doing this
own reward function Ri (st , at ), and at test-time is evaluated with a single forward controller [8], we show that this type
against its distinct initial state distribution µi0 . of perturbation can generally be applied by simply viewing

Fig. 2: Depiction of some steps of reset-free training for the in-hand manipulation task family on hardware. Reset-free training uses the
task graph to choose which policy to execute and train at every step. For executions that are not successful (e.g., the pickup in step 1),
other tasks (recenter in step 2) serve to provide a reset so that pickup can be attempted again. Once pickup is successful, the next task (flip
up) can be attempted. If the flip-up policy is successful, then the in-hand reorientation task can be attempted and if this drops the object
then the re-centering task is activated to continue training.

it as another task. We incorporate this perturbation task in V. TASK AND S YSTEM S ETUP
our instantiation of the algorithm, but we do not show it in To study MTRF in the context of challenging robotic
the task graph figures for simplicity. tasks, such as dexterous manipulation, we designed an
anthropomorphic manipulation platform in both simulation
B. Practical Instantiation and hardware. Our system (Fig 5) consists of a 22-DoF
To instantiate the algorithmic idea described above as anthropomorphic hand-arm system. We use a self-designed
a deep reinforcement learning framework that is capable and manufactured four-fingered, 16 DoF robot hand called
of solving dexterous manipulation tasks without resets, we the D’Hand, mounted on a 6 DoF Sawyer robotic arm to
can build on the framework of actor-critic algorithms. We allow it to operate in an extended workspace in a table-top
learn separate policies πi for each of the K provided tasks, setting. We built this hardware to be particularly amenable
with separate critics Qi and replay buffers Bi for each of to our problem setting due it’s robustness and ease of long
the tasks. Each of the policies πi is a deep neural network term operation. The D’Hand can operate for upwards of 100
Gaussian policy with parameters θi , which is trained using a hours in contact rich tasks without any breakages, whereas
standard actor-critic algorithm, such as soft actor-critic [63], previous hand based systems are much more fragile. Given
using data sampled from its own replay buffer Bi . The task the modular nature of the hand, even if a particular joint
graph G is represented as a user-provided state machine, as malfunctions, it is quick to repair and continue training. In our
shown in Fig 6, and is queried every T steps to determine experimental evaluation, we use two different sets of dexterous
which task policy πi to execute and update next. Training manipulation tasks in simulation and two different sets of
proceeds by starting execution from a particular state s0 in the tasks in the real world. Details can be found in Appendix A
environment, querying the task-graph G to determine which and at https://sites.google.com/view/mtrf
policy i = G(s0 ) to execute, and then collecting T time-steps
A. Simulation Domains
of data using the policy πi , transitioning the environment to
a new state sT (Fig 6). The task-graph is then queried again
and the process is repeated until all the tasks are learned.

Algorithm 1 MTRF
1: Given: K tasks with rewards Ri (st , at ), along with a
task graph mapping states to a task index G(s) : S →
{0, 1, . . . , K − 1}
2: Let î represent the task index associated with the forward
task that is being learned.
3: Initialize πi , Qi , Bi ∀i ∈ {0, 1, . . . , K − 1}
4: Initialize the environment in task î with initial state sî ∼
µî (sî )
5: for iteration n = 1, 2, ... do
6: Obtain current task i to execute by querying task graph
at the current environment state i = G(scurr )
7: for iteration j = 1, 2, ..., T do Fig. 3: Tasks and transitions for lightbulb insertion in simulation.
8: Execute πi in environment, receiving task-specific The goal is to recenter a lightbulb, lift it, flip it over, and then insert
rewards Ri storing data in the buffer Bi it into a lamp.
9: Train the current task’s policy and value functions Lightbulb insertion tasks. The first family of tasks involves
πi , Qi by sampling a batch from the replay buffer inserting a lightbulb into a lamp in simulation with the
containing this task’s experience Bi , according to dexterous hand-arm system. The tasks consist of centering
SAC [63]. the object on the table, pickup, in-hand reorientation, and
10: end for insertion into the lamp. The multi-task transition task graph
11: end for is shown in Fig 3. These tasks all involve coordinated finger

and arm motion and require precise movement to insert the from dropping the object during the in-hand reorientation,
lightbulb. which ordinarily would require some sort of reset mechanism
Basketball tasks. The second family of tasks involves dunk- (as seen in prior work [7]). However, MTRF enables the robot
ing a basketball into a hoop. This consists of repositioning to utilize such “failures” as an opportunity to practice the
the ball, picking it up, positioning the hand over the basket, tabletop re-centering, pickup, and flip-up tasks, which serve
and dunking the ball. This task has a natural cyclic nature, to “reset” the object into a pose where the reorientation can
and allows tasks to reset each other as shown in Fig 4, while be attempted again.1 The configuration of the 22-DoF hand-
requiring fine-grained behavior to manipulate the ball midair. arm system mirrors that in simulation. The object is tracked
using a motion capture system. Our policy directly controls
each joint of the hand and the position of the end-effector.
The system is set up to allow for extended uninterrupted
operation, allowing for over 60 hours of training without any
human intervention. We show how our proposed technique
allows the robot to learn this task in the following section.

Fig. 4: Tasks and transitions for basketball domain in simulation.
The goal here is to reposition a basketball object, pick it up and
then dunk it in a hoop.

B. Hardware Tasks

Fig. 6: Tasks and transitions for the in-hand manipulation domain
on hardware. The goal here is to rotate a 3 pronged valve object to
a particular orientation in the palm of the hand, picking it up if it
falls down to continue practicing.
Fig. 5: Real-world hand-arm manipulation platform. The system Pipe insertion: For the second task on hardware, we set up
comprises a 16 DoF hand mounted on a 6 DoF Sawyer arm. The
goal of the task is to perform in-hand reorientation, as illustrated in
a pipe insertion task, where the goal is to pick up a cylindrical
Fig 10 or pipe insertion as shown in Fig 11 pipe object and insert it into a hose attachment on the wall,
as shown in Fig 7. This task not only requires mastering the
We also evaluate MTRF on the real-world hand-arm robotic contacts required for a successful pickup, but also accurate and
system, training a set of tasks in the real world, without fine-grained arm motion to insert the pipe into the attachment
any simulation or special instrumentation. We considered 2 in the wall. The task graph corresponding to this domain is
different task families - in hand manipulation of a 3 pronged shown in Fig 7. In this domain, the agent learns to pickup the
valve object, as well as pipe insertion of a cylindrical pipe object and then insert it into the attachment in the wall. If the
into a hose attachment mounted on the wall. We describe object is dropped, it is then re-centered and picked up again
each of these setups in detail below: to allow for another attempt at insertion. 2 As in the previous
In-Hand Manipulation: For the first task on hardware, we domain, our policy directly controls each joint of the hand
use a variant of the in-hand reorienting task, where the goal is and the position of the end-effector. The system is set up to
to pick up an object and reorient it in the palm into a desired allow for extended uninterrupted operation, allowing for over
configuration, as shown in Fig 6. This task not only requires
mastering the contacts required for a successful pickup, but 1 For the pickup task, the position of the arm’s end-effector is scripted

also fine-grained finger movements to reorient the object in and only D’Hand controls are learned to reduce training time.
2 For the pickup task, the position of the arm’s end-effector is scripted and
the palm, while at the same time balancing it so as to avoid
only D’Hand controls are learned to reduce training time. For the insertion
dropping. The task graph corresponding to this domain is task, the fingers are frozen since it is largely involving accurate motion of
shown in Fig 6. A frequent challenge in this domain stems the arm.

30 hours of training without any human intervention.                   we roll out its final policy starting from states randomly
                                                                       sampled from the distribution induced by all the tasks that can
                                                                       transition to the task under evaluation, and report performance
                                                                       in terms of their success in solving the task. Additional details,
                                                                       videos of all tasks and hyperparameters can be found at
                                                                       [https://sites.google.com/view/mtrf].

                                                                       B. Reset-Free Learning Comparisons in Simulation
                                                                          We present results for reset-free learning, using our
                                                                       algorithm and prior methods, in Fig 8, corresponding to
                                                                       each of the tasks in simulation in Section V. We see that
                                                                       MTRF is able to successfully learn all of the tasks jointly,
                                                                       as evidenced by Fig 8. We measure evaluation performance
                                                                       after training by loading saved policies and running the policy
                                                                       corresponding to the “forward” task for each of the task
                                                                       families (i.e. lightbulb insertion and basketball dunking). This
                                                                       indicates that we can solve all the tasks, and as a result can
                                                                       learn reset free more effectively than prior algorithms.
                                                                          In comparison, we see that the prior algorithms for
                                                                       off-policy RL – the reset controller [14] and perturbation
Fig. 7: Tasks and transitions for pipe insertion domain on hardware.   controller [8] – are not able to learn the more complex of
The goal here is to reposition a cylindrical pipe object, pick it up
                                                                       the tasks as effectively as our method. While these methods
and then insert it into a hose attachment on the wall.
                                                                       are able to make some amount of progress on tasks that are
             VI. E XPERIMENTAL E VALUATION                             constructed to be very easy such as the pincer task family
                                                                       shown in Appendix B, we see that they struggle to scale well
  We focus our experiment on the following questions:                  to the more challenging tasks (Fig 8). Only MTRF is able
  1) Are existing off-policy RL algorithms effective when              to learn these tasks, while the other methods never actually
     deployed under reset-free settings to solve dexterous             reach the later tasks in the task graph.
     manipulation problems?                                               To understand how the tasks are being sequenced during the
  2) Does simultaneously learning a collection of tasks                learning process, we show task transitions experienced during
     under the proposed multi-task formulation with MTRF               training for the basketball task in Fig 9. We observe that early
     alleviate the need for resets when solving dexterous              in training the transitions are mostly between the recenter
     manipulation tasks?                                               and perturbation tasks. As MTRF improves, the transitions
  3) Does learning multiple tasks simultaneously allow for             add in pickup and then basketball dunking, cycling between
     reset-free learning of more complex tasks than previous           re-centering, pickup and basketball placement in the hoop.
     reset free algorithms?
  4) Does MTRF enable real-world reinforcement learning                C. Learning Real-World Dexterous Manipulation Skills
     without resets or human interventions?                               Next, we evaluate the performance of MTRF on the real-
                                                                       world robotic system described in Section V, studying the
A. Baselines and Prior Methods                                         dexterous manipulation tasks described in Section V-B.
   We compare MTRF (Section IV) to three prior baseline                     a) In-Hand Manipulation: Let us start by considering
algorithms. Our first comparison is to a state-of-the-art              the in-hand manipulation tasks shown in Fig 6. This task
off-policy RL algorithm, soft actor-critic [63] (labeled as            is challenging because it requires delicate handling of
SAC). The actor is executed continuously and reset-free in             finger-object contacts during training, the object is easy
the environment, and the experienced data is stored in a               to drop during the flip-up and in-hand manipulation, and
replay pool. This algorithm is representative of efficient             the reorientation requires a coordinated finger gait to
off-policy RL algorithms. We next compare to a version                 rotate the object. In fact, most prior work that aims to
of a reset controller [14] (labeled as Reset Controller),              learn similar in-hand manipulation behaviors either utilizes
which trains a forward controller to perform the task and              simulation or employs a hand-designed reset procedure,
a reset controller to reset the state back to the initial              potentially involving human interventions [7], [12], [10].
state. Lastly, we compare with the perturbation controller             To the best of our knowledge, our work is the first to
technique [8] introduced in prior work, which alternates               show a real-world robotic system learning such a task
between learning and executing a forward task-directed policy          entirely in the real world and without any manually
and a perturbation controller trained purely with novelty              provided or hand-designed reset mechanism. We visualize a
bonuses [64] (labeled as Perturbation Controller). For all             sequential execution of the tasks (after training) in Fig 10,
the experiments we used the same RL algorithm, soft actor-             and encourage the reader to watch a video of this task,
critic [63], with default hyperparameters. To evaluate a task,         as well as the training process, on the project website:

Fig. 8: Comparison of MTRF with baseline methods in simulation when run without resets. In comparison to the prior reset-free RL
methods, MTRF is able to learn the tasks more quickly and with higher average success rates, even in cases where none of the prior
methods can master the full task set. MTRF is able to solve all of the tasks without requiring any explicit resets.

                                                                         reorientation. For lifting and flipping over, success is defined
                                                                         as lifting the object to a particular height above the table,
                                                                         and for reorient success is defined by the difference between
                                                                         the current object orientation and the target orientation of the
                                                                         object. As shown in Fig 12, MTRF is able to autonomously
                                                                         learn all tasks in the task graph in 60 hours, and achieves
                                                                         an 70% success rate for the in-hand reorient task. This
                                                                         experiment illustrates how MTRF can enable a complex
                                                                         real-world robotic system to learn an in-hand manipulation
                                                                         behavior while at the same time autonomously retrying the
                                                                         task during a lengthy unattended training run, without any
                                                                         simulation, special instrumentation, or manual interventions.
                                                                         This experiment suggests that, when MTRF is provided with
                                                                         an appropriate set of tasks, learning of complex manipulation
                                                                         skills can be carried out entirely autonomously in the real
                                                                         world, even for highly complex robotic manipulators such as
Fig. 9: Visualization of task frequency in the basketball task Family.
While initially recentering and pickup are common, as these get          multi-fingered hands.
better they are able to provide resets for other tasks.                        b) Pipe Insertion: We also considered the second
                                                                         task variant which involves manipulating a cylindrical
                                                                         pipe to insert it into a hose attachment on the wall as
[https://sites.google.com/view/mtrf]. Over                               shown in Fig 7. This task is challenging because it requires
the course of training, the robot must first learn to recenter           accurate grasping and repositioning of the object during
the object on the table, then learn to pick it up (which                 training in order to accurately insert it into the hose
requires learning an appropriate grasp and delicate control of           attachment, requiring coordination of both the arm and
the fingers to maintain grip), then learn to flip up the object          the hand. Training this task without resets requires a
so that it rests in the palm, and finally learn to perform the           combination of repositioning, lifting, insertion and removal
orientation. Dropping the object at any point in this process            to continually keep training and improving. We visualize a
requires going back to the beginning of the sequence, and                sequential execution of the tasks (after training) in Fig 11,
initially most of the training time is spent on re-centering,            and encourage the reader to watch a video of this task,
which provides resets for the pickup. The entire training                as well as the training process, on the project website:
process takes about 60 hours of real time, learning all of               [https://sites.google.com/view/mtrf]. Over
the tasks simultaneously. Although this time requirement                 the course of training, the robot must first learn to recenter
is considerable, it is entirely autonomous, making this                  the object on the table, then learn to pick it up (which
approach scalable even without any simulation or manual                  requires learning an appropriate grasp), then learn to actually
instrumentation. The user only needs to position the objects             move the arm accurately to insert the pipe to the attachment,
for training, and switch on the robot.                                   and finally learn to remove it to continue training and
   For a quantitative evaluation, we plot the success rate of sub-       practicing. Initially most of the training time is spent on
tasks including re-centering, lifting, flipping over, and in-hand        re-centering, which provides resets for the pickup, which

Fig. 10: Film strip illustrating partial training trajectory of hardware system for in-hand manipulation of the valve object. This shows
various behaviors encountered during the training - picking up the object, flipping it over in the hand and then in-hand manipulation to get
it to a particular orientation. As seen here, MTRF is able to successfully learn how to perform in-hand manipulation without any human
intervention.

Fig. 11: Film strip illustrating partial training trajectory of hardware system for pipe insertion. This shows various behaviors encountered
during the training - repositioning the object, picking it up, and then inserting it into the wall attachment. As seen here, MTRF is able to
successfully learn how to do pipe insertion without any human intervention.

then provides resets for the insertion and removal. The entire           Karol Hausman, Corey Lynch for helpful discussions and
training process takes about 25 hours of real time, learning             comments.This research was supported by the Office of Naval
all of the tasks simultaneously.                                         Research, the National Science Foundation, and Berkeley
                                                                         DeepDrive.

                                                                                                      R EFERENCES
                                                                          [1] A. Rajeswaran, V. Kumar, A. Gupta, J. Schulman, E. Todorov,
                                                                              and S. Levine, “Learning complex dexterous manipulation with
                                                                              deep reinforcement learning and demonstrations,” CoRR, vol.
                                                                              abs/1709.10087, 2017. [Online]. Available: http://arxiv.org/abs/1709.
                                                                              10087
                                                                          [2] X. B. Peng, P. Abbeel, S. Levine, and M. van de Panne, “Deepmimic:
                                                                              Example-guided deep reinforcement learning of physics-based character
                                                                              skills,” ACM Trans. Graph., vol. 37, no. 4, pp. 143:1–143:14, Jul. 2018.
                                                                              [Online]. Available: http://doi.acm.org/10.1145/3197517.3201311
                                                                          [3] S. Levine, N. Wagener, and P. Abbeel, “Learning contact-rich manipu-
Fig. 12: Success rate of various tasks on dexterous manipulation
                                                                              lation skills with guided policy search,” in Robotics and Automation
task families on hardware. Left: In-hand manipulation We can                  (ICRA), 2015 IEEE International Conference on. IEEE, 2015.
see that all of the tasks are able to successfully learn with more        [4] L. Pinto and A. Gupta, “Supersizing self-supervision: Learning to
than 70% success rate. Right: Pipe insertion We can see that the              grasp from 50k tries and 700 robot hours,” in 2016 IEEE international
final pipe insertion task is able to successfully learn with more than        conference on robotics and automation (ICRA). IEEE, 2016, pp.
60% success rate.                                                             3406–3413.
                                                                          [5] O. M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew,
                        VII. D ISCUSSION                                      J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray et al., “Learning
                                                                              dexterous in-hand manipulation,” The International Journal of Robotics
   In this work, we introduced a technique for learning                       Research, vol. 39, no. 1, pp. 3–20, 2020.
dexterous manipulation behaviors reset-free, without the need             [6] D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang,
                                                                              D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke et al., “Qt-
for any human intervention during training. This was done by                  opt: Scalable deep reinforcement learning for vision-based robotic
leveraging the multi-task RL setting for reset-free learning.                 manipulation,” arXiv preprint arXiv:1806.10293, 2018.
When learning multiple tasks simultaneously, different tasks              [7] A. Nagabandi, K. Konolige, S. Levine, and V. Kumar, “Deep dynamics
                                                                              models for learning dexterous manipulation,” in Conference on Robot
serve to reset each other and assist in uninterrupted learning.               Learning, 2020, pp. 1101–1112.
This algorithm allows a dexterous manipulation system to                  [8] H. Zhu, J. Yu, A. Gupta, D. Shah, K. Hartikainen, A. Singh, V. Kumar,
learn manipulation behaviors uninterrupted, and also learn                    and S. Levine, “The ingredients of real-world robotic reinforcement
                                                                              learning,” arXiv preprint arXiv:2004.12570, 2020.
behavior that allows it to continue practicing. Our experiments
                                                                          [9] K. Ploeger, M. Lutter, and J. Peters, “High acceleration reinforcement
show that this approach can enable a real-world hand-arm                      learning for real-world juggling with binary rewards,” 2020.
robotic system to learn an in-hand reorientation task, including         [10] V. Kumar, E. Todorov, and S. Levine, “Optimal control with learned
pickup and repositioning, in about 60 hours of training, as                   local models: Application to dexterous manipulation,” in 2016 IEEE
                                                                              International Conference on Robotics and Automation, ICRA 2016,
well as a pipe insertion task in around 25 hours of training                  Stockholm, Sweden, May 16-21, 2016, D. Kragic, A. Bicchi, and
without human intervention or special instrumentation.                        A. D. Luca, Eds. IEEE, 2016, pp. 378–383. [Online]. Available:
                                                                              https://doi.org/10.1109/ICRA.2016.7487156
                 VIII. ACKNOWLEDGEMENTS                                  [11] A. Ghadirzadeh, A. Maki, D. Kragic, and M. Björkman, “Deep
                                                                              predictive policy training using reinforcement learning,” CoRR, vol.
  The authors would like to thank Anusha Nagabandi, Greg                      abs/1703.00727, 2017. [Online]. Available: http://arxiv.org/abs/1703.
Kahn, Archit Sharma, Aviral Kumar, Benjamin Eysenbach,                        00727

[12] H. Zhu, A. Gupta, A. Rajeswaran, S. Levine, and V. Kumar, “Dexterous       [28] S. Ha, P. Xu, Z. Tan, S. Levine, and J. Tan, “Learning to walk in the
     manipulation with deep reinforcement learning: Efficient, general,              real world with minimal human effort,” 2020.
     and low-cost,” in 2019 International Conference on Robotics and            [29] X. B. Peng, E. Coumans, T. Zhang, T.-W. Lee, J. Tan, and S. Levine,
     Automation (ICRA). IEEE, 2019, pp. 3651–3657.                                   “Learning agile robotic locomotion skills by imitating animals,” 2020.
[13] Y. Chebotar, M. Kalakrishnan, A. Yahya, A. Li, S. Schaal, and S. Levine,   [30] R. Calandra, A. Seyfarth, J. Peters, and M. P. Deisenroth, “Bayesian
     “Path integral guided policy search,” CoRR, vol. abs/1610.00529, 2016.          optimization for learning gaits under uncertainty - an experimental
     [Online]. Available: http://arxiv.org/abs/1610.00529                            comparison on a dynamic bipedal walker,” Ann. Math. Artif.
[14] B. Eysenbach, S. Gu, J. Ibarz, and S. Levine, “Leave no trace: Learning         Intell., vol. 76, no. 1-2, pp. 5–23, 2016. [Online]. Available:
     to reset for safe and autonomous reinforcement learning,” arXiv preprint        https://doi.org/10.1007/s10472-015-9463-9
     arXiv:1711.06782, 2017.                                                    [31] S. Levine, P. Pastor, A. Krizhevsky, and D. Quillen, “Learning
[15] M. Ahn, H. Zhu, K. Hartikainen, H. Ponte, A. Gupta, S. Levine,                  hand-eye coordination for robotic grasping with deep learning
     and V. Kumar, “Robel: Robotics benchmarks for learning with low-                and large-scale data collection,” CoRR, vol. abs/1603.02199, 2016.
     cost robots,” in Conference on Robot Learning. PMLR, 2020, pp.                  [Online]. Available: http://arxiv.org/abs/1603.02199
     1300–1313.                                                                 [32] T. Baier-Löwenstein and J. Zhang, “Learning to grasp everyday
[16] P. Agrawal, A. Nair, P. Abbeel, J. Malik, and S. Levine,                        objects using reinforcement-learning with automatic value cut-off,” in
     “Learning to poke by poking: Experiential learning of intuitive                 2007 IEEE/RSJ International Conference on Intelligent Robots and
     physics,” CoRR, vol. abs/1606.07419, 2016. [Online]. Available:                 Systems, October 29 - November 2, 2007, Sheraton Hotel and Marina,
     http://arxiv.org/abs/1606.07419                                                 San Diego, California, USA. IEEE, 2007, pp. 1551–1556. [Online].
[17] S. Sukhbaatar, Z. Lin, I. Kostrikov, G. Synnaeve, A. Szlam, and                 Available: https://doi.org/10.1109/IROS.2007.4399053
     R. Fergus, “Intrinsic motivation and automatic curricula via asymmetric    [33] B. Wu, I. Akinola, J. Varley, and P. Allen, “Mat: Multi-fingered adaptive
     self-play,” arXiv preprint arXiv:1703.05407, 2017.                              tactile grasping via deep reinforcement learning,” 2019.
[18] C. Richter and N. Roy, “Safe visual navigation via deep learning           [34] B. Nemec, L. Zlajpah, and A. Ude, “Door opening by joining
     and novelty detection,” in Robotics: Science and Systems XIII,                  reinforcement learning and intelligent control,” in 18th International
     Massachusetts Institute of Technology, Cambridge, Massachusetts,                Conference on Advanced Robotics, ICAR 2017, Hong Kong, China,
     USA, July 12-16, 2017, N. M. Amato, S. S. Srinivasa, N. Ayanian,                July 10-12, 2017. IEEE, 2017, pp. 222–228. [Online]. Available:
     and S. Kuindersma, Eds., 2017. [Online]. Available: http://www.                 https://doi.org/10.1109/ICAR.2017.8023522
     roboticsproceedings.org/rss13/p64.html                                     [35] Y. Urakami, A. Hodgkinson, C. Carlin, R. Leu, L. Rigazio, and
[19] Y. W. Teh, V. Bapst, W. M. Czarnecki, J. Quan, J. Kirkpatrick,                  P. Abbeel, “Doorgym: A scalable door opening environment and
     R. Hadsell, N. Heess, and R. Pascanu, “Distral: Robust multitask                baseline agent,” CoRR, vol. abs/1908.01887, 2019. [Online]. Available:
     reinforcement learning,” in Advances in Neural Information                      http://arxiv.org/abs/1908.01887
     Processing Systems 30: Annual Conference on Neural Information             [36] S. Gu, E. Holly, T. P. Lillicrap, and S. Levine, “Deep reinforcement
     Processing Systems 2017, 4-9 December 2017, Long Beach, CA,                     learning for robotic manipulation,” CoRR, vol. abs/1610.00633, 2016.
     USA, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach,                        [Online]. Available: http://arxiv.org/abs/1610.00633
     R. Fergus, S. V. N. Vishwanathan, and R. Garnett, Eds., 2017,              [37] A. Nair, V. Pong, M. Dalal, S. Bahl, S. Lin, and S. Levine,
     pp. 4496–4506. [Online]. Available: http://papers.nips.cc/paper/                “Visual reinforcement learning with imagined goals,” in Advances
     7036-distral-robust-multitask-reinforcement-learning                            in Neural Information Processing Systems 31: Annual Conference
[20] A. A. Rusu, S. G. Colmenarejo, Ç. Gülçehre, G. Desjardins,                      on Neural Information Processing Systems 2018, NeurIPS 2018,
     J. Kirkpatrick, R. Pascanu, V. Mnih, K. Kavukcuoglu, and R. Hadsell,            3-8 December 2018, Montréal, Canada, S. Bengio, H. M. Wallach,
     “Policy distillation,” in 4th International Conference on Learning              H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds.,
     Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016,               2018, pp. 9209–9220. [Online]. Available: http://papers.nips.cc/paper/
     Conference Track Proceedings, Y. Bengio and Y. LeCun, Eds., 2016.               8132-visual-reinforcement-learning-with-imagined-goals
     [Online]. Available: http://arxiv.org/abs/1511.06295                       [38] H. van Hoof, T. Hermans, G. Neumann, and J. Peters, “Learning
[21] E. Parisotto, L. J. Ba, and R. Salakhutdinov, “Actor-mimic: Deep                robot in-hand manipulation with tactile features,” in Humanoid Robots
     multitask and transfer reinforcement learning,” in 4th International            (Humanoids). IEEE, 2015.
     Conference on Learning Representations, ICLR 2016, San Juan,               [39] A. M. Okamura, N. Smaby, and M. R. Cutkosky, “An overview of dex-
     Puerto Rico, May 2-4, 2016, Conference Track Proceedings,                       terous manipulation,” in Robotics and Automation, 2000. Proceedings.
     Y. Bengio and Y. LeCun, Eds., 2016. [Online]. Available:                        ICRA’00. IEEE International Conference on, vol. 1. IEEE, 2000, pp.
     http://arxiv.org/abs/1511.06342                                                 255–262.
[22] T. Yu, S. Kumar, A. Gupta, S. Levine, K. Hausman, and C. Finn,             [40] N. Furukawa, A. Namiki, S. Taku, and M. Ishikawa, “Dynamic
     “Gradient surgery for multi-task learning,” CoRR, vol. abs/2001.06782,          regrasping using a high-speed multifingered hand and a high-speed
     2020. [Online]. Available: https://arxiv.org/abs/2001.06782                     vision system,” in Proceedings 2006 IEEE International Conference
[23] O. Sener and V. Koltun, “Multi-task learning as multi-objective                 on Robotics and Automation, 2006. ICRA 2006. IEEE, 2006, pp.
     optimization,” CoRR, vol. abs/1810.04650, 2018. [Online]. Available:            181–187.
     http://arxiv.org/abs/1810.04650                                            [41] Y. Bai and C. K. Liu, “Dexterous manipulation using both palm and
[24] T. Yu, D. Quillen, Z. He, R. Julian, K. Hausman, C. Finn,                       fingers,” in ICRA 2014. IEEE, 2014.
     and S. Levine, “Meta-world: A benchmark and evaluation for                 [42] I. Mordatch, Z. Popović, and E. Todorov, “Contact-invariant op-
     multi-task and meta reinforcement learning,” in 3rd Annual Conference           timization for hand manipulation,” in Proceedings of the ACM
     on Robot Learning, CoRL 2019, Osaka, Japan, October 30 -                        SIGGRAPH/Eurographics symposium on computer animation. Euro-
     November 1, 2019, Proceedings, ser. Proceedings of Machine                      graphics Association, 2012.
     Learning Research, L. P. Kaelbling, D. Kragic, and K. Sugiura,             [43] V. Kumar, Y. Tassa, T. Erez, and E. Todorov, “Real-time behaviour
     Eds., vol. 100. PMLR, 2019, pp. 1094–1100. [Online]. Available:                 synthesis for dynamic hand-manipulation,” in 2014 IEEE International
     http://proceedings.mlr.press/v100/yu20a.html                                    Conference on Robotics and Automation, ICRA 2014, Hong Kong,
[25] S. James, Z. Ma, D. R. Arrojo, and A. J. Davison, “Rlbench: The                 China, May 31 - June 7, 2014, 2014, pp. 6808–6815. [Online].
     robot learning benchmark and learning environment,” 2019.                       Available: https://doi.org/10.1109/ICRA.2014.6907864
[26] N. Heess, D. TB, S. Sriram, J. Lemmon, J. Merel, G. Wayne,                 [44] K. Yamane, J. J. Kuffner, and J. K. Hodgins, “Synthesizing animations
     Y. Tassa, T. Erez, Z. Wang, S. M. A. Eslami, M. A. Riedmiller,                  of human manipulation tasks,” in ACM SIGGRAPH 2004 Papers. CRC
     and D. Silver, “Emergence of locomotion behaviours in rich                      press, 2004, pp. 532–539.
     environments,” CoRR, vol. abs/1707.02286, 2017. [Online]. Available:       [45] P. Mandikal and K. Grauman, “Dexterous robotic grasping with object-
     http://arxiv.org/abs/1707.02286                                                 centric visual affordances,” 2020.
[27] OpenAI, M. Andrychowicz, B. Baker, M. Chociej, R. Józefowicz,              [46] V. Kumar, A. Gupta, E. Todorov, and S. Levine, “Learning dexterous
     B. McGrew, J. W. Pachocki, J. Pachocki, A. Petron, M. Plappert,                 manipulation policies from experience and imitation,” arXiv preprint
     G. Powell, A. Ray, J. Schneider, S. Sidor, J. Tobin, P. Welinder,               arXiv:1611.05095, 2016.
     L. Weng, and W. Zaremba, “Learning dexterous in-hand manipulation,”        [47] OpenAI, “Learning dexterous in-hand manipulation,” CoRR, vol.
     CoRR, vol. abs/1808.00177, 2018. [Online]. Available: http:                     abs/1808.00177, 2018. [Online]. Available: http://arxiv.org/abs/1808.
     //arxiv.org/abs/1808.00177                                                      00177

[48] H. van Hoof, T. Hermans, G. Neumann, and J. Peters, “Learning                A PPENDIX A. R EWARD F UNCTIONS AND A DDITIONAL
     robot in-hand manipulation with tactile features,” in 15th IEEE-RAS                            TASK D ETAILS
     International Conference on Humanoid Robots, Humanoids 2015, Seoul,
     South Korea, November 3-5, 2015. IEEE, 2015, pp. 121–127. [Online].        A. In-Hand Manipulation Tasks on Hardware
     Available: https://doi.org/10.1109/HUMANOIDS.2015.7363524
[49] A. Gupta, C. Eppner, S. Levine, and P. Abbeel, “Learning                      The rewards for this family of tasks are defined as follows.
     dexterous manipulation for a soft robotic hand from human                  θx represent the Sawyer end-effector’s wrist euler angle,
     demonstrations,” in 2016 IEEE/RSJ International Conference on
     Intelligent Robots and Systems, IROS 2016, Daejeon, South Korea,           x, y, z represent the object’s 3D position, and θ̂z represents
     October 9-14, 2016, 2016, pp. 3786–3793. [Online]. Available:              the object’s z euler angle. xgoal and others represent the
     https://doi.org/10.1109/IROS.2016.7759557                                  task’s goal position or angle. The threshold for determining
[50] C. Choi, W. Schwarting, J. DelPreto, and D. Rus, “Learning object
     grasping for soft robot hands,” IEEE Robotics and Automation Letters,      whether the object has been lifted is subjected to the real
     vol. 3, no. 3, pp. 2370–2377, 2018.                                        world arena size. We set it to 0.1m in our experiment.
[51] V. Kumar, A. Gupta, E. Todorov, and S. Levine, “Learning dexterous
     manipulation policies from experience and imitation,” arXiv preprint                                                         
     arXiv:1611.05095, 2016.                                                                                       x      xhand
[52] W. Montgomery, A. Ajay, C. Finn, P. Abbeel, and S. Levine,                                      x   x
                                                                                 Rrecenter   = −3||    − goal || − || y  −  yhand  ||
     “Reset-free guided policy search: Efficient deep reinforcement learning                         y    ygoal
     with stochastic initial states,” CoRR, vol. abs/1610.01112, 2016.                                                 z       zhand
     [Online]. Available: http://arxiv.org/abs/1610.01112
[53] W. Han, S. Levine, and P. Abbeel, “Learning compound multi-step
     controllers under unknown dynamics,” in 2015 IEEE/RSJ International                             Rlif t = −|z − zgoal |
     Conference on Intelligent Robots and Systems (IROS). IEEE, 2015,
     pp. 6435–6442.
[54] L. Smith, N. Dhawan, M. Zhang, P. Abbeel, and S. Levine, “Avid:
     Learning multi-stage tasks via pixel-level translation of human videos,”                                      n            o
     arXiv preprint arXiv:1912.04443, 2019.                                     Rf lipup = − 5|θx − θgoal
                                                                                                     x
                                                                                                          | − 50(1 z < threshold )+
[55] S. Ruder, “An overview of multi-task learning in deep neural                              n                                       o
     networks,” CoRR, vol. abs/1706.05098, 2017. [Online]. Available:                     10(1 |θx − θgoal
                                                                                                       x
                                                                                                             | < 0.15 AND z > threshold )
     http://arxiv.org/abs/1706.05098
[56] R. Yang, H. Xu, Y. Wu, and X. Wang, “Multi-task reinforcement
     learning with soft modularization,” CoRR, vol. abs/2003.13661, 2020.
     [Online]. Available: https://arxiv.org/abs/2003.13661
[57] M. A. Riedmiller, R. Hafner, T. Lampe, M. Neunert, J. Degrave,                               Rreorient = −|θ̂z − θ̂goal
                                                                                                                        z
                                                                                                                             |
     T. V. de Wiele, V. Mnih, N. Heess, and J. T. Springenberg,
     “Learning by playing - solving sparse reward tasks from scratch,”             A rollout is considered a success (as reported in the figures)
     CoRR, vol. abs/1802.10567, 2018. [Online]. Available: http:                if it reaches a state where the valve is in-hand and flipped
     //arxiv.org/abs/1802.10567
[58] M. Wulfmeier, A. Abdolmaleki, R. Hafner, J. T. Springenberg,               facing up:
     M. Neunert, T. Hertweck, T. Lampe, N. Y. Siegel, N. Heess, and
     M. A. Riedmiller, “Regularized hierarchical policies for compositional               z > threshold AND |θx − θgoal
                                                                                                                   x
                                                                                                                        | < 0.15
     transfer in robotics,” CoRR, vol. abs/1906.11228, 2019. [Online].
     Available: http://arxiv.org/abs/1906.11228                                 B. Pipe Insertion Tasks on Hardware
[59] M. P. Deisenroth, P. Englert, J. Peters, and D. Fox, “Multi-task
     policy search for robotics,” in 2014 IEEE International Conference on         The rewards for this family of tasks are defined as follows.
     Robotics and Automation, ICRA 2014, Hong Kong, China, May 31               x, y, z represent the object’s 3D position, and q represent the
     - June 7, 2014. IEEE, 2014, pp. 3876–3881. [Online]. Available:
     https://doi.org/10.1109/ICRA.2014.6907421                                  joint positions of the D’Hand. xgoal and others represent the
[60] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction,     task’s goal position or angle. The threshold for determining
     2018.                                                                      whether the object has been lifted is subjected to the real
[61] V. R. Konda and J. N. Tsitsiklis, “Actor-critic algorithms,” in
     Advances in Neural Information Processing Systems 12, [NIPS                world arena size. We set it to 0.1m in our experiment. To
     Conference, Denver, Colorado, USA, November 29 - December                  reduce collision in real world experiments, we have two tasks
     4, 1999], S. A. Solla, T. K. Leen, and K. Müller, Eds.                     for insertion. One approaches the peg and the other attempts
     The MIT Press, 1999, pp. 1008–1014. [Online]. Available:
     http://papers.nips.cc/paper/1786-actor-critic-algorithms                   insertion.
[62] L. Lee, B. Eysenbach, E. Parisotto, E. P. Xing, S. Levine,
     and R. Salakhutdinov, “Efficient exploration via state marginal                                                                  
     matching,” CoRR, vol. abs/1906.05274, 2019. [Online]. Available:
                                                                                                                       x      xhand
                                                                                                     x   xgoal
     http://arxiv.org/abs/1906.05274                                             Rrecenter   = −3||    −          || − || y  −  yhand  ||
[63] T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan,                                 y    ygoal
                                                                                                                           z       zhand
     V. Kumar, H. Zhu, A. Gupta, P. Abbeel et al., “Soft actor-critic
     algorithms and applications,” arXiv preprint arXiv:1812.05905, 2018.
[64] Y. Burda, H. Edwards, A. J. Storkey, and O. Klimov, “Exploration
     by random network distillation,” in 7th International Conference                        Rlif t = −2|z − zgoal | − 2|q − qgoal |
     on Learning Representations, ICLR 2019, New Orleans, LA,
     USA, May 6-9, 2019. OpenReview.net, 2019. [Online]. Available:
     https://openreview.net/forum?id=H1lJJnR5Ym                                                                   n        o
                                                                                             RInsert1 = −d1 + 10(1 d1 < 0.1 )

                                                                                                                  n        o
                                                                                             RInsert2 = −d2 + 10(1 d2 < 0.1 )

where                                                                A rollout is considered a success (as reported in the figures)
                            
                           x      xgoal1
                                                                   if it reaches a state where the ball is positioned very close to
                  d1 = || y  −  ygoal1  ||                      the goal position above the basket:
                           z       zgoal1                                       
                                                                               x     x
                                                                                           
                                                                           ||    − goal || < 0.1 AND |z − zgoal | < 0.15
                                                                           y     ygoal
                           x      xgoal2
                  d2 = || y  −  ygoal2  ||                                                  A PPENDIX B. A DDITIONAL D OMAINS
                           z       zgoal2                             In addition to the test domains described in Section V-A,
                                                                we also tested our method in simulation on simpler tasks
                           x      xarena_center                     such as a 2D “pincer” and a simplified lifting task on the
          RRemove   = −|| y  −  yarena_center  ||               Sawyer and “D’Hand” setup. The pincer task is described in
                           z       zarena_center                    Figure 13. Figure 13 shows the performance of our method
   A rollout is considered a success (as reported in the figures)   as well as baseline comparisons.
if it reaches a state where the valve is in-hand and pipe is
inserted:
                z > threshold AND d2 < 0.05
C. Lightbulb Insertion Tasks in Simulation
  The rewards for this family of tasks include
Rrecenter , Rpickup , Rf lipup defined in the previous section,
as well as the following bulb insertion reward:
                        
                x   x
Rbulb = − ||      − goal || − 2(|z − zgoal |)+
                y    ygoal
           n x x                 o
                      goal
          1 ||    −          || < 0.1
                y    ygoal
      +                                                                       Pincer Fill Task Success Rate Comparisons
                                                                                   1.0                                                                    All Tasks Success Rate for Pincer Family
                                                                                                                                                          1.0

              n             
                  x      x
                                                           o                       0.8

                     − goal || < 0.1 AND |z − zgoal | < 0.1 )−
                                                                                                                                                          0.8

          10(1 ||
                  y      ygoal
                                                                         Success

                                                                                   0.6

                                                                                                                                                Success
                                                                                                                                                          0.6

           n              o                                                        0.4
                                                                                                                                                          0.4

          1 z < threshold                                                          0.2
                                                                                                                                                          0.2

   A rollout is considered a success (as reported in the figures)                  0.0
                                                                                         0        50   100       150

                                                                                                             Iterations
                                                                                                                          200   250   300
                                                                                                                                                          0.0
                                                                                                                                                                0      50   100       150

                                                                                                                                                                                  Iterations
                                                                                                                                                                                               200    250   300

if it reaches a state where the bulb is positioned very close                                Ours                               Perturbation                        Fill Success                     Pull Success
to the goal position in the lamp:                                                            Reset Controller                   SAC                                 Pick Success

                     
           x     x                                                  Fig. 13: Pincer domain - object grasping, filling the drawer with
       ||    − goal || < 0.1 AND |z − zgoal | < 0.1                 the object, pulling open the drawer. These tasks naturally form a
           y      ygoal
                                                                    cycle - once an object is picked up, it can be filled in the drawer,
D. Basketball Tasks in Simulation                                   following which the drawer can be pulled open and grasping and
                                                                    filling can be practiced again.
  The rewards for this family of tasks include
Rrecenter , Rpickup defined in the previous section, as
well as the following basket dunking reward:                                             A PPENDIX C. H YPERPARAMETER D ETAILS

                                                                    SAC
                       x        xgoal                                   Learning Rate                                                       3 × 10−4
     Rbasket   = − || y  −  ygoal  ||+                              Discount Factor γ                                                   0.99
                       z        zgoal                                   Policy Type                                                         Gaussian
                                                                    Policy Hidden Sizes                                                 (512, 512)
                      n x          xgoal         o                      RL Batch Size                                                       1024
                20(1 || y  −  ygoal  || < 0.2 )                     Reward Scaling                                                      1
                                                                        Replay Buffer Size                                                  500, 000
                           z       zgoal                                Q Hidden Sizes                                                      (512, 512)
                                          
                         n    x        xgoal        o                   Q Hidden Activation                                                 ReLU
                                                                        Q Weight Decay                                                      0
                 + 50(1 || y  −  ygoal  || < 0.1 )−
                                                                        Q Learning Rate                                                     3 × 10−4
                              z        zgoal                            Target Network τ                                                    5 × 10−3
                 n                o
                1 z < threshold                                              TABLE I: Hyperparameters used across all domains.

You can also read