Mutual Information Based Knowledge Transfer Under State-Action Dimension Mismatch

Page created by Ruben Daniel

Science

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Mutual Information Based Knowledge Transfer Under State-Action Dimension Mismatch

Mutual Information Based Knowledge Transfer Under State-Action
                                                                       Dimension Mismatch

                                                       Michael Wan                             Tanmay Gangwani                                Jian Peng
                                                    Computer Science Dept.                    Computer Science Dept.                   Computer Science Dept.
                                                           UIUC                                      UIUC                                       UIUC
                                                      mw3@illinois.edu                        gangwan2@illinois.edu                     jianpeng@illinois.edu
arXiv:2006.07041v1 [stat.ML] 12 Jun 2020

                                                                    Abstract                                 1   INTRODUCTION

                                                                                                             Deep reinforcement learning (RL), which combines the
                                                Deep reinforcement learning (RL) algorithms                  rigor of RL algorithms with the flexibility of universal
                                                have achieved great success on a wide variety                function approximators such as deep neural networks,
                                                of sequential decision-making tasks. However,                has demonstrated a plethora of success stories in recent
                                                many of these algorithms suffer from high                    times. These include computer and board games (Mnih
                                                sample complexity when learning from scratch                 et al., 2015; Silver et al., 2016), continuous control (Lilli-
                                                using environmental rewards, due to issues                   crap et al., 2015), and robotics (Rajeswaran et al., 2017),
                                                such as credit-assignment and high-variance                  to name a few. Crucially though, these methods have
                                                gradients, among others. Transfer learning, in               been shown to be performant in the regime where an
                                                which knowledge gained on a source task is                   agent can accumulate vast amounts of experience in the
                                                applied to more efficiently learn a different but            environment, usually modeled with a simulator. For real-
                                                related target task, is a promising approach to              world environments such as autonomous navigation and
                                                improve the sample complexity in RL. Prior                   industrial processes, data generation is an expensive (and
                                                work has considered using pre-trained teacher                sometimes risky) procedure. To make deep RL algo-
                                                policies to enhance the learning of the stu-                 rithms more sample-efficient, there is great interest in
                                                dent policy, albeit with the constraint that the             designing techniques for knowledge transfer, which en-
                                                teacher and the student MDPs share the state-                ables accelerating agent learning by leveraging either ex-
                                                space or the action-space. In this paper, we                 isting trained policies (referred to as teachers), or using
                                                propose a new framework for transfer learn-                  task demonstrations for imitation learning (Abbeel & Ng,
                                                ing where the teacher and the student can have               2004). One promising idea for knowledge transfer in RL
                                                arbitrarily different state- and action-spaces.              is policy distillation (Rusu et al., 2015; Parisotto et al.,
                                                To handle this mismatch, we produce embed-                   2015; Hinton et al., 2015), where information from the
                                                dings which can systematically extract knowl-                teacher policy network is transferred to a student policy
                                                edge from the teacher policy and value net-                  network to improve the learning process.
                                                works, and blend it into the student networks.               Prior work has incorporated policy distillation in a va-
                                                To train the embeddings, we use a task-aligned               riety of settings (Czarnecki et al., 2019). Some ex-
                                                loss and show that the representations could                 amples include the transfer of knowledge from simple
                                                be enriched further by adding a mutual in-                   to complex agents while following a curriculum over
                                                formation loss. Using a set of challenging                   agents (Czarnecki et al., 2018), learning a centralized
                                                simulated robotic locomotion tasks involving                 policy that captures shared behavior across tasks for
                                                many-legged centipedes, we demonstrate suc-                  multi-task RL (Teh et al., 2017), distilling information
                                                cessful transfer learning in situations when the             from parent policies into a child policy for a genetically-
                                                teacher and student have different state- and                inspired RL algorithm (Gangwani & Peng, 2017), and
                                                action-spaces.                                               speeding-up large-scale population-based training using
                                                                                                             multiple teachers (Schmitt et al., 2018). A common mo-
                                                                                                             tif in these approaches is the use of Kullback-Leibler
                                                                                                             (KL) divergence between the state-conditional action

                                           Proceedings of the 36th Conference on Uncertainty in Artificial
                                           Intelligence (UAI), PMLR volume 124, 2020.

distributions of the teacher and student networks, as the embeddings must be aligned to serve that goal. Secondly,
minimization objective for knowledge transfer. While we would like the embeddings to be correlated with the
simple and intuitive, this restricts learning from teach- states encountered by the student policy. The embed-
ers that have the same output (action) space as the stu- dings are used to deterministically draw out knowledge
dent, since KL divergence is only defined for distribution from the teacher network. Therefore, a high correlation
over a common space. An alternative to knowledge shar- ensures that the most suitable teacher guidance is derived
ing in the action-space is information transfer through for each student state. We achieve this by maximizing
the embedding-space formed via the different layers of a the mutual information between the embeddings and stu-
deep neural network. (Liu et al., 2019) provides an ex- dent states. We evaluate our method on a set of challeng-
ample of this; it utilizes learned lateral connections being robotic locomotion tasks modeled using the MuJoCo
tween intermediate layers of the teacher and student net- simulator. We demonstrate the successful transfer of
works. Although the action-spaces can now be different, knowledge from trained teachers to students, in the sce-
the state-space is still required to be identical between the nario of mismatched state- and action-space. This leads
teacher and the student, since the same input observation to appreciable gains in sample-efficiency, compared to
is fed to both the networks (Liu et al., 2019). RL from scratch using only the environmental rewards.
In our work, we present a transfer learning approach to
accelerate the training of the student policy, by leverag- 2 BACKGROUND
ing teacher policies trained in an environment with differ-
ent state- and action-space. Arguably, there is a huge po- We consider the RL setting where the environment is
tential for data-efficient student learning by tapping into modeled as an infinite-horizon discrete-time Markov De-
teachers trained on dissimilar, but related tasks. For in- cision Process (MDP). The MDP is characterized by the
stance, consider an available teacher policy for locomo- tuple (S, A, R, T , γ, p0 ), where S and A are the con-
tion of a quadruped robot, where the (97-dimensional) tinuous state- and action-space, respectively, γ ∈ [0, 1)
state-space is the set of joint-angles and joint-velocities is the discount factor, and p0 is the initial state distri-
and the (10-dimensional) action-space is the torques to bution. Given an action at ∈ A, the next state is sam-
the joints. If we wish to learn locomotion for a hexa- pled from the transition dynamics distribution, st+1 ∼
pod robot (state-dimension 139, action-dimension 16), T (st+1 |st , at ), and the agent receives a scalar reward
we conjecture that the learning could be kick-started by r(st , at ) determined by the reward function R. A policy
harnessing the information stored in the trained neural πθ (at |st ) defines the state-conditioned distribution over
network for the quadruped, since both the tasks are lo- actions. The RL objective is to learn the policy parame-
comotion for legged robots and therefore share an inher- ters (θ) to maximize theexpected
P∞ t discounted sum of re-
ent structure. However, the dissimilar state- and action- wards, η(πθ ) = Ep0 ,T ,π t=0 γ r(st , at ) .
space preclude the use of the knowledge transfer mecha-
Policy-gradient algorithms (Sutton et al., 2000) are
nisms proposed in prior work.
widely used to estimate the gradient of the RL objec-
Our approach deals with the mismatch in the state- and tive. Proximal policy optimization (PPO, Schulman et al.
action-space of the teacher and student in the following (2017)) is a model-free policy-gradient algorithm that
manner. To handle disparate actions, rather than using serves as an efficient approximation to trust-region meth-
divergence minimization in the action-space, we transfer ods (Schulman et al., 2015a). In each iteration of PPO,
knowledge by augmenting representations in the layers the rollout policy (πθold ) is used to collect sample trajec-
of the student network with representations from the lay- tories τ and the following surrogate loss is minimized
ers of the teacher network. This is similar to the knowl- over multiple epochs:
edge flow in (Liu et al., 2019) using lateral connections, h i
but with the important difference that we do not employ LθPPO = −Eτ min rt (θ) Ât , clip (rt (θ) , 1 − , 1 + ) Ât
learnable matrices to transform the teacher representa-
tion. The mismatch in the observation- or state-space where rt (θ) = ππθθ (a(at |s t)
t |st )
is the ratio of the action prob-
old
has not been considered in prior literature, to the best abilities under the current policy and rollout policy, and
of our knowledge. We manage this by learning an em- Ât is the estimated advantage. Variance in the policy-
bedding space which can be used to extract the neces- gradient estimates is reduced by employing the state-
sary information from the available teacher policy net- value function as a control variate (Mnih et al., 2016).
work. These embeddings are trained to adhere to two This is usually modeled as a neural network Vψ and up-
properties. Firstly, they must be task-aligned. Our RL dated using temporal difference learning:
objective is the maximization of cumulative discounted h i
rewards in the student environment, and therefore, the targ 2
Lψ
PPO = −E τ V ψ (st ) − V t

where Vttarg is the bootstrapped target value obtained with   tween the input states of the target MDP and the embed-
TD(λ). To further reduce variance, Generalized Advan-         ding vectors produced from them. The embeddings are
tage Estimation (GAE, Schulman et al. (2015b)) is used        used to deterministically derive representations from the
when estimating advantage. The overall PPO minimiza-          teacher network, and hence a high correlation helps to
tion objective then is:                                       obtain the most appropriate teacher guidance for each of
                                                              the states encountered by the target policy. To this end,
               LPPO (θ, ψ) = LθPPO + Lψ
                                      PPO              (1)    we propose a mutual information maximization objec-
                                                              tive; this is detailed in subsection 3.2.
Although we use the PPO objective for our experiments,
our method can be readily combined with any on-policy
or off-policy actor-critic RL algorithm.                      3.1   TASK-ALIGNED EMBEDDING SPACE

                                                              This section describes our approach for training the en-
3   METHOD
                                                              coder parameters (φ) such that the generated embeddings
                                                              are aligned with the RL objective. We begin by detail-
In this section, we outline our method for distilling
                                                              ing the architecture that we use for transfer of knowledge
knowledge from a pre-trained teacher policy to a stu-
                                                              from a teacher, pre-trained in source MDP, to a student
dent policy, in the hope that such knowledge sharing im-
                                                              policy in the target MDP with different state- and action-
proves the sample-efficiency of the student learning pro-
                                                              space. Inspired by the concept of knowledge-flow used
cess. Our problem setting is as follows. We assume that
                                                              in (Liu et al., 2019), we employ lateral connections be-
the teacher and the student policies operate in two differ-
                                                              tween the student and teacher networks, which augment
ent MDPs. All the MDP properties (S, A, R, T , γ, p0 )
                                                              the representations in the layers of the student with useful
could be different, provided that some high-level struc-
                                                              representations from the layers of the teacher. A crucial
tural commonality exists between the MDPs, such as the
                                                              benefit of this approach is that since information sharing
example of transfer from a quadruped robot to hexapod
                                                              happens through the hidden layers, the output (action)
robot introduced in Section 1. Henceforth, for notational
                                                              space of the source and target MDPs can be disparate,
convenience, we refer to the MDP of the teacher as the
                                                              as is the scenario in our experiments. It is also quite
source MDP, and that of the student as the target MDP.
                                                              straightforward to include multiple teachers in this ar-
We assume the availability of a teacher policy network
                                                              chitecture to distill diverse knowledge into a student; we
pre-trained in the source MDP. Crucially though, we do
                                                              leave this to future work.
not assume access to the source MDP for any further ex-
ploration, or for obtaining demonstration trajectories that   We draw out knowledge from both the teacher policy and
could be used for training in the target MDP using cross-     state-value networks. We denote the teacher policy and
domain imitation-learning techniques. We instead focus        value network with πθ0 and Vψ0 , respectively, where the
on extracting representations from the teacher policy net-    parameters (θ0 , ψ 0 ) are held fixed throughout the train-
work which are useful for learning in the target MDP.         ing. Analogously, (θ, ψ) are the trainable parameters for
                                                              the student policy and value networks. Let Nπ denote
In this work, we address knowledge transfer when Ssrc 6=
                                                              the number of hidden layers in the teacher (and student)
Starg , where Ssrc and Starg denote the state-space of the
                                                              policy network, and NV be the number of hidden layers
source and target MDPs, respectively. To handle the
                                                              in the teacher (and student) value network. In general,
mismatch, we introduce a learned embedding-space pa-
                                                              the teacher and student networks could have a different
rameterized by an encoder function φ(·), and defined as
                                                              number of layers, but we assume them to be the same for
Semb := {φ(s) | s ∈ Starg }. Data points from this embed-
                                                              ease of exposition.
ding space are used to extract useful information from
the teacher policy network. Therefore, we further en-         In the target MDP, the student policy observes a state
force that the dimension of the embedding space matches       starg ∈ Starg , which is fed to the encoder to produce
the dimension of the state-space in the source MDP, i.e.,     the embedding φ(starg ) ∈ Semb . Since |Semb | = |Ssrc |,
|Semb | = |Ssrc |. Note that this does not necessitate that   this embedding can be readily passed through the teacher
any embedding vector s ∈ Semb be a feasible input state       networks to extract {zθj0 , 1 ≤ j ≤ Nπ }, representing
in the source MDP. To learn the encoder function φ(·),        the pre-activation outputs of the Nπ hidden layers of the
we consider the following two desiderata. Firstly, the        teacher policy network, and {zψj 0 , 1 ≤ j ≤ NV }, rep-
embeddings must be learned to facilitate our objective of     resenting the pre-activation outputs of the NV hidden
maximizing the cumulative discount rewards in the tar-        layers of the teacher value function network. To ob-
get MDP. In subsection 3.1, we show how to achieve this       tain the pre-activation representations in the student net-
by utilizing the policy gradient to update embedding pa-      works, we feed in the state starg and perform a weighted
rameters. Secondly, we wish for a high correlation be-        linear combination of the appropriate outputs with the

corresponding pre-activations from the teacher networks.                 be different at different input states. To aid with this,
Concretely, to obtain the hidden layer outputs hjπθ and                  we utilize a surrogate objective that instead maximizes
hjVψ at layer j in the student networks, we have the fol-                the correlation between starg and the embeddings φ(starg ),
lowing:                                                                  defined using the principle of mutual information (MI).
                                                                       If we view starg as a stochastic input s, the encoder output
              hjπθ = σ pjθ zθj + (1 − pjθ )zθj0                          is then also a random variable e, and the mutual informa-
                                                    (2)                tion between the two is defined as:
            hjVψ = σ pjψ zψj + (1 − pjψ )zψj 0
                                                                                        I(s; e) = H(s) − H(s|e)
where σ is the activation function, and           pjθ , pjψ
                                                  ∈ [0, 1]
are layer-specific learnable parameters denoting the mix-                where H denotes the differential entropy. Direct maxi-
ing weights. In the target MDP, the student network is                   mizing of the MI is intractable due to the unknown condi-
optimized for the RL objective LPPO (θ, ψ), mentioned in                 tional densities. However, it is possible to obtain a lower
Equation 1. The outputs of the student policy and value                  bound to the MI using a variational distribution qω (s|e)
networks, and hence LPPO , depend on the encoder pa-                     that approximates the true conditional distribution p(s|e)
rameters (φ) through the representation sharing (Equa-                   as follows:
tion 2) enabled by the lateral connections stemming from                        I(s; e) = H(s) − H(s|e)
the pre-trained teacher network. Therefore, an intuitive
                                                                                         = H(s) + Es,e [log p(s|e)]
objective for shaping the embeddings such that they be-
come task-aligned is to optimize them using the original                                 = H(s) + Es,e [log qω (s|e)]
RL loss gradient: φ ← φ − α∇φ LPPO (θ, ψ, φ, θ0 , ψ 0 ).
                                                                                                                      
                                                                                            + Ee DKL (p(s|e)||qω (s|e))
Note that LPPO (·) now also depends on the fixed teacher                                 ≥ H(s) + Es,e [log qω (s|e)]
parameters (θ0 , ψ 0 ).
                                                                         where the last inequality is due to the non-negativity of
The learnable mixing weights pjθ , pjψ ∈ [0, 1] control the
                                                                         the KL divergence. This is known as the variational in-
influence of the teacher’s representation on the student
                                                                         formation maximization algorithm (Agakov & Barber,
outputs – higher the value, lesser the impact. We ar-
                                                                         2004). Re-writing in terms of target-MDP states and the
gue that a low value for these coefficients helps in the
                                                                         encoder parameters, the surrogate objective jointly opti-
early phases of the training process by providing nec-
                                                                         mizes over the variational and encoder parameters:
essary information to kick-start learning. At the end of
the training, however, we desire that the student becomes                             max Estarg [log qω (starg |φ(starg ))]
completely independent of the teacher, since this helps                                ω,φ

in faster test-time deployment of the agent. To encour-
                                                                         where H(s) is omitted since it is a constant w.r.t the con-
age this, we introduce additional coupling-loss terms that
                                                                         cerned parameters. In terms of the loss function to mini-
drive pjθ , pjψ towards 1 as the training progresses:
                                                                         mize, we can succinctly write:
                   Nπ              NV
                 1 X        1 X
                            j
                                           
                                                                                  LMI (φ, ω) = −Es∼ρπθ [log qω (s|φ(s))]         (4)
 Lcoupling   =−        log pθ −        log pjψ (3)
                Nπ j=1          NV j=1
                                                                         where ρπθ is the state-visitation distribution of the stu-
Experimentally, we observe that although the student be-                 dent policy in the target MDP. In our experiments, we use
comes independent in the final stages of training, it is                 a multivariate Gaussian distribution (with a learned diag-
able to achieve the same level of performance that it                    onal covariance matrix) to model the variational distribu-
would if it could still rely on the teacher.                             tion qω . Although this simple model yields good perfor-
                                                                         mance, more expressive model classes, such as mixture
3.2   ENRICHED EMBEDDINGS WITH MUTUAL                                    density networks and flow-based models (Rezende &
      INFORMATION MAXIMIZATION                                           Mohamed, 2015) could be readily incorporated as well,
                                                                         to learn complex and multi-modal distributions.
As outlined in the previous section, at each timestep of
the discrete-time target MDP, the representation distilled               3.3   OVERALL ALGORITHM
from the teacher networks is a fixed function f of the
embedding vector generated from the current input state:                 Figure 1 shows the schematic diagram of our complete
f (θ0 , ψ 0 , φ(starg )), where (θ0 , ψ 0 ) are fixed. It is desirable   architecture and gradient flows, along with a description
to have a high degree of correlation between starg and                   of the implemented neural networks. We refer to our al-
f (θ0 , ψ 0 , φ(starg )) because, intuitively, the teacher repre-        gorithm as MIKT, for Mutual Information based Knowl-
sentation that is the most useful for the student should                 edge Transfer. Algorithm 1 outlines the main steps of the

Policy                                              Value-fn
                  Branch                                              Branch

                                                                             Val e
                        Ac i

                                                                                                       (F   d)

                    +                                                    +

    Figure 1: Schematic diagram of our complete architecture (best viewed in color). The encoder parameters φ (blue)
    receive gradients from three sources: the policy-gradient loss LθPPO , the value function loss Lψ  PPO , and the mutual
    information loss LMI . The teacher networks (yellow) remain fixed throughout training and do not receive any gradients.
    In the student networks (green), the pre-activation representations are linearly combined (using learnt mixing weights)
    with the corresponding representations from the teacher (Equation 2). Note that this knowledge-flow occurs at all
    layers, although we show it only once for clarity of exposition.

    Algorithm 1: Mutual Information based Knowledge               experience is then used to compute the RL loss (Equa-
    Transfer (MIKT)                                               tion 1) and the mutual information loss (Equation 4), en-
    Input : θ0 , ψ 0 fixed teacher policy and value networks      abling the calculation of gradients for the different pa-
    θ, ψ: student policy and value networks                       rameters (Lines 3–6). Using both the losses to update
    {p}: set of coupling parameters for policy and value          the encoder (φ) helps us to satisfy the desiderata on the
    networks                                                      embeddings – that they should be task-aligned and corre-
    φ: encoder parameters                                         lated with the states in the target MDP. The coupling pa-
    ω: variational distribution parameters                        rameters {pj }, used for the weighted combination of the
                                                                  representations in the teacher and student networks, are
    for each iteration do                                         updated with the coupling-loss (Equation 3) along with
1       Run πθ in target MDP and collect few trajectories τ       the RL loss. In each iteration of the algorithm, the PPO
2      for each minibatch m ∈ τ do                                update ensures that the state-action visitation distribution
3          Update θ, ψ with ∇θ,ψ LPPO (θ, ψ, φ, θ0 , ψ 0 )        of the policy πθ is modified by only a small amount. This
4          Update φ with                                          is because of the clipping on the importance-sampling ra-
            ∇φ LMI (φ, ω) + LPPO (θ, ψ, φ, θ0 , ψ 0 )             tio (Section 2) when obtaining the PPO gradient. In ad-
                                                    

5          Update ω with ∇ω LMI (φ, ω)                            dition to this, we experimentally found that enforcing an
6          Update {p} using [Lcoupling + LPPO ]                   explicit KL-regularization on the policy further stabilizes
7      end                                                        learning. Let πθ , πθold denote the current and the rollout
8   end                                                           policy, respectively. The loss is then formalized as:

                                                                                                                                
                                                                    LKL (θ, θold ) = Es∼ρπθ          DKL (πθ (·|s)||πθold (·|s))
                                                                                              old

    training procedure. In each iteration, we run the policy
    in the target MDP and collect a batch of trajectories. This

(a) CentipedeFour (b) CentipedeSix (c) CentipedeEight (d) CpCentipedeSix (e) CpCentipedeEight (f) Ant

Figure 2: Our MuJoCo locomotion environments. The centipede agents are configured using the details in (Wang et al., 2018),
while Ant-v2 is a popular OpenAI Gym task.

4 RELATED WORK Different from these, our approach handles the mismatch
in the state-space by training an embedding space which
is utilized for efficient knowledge transfer. Rozantsev
The concepts of knowledge transfer and information et al. (2018) employ layer-wise weight regularization and
sharing between deep neural networks have been ex- evaluate on (un-)supervised tasks where the input dis-
tensively researched for a wide variety of tasks in ma- tributions for source and target domains have semantic
chine learning. In the context of reinforcement learning, similarity and are static. For RL tasks, the input distribu-
the popular paradigms for knowledge transfer include tions change dynamically as the student policy updates;
imitation-learning, meta-RL, and policy distillation; each it is unclear if enforcing similarity between the networks
of these being applicable under different settings and for all inputs by coupling the weights is ideal. Gamrian
assumptions. Imitation learning algorithms (Ng et al., & Goldberg (2018) use GANs to learn a mapping from
2000; Ziebart et al., 2008) utilize teacher demonstrations target states to source states. In addition to requiring that
to extract useful information (such as the teacher reward the source and the target domains have the same action-
function in inverse-RL methods) and use that to acceler- space, their method also relies on the exploratory sam-
ate student learning. In meta-RL approaches (Duan et al., ples collected in the source MDP for training the GAN.
2016; Finn et al., 2017), we are generally provided with a In contrast, we handle the action-space mismatch and do
distribution of tasks that share some structural similarity, not assume access to the source MDP for exploration.
and the objective is to discover this generalizable knowl-
edge for accelerating the process of learning on a new Our work also has connections to policy distillation
task. Our work is most closely related to policy distil- methods that use implicit teachers, rather than external
lation methods (Rusu et al., 2015; Parisotto et al., 2015; pre-trained models. In Czarnecki et al. (2018), the au-
Czarnecki et al., 2019), where pre-trained teacher net- thors recommend a curriculum over agents, rather than
works are available and can expedite learning in dissim- the usual curriculum over tasks. Such a curriculum
ilar (but related) student tasks. trains simple agents first, the knowledge of which is then
distilled into more complex agents over time. Akkaya
Prior work has considered teachers in various capac- et al. (2019) iterate on policy architectures by utiliz-
ities. Rusu et al. (2016) and Liu et al. (2019) utilize ing behavior-cloning with DAgger; the new architecture
learned cross-connections between intermediate layers (student) is trained using the old architecture (teacher).
of teacher networks—that have been pre-trained on var- Distillation has been used in multi-task RL (Teh et al.,
ious source tasks—and a student network to effectively 2017) to learn a centralized policy that captures gen-
transfer knowledge and enable more efficient learning on eralizable information from policies trained on individ-
a target task. Ahn et al. (2019) use an objective based on ual tasks. (Gangwani & Peng, 2017) combine ideas from
the mutual information between the corresponding lay- the genetic-algorithms literature and distillation to train
ers of teacher and student networks, and show gains in offspring policies that inherit the best traits of both
image classification tasks. In Hinton et al. (2015), in- the parent policies. Since all these approaches trans-
formation from a large model (teacher) is compressed fer information in the action-space by minimizing the
into a smaller model (student) using a distillation process KL-divergence between state-conditional action distribu-
that uses the temperature-regulated softmax outputs from tions, they share the limitation that the student can only
the teacher as targets to train the student. Schmitt et al. leverage a teacher with the same output (action) space.
(2018) propose a large-scale population-based training Our approach avoids this by using the representations in
pipeline that allows a student policy to leverage multiple the different layers of the neural network for knowledge
teachers specialized in different tasks. All these afore- sharing, enabling transfer-learning in many diverse sce-
mentioned methods work in the setting where the teacher narios as shown in our experiments.
and student share the input state (observation) space.

CentipedeFour to CentipedeEight                                   CentipedeSix to CentipedeEight                                 CentipedeFour to CpCentipedeSix
                                                                             3500                                                           3500

         2500                                                                3000                                                           3000

         2000                                                                2500                                                           2500

                                                                             2000                                                           2000
         1500
Reward

                                                                    Reward

                                                                                                                                   Reward
                                                                             1500                                                           1500
         1000
                                                                             1000                                                           1000
         500
                                                                             500                                                            500
                                                           MIKT                                                            MIKT                                                            MIKT
           0                                               MLPP                                                            MLPP                                                            MLPP
                                                                               0
                                                           VPG                                                             VPG                0                                            VPG

                0.0         0.5        1.0       1.5         2.0                    0.0      0.5       1.0        1.5        2.0                   0.0       0.5           1.0       1.5     2.0
                                    Timesteps                1e6                                    Timesteps                1e6                                        Timesteps            1e6

                                  (a)                                                              (b)                                                                 (c)
                        CentipedeSix to CpCentipedeEight                                       CentipedeFour to Ant                                                CentipedeSix to Ant

                                                                             3000                                                           3000
         3000

                                                                             2500                                                           2500
         2500

                                                                             2000                                                           2000
         2000
Reward

                                                                    Reward

                                                                                                                                   Reward
                                                                             1500                                                           1500
         1500
                                                                             1000                                                           1000
         1000
                                                                              500                                                            500
         500
                                                           MIKT                 0                                          MIKT                                                            MIKT
                                                                                                                                               0
                                                           MLPP                                                            MLPP                                                            MLPP
           0                                               VPG                                                             VPG                                                             VPG
                                                                             −500                                                           −500
                0.0         0.5        1.0       1.5         2.0                    0.0      0.5       1.0        1.5        2.0                   0.0       0.5           1.0       1.5     2.0
                                    Timesteps                1e6                                    Timesteps                1e6                                        Timesteps            1e6

                                  (d)                                                              (e)                                                                 (f)

Figure 3: Performance of our transfer learning algorithm (MIKT) and the baselines (VPG, MLPP) on the MuJoCo
locomotion tasks. Each plot is titled “x to y”, where x is the source (teacher) MDP and y is the target (student) MDP.
(a) CentipedeFour to CentipedeEight, (b) CentipedeSix to CentipedeEight, (c) CentipedeFour to CpCentipedeSix, (d)
CentipedeSix to CpCentipedeEight, (e) CentipedeFour to Ant, (f) CentipedeSix to Ant.

5           EXPERIMENTS                                                                                  a centipede – it consists of repetitive torso bodies, each
                                                                                                         having two legs. Figure 2 shows an illustration of the dif-
In this section, we perform experiments to quantify the                                                  ferent centipede agents. Please see (Wang et al., 2018)
efficacy of our algorithm, MIKT, for transfer learning in                                                for a detailed description of the environment generation
RL, and also do some qualitative analysis. We address                                                    process. The agent is rewarded for running fast in a par-
the following questions: a) Can we do successful knowl-                                                  ticular direction. Table 1 includes the state and action
edge transfer between a teacher and a student with dif-                                                  dimensions of all the agents. Centipede-x refers to a
ferent state- and action-space? b) Are both the losses                                                   centipede with x legs; we use x ∈ {4, 6, 8}. We use
{LPPO , LMI } important for learning useful embeddings                                                   additional environments where the centipede is crippled
φ? c) How does task-similarity affect the benefits that                                                  (some legs disabled) and denote this by Cp-Centipede-
can be reaped from MIKT?                                                                                 x. Finally, we include the standard Ant-v2 task from the
                                                                                                         MuJoCo suite. Note that all robots have separate state
                      Table 1: MuJoCo locomotion environments.                                           and action dimensions. Intuitively though, these loco-
                                                                                                         motion tasks share an inherent structure that could be ex-
     Environment                        State Dimension            Action Dimension                      ploited for transfer learning between the centipedes of
    CentipedeFour                              97                         10                             various types. We now demonstrate that our algorithm
     CentipedeSix                             139                         16                             achieves this successfully.
    CentipedeEight                            181                         22
    CpCentipedeSix                            139                         12                             Baselines: We compare MIKT with two baselines: a)
   CpCentipedeEight                           181                         18                             Vanilla Policy Gradient (VPG), which learns the task in
         Ant                                  111                          8                             the target MDP from scratch using only the environmen-
                                                                                                         tal rewards. Any transfer learning algorithm which ef-
Environments: We evaluate using locomotion tasks                                                         fectively leverages the available teacher networks should
for legged robots, modeled in OpenAI Gym (Brock-                                                         be able to outperform this baseline that does not receive
man et al., 2016) using the MuJoCo physics simulator.                                                    any prior knowledge it can use. We use the standard
Specifically, we use the environments provided by Wang                                                   PPO (Schulman et al., 2017) algorithm for this baseline.
et al. (2018), where the agent structure resembles that of                                               b) MLP Pre-trained (MLPP) In our setting, the teacher

CentipedeFour to CentipedeEight                                   CentipedeSix to CentipedeEight                                  CentipedeFour to CpCentipedeSix
                                                                           3500                                                            3500
         3000

                                                                           3000                                                            3000
         2500
                                                                           2500                                                            2500
         2000
                                                                           2000                                                            2000
Reward

                                                                  Reward

                                                                                                                                  Reward
         1500
                                                                           1500                                                            1500

         1000
                                                                           1000                                                            1000

         500                                                               500
                                          MIKT                                                            MIKT                             500                                 MIKT
                                          MIKT w/o MI                                                     MIKT w/o MI                                                          MIKT w/o MI
           0                                                                 0
                                          MIKT w/o RL gradients                                           MIKT w/o RL gradients              0                                 MIKT w/o RL gradients

                0.0       0.5        1.0        1.5         2.0                   0.0      0.5       1.0        1.5         2.0                   0.0       0.5           1.0        1.5         2.0
                                  Timesteps                 1e6                                   Timesteps                 1e6                                        Timesteps                 1e6

                                (a)                                                              (b)                                                                  (c)
                      CentipedeSix to CpCentipedeEight                                       CentipedeFour to Ant                                                 CentipedeSix to Ant

                                                                           3000                                                            3000
         3000

                                                                           2500                                                            2500
         2500

                                                                           2000                                                            2000
         2000
Reward

                                                                  Reward

                                                                                                                                  Reward
                                                                           1500                                                            1500
         1500
                                                                           1000                                                            1000
         1000
                                                                            500                                                             500
         500
                                          MIKT                                0                           MIKT                                0                                MIKT
                                          MIKT w/o MI                                                     MIKT w/o MI                                                          MIKT w/o MI
           0                              MIKT w/o RL gradients                                           MIKT w/o RL gradients                                                MIKT w/o RL gradients
                                                                           −500                                                            −500
                0.0       0.5        1.0        1.5         2.0                   0.0      0.5       1.0        1.5         2.0                   0.0       0.5           1.0        1.5         2.0
                                  Timesteps                 1e6                                   Timesteps                 1e6                                        Timesteps                 1e6

                                (d)                                                              (e)                                                                  (f)

Figure 4: Ablation on the importance of each of {LPPO , LMI } for training the encoder φ. MIKT (blue) is compared
with two variants: MIKT w/o MI (LMI not used) and MIKT w/o RL gradients (LPPO not used).

and the student networks have dissimilar input and out-                                                trained and randomly initialized parameters of the stu-
put dimensions (because the MDPs have different state-                                                 dent networks. This indicates that the MLPP strategy is
and action-spaces). A natural transfer learning strategy                                               not productive for transfer learning across the RL loco-
is to remove the input and output layers from the pre-                                                 motion tasks considered. Finally, we note that our al-
trained teacher and replace them with new learnable lay-                                               gorithm (MIKT) vastly outperforms the two baselines,
ers that match the dimensions required of the student pol-                                             both achieving higher returns in earlier stages of train-
icy (analogously value) network. The middle stack of the                                               ing and reaching much higher final performance. This
deep neural network is then fine-tuned with the RL loss.                                               proves that firstly, these tasks do have a structural com-
Prior work has shown that such a transfer is effective in                                              monality such that a teacher policy trained in one task
certain computer vision tasks.                                                                         could be used advantageously to accelerate learning in a
                                                                                                       different task; and secondly, that MIKT is a successful
                                                                                                       approach for achieving such a knowledge transfer. This
5.1             EXPERIMENTAL RESULTS
                                                                                                       works even when the teacher and student MDPs have dif-
                                                                                                       ferent state- and action-spaces, and is realized by learn-
Figure 3 plots the learning curves for MIKT and our two
                                                                                                       ing embeddings that are task-aligned and are optimized
baselines in different transfer learning experiments. Each
                                                                                                       with a mutual information loss (Algorithm 1).
plot is titled “x to y”, where x is the source (teacher)
MDP and y is the target (student) MDP. We run each
experiment with 5 different random seeds and plot the                                                  5.2    ABLATION STUDIES
average episodic returns (mean and standard deviation)
on the y-axis, against the number of timesteps of envi-                                                Are gradients from both {LPPO , LMI } to the encoder
ronment interaction (2 million total) on the x-axis. VPG                                               beneficial? To quantify this, we experiment with two
does not use utilize the pre-trained teachers. We ob-                                                  variants of our algorithm, each of which removes one
serve that its performance improves with the training it-                                              of the components: MIKT w/o MI, which does not up-
erations, albeit at a sluggish pace. MLPP uses the mid-                                                date φ with the mutual information loss proposed in Sec-
dle stack of the pre-trained teacher network as an ini-                                                tion 3.2, and MIKT w/o RL gradients, which omits using
tialization and trains the input and output layers from                                                the policy-gradient and the value function TD-error gra-
scratch. It only performs on par with VPG, potentially                                                 dient for the encoder. Figure 4 plots the performance of
due to the non-constructive interaction between the pre-                                               these variants and compares it to MIKT (which includes

both the losses). We note that MIKT w/o MI generally                                                                                                            action-spaces. We achieve this by learning an encoder to
struggles to learn in the early stages of training; see for                                                                                                     produce embeddings that draw out useful representations
instance Figure 4 (c), (d). MIKT w/o RL gradients does                                                                                                          from the teacher networks. We argue that training the en-
comparatively better early on in training, but it is evi-                                                                                                       coder with both the RL-loss and the mutual information-
dent that MIKT is the most performant, both in terms of                                                                                                         loss yields rich representations; we provide empirical
early training efficiency and the average episodic returns                                                                                                      validation for this as well. Our experiments on a set
of the final policy. This supports our design choice of                                                                                                         of challenging locomotion tasks involving many-legged
using both {LPPO , LMI } to update the encoder φ.                                                                                                               centipedes show that MIKT is a successful approach for
                                                                                                                                                                achieving knowledge transfer when the teacher and stu-
How sensitive is MIKT to the task-similarity? It is
                                                                                                                                                                dent MDPs have mismatched state- and action-space.
reasonable to assume that the benefits of transfer learning
depend on the task-similarity between the teacher and the
student. To better understand this in the context of our al-                                                                                                    References
gorithm, we consider learning in the CentipedeEight en-
                                                                                                                                                                Pieter Abbeel and Andrew Y Ng. Apprenticeship learn-
vironment using different types of teachers – Centipede-
                                                                                                                                                                  ing via inverse reinforcement learning. In Proceed-
Four, CentipedeSix, Hopper. In Figure 5a, we notice
                                                                                                                                                                  ings of the twenty-first international conference on
that the influence of the Centipede-{Four,Six} teachers
                                                                                                                                                                  Machine learning, pp. 1, 2004.
is much more significant than the Hopper teacher. This is
likely because the motion of the centipedes shares sim-                                                                                                         Felix Agakov and David Barber. Variational information
ilarity, whereas the Hopper (which is trained to hop) is                                                                                                          maximization for neural coding. In International Con-
a dissimilar task and therefore less useful for transfer                                                                                                          ference on Neural Information Processing, pp. 543–
learning. In Figure 5b we plot the value of the weight                                                                                                            548. Springer, 2004.
on the student representation, when doing a weighed lin-                                                                                                        Sungsoo Ahn, Shell Xu Hu, Andreas Damianou, Neil D
ear combination with the teacher (Section 3.1). We ob-                                                                                                            Lawrence, and Zhenwen Dai. Variational information
serve with the Hopper teacher that, very early in training,                                                                                                       distillation for knowledge transfer. In Proceedings of
the student learns to trust its own learned representations                                                                                                       the IEEE Conference on Computer Vision and Pattern
rather than incorporate knowledge from the dissimilar                                                                                                             Recognition, pp. 9163–9171, 2019.
teacher.                                                                                                                                                        Ilge Akkaya, Marcin Andrychowicz, Maciek Chociej,
                      CentipedeEight with Various Teachers                                                    MIKT Normalized Student Weight                       Mateusz Litwin, Bob McGrew, Arthur Petron, Alex
         3500                                                                                  1.0
                                                                                                                                                                   Paino, Matthias Plappert, Glenn Powell, Raphael
         3000
                                                                                               0.9                                                                 Ribas, et al. Solving rubik’s cube with a robot hand.
                                                                   Normalized Student Weight

         2500
                                                                                                                                                                   arXiv preprint arXiv:1910.07113, 2019.
                                                                                               0.8
         2000
                                                                                                                                                                Greg Brockman, Vicki Cheung, Ludwig Pettersson,
Reward

         1500                                                                                  0.7
                                                                                                                                                                  Jonas Schneider, John Schulman, Jie Tang, and Wo-
         1000
                                                                                               0.6
                                                                                                                                                                  jciech Zaremba. Openai gym, 2016.
         500
                                          MIKT w/ CentipedeFour                                                               MIKT w/ CentipedeFour Teacher

           0
                                          MIKT w/ CentipedeSix
                                          MIKT w/ Hopper
                                                                                               0.5                            MIKT w/ CentipedeSix Teacher
                                                                                                                              MIKT w/ Hopper Teacher
                                                                                                                                                                Wojciech Marian Czarnecki, Siddhant M Jayakumar,
                0.0        0.5        1.0
                                   Timesteps
                                                 1.5         2.0
                                                             1e6
                                                                                                     0.00   0.25   0.50   0.75 1.00 1.25
                                                                                                                              Timesteps
                                                                                                                                           1.50   1.75   2.00
                                                                                                                                                          1e6
                                                                                                                                                                 Max Jaderberg, Leonard Hasenclever, Yee Whye Teh,
                                                                                                                                                                 Simon Osindero, Nicolas Heess, and Razvan Pascanu.
                                 (a)                                                                                      (b)                                    Mix&match-agent curricula for reinforcement learn-
                                                                                                                                                                 ing. arXiv preprint arXiv:1806.01780, 2018.
Figure 5: Training on CentipedeEight with different
teachers. (a) Transfer from a dissimilar teacher (Hop-                                                                                                          Wojciech Marian Czarnecki, Razvan Pascanu, Simon
per) is less effective compared to using Centipede teach-                                                                                                        Osindero, Siddhant M Jayakumar, Grzegorz Swirszcz,
ers. (b) Value of the weight on the student representa-                                                                                                          and Max Jaderberg. Distilling policy distillation.
tion in the weighed linear combination. With the Hopper                                                                                                          arXiv preprint arXiv:1902.02186, 2019.
teacher, the value rises sharply in the early stages, indi-                                                                                                     Yan Duan, John Schulman, Xi Chen, Peter L Bartlett,
cating a low teacher contribution.                                                                                                                                Ilya Sutskever, and Pieter Abbeel. Rl2: Fast reinforce-
                                                                                                                                                                  ment learning via slow reinforcement learning. arXiv
                                                                                                                                                                  preprint arXiv:1611.02779, 2016.
6               CONCLUSION                                                                                                                                      Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-
                                                                                                                                                                  agnostic meta-learning for fast adaptation of deep net-
In this paper, we proposed an algorithm for transfer                                                                                                              works. In Proceedings of the 34th International Con-
learning in RL where the teacher (source) and the stu-                                                                                                            ference on Machine Learning-Volume 70, pp. 1126–
dent (task) agents can have arbitrarily different state- and                                                                                                      1135. JMLR. org, 2017.

Shani Gamrian and Yoav Goldberg. Transfer learning Andrei A Rusu, Sergio Gomez Colmenarejo, Caglar Gul-
for related reinforcement learning tasks via image-to- cehre, Guillaume Desjardins, James Kirkpatrick, Raz-
image translation. arXiv preprint arXiv:1806.07377, van Pascanu, Volodymyr Mnih, Koray Kavukcuoglu,
2018. and Raia Hadsell. Policy distillation. arXiv preprint
Tanmay Gangwani and Jian Peng. Policy optimization by arXiv:1511.06295, 2015.
genetic distillation. arXiv preprint arXiv:1711.01012, Andrei A Rusu, Neil C Rabinowitz, Guillaume Des-
2017. jardins, Hubert Soyer, James Kirkpatrick, Koray
Kavukcuoglu, Razvan Pascanu, and Raia Had-
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distill-
sell. Progressive neural networks. arXiv preprint
ing the knowledge in a neural network. arXiv preprint
arXiv:1606.04671, 2016.
arXiv:1503.02531, 2015.
Simon Schmitt, Jonathan J Hudson, Augustin Zidek,
Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel,
Simon Osindero, Carl Doersch, Wojciech M Czar-
Nicolas Heess, Tom Erez, Yuval Tassa, David Silver,
necki, Joel Z Leibo, Heinrich Kuttler, Andrew Zisser-
and Daan Wierstra. Continuous control with deep rein-
man, Karen Simonyan, et al. Kickstarting deep rein-
forcement learning. arXiv preprint arXiv:1509.02971,
forcement learning. arXiv preprint arXiv:1803.03835,
2015.
2018.
Iou-Jen Liu, Jian Peng, and Alexander G Schwing. John Schulman, Sergey Levine, Pieter Abbeel, Michael
Knowledge flow: Improve upon your teachers. arXiv Jordan, and Philipp Moritz. Trust region policy op-
preprint arXiv:1904.05878, 2019. timization. In International Conference on Machine
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Learning, pp. 1889–1897, 2015a.
Andrei A Rusu, Joel Veness, Marc G Bellemare, John Schulman, Philipp Moritz, Sergey Levine, Michael
Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Jordan, and Pieter Abbeel. High-dimensional contin-
Georg Ostrovski, et al. Human-level control through uous control using generalized advantage estimation.
deep reinforcement learning. Nature, 518(7540):529– arXiv preprint arXiv:1506.02438, 2015b.
533, 2015.
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec
Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Radford, and Oleg Klimov. Proximal policy optimiza-
Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, tion algorithms. arXiv preprint arXiv:1707.06347,
David Silver, and Koray Kavukcuoglu. Asynchronous 2017.
methods for deep reinforcement learning. In Inter-
David Silver, Aja Huang, Chris J Maddison, Arthur
national Conference on Machine Learning, pp. 1928–
Guez, Laurent Sifre, George Van Den Driessche, Ju-
1937, 2016.
lian Schrittwieser, Ioannis Antonoglou, Veda Panneer-
Andrew Y Ng, Stuart J Russell, et al. Algorithms for shelvam, Marc Lanctot, et al. Mastering the game of
inverse reinforcement learning. In Icml, pp. 663–670, go with deep neural networks and tree search. nature,
2000. 529(7587):484–489, 2016.
Emilio Parisotto, Jimmy Lei Ba, and Ruslan Salakhutdi- Richard S Sutton, David A McAllester, Satinder P Singh,
nov. Actor-mimic: Deep multitask and transfer rein- and Yishay Mansour. Policy gradient methods for re-
forcement learning. arXiv preprint arXiv:1511.06342, inforcement learning with function approximation. In
2015. Advances in neural information processing systems,
Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, pp. 1057–1063, 2000.
Giulia Vezzani, John Schulman, Emanuel Todorov, Yee Teh, Victor Bapst, Wojciech M Czarnecki, John
and Sergey Levine. Learning complex dexterous Quan, James Kirkpatrick, Raia Hadsell, Nicolas
manipulation with deep reinforcement learning and Heess, and Razvan Pascanu. Distral: Robust multitask
demonstrations. arXiv preprint arXiv:1709.10087, reinforcement learning. In Advances in Neural Infor-
2017. mation Processing Systems, pp. 4496–4506, 2017.
Danilo Jimenez Rezende and Shakir Mohamed. Vari- Tingwu Wang, Renjie Liao, Jimmy Ba, and Sanja Fidler.
ational inference with normalizing flows. arXiv Nervenet: Learning structured policy with graph neu-
preprint arXiv:1505.05770, 2015. ral networks. 2018.
Artem Rozantsev, Mathieu Salzmann, and Pascal Fua. Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and
Beyond sharing weights for deep domain adaptation. Anind K Dey. Maximum entropy inverse reinforce-
IEEE transactions on pattern analysis and machine ment learning. In AAAI, volume 8, pp. 1433–1438.
intelligence, 41(4):801–814, 2018. Chicago, IL, USA, 2008.

APPENDIX

A                                   Hyper-parameters
                                  Table 2: Hyper-parameters used for all experiments.

                                                 Hyperparameter                            Value
                                                  Hidden Layers                              2
                                                   Hidden Units                             64
                                                    Activation                             tanh
                                                    Optimizer                             Adam
                                                  Learning Rate                          3 x 10−4
                                                Epochs per Iteration                        10
                                                  Minibatch Size                            64
                                                   Discount (γ)                            0.99
                                                GAE parameter (λ)                          0.95
                                                  Clip range ()                            0.2

B                                   Normalized Student Weights

                                           CentipedeFour to CentipedeEight                                                                  CentipedeSix to CentipedeEight                                                                 CentipedeFour to CpCentipedeSix
                            1.0                                                                                             1.0                                                                                             1.0

                            0.9                                                                                             0.9                                                                                             0.9
Normalized Student Weight

                                                                                                Normalized Student Weight

                                                                                                                                                                                                Normalized Student Weight
                                                                                                                                                                                                                            0.8
                            0.8                                                                                             0.8

                                                                                                                                                                                                                            0.7
                            0.7                                                                                             0.7

                                                                                                                                                                                                                            0.6
                            0.6                                                                                             0.6

                                                                                                                                                                                                                            0.5
                            0.5                                                                                             0.5
                                                          MIKT Normalized Student Weight                                                                  MIKT Normalized Student Weight                                                                  MIKT Normalized Student Weight
                                                                                                                                                                                                                            0.4
                                  0.00   0.25   0.50   0.75   1.00  1.25   1.50   1.75   2.00                                     0.00   0.25   0.50   0.75   1.00  1.25   1.50   1.75   2.00                                     0.00   0.25   0.50   0.75   1.00  1.25   1.50   1.75   2.00
                                                           Timesteps                      1e6                                                              Timesteps                      1e6                                                              Timesteps                      1e6

                                                          (a)                                                                                             (b)                                                                                             (c)
                                          CentipedeSix to CpCentipedeEight                                                                         CentipedeFour to Ant                                                                            CentipedeSix to Ant
                            1.0                                                                                             1.0                                                                                             1.0

                                                                                                                            0.9                                                                                             0.9
                            0.9
Normalized Student Weight

                                                                                                Normalized Student Weight

                                                                                                                                                                                                Normalized Student Weight

                                                                                                                            0.8                                                                                             0.8
                            0.8

                                                                                                                            0.7                                                                                             0.7

                            0.7
                                                                                                                            0.6                                                                                             0.6

                            0.6                                                                                             0.5
                                                                                                                                                                                                                            0.5

                            0.5                                                                                             0.4
                                                                                                                                                                                                                            0.4
                                                          MIKT Normalized Student Weight                                                                  MIKT Normalized Student Weight                                                                  MIKT Normalized Student Weight
                                                                                                                            0.3
                                  0.00   0.25   0.50   0.75   1.00  1.25   1.50   1.75   2.00                                     0.00   0.25   0.50   0.75   1.00  1.25   1.50   1.75   2.00                                     0.00   0.25   0.50   0.75   1.00  1.25   1.50   1.75   2.00
                                                           Timesteps                      1e6                                                              Timesteps                      1e6                                                              Timesteps                      1e6

                                                          (d)                                                                                             (e)                                                                                             (f)
Figure 6: Plots of the normalized student weight throughout the course of training. Lower values indicate heavier
dependence on teacher representations. A value of 1 indicates the student is completely independent of the teacher.

C               KL-Regularization Ablation

                      CentipedeFour to CentipedeEight                                 CentipedeSix to CentipedeEight                                   CentipedeFour to CpCentipedeSix
                                                                         3500                                                             3500

         3000
                                                                         3000                                                             3000

         2500
                                                                         2500                                                             2500

         2000
                                                                         2000                                                             2000
Reward

                                                                Reward

                                                                                                                                 Reward
         1500
                                                                         1500                                                             1500

         1000                                                            1000                                                             1000

         500                                                             500                                                              500
                                                 MIKT                                                             MIKT                                                                 MIKT
           0                                                               0
                                                 MIKT w/o KL                                                      MIKT w/o KL               0                                          MIKT w/o KL

                0.0       0.5        1.0       1.5        2.0                   0.0      0.5       1.0        1.5          2.0                   0.0       0.5           1.0       1.5          2.0
                                  Timesteps               1e6                                   Timesteps                  1e6                                        Timesteps                 1e6

                                (a)                                                            (b)                                                                   (c)
                      CentipedeSix to CpCentipedeEight                                     CentipedeFour to Ant                                                  CentipedeSix to Ant
                                                                         3500
                                                                                                                                          3000
         3000
                                                                         3000
                                                                                                                                          2500
         2500                                                            2500
                                                                                                                                          2000
         2000                                                            2000
Reward

                                                                Reward

                                                                                                                                 Reward
                                                                                                                                          1500
                                                                         1500
         1500

                                                                         1000                                                             1000
         1000
                                                                          500                                                              500
         500
                                                                            0                                                                0
                                                 MIKT                                                             MIKT                                                                 MIKT
           0                                     MIKT w/o KL                                                      MIKT w/o KL                                                          MIKT w/o KL
                                                                         −500                                                             −500
                0.0       0.5        1.0       1.5        2.0                   0.0      0.5       1.0        1.5          2.0                   0.0       0.5           1.0       1.5          2.0
                                  Timesteps               1e6                                   Timesteps                  1e6                                        Timesteps                 1e6

                                (d)                                                            (e)                                                                   (f)
Figure 7: MIKT with KL-regularization (blue) vs. MIKT without KL-regularization (green). MIKT still works well
without the KL-regularization.

You can also read