# Mutual Information Based Knowledge Transfer Under State-Action Dimension Mismatch

←

→

**Page content transcription**

If your browser does not render page correctly, please read the page content below

Mutual Information Based Knowledge Transfer Under State-Action Dimension Mismatch Michael Wan Tanmay Gangwani Jian Peng Computer Science Dept. Computer Science Dept. Computer Science Dept. UIUC UIUC UIUC mw3@illinois.edu gangwan2@illinois.edu jianpeng@illinois.edu arXiv:2006.07041v1 [stat.ML] 12 Jun 2020 Abstract 1 INTRODUCTION Deep reinforcement learning (RL), which combines the Deep reinforcement learning (RL) algorithms rigor of RL algorithms with the flexibility of universal have achieved great success on a wide variety function approximators such as deep neural networks, of sequential decision-making tasks. However, has demonstrated a plethora of success stories in recent many of these algorithms suffer from high times. These include computer and board games (Mnih sample complexity when learning from scratch et al., 2015; Silver et al., 2016), continuous control (Lilli- using environmental rewards, due to issues crap et al., 2015), and robotics (Rajeswaran et al., 2017), such as credit-assignment and high-variance to name a few. Crucially though, these methods have gradients, among others. Transfer learning, in been shown to be performant in the regime where an which knowledge gained on a source task is agent can accumulate vast amounts of experience in the applied to more efficiently learn a different but environment, usually modeled with a simulator. For real- related target task, is a promising approach to world environments such as autonomous navigation and improve the sample complexity in RL. Prior industrial processes, data generation is an expensive (and work has considered using pre-trained teacher sometimes risky) procedure. To make deep RL algo- policies to enhance the learning of the stu- rithms more sample-efficient, there is great interest in dent policy, albeit with the constraint that the designing techniques for knowledge transfer, which en- teacher and the student MDPs share the state- ables accelerating agent learning by leveraging either ex- space or the action-space. In this paper, we isting trained policies (referred to as teachers), or using propose a new framework for transfer learn- task demonstrations for imitation learning (Abbeel & Ng, ing where the teacher and the student can have 2004). One promising idea for knowledge transfer in RL arbitrarily different state- and action-spaces. is policy distillation (Rusu et al., 2015; Parisotto et al., To handle this mismatch, we produce embed- 2015; Hinton et al., 2015), where information from the dings which can systematically extract knowl- teacher policy network is transferred to a student policy edge from the teacher policy and value net- network to improve the learning process. works, and blend it into the student networks. Prior work has incorporated policy distillation in a va- To train the embeddings, we use a task-aligned riety of settings (Czarnecki et al., 2019). Some ex- loss and show that the representations could amples include the transfer of knowledge from simple be enriched further by adding a mutual in- to complex agents while following a curriculum over formation loss. Using a set of challenging agents (Czarnecki et al., 2018), learning a centralized simulated robotic locomotion tasks involving policy that captures shared behavior across tasks for many-legged centipedes, we demonstrate suc- multi-task RL (Teh et al., 2017), distilling information cessful transfer learning in situations when the from parent policies into a child policy for a genetically- teacher and student have different state- and inspired RL algorithm (Gangwani & Peng, 2017), and action-spaces. speeding-up large-scale population-based training using multiple teachers (Schmitt et al., 2018). A common mo- tif in these approaches is the use of Kullback-Leibler (KL) divergence between the state-conditional action Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI), PMLR volume 124, 2020.

distributions of the teacher and student networks, as the embeddings must be aligned to serve that goal. Secondly, minimization objective for knowledge transfer. While we would like the embeddings to be correlated with the simple and intuitive, this restricts learning from teach- states encountered by the student policy. The embed- ers that have the same output (action) space as the stu- dings are used to deterministically draw out knowledge dent, since KL divergence is only defined for distribution from the teacher network. Therefore, a high correlation over a common space. An alternative to knowledge shar- ensures that the most suitable teacher guidance is derived ing in the action-space is information transfer through for each student state. We achieve this by maximizing the embedding-space formed via the different layers of a the mutual information between the embeddings and stu- deep neural network. (Liu et al., 2019) provides an ex- dent states. We evaluate our method on a set of challeng- ample of this; it utilizes learned lateral connections be- ing robotic locomotion tasks modeled using the MuJoCo tween intermediate layers of the teacher and student net- simulator. We demonstrate the successful transfer of works. Although the action-spaces can now be different, knowledge from trained teachers to students, in the sce- the state-space is still required to be identical between the nario of mismatched state- and action-space. This leads teacher and the student, since the same input observation to appreciable gains in sample-efficiency, compared to is fed to both the networks (Liu et al., 2019). RL from scratch using only the environmental rewards. In our work, we present a transfer learning approach to accelerate the training of the student policy, by leverag- 2 BACKGROUND ing teacher policies trained in an environment with differ- ent state- and action-space. Arguably, there is a huge po- We consider the RL setting where the environment is tential for data-efficient student learning by tapping into modeled as an infinite-horizon discrete-time Markov De- teachers trained on dissimilar, but related tasks. For in- cision Process (MDP). The MDP is characterized by the stance, consider an available teacher policy for locomo- tuple (S, A, R, T , γ, p0 ), where S and A are the con- tion of a quadruped robot, where the (97-dimensional) tinuous state- and action-space, respectively, γ ∈ [0, 1) state-space is the set of joint-angles and joint-velocities is the discount factor, and p0 is the initial state distri- and the (10-dimensional) action-space is the torques to bution. Given an action at ∈ A, the next state is sam- the joints. If we wish to learn locomotion for a hexa- pled from the transition dynamics distribution, st+1 ∼ pod robot (state-dimension 139, action-dimension 16), T (st+1 |st , at ), and the agent receives a scalar reward we conjecture that the learning could be kick-started by r(st , at ) determined by the reward function R. A policy harnessing the information stored in the trained neural πθ (at |st ) defines the state-conditioned distribution over network for the quadruped, since both the tasks are lo- actions. The RL objective is to learn the policy parame- comotion for legged robots and therefore share an inher- ters (θ) to maximize theexpected P∞ t discounted sum of re- ent structure. However, the dissimilar state- and action- wards, η(πθ ) = Ep0 ,T ,π t=0 γ r(st , at ) . space preclude the use of the knowledge transfer mecha- Policy-gradient algorithms (Sutton et al., 2000) are nisms proposed in prior work. widely used to estimate the gradient of the RL objec- Our approach deals with the mismatch in the state- and tive. Proximal policy optimization (PPO, Schulman et al. action-space of the teacher and student in the following (2017)) is a model-free policy-gradient algorithm that manner. To handle disparate actions, rather than using serves as an efficient approximation to trust-region meth- divergence minimization in the action-space, we transfer ods (Schulman et al., 2015a). In each iteration of PPO, knowledge by augmenting representations in the layers the rollout policy (πθold ) is used to collect sample trajec- of the student network with representations from the lay- tories τ and the following surrogate loss is minimized ers of the teacher network. This is similar to the knowl- over multiple epochs: edge flow in (Liu et al., 2019) using lateral connections, h i but with the important difference that we do not employ LθPPO = −Eτ min rt (θ) Ât , clip (rt (θ) , 1 − , 1 + ) Ât learnable matrices to transform the teacher representa- tion. The mismatch in the observation- or state-space where rt (θ) = ππθθ (a(at |s t) t |st ) is the ratio of the action prob- old has not been considered in prior literature, to the best abilities under the current policy and rollout policy, and of our knowledge. We manage this by learning an em- Ât is the estimated advantage. Variance in the policy- bedding space which can be used to extract the neces- gradient estimates is reduced by employing the state- sary information from the available teacher policy net- value function as a control variate (Mnih et al., 2016). work. These embeddings are trained to adhere to two This is usually modeled as a neural network Vψ and up- properties. Firstly, they must be task-aligned. Our RL dated using temporal difference learning: objective is the maximization of cumulative discounted h i rewards in the student environment, and therefore, the targ 2 Lψ PPO = −E τ V ψ (st ) − V t

where Vttarg is the bootstrapped target value obtained with tween the input states of the target MDP and the embed- TD(λ). To further reduce variance, Generalized Advan- ding vectors produced from them. The embeddings are tage Estimation (GAE, Schulman et al. (2015b)) is used used to deterministically derive representations from the when estimating advantage. The overall PPO minimiza- teacher network, and hence a high correlation helps to tion objective then is: obtain the most appropriate teacher guidance for each of the states encountered by the target policy. To this end, LPPO (θ, ψ) = LθPPO + Lψ PPO (1) we propose a mutual information maximization objec- tive; this is detailed in subsection 3.2. Although we use the PPO objective for our experiments, our method can be readily combined with any on-policy or off-policy actor-critic RL algorithm. 3.1 TASK-ALIGNED EMBEDDING SPACE This section describes our approach for training the en- 3 METHOD coder parameters (φ) such that the generated embeddings are aligned with the RL objective. We begin by detail- In this section, we outline our method for distilling ing the architecture that we use for transfer of knowledge knowledge from a pre-trained teacher policy to a stu- from a teacher, pre-trained in source MDP, to a student dent policy, in the hope that such knowledge sharing im- policy in the target MDP with different state- and action- proves the sample-efficiency of the student learning pro- space. Inspired by the concept of knowledge-flow used cess. Our problem setting is as follows. We assume that in (Liu et al., 2019), we employ lateral connections be- the teacher and the student policies operate in two differ- tween the student and teacher networks, which augment ent MDPs. All the MDP properties (S, A, R, T , γ, p0 ) the representations in the layers of the student with useful could be different, provided that some high-level struc- representations from the layers of the teacher. A crucial tural commonality exists between the MDPs, such as the benefit of this approach is that since information sharing example of transfer from a quadruped robot to hexapod happens through the hidden layers, the output (action) robot introduced in Section 1. Henceforth, for notational space of the source and target MDPs can be disparate, convenience, we refer to the MDP of the teacher as the as is the scenario in our experiments. It is also quite source MDP, and that of the student as the target MDP. straightforward to include multiple teachers in this ar- We assume the availability of a teacher policy network chitecture to distill diverse knowledge into a student; we pre-trained in the source MDP. Crucially though, we do leave this to future work. not assume access to the source MDP for any further ex- ploration, or for obtaining demonstration trajectories that We draw out knowledge from both the teacher policy and could be used for training in the target MDP using cross- state-value networks. We denote the teacher policy and domain imitation-learning techniques. We instead focus value network with πθ0 and Vψ0 , respectively, where the on extracting representations from the teacher policy net- parameters (θ0 , ψ 0 ) are held fixed throughout the train- work which are useful for learning in the target MDP. ing. Analogously, (θ, ψ) are the trainable parameters for the student policy and value networks. Let Nπ denote In this work, we address knowledge transfer when Ssrc 6= the number of hidden layers in the teacher (and student) Starg , where Ssrc and Starg denote the state-space of the policy network, and NV be the number of hidden layers source and target MDPs, respectively. To handle the in the teacher (and student) value network. In general, mismatch, we introduce a learned embedding-space pa- the teacher and student networks could have a different rameterized by an encoder function φ(·), and defined as number of layers, but we assume them to be the same for Semb := {φ(s) | s ∈ Starg }. Data points from this embed- ease of exposition. ding space are used to extract useful information from the teacher policy network. Therefore, we further en- In the target MDP, the student policy observes a state force that the dimension of the embedding space matches starg ∈ Starg , which is fed to the encoder to produce the dimension of the state-space in the source MDP, i.e., the embedding φ(starg ) ∈ Semb . Since |Semb | = |Ssrc |, |Semb | = |Ssrc |. Note that this does not necessitate that this embedding can be readily passed through the teacher any embedding vector s ∈ Semb be a feasible input state networks to extract {zθj0 , 1 ≤ j ≤ Nπ }, representing in the source MDP. To learn the encoder function φ(·), the pre-activation outputs of the Nπ hidden layers of the we consider the following two desiderata. Firstly, the teacher policy network, and {zψj 0 , 1 ≤ j ≤ NV }, rep- embeddings must be learned to facilitate our objective of resenting the pre-activation outputs of the NV hidden maximizing the cumulative discount rewards in the tar- layers of the teacher value function network. To ob- get MDP. In subsection 3.1, we show how to achieve this tain the pre-activation representations in the student net- by utilizing the policy gradient to update embedding pa- works, we feed in the state starg and perform a weighted rameters. Secondly, we wish for a high correlation be- linear combination of the appropriate outputs with the

corresponding pre-activations from the teacher networks. be different at different input states. To aid with this, Concretely, to obtain the hidden layer outputs hjπθ and we utilize a surrogate objective that instead maximizes hjVψ at layer j in the student networks, we have the fol- the correlation between starg and the embeddings φ(starg ), lowing: defined using the principle of mutual information (MI). If we view starg as a stochastic input s, the encoder output hjπθ = σ pjθ zθj + (1 − pjθ )zθj0 is then also a random variable e, and the mutual informa- (2) tion between the two is defined as: hjVψ = σ pjψ zψj + (1 − pjψ )zψj 0 I(s; e) = H(s) − H(s|e) where σ is the activation function, and pjθ , pjψ ∈ [0, 1] are layer-specific learnable parameters denoting the mix- where H denotes the differential entropy. Direct maxi- ing weights. In the target MDP, the student network is mizing of the MI is intractable due to the unknown condi- optimized for the RL objective LPPO (θ, ψ), mentioned in tional densities. However, it is possible to obtain a lower Equation 1. The outputs of the student policy and value bound to the MI using a variational distribution qω (s|e) networks, and hence LPPO , depend on the encoder pa- that approximates the true conditional distribution p(s|e) rameters (φ) through the representation sharing (Equa- as follows: tion 2) enabled by the lateral connections stemming from I(s; e) = H(s) − H(s|e) the pre-trained teacher network. Therefore, an intuitive = H(s) + Es,e [log p(s|e)] objective for shaping the embeddings such that they be- come task-aligned is to optimize them using the original = H(s) + Es,e [log qω (s|e)] RL loss gradient: φ ← φ − α∇φ LPPO (θ, ψ, φ, θ0 , ψ 0 ). + Ee DKL (p(s|e)||qω (s|e)) Note that LPPO (·) now also depends on the fixed teacher ≥ H(s) + Es,e [log qω (s|e)] parameters (θ0 , ψ 0 ). where the last inequality is due to the non-negativity of The learnable mixing weights pjθ , pjψ ∈ [0, 1] control the the KL divergence. This is known as the variational in- influence of the teacher’s representation on the student formation maximization algorithm (Agakov & Barber, outputs – higher the value, lesser the impact. We ar- 2004). Re-writing in terms of target-MDP states and the gue that a low value for these coefficients helps in the encoder parameters, the surrogate objective jointly opti- early phases of the training process by providing nec- mizes over the variational and encoder parameters: essary information to kick-start learning. At the end of the training, however, we desire that the student becomes max Estarg [log qω (starg |φ(starg ))] completely independent of the teacher, since this helps ω,φ in faster test-time deployment of the agent. To encour- where H(s) is omitted since it is a constant w.r.t the con- age this, we introduce additional coupling-loss terms that cerned parameters. In terms of the loss function to mini- drive pjθ , pjψ towards 1 as the training progresses: mize, we can succinctly write: Nπ NV 1 X 1 X j LMI (φ, ω) = −Es∼ρπθ [log qω (s|φ(s))] (4) Lcoupling =− log pθ − log pjψ (3) Nπ j=1 NV j=1 where ρπθ is the state-visitation distribution of the stu- Experimentally, we observe that although the student be- dent policy in the target MDP. In our experiments, we use comes independent in the final stages of training, it is a multivariate Gaussian distribution (with a learned diag- able to achieve the same level of performance that it onal covariance matrix) to model the variational distribu- would if it could still rely on the teacher. tion qω . Although this simple model yields good perfor- mance, more expressive model classes, such as mixture 3.2 ENRICHED EMBEDDINGS WITH MUTUAL density networks and flow-based models (Rezende & INFORMATION MAXIMIZATION Mohamed, 2015) could be readily incorporated as well, to learn complex and multi-modal distributions. As outlined in the previous section, at each timestep of the discrete-time target MDP, the representation distilled 3.3 OVERALL ALGORITHM from the teacher networks is a fixed function f of the embedding vector generated from the current input state: Figure 1 shows the schematic diagram of our complete f (θ0 , ψ 0 , φ(starg )), where (θ0 , ψ 0 ) are fixed. It is desirable architecture and gradient flows, along with a description to have a high degree of correlation between starg and of the implemented neural networks. We refer to our al- f (θ0 , ψ 0 , φ(starg )) because, intuitively, the teacher repre- gorithm as MIKT, for Mutual Information based Knowl- sentation that is the most useful for the student should edge Transfer. Algorithm 1 outlines the main steps of the

Policy Value-fn Branch Branch Val e Ac i (F d) + + Figure 1: Schematic diagram of our complete architecture (best viewed in color). The encoder parameters φ (blue) receive gradients from three sources: the policy-gradient loss LθPPO , the value function loss Lψ PPO , and the mutual information loss LMI . The teacher networks (yellow) remain fixed throughout training and do not receive any gradients. In the student networks (green), the pre-activation representations are linearly combined (using learnt mixing weights) with the corresponding representations from the teacher (Equation 2). Note that this knowledge-flow occurs at all layers, although we show it only once for clarity of exposition. Algorithm 1: Mutual Information based Knowledge experience is then used to compute the RL loss (Equa- Transfer (MIKT) tion 1) and the mutual information loss (Equation 4), en- Input : θ0 , ψ 0 fixed teacher policy and value networks abling the calculation of gradients for the different pa- θ, ψ: student policy and value networks rameters (Lines 3–6). Using both the losses to update {p}: set of coupling parameters for policy and value the encoder (φ) helps us to satisfy the desiderata on the networks embeddings – that they should be task-aligned and corre- φ: encoder parameters lated with the states in the target MDP. The coupling pa- ω: variational distribution parameters rameters {pj }, used for the weighted combination of the representations in the teacher and student networks, are for each iteration do updated with the coupling-loss (Equation 3) along with 1 Run πθ in target MDP and collect few trajectories τ the RL loss. In each iteration of the algorithm, the PPO 2 for each minibatch m ∈ τ do update ensures that the state-action visitation distribution 3 Update θ, ψ with ∇θ,ψ LPPO (θ, ψ, φ, θ0 , ψ 0 ) of the policy πθ is modified by only a small amount. This 4 Update φ with is because of the clipping on the importance-sampling ra- ∇φ LMI (φ, ω) + LPPO (θ, ψ, φ, θ0 , ψ 0 ) tio (Section 2) when obtaining the PPO gradient. In ad- 5 Update ω with ∇ω LMI (φ, ω) dition to this, we experimentally found that enforcing an 6 Update {p} using [Lcoupling + LPPO ] explicit KL-regularization on the policy further stabilizes 7 end learning. Let πθ , πθold denote the current and the rollout 8 end policy, respectively. The loss is then formalized as: LKL (θ, θold ) = Es∼ρπθ DKL (πθ (·|s)||πθold (·|s)) old training procedure. In each iteration, we run the policy in the target MDP and collect a batch of trajectories. This

(a) CentipedeFour (b) CentipedeSix (c) CentipedeEight (d) CpCentipedeSix (e) CpCentipedeEight (f) Ant Figure 2: Our MuJoCo locomotion environments. The centipede agents are configured using the details in (Wang et al., 2018), while Ant-v2 is a popular OpenAI Gym task. 4 RELATED WORK Different from these, our approach handles the mismatch in the state-space by training an embedding space which is utilized for efficient knowledge transfer. Rozantsev The concepts of knowledge transfer and information et al. (2018) employ layer-wise weight regularization and sharing between deep neural networks have been ex- evaluate on (un-)supervised tasks where the input dis- tensively researched for a wide variety of tasks in ma- tributions for source and target domains have semantic chine learning. In the context of reinforcement learning, similarity and are static. For RL tasks, the input distribu- the popular paradigms for knowledge transfer include tions change dynamically as the student policy updates; imitation-learning, meta-RL, and policy distillation; each it is unclear if enforcing similarity between the networks of these being applicable under different settings and for all inputs by coupling the weights is ideal. Gamrian assumptions. Imitation learning algorithms (Ng et al., & Goldberg (2018) use GANs to learn a mapping from 2000; Ziebart et al., 2008) utilize teacher demonstrations target states to source states. In addition to requiring that to extract useful information (such as the teacher reward the source and the target domains have the same action- function in inverse-RL methods) and use that to acceler- space, their method also relies on the exploratory sam- ate student learning. In meta-RL approaches (Duan et al., ples collected in the source MDP for training the GAN. 2016; Finn et al., 2017), we are generally provided with a In contrast, we handle the action-space mismatch and do distribution of tasks that share some structural similarity, not assume access to the source MDP for exploration. and the objective is to discover this generalizable knowl- edge for accelerating the process of learning on a new Our work also has connections to policy distillation task. Our work is most closely related to policy distil- methods that use implicit teachers, rather than external lation methods (Rusu et al., 2015; Parisotto et al., 2015; pre-trained models. In Czarnecki et al. (2018), the au- Czarnecki et al., 2019), where pre-trained teacher net- thors recommend a curriculum over agents, rather than works are available and can expedite learning in dissim- the usual curriculum over tasks. Such a curriculum ilar (but related) student tasks. trains simple agents first, the knowledge of which is then distilled into more complex agents over time. Akkaya Prior work has considered teachers in various capac- et al. (2019) iterate on policy architectures by utiliz- ities. Rusu et al. (2016) and Liu et al. (2019) utilize ing behavior-cloning with DAgger; the new architecture learned cross-connections between intermediate layers (student) is trained using the old architecture (teacher). of teacher networks—that have been pre-trained on var- Distillation has been used in multi-task RL (Teh et al., ious source tasks—and a student network to effectively 2017) to learn a centralized policy that captures gen- transfer knowledge and enable more efficient learning on eralizable information from policies trained on individ- a target task. Ahn et al. (2019) use an objective based on ual tasks. (Gangwani & Peng, 2017) combine ideas from the mutual information between the corresponding lay- the genetic-algorithms literature and distillation to train ers of teacher and student networks, and show gains in offspring policies that inherit the best traits of both image classification tasks. In Hinton et al. (2015), in- the parent policies. Since all these approaches trans- formation from a large model (teacher) is compressed fer information in the action-space by minimizing the into a smaller model (student) using a distillation process KL-divergence between state-conditional action distribu- that uses the temperature-regulated softmax outputs from tions, they share the limitation that the student can only the teacher as targets to train the student. Schmitt et al. leverage a teacher with the same output (action) space. (2018) propose a large-scale population-based training Our approach avoids this by using the representations in pipeline that allows a student policy to leverage multiple the different layers of the neural network for knowledge teachers specialized in different tasks. All these afore- sharing, enabling transfer-learning in many diverse sce- mentioned methods work in the setting where the teacher narios as shown in our experiments. and student share the input state (observation) space.

CentipedeFour to CentipedeEight CentipedeSix to CentipedeEight CentipedeFour to CpCentipedeSix 3500 3500 2500 3000 3000 2000 2500 2500 2000 2000 1500 Reward Reward Reward 1500 1500 1000 1000 1000 500 500 500 MIKT MIKT MIKT 0 MLPP MLPP MLPP 0 VPG VPG 0 VPG 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 Timesteps 1e6 Timesteps 1e6 Timesteps 1e6 (a) (b) (c) CentipedeSix to CpCentipedeEight CentipedeFour to Ant CentipedeSix to Ant 3000 3000 3000 2500 2500 2500 2000 2000 2000 Reward Reward Reward 1500 1500 1500 1000 1000 1000 500 500 500 MIKT 0 MIKT MIKT 0 MLPP MLPP MLPP 0 VPG VPG VPG −500 −500 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 Timesteps 1e6 Timesteps 1e6 Timesteps 1e6 (d) (e) (f) Figure 3: Performance of our transfer learning algorithm (MIKT) and the baselines (VPG, MLPP) on the MuJoCo locomotion tasks. Each plot is titled “x to y”, where x is the source (teacher) MDP and y is the target (student) MDP. (a) CentipedeFour to CentipedeEight, (b) CentipedeSix to CentipedeEight, (c) CentipedeFour to CpCentipedeSix, (d) CentipedeSix to CpCentipedeEight, (e) CentipedeFour to Ant, (f) CentipedeSix to Ant. 5 EXPERIMENTS a centipede – it consists of repetitive torso bodies, each having two legs. Figure 2 shows an illustration of the dif- In this section, we perform experiments to quantify the ferent centipede agents. Please see (Wang et al., 2018) efficacy of our algorithm, MIKT, for transfer learning in for a detailed description of the environment generation RL, and also do some qualitative analysis. We address process. The agent is rewarded for running fast in a par- the following questions: a) Can we do successful knowl- ticular direction. Table 1 includes the state and action edge transfer between a teacher and a student with dif- dimensions of all the agents. Centipede-x refers to a ferent state- and action-space? b) Are both the losses centipede with x legs; we use x ∈ {4, 6, 8}. We use {LPPO , LMI } important for learning useful embeddings additional environments where the centipede is crippled φ? c) How does task-similarity affect the benefits that (some legs disabled) and denote this by Cp-Centipede- can be reaped from MIKT? x. Finally, we include the standard Ant-v2 task from the MuJoCo suite. Note that all robots have separate state Table 1: MuJoCo locomotion environments. and action dimensions. Intuitively though, these loco- motion tasks share an inherent structure that could be ex- Environment State Dimension Action Dimension ploited for transfer learning between the centipedes of CentipedeFour 97 10 various types. We now demonstrate that our algorithm CentipedeSix 139 16 achieves this successfully. CentipedeEight 181 22 CpCentipedeSix 139 12 Baselines: We compare MIKT with two baselines: a) CpCentipedeEight 181 18 Vanilla Policy Gradient (VPG), which learns the task in Ant 111 8 the target MDP from scratch using only the environmen- tal rewards. Any transfer learning algorithm which ef- Environments: We evaluate using locomotion tasks fectively leverages the available teacher networks should for legged robots, modeled in OpenAI Gym (Brock- be able to outperform this baseline that does not receive man et al., 2016) using the MuJoCo physics simulator. any prior knowledge it can use. We use the standard Specifically, we use the environments provided by Wang PPO (Schulman et al., 2017) algorithm for this baseline. et al. (2018), where the agent structure resembles that of b) MLP Pre-trained (MLPP) In our setting, the teacher

CentipedeFour to CentipedeEight CentipedeSix to CentipedeEight CentipedeFour to CpCentipedeSix 3500 3500 3000 3000 3000 2500 2500 2500 2000 2000 2000 Reward Reward Reward 1500 1500 1500 1000 1000 1000 500 500 MIKT MIKT 500 MIKT MIKT w/o MI MIKT w/o MI MIKT w/o MI 0 0 MIKT w/o RL gradients MIKT w/o RL gradients 0 MIKT w/o RL gradients 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 Timesteps 1e6 Timesteps 1e6 Timesteps 1e6 (a) (b) (c) CentipedeSix to CpCentipedeEight CentipedeFour to Ant CentipedeSix to Ant 3000 3000 3000 2500 2500 2500 2000 2000 2000 Reward Reward Reward 1500 1500 1500 1000 1000 1000 500 500 500 MIKT 0 MIKT 0 MIKT MIKT w/o MI MIKT w/o MI MIKT w/o MI 0 MIKT w/o RL gradients MIKT w/o RL gradients MIKT w/o RL gradients −500 −500 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 Timesteps 1e6 Timesteps 1e6 Timesteps 1e6 (d) (e) (f) Figure 4: Ablation on the importance of each of {LPPO , LMI } for training the encoder φ. MIKT (blue) is compared with two variants: MIKT w/o MI (LMI not used) and MIKT w/o RL gradients (LPPO not used). and the student networks have dissimilar input and out- trained and randomly initialized parameters of the stu- put dimensions (because the MDPs have different state- dent networks. This indicates that the MLPP strategy is and action-spaces). A natural transfer learning strategy not productive for transfer learning across the RL loco- is to remove the input and output layers from the pre- motion tasks considered. Finally, we note that our al- trained teacher and replace them with new learnable lay- gorithm (MIKT) vastly outperforms the two baselines, ers that match the dimensions required of the student pol- both achieving higher returns in earlier stages of train- icy (analogously value) network. The middle stack of the ing and reaching much higher final performance. This deep neural network is then fine-tuned with the RL loss. proves that firstly, these tasks do have a structural com- Prior work has shown that such a transfer is effective in monality such that a teacher policy trained in one task certain computer vision tasks. could be used advantageously to accelerate learning in a different task; and secondly, that MIKT is a successful approach for achieving such a knowledge transfer. This 5.1 EXPERIMENTAL RESULTS works even when the teacher and student MDPs have dif- ferent state- and action-spaces, and is realized by learn- Figure 3 plots the learning curves for MIKT and our two ing embeddings that are task-aligned and are optimized baselines in different transfer learning experiments. Each with a mutual information loss (Algorithm 1). plot is titled “x to y”, where x is the source (teacher) MDP and y is the target (student) MDP. We run each experiment with 5 different random seeds and plot the 5.2 ABLATION STUDIES average episodic returns (mean and standard deviation) on the y-axis, against the number of timesteps of envi- Are gradients from both {LPPO , LMI } to the encoder ronment interaction (2 million total) on the x-axis. VPG beneficial? To quantify this, we experiment with two does not use utilize the pre-trained teachers. We ob- variants of our algorithm, each of which removes one serve that its performance improves with the training it- of the components: MIKT w/o MI, which does not up- erations, albeit at a sluggish pace. MLPP uses the mid- date φ with the mutual information loss proposed in Sec- dle stack of the pre-trained teacher network as an ini- tion 3.2, and MIKT w/o RL gradients, which omits using tialization and trains the input and output layers from the policy-gradient and the value function TD-error gra- scratch. It only performs on par with VPG, potentially dient for the encoder. Figure 4 plots the performance of due to the non-constructive interaction between the pre- these variants and compares it to MIKT (which includes

both the losses). We note that MIKT w/o MI generally action-spaces. We achieve this by learning an encoder to struggles to learn in the early stages of training; see for produce embeddings that draw out useful representations instance Figure 4 (c), (d). MIKT w/o RL gradients does from the teacher networks. We argue that training the en- comparatively better early on in training, but it is evi- coder with both the RL-loss and the mutual information- dent that MIKT is the most performant, both in terms of loss yields rich representations; we provide empirical early training efficiency and the average episodic returns validation for this as well. Our experiments on a set of the final policy. This supports our design choice of of challenging locomotion tasks involving many-legged using both {LPPO , LMI } to update the encoder φ. centipedes show that MIKT is a successful approach for achieving knowledge transfer when the teacher and stu- How sensitive is MIKT to the task-similarity? It is dent MDPs have mismatched state- and action-space. reasonable to assume that the benefits of transfer learning depend on the task-similarity between the teacher and the student. To better understand this in the context of our al- References gorithm, we consider learning in the CentipedeEight en- Pieter Abbeel and Andrew Y Ng. Apprenticeship learn- vironment using different types of teachers – Centipede- ing via inverse reinforcement learning. In Proceed- Four, CentipedeSix, Hopper. In Figure 5a, we notice ings of the twenty-first international conference on that the influence of the Centipede-{Four,Six} teachers Machine learning, pp. 1, 2004. is much more significant than the Hopper teacher. This is likely because the motion of the centipedes shares sim- Felix Agakov and David Barber. Variational information ilarity, whereas the Hopper (which is trained to hop) is maximization for neural coding. In International Con- a dissimilar task and therefore less useful for transfer ference on Neural Information Processing, pp. 543– learning. In Figure 5b we plot the value of the weight 548. Springer, 2004. on the student representation, when doing a weighed lin- Sungsoo Ahn, Shell Xu Hu, Andreas Damianou, Neil D ear combination with the teacher (Section 3.1). We ob- Lawrence, and Zhenwen Dai. Variational information serve with the Hopper teacher that, very early in training, distillation for knowledge transfer. In Proceedings of the student learns to trust its own learned representations the IEEE Conference on Computer Vision and Pattern rather than incorporate knowledge from the dissimilar Recognition, pp. 9163–9171, 2019. teacher. Ilge Akkaya, Marcin Andrychowicz, Maciek Chociej, CentipedeEight with Various Teachers MIKT Normalized Student Weight Mateusz Litwin, Bob McGrew, Arthur Petron, Alex 3500 1.0 Paino, Matthias Plappert, Glenn Powell, Raphael 3000 0.9 Ribas, et al. Solving rubik’s cube with a robot hand. Normalized Student Weight 2500 arXiv preprint arXiv:1910.07113, 2019. 0.8 2000 Greg Brockman, Vicki Cheung, Ludwig Pettersson, Reward 1500 0.7 Jonas Schneider, John Schulman, Jie Tang, and Wo- 1000 0.6 jciech Zaremba. Openai gym, 2016. 500 MIKT w/ CentipedeFour MIKT w/ CentipedeFour Teacher 0 MIKT w/ CentipedeSix MIKT w/ Hopper 0.5 MIKT w/ CentipedeSix Teacher MIKT w/ Hopper Teacher Wojciech Marian Czarnecki, Siddhant M Jayakumar, 0.0 0.5 1.0 Timesteps 1.5 2.0 1e6 0.00 0.25 0.50 0.75 1.00 1.25 Timesteps 1.50 1.75 2.00 1e6 Max Jaderberg, Leonard Hasenclever, Yee Whye Teh, Simon Osindero, Nicolas Heess, and Razvan Pascanu. (a) (b) Mix&match-agent curricula for reinforcement learn- ing. arXiv preprint arXiv:1806.01780, 2018. Figure 5: Training on CentipedeEight with different teachers. (a) Transfer from a dissimilar teacher (Hop- Wojciech Marian Czarnecki, Razvan Pascanu, Simon per) is less effective compared to using Centipede teach- Osindero, Siddhant M Jayakumar, Grzegorz Swirszcz, ers. (b) Value of the weight on the student representa- and Max Jaderberg. Distilling policy distillation. tion in the weighed linear combination. With the Hopper arXiv preprint arXiv:1902.02186, 2019. teacher, the value rises sharply in the early stages, indi- Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, cating a low teacher contribution. Ilya Sutskever, and Pieter Abbeel. Rl2: Fast reinforce- ment learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779, 2016. 6 CONCLUSION Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model- agnostic meta-learning for fast adaptation of deep net- In this paper, we proposed an algorithm for transfer works. In Proceedings of the 34th International Con- learning in RL where the teacher (source) and the stu- ference on Machine Learning-Volume 70, pp. 1126– dent (task) agents can have arbitrarily different state- and 1135. JMLR. org, 2017.

Shani Gamrian and Yoav Goldberg. Transfer learning Andrei A Rusu, Sergio Gomez Colmenarejo, Caglar Gul- for related reinforcement learning tasks via image-to- cehre, Guillaume Desjardins, James Kirkpatrick, Raz- image translation. arXiv preprint arXiv:1806.07377, van Pascanu, Volodymyr Mnih, Koray Kavukcuoglu, 2018. and Raia Hadsell. Policy distillation. arXiv preprint Tanmay Gangwani and Jian Peng. Policy optimization by arXiv:1511.06295, 2015. genetic distillation. arXiv preprint arXiv:1711.01012, Andrei A Rusu, Neil C Rabinowitz, Guillaume Des- 2017. jardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Had- Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distill- sell. Progressive neural networks. arXiv preprint ing the knowledge in a neural network. arXiv preprint arXiv:1606.04671, 2016. arXiv:1503.02531, 2015. Simon Schmitt, Jonathan J Hudson, Augustin Zidek, Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Simon Osindero, Carl Doersch, Wojciech M Czar- Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, necki, Joel Z Leibo, Heinrich Kuttler, Andrew Zisser- and Daan Wierstra. Continuous control with deep rein- man, Karen Simonyan, et al. Kickstarting deep rein- forcement learning. arXiv preprint arXiv:1509.02971, forcement learning. arXiv preprint arXiv:1803.03835, 2015. 2018. Iou-Jen Liu, Jian Peng, and Alexander G Schwing. John Schulman, Sergey Levine, Pieter Abbeel, Michael Knowledge flow: Improve upon your teachers. arXiv Jordan, and Philipp Moritz. Trust region policy op- preprint arXiv:1904.05878, 2019. timization. In International Conference on Machine Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Learning, pp. 1889–1897, 2015a. Andrei A Rusu, Joel Veness, Marc G Bellemare, John Schulman, Philipp Moritz, Sergey Levine, Michael Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Jordan, and Pieter Abbeel. High-dimensional contin- Georg Ostrovski, et al. Human-level control through uous control using generalized advantage estimation. deep reinforcement learning. Nature, 518(7540):529– arXiv preprint arXiv:1506.02438, 2015b. 533, 2015. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Radford, and Oleg Klimov. Proximal policy optimiza- Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, tion algorithms. arXiv preprint arXiv:1707.06347, David Silver, and Koray Kavukcuoglu. Asynchronous 2017. methods for deep reinforcement learning. In Inter- David Silver, Aja Huang, Chris J Maddison, Arthur national Conference on Machine Learning, pp. 1928– Guez, Laurent Sifre, George Van Den Driessche, Ju- 1937, 2016. lian Schrittwieser, Ioannis Antonoglou, Veda Panneer- Andrew Y Ng, Stuart J Russell, et al. Algorithms for shelvam, Marc Lanctot, et al. Mastering the game of inverse reinforcement learning. In Icml, pp. 663–670, go with deep neural networks and tree search. nature, 2000. 529(7587):484–489, 2016. Emilio Parisotto, Jimmy Lei Ba, and Ruslan Salakhutdi- Richard S Sutton, David A McAllester, Satinder P Singh, nov. Actor-mimic: Deep multitask and transfer rein- and Yishay Mansour. Policy gradient methods for re- forcement learning. arXiv preprint arXiv:1511.06342, inforcement learning with function approximation. In 2015. Advances in neural information processing systems, Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, pp. 1057–1063, 2000. Giulia Vezzani, John Schulman, Emanuel Todorov, Yee Teh, Victor Bapst, Wojciech M Czarnecki, John and Sergey Levine. Learning complex dexterous Quan, James Kirkpatrick, Raia Hadsell, Nicolas manipulation with deep reinforcement learning and Heess, and Razvan Pascanu. Distral: Robust multitask demonstrations. arXiv preprint arXiv:1709.10087, reinforcement learning. In Advances in Neural Infor- 2017. mation Processing Systems, pp. 4496–4506, 2017. Danilo Jimenez Rezende and Shakir Mohamed. Vari- Tingwu Wang, Renjie Liao, Jimmy Ba, and Sanja Fidler. ational inference with normalizing flows. arXiv Nervenet: Learning structured policy with graph neu- preprint arXiv:1505.05770, 2015. ral networks. 2018. Artem Rozantsev, Mathieu Salzmann, and Pascal Fua. Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Beyond sharing weights for deep domain adaptation. Anind K Dey. Maximum entropy inverse reinforce- IEEE transactions on pattern analysis and machine ment learning. In AAAI, volume 8, pp. 1433–1438. intelligence, 41(4):801–814, 2018. Chicago, IL, USA, 2008.

APPENDIX A Hyper-parameters Table 2: Hyper-parameters used for all experiments. Hyperparameter Value Hidden Layers 2 Hidden Units 64 Activation tanh Optimizer Adam Learning Rate 3 x 10−4 Epochs per Iteration 10 Minibatch Size 64 Discount (γ) 0.99 GAE parameter (λ) 0.95 Clip range () 0.2 B Normalized Student Weights CentipedeFour to CentipedeEight CentipedeSix to CentipedeEight CentipedeFour to CpCentipedeSix 1.0 1.0 1.0 0.9 0.9 0.9 Normalized Student Weight Normalized Student Weight Normalized Student Weight 0.8 0.8 0.8 0.7 0.7 0.7 0.6 0.6 0.6 0.5 0.5 0.5 MIKT Normalized Student Weight MIKT Normalized Student Weight MIKT Normalized Student Weight 0.4 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Timesteps 1e6 Timesteps 1e6 Timesteps 1e6 (a) (b) (c) CentipedeSix to CpCentipedeEight CentipedeFour to Ant CentipedeSix to Ant 1.0 1.0 1.0 0.9 0.9 0.9 Normalized Student Weight Normalized Student Weight Normalized Student Weight 0.8 0.8 0.8 0.7 0.7 0.7 0.6 0.6 0.6 0.5 0.5 0.5 0.4 0.4 MIKT Normalized Student Weight MIKT Normalized Student Weight MIKT Normalized Student Weight 0.3 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Timesteps 1e6 Timesteps 1e6 Timesteps 1e6 (d) (e) (f) Figure 6: Plots of the normalized student weight throughout the course of training. Lower values indicate heavier dependence on teacher representations. A value of 1 indicates the student is completely independent of the teacher.

C KL-Regularization Ablation CentipedeFour to CentipedeEight CentipedeSix to CentipedeEight CentipedeFour to CpCentipedeSix 3500 3500 3000 3000 3000 2500 2500 2500 2000 2000 2000 Reward Reward Reward 1500 1500 1500 1000 1000 1000 500 500 500 MIKT MIKT MIKT 0 0 MIKT w/o KL MIKT w/o KL 0 MIKT w/o KL 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 Timesteps 1e6 Timesteps 1e6 Timesteps 1e6 (a) (b) (c) CentipedeSix to CpCentipedeEight CentipedeFour to Ant CentipedeSix to Ant 3500 3000 3000 3000 2500 2500 2500 2000 2000 2000 Reward Reward Reward 1500 1500 1500 1000 1000 1000 500 500 500 0 0 MIKT MIKT MIKT 0 MIKT w/o KL MIKT w/o KL MIKT w/o KL −500 −500 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 Timesteps 1e6 Timesteps 1e6 Timesteps 1e6 (d) (e) (f) Figure 7: MIKT with KL-regularization (blue) vs. MIKT without KL-regularization (green). MIKT still works well without the KL-regularization.

You can also read