Leveraging class abstraction for commonsense reinforcement learning via residual policy gradient methods

Page created by Tracy Vasquez
 
CONTINUE READING
Leveraging class abstraction for commonsense reinforcement learning via residual
                                                                     policy gradient methods
                                                                       Niklas Höpner1∗ , Ilaria Tiddi 2 and Herke van Hoof 1
                                                                                       1
                                                                                         University of Amsterdam
                                                                                    2
                                                                                      Vrije Universiteit Amsterdam
                                                                           {n.r.hopner, h.c.vanhoof}@uva.nl, i.tiddi@vu.nl,
arXiv:2201.12126v2 [cs.AI] 1 May 2022

                                                                   Abstract                               or world knowledge is an important step towards improved
                                                                                                          human-machine interaction [Akata et al., 2020], as inter-
                                                Enabling reinforcement learning (RL) agents to            esting interactions demand machines to access prior knowl-
                                                leverage a knowledge base while learning from ex-         edge not learnable from experience. Commonsense games
                                                perience promises to advance RL in knowledge in-          [Jiang et al., 2020; Murugesan et al., 2021] have emerged as
                                                tensive domains. However, it has proven difficult         a testbed for methods that aim at integrating commonsense
                                                to leverage knowledge that is not manually tailored       knowledge into a RL agent. Prior work has focused on aug-
                                                to the environment. We propose to use the sub-            menting the state by extracting subparts of ConceptNet [Mu-
                                                class relationships present in open-source knowl-         rugesan et al., 2021]. Performance only improved when the
                                                edge graphs to abstract away from specific objects.       extracted knowledge was tailored to the environment. Here,
                                                We develop a residual policy gradient method that         we focus on knowledge that is automatically extracted and
                                                is able to integrate knowledge across different ab-       should be useful across a range of commonsense games.
                                                straction levels in the class hierarchy. Our method
                                                                                                             Humans abstract away from specific objects using classes,
                                                results in improved sample efficiency and general-
                                                                                                          which allows them to learn behaviour at class-level and gen-
                                                isation to unseen objects in commonsense games,
                                                                                                          eralise to unseen objects [Yee, 2019]. Since commonsense
                                                but we also investigate failure modes, such as ex-
                                                                                                          games deal with real-world entities, we look at the problem
                                                cessive noise in the extracted class knowledge or
                                                                                                          of leveraging subclass knowledge from open-source KGs to
                                                environments with little class structure1 .
                                                                                                          improve sample efficiency and generalisation of an agent in
                                                                                                          commonsense games. We use subclass knowledge to for-
                                        1       Introduction                                              mulate a state abstraction, that aggregates states depending
                                                                                                          on which classes are present in a given state. This state
                                        Deep reinforcement learning (DRL) has enabled us to opti-         abstraction might not preserve all information necessary to
                                        mise control policies in MDPs with high-dimensional state         act optimally in a state. Therefore, a method is needed that
                                        and action spaces such as in game-play [Silver et al., 2016]      learns to integrate useful knowledge over a sequence of more
                                        and robotics [Lillicrap et al., 2016]. Two main hindrances        and more fine-grained state representations. We show how a
                                        in bringing deep reinforcement learning to the real world are     naive ensemble approach can fail to correctly integrate in-
                                        the sample inefficiency and poor generalisation performance       formation from imperfectly abstracted states, and design a
                                        of current methods [Kirk et al., 2021]. Amongst other ap-         residual learning approach that is forced to learn the differ-
                                        proaches, including prior knowledge in the learning process       ence between policies over adjacent abstraction levels. The
                                        of the agent promises to alleviate these obstacles and move re-   properties of both approaches are first studied in a toy set-
                                        inforcement learning (RL) from a tabula rasa method to more       ting where the effectiveness of class-based abstractions can
                                        human-like learning. Depending on the area of research, prior     be controlled. We then show that if a commonsense game is
                                        knowledge representations can vary from pretrained embed-         governed by class structure, the agent is more sample efficient
                                        dings or weights [Devlin et al., 2019] to symbolic knowledge      and generalises better to unseen objects, outperforming em-
                                        representations such as logics [Vaezipoor et al., 2021] and       bedding approaches and methods augmenting the state with
                                        knowledge graphs (KGs) [Zhang et al., 2020b]. While the           subparts of ConceptNet. However, learning might be ham-
                                        former are easier to integrate into deep neural network-based     pered, if the extracted class knowledge aggregates objects in-
                                        algorithms, they lack specificity, abstractness, robustness and   correctly. To summarise, our key contributions are:
                                        interpretability [van Harmelen and ten Teije, 2019].
                                           One type of prior knowledge that is hard to obtain for         • we use the subclass relationship from open-source KGs to
                                        purely data-driven methods is commonsense knowledge.                formulate a state abstraction for commonsense games;
                                        Equipping reinforcement learning agents with commonsense
                                                                                                          • we propose a residual learning approach that can be inte-
                                            ∗
                                              Contact Author                                                grated with policy gradient algorithms to leverage imper-
                                            1
                                              Code can be found at https://github.com/NikeHop/CSRL          fect state abstractions;
• we show that in environments with class structure our                   animal       building      animal       animal

                                                                                                                                                          entity
  method leads to more sample efficient learning and better
  generalisation to unseen objects.                                     vertebrate   farm building vertebrate   invertebrate                   animal               building

                                                                                                                                      vertebrate     invertibrate    farm

2   Related Work                                                        mammal          barn        mammal      arthropod
                                                                                                                                                                    building

                                                                                                                                      mammal         anthropod
                                                                                                                                                                     barn
We introduce the resources available to include commonsense
knowledge and the attempts that have been made by prior                 ungulate        barn        ungulate       spider             ungulate          spider

work to leverage these resources. Since the class knowledge                                                                    goat          sheep
is extracted from knowledge graphs, work on including KGs                  goat         barn         sheep         spider

in deep neural network based architectures is discussed. The
setting considered here also offers a new perspective on state   Figure 1: (Left) Visualisation of a state and its abstractions in Word-
abstraction in reinforcement learning.                           craft based on classes extracted from WordNet. Colours indicate that
                                                                 an object is mapped to a class, where the same colour refers to the
Commonsense KGs. KGs store facts in form of entity-
                                                                 same class within an abstraction. (Right) Subgraph of the class tree
relation-entity triplets. Often a KG is constructed to cap-      that is used to determine the state abstractions.
ture either a general area of knowledge such as common-
sense [Ilievski et al., 2021], or more domain specific knowl-
edge like the medical domain [Huang et al., 2017]. While         al., 2020b]. Most methods are based on an attention mech-
ConceptNet [Speer et al., 2017] tries to represent all com-      anism over parts of the knowledge base [Moon et al., 2019;
monsense knowledge, others focus on specific parts such as       Gou et al., 2021]. Two key differences are that the knowledge
cause and effect relations [Sap et al., 2019]. Manually de-      graphs used are curated for the task, and therefore contain lit-
signed KGs [Miller, 1995] are less error prone, but provide      tle noise. Additionally, most tasks are framed as supervised
less coverage and are more costly to design, making hybrid       learning problems where annotation about correct reasoning
approaches popular [Speer et al., 2017]. Here, we focus on       patterns are given. Here, we extract knowledge from open-
WordNet [Miller, 1995], ConceptNet and DBpedia [Lehmann          source knowledge graphs and have to deal with errors in the
et al., 2015] and study how the quality of their class knowl-    class structure due to problems with entity reconciliation and
edge affects our method. Representing KGs in vector form         incompleteness of knowledge.
can be achieved via knowledge graph embedding techniques         State abstraction in RL. State abstraction aims to partition
[Nickel and Kiela, 2017], where embeddings can be trained        the state space of a base Markov Decision Process (MPD)
from scratch or word embeddings are finetuned [Speer and         into abstract states to reduce the complexity of the state space
Lowry-Duda, 2017]. Hyperbolic embeddings [Nickel and             on which a policy is learnt [Li et al., 2006]. Different crite-
Kiela, 2017] capture the hierarchical structure of WordNet       ria for aggregating states have been proposed [Givan et al.,
given by the hypernym relation between two nouns and are         2003]. They guarantee that an optimal policy learnt for the
investigated as an alternative method to include class prior     abstract MDP remains optimal for the base MDP. To leverage
knowledge.                                                       state abstraction an aggregation function has to be learned
Commonsense Games. To study the problem of integrat-             [Zhang et al., 2020a], which either needs additional samples
ing commonsense knowledge into an RL agent, common-              or is performed on-policy leading to a potential collapse of the
sense games have recently been introduced [Jiang et al.,         aggregation function [Kemertas and Aumentado-Armstrong,
2020; Murugesan et al., 2021]. Prior methods have fo-            2021]. The case in which an approximate state abstraction is
cused on leveraging knowledge graph embeddings [Jiang            given as prior knowledge has not been looked at yet. The ab-
et al., 2020] or augmenting the state representation via an      straction given here must not satisfy any consistency criteria
extracted subpart of ConceptNet [Murugesan et al., 2021].        and can consist of multiple abstraction levels. A method that
While knowledge graph embeddings improve performance             is capable of integrating useful knowledge from each abstrac-
more than GloVe word embeddings [Pennington et al., 2014],       tion level is needed.
the knowledge graph they are based on is privileged game
information. Extracting task-relevant knowledge from Con-        3    Problem Setting
ceptNet automatically is challenging. If the knowledge is
                                                                 Reinforcement learning enables us to learn optimal behaviour
manually specified, sample efficiency improves but heuris-
                                                                 in an MDP M = (S, A, R, T, γ) with state space S, action
tic extraction rules hamper learning. No method that learns
                                                                 space A, reward function R : S × A → R, discount factor
to extract useful knowledge exists yet. The class knowledge
                                                                 γ and transition function T : S × A → ∆S , where ∆S
we use here is not tailored to the environment, and therefore
                                                                 represents the set of probability distributions over the space
should hold across a range of environments.
                                                                 S. The goal is to learn from experience a policy π : S → ∆A
Integrating KGs into deep learning architectures. The            that optimises the objective:
problem of integrating knowledge present in a knowledge                                     "∞        #
graph into a learning algorithm based on deep neural net-                                    X
works has mostly been studied by the natural language com-                    J(π) = Eπ         γ t Rt = Eπ [G],            (1)
                                                                                                                 t=0
munity [Xie and Pu, 2021]. Use-cases include, but are not
limited to, open-dialog [Moon et al., 2019], task-oriented di-   where G is the discounted
                                                                                 0
                                                                                           return. A state abstraction func-
alogue [Gou et al., 2021] and story completion [Zhang et         tion φ : S → S aggregates states into abstract states with
the goal to reduce the complexity of the state space. Given an                   the game state and construct a class tree from open-source
arbitrary   weighting function w : S → [0, 1] s.t. ∀s0 ∈ S 0 ,                   KGs. If the game state is not a set of entities but rather text,
                                                                                 we use spaCy2 to extract all nouns as the set of objects. The
P
   s∈φ−1 (s0 ) w(s) = 1, one can define an abstract reward
function R0 and transition function T 0 on the abstract state                    class tree is extracted from either DBpedia, ConceptNet or
space S 0 :                                                                      WordNet. For the detailed algorithms of each KG extraction,
                              X                                                  we refer to Appendix A. Here, we discuss some of the caveats
                R0 (s0 , a) =       w(s)R(s, a)            (2)                   that arise when extracting class trees from open-source KGs,
                                  s∈φ−1 (s0 )                                    and how to tackle them. Class tree can become imbalanced,
                               X            X                                    i.e. the depths of the leaves, representing the objects, differs
      T 0 (s̄0 |s0 , a) =                               w(s)T (s̄|s, a),   (3)   (Figure 2). As each additional layer with a small number of
                            s̄∈φ−1 (s̄0 ) s∈φ−1 (s0 )
                                                                                 classes adds computational overhead but provides little ab-
                                                                                 straction, we collapse layers depending on their contribution
to obtain an abstract MDP M 0 = (S 0 , A, R0 , T 0 , γ). If the ab-              towards abstraction (Figure 2). While in DBpedia or Word-
straction φ satisfies consistency criteria (see section 2, State                 Net the found entities are mapped to a unique superclass, enti-
abstraction in RL) a policy learned over M 0 allows for opti-                    ties in ConceptNet are associated with multiple superclasses.
mal behaviour in M , which we reference from now on as base                      To handle the case of multiple superclasses, each entity is
MDP [Li et al., 2006]. Here, we assume that we are given ab-                     mapped to the set of all i-step superclasses for i = 1, ..., n.
straction functions φ1 , ..., φn with φi : Si−1 → Si , where                     To obtain representations for these class sets, the embeddings
S0 corresponds to the state space of the base MDP. Since φi                      of each element of the set are averaged.
must not necessarily satisfy any consistency criteria, learning
a policy over one of the corresponding abstract MDPs Mi can                      Learning policies over a hierachy of abstract states.
result in a non-optimal policy. The goal is to learn a policy π                  Since prior methods in commonsense games are policy
or action value function Q, that takes as input a hierarchy of                   gradient-based, we will focus on this class of algorithms,
state abstractions s = (s1 , ..., sn ). Here, we want to make use                while providing a similar analysis for value-based methods in
of the more abstract states s2 , ...sn for more sample efficient                 Appendix B. First, we look at a naive method to learn a policy
learning and better generalisation.                                              in our setting, discuss its potential weaknesses and then pro-
                                                                                 pose a novel gradient update to overcome these weaknesses.
                                                                                    A simple approach to learning a policy π over s is to take
4   Methodology                                                                  an ensemble approach by having a network with separate pa-
The method can be separated into two components: (i) con-                        rameters for each abstraction level to predict logits, that are
structing the abstraction functions φ1 , ..., φn from the sub-                   then summed up and converted via the softmax operator into
class relationships in open source knowledge graphs; (ii)                        a final policy π. Let si,t denote the abstract state on the i-th
learning a policy over the hierarchy of abstract states s =                      level at timestep t, then π is computed via:
(s1 , ..., sn ) given the abstraction functions.                                                                     n
                                                                                                                                    !
                                                                                                                    X
Constructing the abstraction functions φi . A state in a                                     π(at |st ) = Softmax       NNθi (si,t ) ,        (6)
commonsense game features real world entities and their re-                                                          i=1
lations, which can be modelled as a set, sequence or graph of
                                                                                 where NNθi is a neural network processing the abstract state
entities. The idea is to replace each entity with its superclass,
                                                                                 si parameterised by θi . This policy can then be trained via
so that states that contain objects with the same superclass are
                                                                                 any policy gradient algorithm [Schulman et al., 2015; Mnih
aggregated into the same abstract state. Let E be the vocab-
                                                                                 et al., 2016; Espeholt et al., 2018]. From here on, we will
ulary of symbols that can appear in any of the abstract states
                                                                                 refer to this approach as sum-method.
si , i.e. si = {e1 , ..., ek |el ∈ E}. The symbols that refer to
                                                                                     There is no mechanism that forces the sum approach to
real-world objects are denoted by O ⊆ E and Ctree repre-
                                                                                 learn on the most abstract level possible, potentially leading
sents their class tree. A class tree is a rooted tree in which
                                                                                 to worse generalisation to unseen objects. At train time mak-
the leaves are objects and the parent of each node is its su-
                                                                                 ing all predictions solely based on the lowest level (ignoring
perclass. The root is a generic entity class of which every
                                                                                 all higher levels) can be a solution that maximises discounted
object/class is a subclass (see Appendix A for an example).
                                                                                 return, though it will not generalise well to unseen objects. To
To help define φi , we introduce an entity based abstraction
                                                                                 circumvent this problem, we adapt the policy gradient so that
φEi : E → E. Let Ck represent objects/classes with depth k                       the parameters θi at each abstraction level are optimised to
in Ctree and L be the depth of Ctree , then we can define φE   i                 approximate an optimal policy for the i-th abstraction level
and φi :                                                                         given the computed logits on abstraction level i − 1. Let
                           
                             Pa(e), if e ∈ CL+1−i                                sni,t = (si,t , ..., sn,t ) denote the hierarchy of abstract states
              φEi (e)  =                                      (4)                at timestep t down to the i-th level. Define the policy on the
                             e, else,
                                                                                 i-th level as
                                                                                                                         n
                                                                                                                                        !
                      φi (s) = {φE
                                 i (e)|e ∈ s},                             (5)                                          X
                                                                                            πi (a|sni,t ) = Softmax         NNθk (sk,t ) .       (7)
where Pa(e) are the parents of entity e in the class tree Ctree                                                      k=i
This abstraction process is visualised in Figure 1. In practice,
                                                                                    2
we need to be able to extract the set of relevant objects from                          https://spacy.io/
700
Number of objects/classes
                            600                                                                                                                   Toy environment. We start with a rooted tree, where each
                                                                                        1
                            500
                                                                                    2       3
                                                                                                                          1                       node is represented via a random embedding. The leaves
                            400                                                                         Collapsing

                            300                                                         4       5
                                                                                                        layer 4 & 5
                                                                                                                      2           3
                                                                                                                                                  of the tree represent the states of a base MDP. Each depth
                            200                                                             6       7
                                                                                                                      4       5       6   7   8
                                                                                                                                                  level of the tree represents one abstraction level. Every in-
                            100                                                                         8                                         ner node of the tree represents an abstract state that aggre-
                              0                                                                                                                   gates the states of its children. The optimal action for each
                                  0.0   2.5   5.0     7.5      10.0   12.5   15.0
                                                Abstraction level
                                                                                                                                                  leaf (base state) is determined by first fixing an abstraction
Figure 2: (Left) The number of different objects/classes that can ap-                                                                             level l and randomly sampling one of five possible actions
pear for each state abstraction level in the Wordcraft environment                                                                                for each abstract state on that level. Then, the optimal action
with the superclass relation extracted from Wordnet. (Right) Visual-                                                                              for a leaf is given by the sampled action for its correspond-
isation of collapsing two layers in a class tree.                                                                                                 ing abstract state on level l. The time horizon is one step,
                                                                                                                                                  i.e. for each episode a leaf state is sampled, and if the agent
                                                                                                                                                  chooses the correct action, it receives a reward of one. We
and notice that π = π1 . To obtain a policy gradient expres-                                                                                      test on new leaves with unseen random embeddings, but keep
sion that contains the abstract policies πi , we write π as a                                                                                     the same abstraction hierarchy and the same pattern of opti-
product of abstract policies:                                                                                                                     mal actions. The sum and residual approach are compared
                                         !
                     n−1
                     Y πi (a|sni,t )                                                                                                              to a policy trained only given the base states and a policy
       π(a|st ) =                          πn (a|sn,t ).  (8)                                                                                     given only the optimal abstract states (oracle). To study the
                          π (a|sni+1,t )
                     i=1 i+1                                                                                                                      effect of noise in the abstraction, we replace the optimal ac-
and plug it into the policy gradient expression for an episodic                                                                                   tion as determined by the optimal state with noise probability
task with discounted return G:                                                                                                                    σ (noise setting) or ensure that the abstraction step only ag-
                n
                       " T                              ! #                                                                                       gregates a single state (ambiguity setting). All policies are
               X         X               πi (a|sni,t )                                                                                            trained via REINFORCE [Williams, 1992] with a value func-
   ∇θ J(θ) =       Eπ        ∇θ log                      G ,                                                                                      tion as baseline and entropy regularisation. More details on
               i=1       t=1
                                      πi+1 (a|sni+1,t )
                                                                                                                                                  the chosen trees and the policy network architecture can be
                                                            (9)
                                                                                                                                                  found in Appendix C.
where πn+1 ≡ 1. Notice that in Equation 9, the gradient of
the parameters θi depend on the values of all policies of the                                                                                     Textworld Commonsense. In text-based games, the state
level equal or lower than i. The idea is to take the gradient for                                                                                 and action space are given as text. In Textworld Common-
each abstraction level i only with respect to θi and not the full                                                                                 sense (TWC), an agent is located in a household and has to
set of parameters θ. This optimises the parameters θi not with                                                                                    put objects in their correct commonsense location, i.e. the
respect to their effect on the overall policy, but their effect on                                                                                location one would expect these objects to be in a normal
the abstract policy on level i. The residual policy gradient is                                                                                   household. The agent receives a reward of one for each ob-
given by:                                                                                                                                         ject that is put into the correct location. The agent is evaluated
                   Xn
                           " T
                            X
                                                        #                                                                                         by the achieved normalised reward and the number of steps
   ∇θ Jres (θ) =       Eπ       ∇θi log(πi (a|sni,t ))G . (10)                                                                                    needed to solve the environment, where 50 steps is the max-
                                                          i=1                t=1                                                                  imum number of steps allowed. To make abstraction neces-
                                                                                                                                                  sary, we use a large number of objects per class and increase
Each element of the first sum resembles the policy gradient                                                                                       the number of games the agent sees during training from 5
loss of a policy over the abstract state si . However, the sam-                                                                                   to 90. This raises the exposure to different objects at train-
pled trajectories are from the overall policy π and not the pol-                                                                                  ing time. The agent is evaluated on a validation set where
icy πi and the policy πi inherits a bias from previous layers                                                                                     it encounters previously seen objects and a test set where it
in form of logit-levels. We refer to the method based on the                                                                                      does not. The difficulty level of a game is determined by
update in Equation 10 as residual approach. An advantage                                                                                          the number of rooms, the number of objects to move and the
of the residual and sum approach is that the computation of                                                                                       number of distractors (objects already in the correct location),
the logits from each layer can be done in parallel. Any se-                                                                                       which are here two, two and three respectively. To study the
quential processing of levels would have a prohibitively large                                                                                    effect of inaccuracies in the extratcted class trees from Word-
computational overhead.                                                                                                                           Net, ConceptNet and DBpedia, we compare them to a man-
                                                                                                                                                  ual aggregation of objects based on their reward and transi-
5                                 Experimental Evaluation                                                                                         tion behaviour. Murugesan et al. [2021] benchmark differ-
Our methodology is based on the assumption that abstrac-                                                                                          ent architectures that have been proposed to solve text-based
tion via class knowledge extracted from open-source KGs is                                                                                        games [He et al., 2016]. Here, we focus on their proposed
useful in commonsense game environments. This must not                                                                                            method that makes use of a recurrent neural network architec-
necessarily be true. It is first sensible to study the workings                                                                                   ture and numberbatch embeddings [Speer and Lowry-Duda,
of our methodology in an idealised setting, where we con-                                                                                         2017], which performed best in terms of normalised reward
trol whether and how abstraction is useful for generalisation                                                                                     and number of steps needed. For more details on the archi-
and sample efficiency. Then, we evaluate the method on two                                                                                        tecture and the learning algorithm, we refer to the initial pa-
commonsense games, namely a variant of Textworld Com-                                                                                             per. As baselines, we choose: (i) adding class information
monsense [Murugesan et al., 2021] and Wordcraft.                                                                                                  via hyperbolic embeddings trained on WordNet [Nickel and
1.0                                                              1.0                                           without abstraction; on par with the sample efficiency of the
               0.8                                                              0.8                                           oracle. The same holds for the generalisation performance.
Reward Train

                                                                  Reward Test
               0.6                                                              0.6                                           As the base method faces unseen random embeddings at test
               0.4
                                                       residual
                                                       sum                      0.4
                                                                                                                              time, there is no possibility for generalisation. In the noise
               0.2
                                                       base
                                                       oracle                   0.2
                                                                                                                              setting (Figure 3 middle), the policy trained on the abstract
                     0    10     20      30      40          50                       0    10     20      30      40     50
                                                                                                                              state reaches its performance limit at the percentage of states
                         Number of episodes (in 100)                                      Number of episodes (in 100)         whose action is determined via their abstract state; the base,
               1.0                                                              1.0
                                                                                                                              residual and sum method reach instead optimal performance.
                                                                                                                              The achieved reward at generalisation time decreases for both
               0.8                                                              0.8
Reward Train

                                                                  Reward Test
                                                                                                                              the sum and residual method, but the sum approach suffers
               0.6                                                              0.6
                                                   residual                                                                   more from overfitting to the noise at training time when com-
               0.4                                 sum                          0.4                                           pared to the residual approach. In the case of ambiguous ab-
                                                   base
               0.2                                 oracle                       0.2                                           straction, we see that the residual approach outperforms the
                     0    20    40      60       80        100                        0    20    40      60       80    100   sum approach. This can be explained by the residual gradient
                         Number of episodes (in 100)                                      Number of episodes (in 100)
                                                                                                                              update, which forces the residual method to learn everything
               1.0                                                              1.0                                           on the most abstract level, while the sum approach distributes
               0.8                                                              0.8                                           the contribution to the final policy over the abstraction layers.
Reward Train

                                                                  Reward Test

               0.6                                                              0.6                                           At test time, the sum approach puts too much emphasis on
                                                       residual
               0.4                                     sum                      0.4                                           the uninformative base state causing the policy to take wrong
                                                       base
               0.2                                     oracle                   0.2
                                                                                                                              actions.
                     0      5         10        15                                    0      5         10        15
                         Number of episodes (in 100)                                      Number of episodes (in 100)         6.2                    Textworld Commonsense
                                                                                                                              The results suggest that both the sum and residual approach
Figure 3: Training and generalisation results in the toy environment.                                                         are able to leverage the abstract states. It remains an open
The abstract state perfectly determines the action to take in the base
                                                                                                                              question whether this transfers to more complex environ-
state (top). Noise is added to this relation, so that in 50% of the time
the optimal action is determined randomly beforehand (middle). In                                                             ments with a longer time-horizon, where the abstraction is
the ambiguous setting, not every abstract state has multiple substates                                                        derived by extracting knowledge from an open-source KG.
(bottom). Experiments are run over five seeds.

                                                                                                                                                   1.0       WordNet­residual
Kiela, 2017] by concatenating them to numberbatch embed-                                                                                                     WordNet­sum
                                                                                                                               Normalized Reward

                                                                                                                                                   0.8       base
dings of objects; (ii) extracting the LocatedAt relation from
ConceptNet for game objects, encoding it via graph attention                                                                                       0.6
and combining it with the textual embedding [Murugesan et
                                                                                                                                                   0.4
al., 2021].
Wordcraft. An agent is given a goal entity and ingredient                                                                                          0.2
entities, and has to combine the ingredient entities to obtain
                                                                                                                                                         0            20            40             60   80
the goal entity. A set of recipes determines which combina-                                                                                                                     Number of episodes
tion of input objects leads to which output entity. One can
create different settings depending on the number of distrac-
                                                                                                                              Figure 4: Training performance of the baseline (base), the sum and
tors (irrelevant entities in the set of recipe entities) and the
                                                                                                                              residual abstraction with knowledge extracted from WordNet in the
set of recipes available at training time. Generalisation to un-                                                              difficult environment. Experiments are run over ten seeds.
seen recipes guarantees that during training not all recipes
are available, and generalisation to unseen goals ensures that
at test time the goal entity has not been encountered before                                                                  Sample efficiency. From Figure 4, one can see that the
as a goal entity. Jiang et al. [2020] trained a policy using                                                                  residual and sum approach learn with less samples com-
the IMPALA algorithm [Espeholt et al., 2018]. We retain the                                                                   pared to the base approach. The peak mean difference is
training algorithm and the multi-head attention architecture                                                                  20 episodes from a total of 100 training episodes. While
[Vaswani et al., 2017] used for the policy network to train our                                                               the sum approach learns towards the beginning of training
policy components NNθ1 , ..., NNθn .                                                                                          as efficiently, it collapses for some random seeds towards the
                                                                                                                              end of training, resulting in higher volatility of the achieved
6                Results                                                                                                      normalised reward. This is similar to what was previously
                                                                                                                              observed in experiments in the noisy setting of the ideal en-
After checking whether the results in the ideal environment                                                                   vironment. That the performance of the sum approach col-
are as expected, we discuss results in TWC and Wordcraft.                                                                     lapses for certain seeds suggests, that the learned behaviour
                                                                                                                              for more abstract states depends on random initialisation of
6.1                  Toy Environment                                                                                          weights. This stability issue is not present for the residual ap-
From Figure 3 top, one can see that the sum and residual                                                                      proach. Figure 5 (right) shows that adding hyperbolic embed-
approach are both more sample efficient than the baseline                                                                     dings does not improve sample efficiency and even decreases
Reward                                                                               Steps
 Method                                                                                                                                                    0.6
                                                                                                                                                                                                                 No abstraction ­ random
           Valid.      Test                                                                   Valid.        Test                                                                                                 No abstraction ­ GloVe
                                                                                                                                                                                                                 residual ­ random
 Base   0.91 (0.04) 0.85 (0.05)                                                            25.98 (3.14) 29.36 (3.42)                                       0.5                                                   residual ­ GloVe

                                                                                                                                          Average Reward
                                                                                                                                                                                                                 sum ­ random
 Base-H 0.83 (0.06) 0.75 (0.14)                                                            30.59 (5.39) 33.92 (6.08)                                       0.4
                                                                                                                                                                                                                 sum ­GloVe

 Base-L 0.90 (0.04) 0.86 (0.06)                                                            24.83 (2.31) 28.71 (3.46)                                       0.3
 M-R    0.96 (0.03) 0.96 (0.02)                                                            21.26 (2.65) 20.00 (1.57)
                                                                                                                                                           0.2
 M-S    0.97 (0.02) 0.96 (0.02)                                                            20.95 (1.38) 19.55 (1.69)
 W-R    0.93 (0.02) 0.94 (0.02)                                                            23.25 (1.46) 24.04 (1.75)                                       0.1
                                                                                                                                                             0.0      0.2      0.4   0.6        0.8        1.0   1.2            1.4
 W-S    0.88 (0.10) 0.87 (0.17)                                                            25.69 (4.78) 26.21 (6.81)                                                                   Environment Steps                              1e6

Table 1: Generalisation results for the easy and difficult level w.r.t                                                                    Figure 6: Generalisation results for the Wordcraft environment with
the validation set and the test set, in terms of mean number of steps                                                                     respect to unseen goal entities. Experiments are run over ten seeds.
taken (standard deviation in brackets) for each type of knowledge
graph (M=manual, W=WordNet) and each approach (R=residual,
S=Sum). Base refers to the baseline with no abstraction, Base-H                                                                           ple inefficiency. The abstractions derived from DBpedia and
refers to baseline with hyperbolic embeddings and Base-L for base-
                                                                                                                                          ConceptNet hamper learning particularly for the residual ap-
line added LocatedAt relations. In bold, the best performing method
without manually specified knowledge. We highlight in red the con-                                                                        proach (Figure 5). By investigating the DBpedia abstraction,
ditions in which the manual class graph outperforms the other meth-                                                                       we notice that many entities are not resolved correctly. There-
ods. Experiments are run over ten seeds.                                                                                                  fore objects with completely different semantics get aggre-
                                                                                                                                          gated. The poor performance from the ConceptNet abstrac-
                    1.0                                                                  1.0
                                                                                                                                          tion hints at a problem with averaging class embeddings over
                    0.8                                                                  0.8
                                                                                                                                          multiple superclasses. Although two objects may have an
Normalized Reward

                                                                     Normalized Reward

                                                                                                                                          overlap in their set of superclasses, the resulting embeddings
                    0.6                                                                  0.6
                                                                                                                                          could still differ heavily due to the non-overlapping classes.
                    0.4                                 WordNet                          0.4                           WordNet­residual
                                                        manual                                                         base
                    0.2                                 ConceptNet                       0.2                           hyperbolic
                                                        DBpedia                                                        LocatedAt          6.3                      Wordcraft
                          0   20        40        60        80                                 0   20        40        60       80
                                   Number of episodes                                                   Number of episodes
                                                                                                                                          Figure 6 shows the generalisation performance in the Word-
Figure 5: Training performance in the difficult environment using                                                                         craft environment. For the case of random embeddings, ab-
the residual approach when the class knowledge comes from differ-                                                                         straction can help to improve generalisation results. Replac-
ent knowledge graphs trained (left) and when the baseline is aug-                                                                         ing random embeddings with GloVe embeddings improves
mented with hyperbolic embeddings or the LocatedAt relation from                                                                          generalisation beyond the class abstraction, and combining
ConceptNet (right). Experiments are run over ten seeds.                                                                                   class abstraction with GloVe embeddings does not result in
                                                                                                                                          any additional benefit. Looking at the recipe structure in
                                                                                                                                          Wordcraft, objects are combined based on their semantic sim-
performance towards the end of training. Adding the Locate-                                                                               ilarity, which can be better captured via word embeddings
dAt relations improves sample efficiency to a similar degree                                                                              rather than through classes. Although no generalisation gains
as adding class abstractions.                                                                                                             can be identified, adding abstraction in the presence of GloVe
Generalisation. Table 1 shows that, without any additional                                                                                embeddings leads to improved sample efficiency. A more
knowledge the performance of the baseline, measured via                                                                                   detailed analysis discussing the difference in generalisation
mean normalised reward and mean number of steps, drops                                                                                    between class abstraction and pretrained embeddings in the
when faced with unseen objects at test time. Adding hyper-                                                                                Wordcraft environment can be found in Appendix D.
bolic embeddings does not alleviate that problem but rather
hampers generalisation performance on the validation and
test set. When the LocatedAt relation from ConceptNet is                                                                                  7                  Conclusion
added to the state representation the drop in performance is                                                                              We show how class knowledge, extracted from open-source
reduced marginally. Given a manually defined class abstrac-                                                                               KGs, can be leveraged to learn behaviour for classes instead
tion, generalisation to unseen objects is possible without any                                                                            of individual objects in commonsense games. To force the
drop in performance for the residual and the sum approach.                                                                                RL agent to make use of class knowledge even in the pres-
This remains true for the residual approach when the man-                                                                                 ence of noise, we propose a novel residual policy gradient up-
ually defined class knowledge is exchanged with the Word-                                                                                 date based on an ensemble learning approach. If the extracted
Net class knowledge. The sum approach based on WordNet                                                                                    class structure approximates relevant classes in the environ-
performs worse on the validation and training set due to the                                                                              ment, the sample efficiency and generalisation performance
collapse of performance at training time.                                                                                                 to unseen objects are improved. Future work could look at
KG ablation. From Figure 5 and Table 1 it is evident that                                                                                 other settings where imperfect prior knowledge is given and
for the residual approach, the noise and additional abstrac-                                                                              whether this can be leveraged for learning. Finally, KGs con-
tion layers introduced by moving from the manually defined                                                                                tain more semantic information than classes, which future
classes to the classes given by WordNet, only cause a small                                                                               work could try to leverage for an RL agent in commonsense
drop in generalisation performance and no additional sam-                                                                                 games.
Acknowledgements                                                         [Mnih et al., 2015] V. Mnih, K. Kavukcuoglu, D. Silver, A. A.
                                                                            Rusu, and J. Veness et al. Human-level control through deep
This research was (partially) funded by the Hybrid In-                      reinforcement learning. Nat., 2015.
telligence Center, a 10-year programme funded by the
                                                                         [Mnih et al., 2016] V. Mnih, A. P. Badia, M. Mirza, A. Graves, and
Dutch Ministry of Education, Culture and Science through
                                                                            T. P. Lillicrap et al. Asynchronous methods for deep reinforce-
the Netherlands Organisation for Scientific Research,
                                                                            ment learning. In ICML, 2016.
https://hybrid-intelligence-centre.nl.
                                                                         [Moon et al., 2019] S. Moon, P. Shah, A. Kumar, and R. Subba.
                                                                            Opendialkg:       Explainable conversational reasoning with
References                                                                  attention-based walks over knowledge graphs. In ACL, 2019.
[Akata et al., 2020] Z. Akata, D. Balliet, M. d. Rijke, F. Dignum,       [Murugesan et al., 2021] K. Murugesan, M. Atzeni, P. Kapanipathi,
    and V. Dignum et al. A research agenda for hybrid intelli-              P. Shukla, and S. Kumaravel et al. Text-based RL agents with
    gence: Augmenting human intellect with collaborative, adaptive,         commonsense knowledge: New challenges, environments and
    responsible, and explainable artificial intelligence. 2020.             baselines. In AAAI, 2021.
[Côté et al., 2018] M. Côté, A. Kádár, X. Yuan, B. Kybartas, and   [Nickel and Kiela, 2017] M. Nickel and D. Kiela. Poincaré embed-
    T. Barnes et al. Textworld: A learning environment for text-based       dings for learning hierarchical representations. NeurIPS, 2017.
    games. In Workshop on Computer Games, 2018.
                                                                         [Pennington et al., 2014] J. Pennington, R. Socher, and C. D Man-
[Devlin et al., 2019] J. Devlin, M. Chang, K. Lee, and                      ning. Glove: Global vectors for word representation. In EMNLP,
    K. Toutanova.         BERT: pre-training of deep bidirectional          2014.
    transformers for language understanding. In NAACL-HLT, 2019.         [Sap et al., 2019] M. Sap, R. Le Bras, E. Allaway, C. Bhagavatula,
[Espeholt et al., 2018] L. Espeholt, H. Soyer, R. Munos, K. Si-             and N. Lourie et al. ATOMIC: an atlas of machine commonsense
    monyan, and V. Mnih et al. Impala: Scalable distributed deep-rl         for if-then reasoning. In AAAI, 2019.
    with importance weighted actor-learner architectures. In ICML,       [Schulman et al., 2015] J. Schulman, S. Levine, P. Abbeel, M. I.
    2018.                                                                   Jordan, and P. Moritz. Trust region policy optimization. In ICML,
[Givan et al., 2003] R. Givan, T. Dean, and M. Greig. Equivalence           2015.
    notions and model minimization in markov decision processes.         [Silver et al., 2016] D. Silver, A. Huang, C. J. Maddison, A. Guez,
    Artificial Intelligence, 2003.                                          and L. Sifre et al. Mastering the game of go with deep neural
[Gou et al., 2021] Y. Gou, Y. Lei, L. Liu, Y. Dai, and C. Shen. Con-        networks and tree search. Nat., 2016.
    textualize knowledge bases with transformer for end-to-end task-     [Speer and Lowry-Duda, 2017] R. Speer and J. Lowry-Duda. Con-
    oriented dialogue systems. In EMNLP, 2021.                              ceptnet at semeval-2017 task 2: Extending word embeddings
[He et al., 2016] J. He, J. Chen, X. He, J. Gao, and L. Li et al. Deep      with multilingual relational knowledge. In SemEval@ACL, 2017.
    reinforcement learning with a natural language action space. In      [Speer et al., 2017] R. Speer, J. Chin, and C. Havasi. Conceptnet
    ACL, 2016.                                                              5.5: An open multilingual graph of general knowledge. In AAAI,
[Huang et al., 2017] Z. Huang, J. Yang, F. v. Harmelen, and Q. Hu.          2017.
    Constructing knowledge graphs of depression. In International        [Vaezipoor et al., 2021] P. Vaezipoor, A. C. Li, R. T. Icarte, and
    Conference on Health Information Science. Springer, 2017.               S. A. McIlraith. Ltl2action: Generalizing LTL instructions for
[Ilievski et al., 2021] F. Ilievski, P. A. Szekely, and B. Zhang.           multi-task RL. In ICML, 2021.
    CSKG: the commonsense knowledge graph. In ESWC, 2021.                [van Harmelen and ten Teije, 2019] F. van Harmelen and A. ten
[Jiang et al., 2020] M. Jiang, J. Luketina, N. Nardelli, P. Minervini,      Teije. A boxology of design patterns for hybrid learning and rea-
    and P. HS Torr et al. Wordcraft: An environment for bench-              soning systems. In BNAIC, 2019.
    marking commonsense agents. arXiv preprint arXiv:2007.09185,         [Vaswani et al., 2017] A. Vaswani, N. Shazeer, N. Parmar,
    2020.                                                                   J. Uszkoreit, and L. Jones et al. Attention is all you need. In
[Kemertas and Aumentado-Armstrong, 2021] M. Kemertas and                    NeurIPS, 2017.
    T. Aumentado-Armstrong. Towards robust bisimulation metric           [Watkins and Dayan, 1992] C. JCH. Watkins and P. Dayan. Q-
    learning. CoRR, 2021.                                                   learning. Machine learning, 1992.
[Kirk et al., 2021] R. Kirk, A. Zhang, E. Grefenstette, and              [Williams, 1992] R. J Williams.         Simple statistical gradient-
    T. Rocktäschel. A survey of generalisation in deep reinforcement       following algorithms for connectionist reinforcement learning.
    learning. CoRR, abs/2111.09794, 2021.                                   Machine learning, 1992.
[Lehmann et al., 2015] J. Lehmann, R. Isele, M. Jakob,                   [Xie and Pu, 2021] Y. Xie and P. Pu. How commonsense knowl-
    A. Jentzsch, and D. Kontokostas et al. Dbpedia–a large-                 edge helps with natural language tasks: A survey of recent re-
    scale, multilingual knowledge base extracted from wikipedia.            sources and methodologies. CoRR, 2021.
    Semantic web, 2015.                                                  [Yee, 2019] E. Yee. Abstraction and concepts: when, how, where,
[Li et al., 2006] L. Li, T. J Walsh, and M. L Littman. Towards a            what and why?, 2019.
    unified theory of state abstraction for MDPs. ISAIM, 4:5, 2006.      [Zhang et al., 2020a] A. Zhang, C. Lyle, S. Sodhani, A. Filos, and
[Lillicrap et al., 2016] T. P. Lillicrap, J. J. Hunt, A. Pritzel,           M. Kwiatkowska et al. Invariant causal prediction for block-
    N. Heess, T. Erez, and et al. Continuous control with deep re-          MDPs. In ICML, 2020.
    inforcement learning. In ICLR, 2016.                                 [Zhang et al., 2020b] M. Zhang, K. Ye, R. Hwa, and A. Kovashka.
[Miller, 1995] G. A Miller. Wordnet: a lexical database for english.        Story completion with explicit modeling of commonsense knowl-
    Communications of the ACM, 1995.                                        edge. In CVPR, 2020.
A       Extracting class trees from commonsense                          the rdf:type relation recursively until the URI owl:Thing is
        KGs                                                              reached. This gives us again a chain of classes connected via
                                                                         the superclass relation for each object, which can be merged
To construct the abstraction function φi presented in Section 4          into a class tree as described in Section A.1. Any entity that
from open-source knowledge graphs, we need a list of objects             can not be mapped to a URI in DBpedia is directly mapped
that can appear in the game. We collect trajectories via ran-            to the owl:Thing node of the class tree.
dom play and extract the objects out of each state appearing in
these trajectories. The extraction method depends on the rep-            A.3      ConceptNet
resentation of the state. In Wordcraft [Jiang et al., 2020] (see         We use ConceptNet’s API6 to query the IsA relation for a list
Section 5), the state is a set of objects, such that no extraction       of objects. In ConceptNet, it can happen that there are mul-
is needed. In Textworld [Côté et al., 2018] the state consists         tiple nodes with different descriptions for the same semantic
of text. We use spaCy3 for POS-tagging and take all nouns to             concept connected to an object via the IsA relation. For exam-
be relevant objects in the game. Next, we look at how to ex-             ple, the object airplane is connected among other things via
tract class knowledge from WordNet [Miller, 1995], DBpedia               the IsA relation to heavier-than-air craft, aircraft and fixed-
[Lehmann et al., 2015] and ConceptNet [Speer et al., 2017]
                                                                         wing aircraft. This causes two problems: (i) sampling only
given a set of relevant objects. An overview of the charac-              one superclass could lead to a class tree where objects of the
teristics of the extracted class trees for each knowledge graph          same superclass are not aggregated due to minor differences
and each environment is presented in Table 2.                            in the class description; (ii) recursively querying the super-
                                                                         class relation until a general entity class is reached leads to
        Env.         KG               Miss. Ent.      #Layers
                                                                         a prohibitively large class tree, and is not guaranteed to ter-
                     WordNet          2/148           9                  minate due to possible cycles in the IsA relation. To tackle
        TWC          DBpedia          2/148           6                  the first problem, we maintain a set of superclasses for each
                     ConceptNet       6/148           2                  object per abstraction level, instead of a single superclass per
        Wordcraft    WordNet          13/700          16                 level. The representative embedding of the superclass of an
                                                                         object is then calculated by averaging the embeddings of the
Table 2: Characteristics of the extracted class trees for each KG
in the difficult TWC games and Wordcraft. Missing entities (Miss.        elements in the set of superclasses. To tackle the second prob-
Ent.) represents the number of entities that could not be matched        lem, we restrict the superclasses to the two step neighbour-
to an entity in the KG, relative to the total number of game entities.   hood of the IsA relation, i.e. we have two abstraction levels.
The number of layers (#Layers) does not include the base level.          Further simplifications are that only IsA relations are consid-
                                                                         ered with a weight above or equal to one7 and superclasses
                                                                         with a description length of less than or equal to three words.
A.1      WordNet
We use WordNet’s hypernym relation between two nouns as                  B       Residual Q-learning
the superclass relation. To query WordNet’s synsets and the              First, the adaption of Q-learning to the setting described in
hypernym relation between them, we use NLTK4 . For each                  section 3 is explained. Then results of the algorithm in the
object, we query recursively the superclass relation until the           toy environment from section 5 are presented.
entity synset is reached. Having obtained this chain of super-
classes for each object, we merge it into a class tree by aggre-         B.1      Algorithm
gating superclasses appearing in multiple superclass chains              In Section 4 we focused on how to adapt policy gradient
of different objects into one node. An example of the re-                methods to learn a policy over a hierarchy of abstract states
sulting class tree can be found in Figure 7. One problem                 s = (s1 , ..., sn ). Here we provide a similar adaption of Q-
that arises is entity resolution, i.e. picking the semantically          learning [Watkins and Dayan, 1992] and present results from
correct synset for a given word in case of multiple available            the toy environment. The sum approach naturally extends
semantics. Here, we simply take the first synset/hypernym re-            to Q-learning but instead of summing over logits from each
turned. More complex resolution methods are possible. Ob-                layer, we sum over action values. Let st = (s1,t , ..., sn,t ) be
jects that have no corresponding synsets in WordNet have the             the hierarchy of abstract states at time t, then the action value
noun ”entity” as a parent in the class tree.                             is given by:
A.2      DBpedia                                                                                        n
                                                                                                        X
                         5
We use the SPARQL language to first query all the URI’s                                   Q(st , a) =         N Nθi (si , a),        (11)
                                                                                                        i=1
that have the objects name as a label (using the rdfs:label
property). If a URI that is part of DBpedia’s ontology can               where N Nθi is a neural network parameterized by θi . The
be found, we query the rdfs:subClassOf relation recursively              idea of the residual Q-learning method is similar to the resid-
until the URI owl:Thing is reached. If the URI correspond-               ual policy-gradient, i.e. we want to learn an action value func-
ing to the object name is not part of the ontology, we query             tion on the most abstract level and then adapt this on lower
    3                                                                        6
      https://spacy.io/                                                      https://conceptnet.io/
    4                                                                        7
      https://www.nltk.org/                                                  The weight of a relation in ConceptNet is a measure of confi-
    5
      https://www.w3.org/TR/rdf-sparql-query/                            dence of whether the relation is correct.
pan     pot       kettle          spatula

  counter     desk         cooking_utensil.n.01           grater         whisk

            table.n.02         chair              kitchen_utensil.n.01           knife   button          candle          computer         bedroom        cupboard                                                                                            corn       spinach

                           furniture.n.01                           implement.n.01         device.n.01            vase       phone   tv         area.n.05          wall              trousers       scarf              jeans                                    herb.n.01         tuna

                                                                                                         instrumentality.n.03                 structure.n.01           teddy        book        garment.n.01                   fruit         cinnamon               vascular_plant.n.01          salmon         west         corner                                          milk      chocolate

                                                                                                                                                     mind          kind         artifact.n.01                                          natural_object.n.01             organism.n.01                      region.n.03          place                                           beverage.n.01         syrup      salad

                                                                                                                                                               cognition.n.01          deal                                                                         whole.n.02                      location.n.01          chest             head        peeler     hanger            glass         food.n.01

                                                                                                                                                                            psychological_feature.n.01                           things          board          one         nightmare             object.n.01           body_part.n.01                person.n.01     dining          matter.n.03

                                                                                                                                                                                                                                                 abstraction.n.06                                                                      physical_entity.n.01

                                                                                                                                                                                                                                                                                   entity.n.01

Figure 7: Visualisation of the class tree constructed from a subset of entities appearing in the Textworld Commonsense environment. The
subclass relations are extracted from WordNet.

levels as more information on the state becomes available.                                                                                                                                                                 1.0                                                                                                         1.0
Let st = (s1,t , ..., sn,t ) be a hierarchy of abstract states at                                                                                                                                                          0.8                                                                                                         0.8

                                                                                                                                                                                                            Reward Train

                                                                                                                                                                                                                                                                                                                         Reward Test
time t and Ri the reward function of the abstract MDP on the                                                                                                                                                               0.6                                                                                                         0.6
i-th level as defined in Section 3. Further define Rn+1 ≡ 0.                                                                                                                                                               0.4
                                                                                                                                                                                                                                                                                                          residual
                                                                                                                                                                                                                                                                                                          sum                          0.4
Then one can write the action value function for an episodic                                                                                                                                                                                                                                              base
                                                                                                                                                                                                                                                                                                          oracle
                                                                                                                                                                                                                           0.2                                                                                                         0.2
task with discount factor γ as:                                                                                                                                                                                                        0          10     20      30      40                                       50                           0               10     20      30      40                            50
                                                                                                                                                                                                                                                 Number of episodes (in 100)                                                                                  Number of episodes (in 100)
                                            T
                                            X
   Q(s, a) = E[                                         γ t R1,t |S = s, A = a]                                                                                                                                            1.0                                                                                                         1.0
                                                                                                                                                                                                                           0.8                                                                                                         0.8
                                                                                                                                                                                                            Reward Train

                                            t=0

                                                                                                                                                                                                                                                                                                                         Reward Test
                                            T                      n                                                                                                                                                       0.6                                                                                                         0.6
                                                                                                                                                                                                                                                                                                      residual
                                            X                      X
                           = E[                         γt                   (Ri,t − Ri+1,t ) |S = s, A = a]                                                                                                               0.4                                                                        sum                              0.4
                                                                                                                                                                                                                                                                                                      base
                                            t=0                    i=1                                                                                                                                                     0.2                                                                        oracle                           0.2
                                   n                                                                                                                                                                                                   0          20    40      60       80                                     100                            0               20    40      60       80                         100
                                   X                                                                                                                                                                                                             Number of episodes (in 100)                                                                                  Number of episodes (in 100)
                           =                    Qres
                                                 i (si,t , si+1,t , a),
                                   i=1                                                                                                                                                                                     1.0                                                                                                         1.0
                                                                                                                                                                                                                           0.8                                                                                                         0.8
             Qres
                                                                                                                                                                                                            Reward Train

                                                                                                                                                                                                                                                                                                                         Reward Test
where         i              is defined as:
                                                                                                                                                                                                                           0.6                                                                                                         0.6
                           T                                                                                                                                                                                                                                                                              residual
                          X                                                                                                                                                                                                0.4                                                                            sum                          0.4
Qres
 i   (s   , s
       i,t i+1,t , a) = E[   γ t (Ri,t − Ri+1,t ) |S = s, A = a].                                                                                                                                                          0.2
                                                                                                                                                                                                                                                                                                          base
                                                                                                                                                                                                                                                                                                          oracle                       0.2
                                                                          t=0                                                                                                                                                          0            5         10        15                                                                     0                 5         10        15
                                                                                                                                                                                                                                                 Number of episodes (in 100)                                                                                  Number of episodes (in 100)
The dependency of the reward function Ri,t on the state si,t
and action a is omitted for brevity. At each layer of the ab-                                                                                                                                            Figure 8: Training and generalisation results in the ideal environ-
straction the residual action value function Qres  i   learns how                                                                                                                                        ment. The abstract state perfectly determines the action to take in
to adapt the action values of the previous layer to approximate                                                                                                                                          the base state (top). Noise is added to this relation, such that in 40%
the discounted return for the now more detailed state-action                                                                                                                                             of the time the optimal action is determined randomly beforehand
pair. This information difference is characterised by the dif-                                                                                                                                           (middle). In the ambiguous setting not every abstract state has mul-
ference in the reward functions of both layers Ri − Ri+1 .                                                                                                                                               tiple substates (bottom). Experiments are run over five seeds.
   To learn Qres
              i   we use an adapted version of the Q-learning
update rule. Given a transition (st , a, rt , st+1 ), where s rep-
resents the full hierarchy over abstract states, the update is                                                                                                                                           i.e. rt is a sample of Ri and Ri+1 . As a result for every
given by:                                                                                                                                                                                                transition the difference would be 0. To be able to apply the
                                                                                                                                                                                                         learning rule, one solution would be to maintain averages of
     Qres
      i   = (1 − α)Qres
                    i   + α(Ri − Ri+1 + γ max Qres
                                               i )                                                                                                                                                       the reward functions for each state. This is infeasible in high-
                                                                                                                                          a∈A
                                                                                                                                                                                                         dimensional state spaces. Instead, we approximate Ri+1 via
This can be seen as performing Q-learning over the following                                                                                                                                             the difference between the current and next state action value
residual MDP Mires = (Si × Si+1 , A, Rires , Tires , γ) with                                                                                                                                             of the previous abstraction level, i.e.
        Rires = Ri (si,t , a) − Ri+1 (si+1,t , a) and                                                                                                                                                                            Ri+1 = Qi+1 (si+1,t , a) − γQi+1 (si+1,t+1 , a).                                                                                                                               (12)
        Tires = Ti (si,t+1 |si,t , a) · Ti+1 (si+1,t+1 |si+1,t , a).
                                                                                                                                                                                                         The adaptions made by Mnih et al. [2015] to leverage deep
We cannot apply the update rule directly, since each transi-                                                                                                                                             neural networks for Q-learning can be used to scale residual
tion only contains a single sample for both reward functions,                                                                                                                                            Q-learning to high-dimensional state spaces.
B.2    Results in the Toy Environment                                                                          -

The results resemble the results already discussed for the pol-
icy gradient case (cfr. Section 6.1). Without any noise or am-
biguity in the state abstraction the residual and sum approach                                        3                    2

learn as efficiently with respect to the number of samples and
generalise as well as if the perfect abstraction is given (Fig-
ure 8 top). The sum approach struggles to learn the action                                 3                   2                   2

values on the correct abstraction level if the abstraction is am-
biguous, hampering its generalisation performance (Figure 8
bottom). Both, the sum and residual approach, generalise                              3         3         2    2       2       2       2

worse in the presence of noise (Figure 8 middle). However,            Agent
the achieved reward at test time is less influenced by noise
for the residual approach. We did not evaluate the residual
Q-learning approach on the more complex environments as
                                                                                                               -
policy gradient methods have shown better performance in
previous work [Murugesan et al., 2021]
                                                                                                      3                    2
C     Toy environment
Here, a visualisation of the toy environment, details on the
tree representing the environment and the used network ar-                                 3                   2                   2
chitecture for the policy are given.
C.1    Environment Details
                                                                                      3         3         2    3       2       4       2
As explained in Section 5 the environment is determined by
a tree where each node represents a state of the base MDP             Agent

(leaves) or an abstract state (inner nodes). In a noise free set-
ting the optimal action for a base state is determined by a cor-
responding abstract state, while with noise the optimal action                                                 -
is sampled at random with probability σ or else determined
by the corresponding abstract state.
   A visualisation of the environment and what changes for                                            3                    2
each condition can be seen in Figure 9 and is summarised in
Table 3.
 Setting     Branching factor      σ     Abstraction ambiguity                             3                   2                   2

 Basic       [7,10,8,8]            0     No
 Noise       [7,10,8,8]            0.5   No
 Ambig.      [7,10,8,1]            0     Yes                                               3                   2                   2

                                                                      Agent
Table 3: Properties of the environment tree for the three settings:
basic, noise and ambiguous with noise probability σ.
                                                                      Figure 9: Simplified version of the tree used to define the toy en-
                                                                      vironment. In the base method the agent can only observe the leaf
C.2    Network architecture and Hyperparameter                        states. The optimal action for each state is displayed by the number
                                                                      in each node. At top the environment in the basic setting is dis-
The policy is trained via REINFORCE [Williams, 1992] with             played, in the middle with noise and at the bottom with ambiguous
a value function as baseline and entropy regularisation. The          abstraction.
value function and policy are parameterised by two separate
networks and the entropy regularisation parameter is β. For                      Hyperparameter                       Value
the baseline and the optimal abstraction method the value net-                   number of episodes                   5000
work is a one layer MLP of size 128 and the policy is a three-                   evaluation frequency                 100
layer MLP of size (256, 128, 64). In case abstraction is avail-                  number of evaluation samples         20
able the value network is kept the same but the policy network                   batch size                           32
is duplicated (with a separate set of weights) for each abstrac-                 state embedding size                 300
tion level to parameterise the N Nθi (Section 4).                                learning rate                        0.0001
                                                                                 β                                    1
D     Wordcraft environment
D.1    Network architecture and Hyperparameter                           Table 4: Hyperparameters for training in the toy environment.
We build upon the network architecture and training algo-
rithm used by Jiang et al. [2020]. Here we only present the
changes necessary to compute the abstract policies πi defined
in Section 4 and the chosen hyperparameters.                                         0.6
   To process a hierarchy of states st = (st,1 , ..., st,n ) one

                                                                    Average Reward
can simply add a dimension on the state tensor representing                          0.4
                                                                                                                                                       No abstraction ­ random
the abstraction level. Classical network layers such as multi-                                                                                         No abstraction ­ GloVe
                                                                                     0.2                                                               residual ­ random
layer perceptron (MLP) can be adapted to process such a state                                                                                          residual ­ GloVe
                                                                                                                                                       sum ­ random
tensor by adding a weight dimension representing the weights                                                                                           sum ­GloVe
for each abstraction level and using batched matrix multipli-                        0.0
                                                                                       0.0                 0.5          1.0              1.5           2.0                   2.5
cation. We tuned the learning rate and entropy regularisation                                                            Environment Steps                                 1e6

parameter (β) by performing a grid search over possible pa-
rameter values. We decided to collapse layers based on the          Figure 10: Generalisation results for the Wordcraft environment
additional abstraction each layer contributed (Section 4). For      with respect to unseen recipes. Experiments are run over 5 seeds.
the sum and residual approach a learning rate of 0.0001 and
β = 0.1 were used across experiments and layer 9-15 of the                           0.6
class hierarchy were collapsed. For the baseline a learning
rate of 0.001 and the β = 0.1 were used. Only in the unseen                          0.5

                                                                    Average Reward
goal condition the baseline with Glove embedding had to be                           0.4
trained with a learning rate of 0.0001.
                                                                                     0.3
                                                                                                                                                  No abstraction ­ GloVe
D.2    Additional Results                                                                                                                         Wordnet ­ residual ­ random
                                                                                     0.2                                                          Wordnet ­ sum ­ random
In the Wordcraft environment the results with respect to gen-                                  0.0   0.2          0.4   0.6      0.8       1.0   1.2            1.4         1.6
eralisation to unseen recipes (Figure 10) show similar pat-                                                              Environment Steps                                 1e6
terns to the generalisation results presented in Section 6.3.
With random embeddings adding class abstraction improves            Figure 11: Generalisation results for the Wordcraft environment
generalisation, but not as much if random embeddings are re-        with respect to unseen goal entities trained based on 10% of all avail-
placed with Glove embeddings [Pennington et al., 2014]. The         able recipes instead of 80%. Experiments are run over 5 seeds.
combination of class abstraction with Glove embeddings does
not lead to additional improvements. The objects that need
                                                                    tional overhead. If the method is tested in an environment
to be combined in Wordcraft rarely follow any class rules,
                                                                    where sampling is not the computational bottleneck it re-
i.e. objects are not combined based on their class. Word em-
                                                                    mains an open question whether the improvement in sample
beddings capture more fine-grained semantics rather than just
                                                                    efficiency also translates into higher reward after a fixed time
class knowledge [Pennington et al., 2014], therefore it is pos-
                                                                    budget. To test this we measured the averaged time needed
sible that class prior knowledge is less prone to overfitting. In
                                                                    for one timestep over multiple seeds and fixed hardware and
a low data setting it is possible that word embeddings contain
                                                                    plotted the reward after time passed for the baseline and for
spurious correlations which a policy can leverage to perform
                                                                    the residual method with abstract states based on WordNet.
well on during training but do not generalize well.
                                                                    All experiments were run on an a workstation with an Intel
   To test this hypothesis we created a low data setting in
                                                                    Xeon W2255 processor and a GeForce RTX 3080 graphics
which training was performed when only 10% of all possible
                                                                    card. From Figure 12 we can see that the baseline as well as
goal entities were part of the training set instead of the usual
                                                                    the extension via abstraction achieve the same reward after a
80%. In Figure 11 one can see that all methods have a drop
                                                                    fixed time. Even if computation time is an issue the residual
in performance (cfr. Section 6.3), but the drop is substanially
                                                                    approach takes no more time than the baseline and its gener-
smaller for the methods with abstraction and random embed-
                                                                    alisation performance in terms of normalised reward is better.
dings compared to the baseline with Glove embeddings. The
generalisation performance reverses, i.e. the baseline with
Glove embeddings now performs the worst.
   To understand whether and how the learned policy makes
use of the class hierarchy we visualise a policy trained with                            1.0
                                                                                                     WordNet­residual
the residual approach, random embeddings and access to the                               0.8         base
                                                                     Normalized Reward

the full class hierarchy extracted from WordNet. For the vi-
sualisation we choose recipes that follow a class rule. In                               0.6
Wordcraft one can create an airplane by combining a bird and
                                                                                         0.4
something that is close to a machine. From Figure 13 one can
see that the policy learns to pick the correct action on level                           0.2
nine by recognising that a vertebrate and machine need to be
combined.                                                                                        0               2000       4000         6000                   8000
                                                                                                                         Time in seconds
E     Runtime Analysis
Often an improved sample efficiency is achieved via an ex-          Figure 12: Normalised reward after time in seconds for the baseline
tension of an existing baseline, which introduces a computa-        and for the residual approach if abstraction is added from WordNet.
Abstraction­Level 16                                                            Abstraction­Level 15                                                      Abstraction­Level 14
  State:                                    1.00                          State:                                            1.00                          State:                                      1.00

  entity entity entity                      0.75                          physical_entity physical_entity abstraction       0.75                          causal_agent object psychological_feature   0.75
  entity entity entity                                                    physical_entity physical_entity physical_entity                                 causal_agent causal_agent object
  entity entity entity                      0.50                          physical_entity physical_entity physical_entity   0.50                          object object object                        0.50
  entity                                                                  physical_entity                                                                 matter
  Selection:         ­1                     0.25                          Selection:       ­1                               0.25                          Selection:       ­1                         0.25
  Goal:              entity                                               Goal:            physical_entity                                                Goal:            object
                                            0.00                                                                            0.00                                                                      0.00
                                                   Abstraction­Level 13                                                            Abstraction­Level 12                                                      Abstraction­Level 11
  State:                                    1.00                          State:                                            1.00                          State:                                      1.00

  person whole cognition                    0.75                          head_of_state artifact content                    0.75                          tyrannosaurus rex structure belief          0.75
  person person whole                                                     expert worker artifact                                                          wizard skilled_worker instrumentality
  whole whole whole                         0.50                          artifact living_thing artifact                    0.50                          instrumentality organism instrumentality    0.50
  solid                                                                   glass                                                                           glass
  Selection:     ­1                         0.25                          Selection:         ­1                             0.25                          Selection:       ­1                         0.25
  Goal:          whole                                                    Goal:              artifact                                                     Goal:            instrumentality
                                            0.00                                                                            0.00                                                                      0.00
                                                   Abstraction­Level 10                                                            Abstraction­Level 9                                                       Abstraction­Level 8
  State:                                    1.00                          State:                                            1.00                          State:                                      1.00

  tyrannosaurus rex building philosophy     0.75                          tyrannosaurus rex hospital philosophy             0.75                          tyrannosaurus rex hospital philosophy       0.75
  wizard cook container                                                   wizard cook container                                                           wizard cook container
  implement animal device                   0.50                          tool vertebrate machine                           0.50                          tool bird machine                           0.50
  glass                                                                   glass                                                                           glass
  Selection:      ­1                        0.25                          Selection:       ­1                               0.25                          Selection:       ­1                         0.25
  Goal:           conveyance                                              Goal:            vehicle                                                        Goal:            craft
                                            0.00                                                                            0.00                                                                      0.00
                                                   Abstraction­Level 7                                                             Abstraction­Level 6                                                       Abstraction­Level 5
  State:                                    1.00                          State:                                            1.00                          State:                                      1.00

  tyrannosaurus rex hospital philosophy     0.75                          tyrannosaurus rex hospital philosophy             0.75                          tyrannosaurus rex hospital philosophy       0.75
  wizard cook container                                                   wizard cook container                                                           wizard cook container
  tool bird_of_prey machine                 0.50                          tool vulture machine                              0.50                          tool vulture machine                        0.50
  glass                                                                   glass                                                                           glass
  Selection:       ­1                       0.25                          Selection:       ­1                               0.25                          Selection:       ­1                         0.25
  Goal:            heavier­than­air_craft                                 Goal:            airplane                                                       Goal:            airplane
                                            0.00                                                                            0.00                                                                      0.00
                                                   Abstraction­Level 4                                                             Abstraction­Level 3                                                       Abstraction­Level 2
  State:                                    1.00                          State:                                            1.00                          State:                                      1.00

  tyrannosaurus rex hospital philosophy     0.75                          tyrannosaurus rex hospital philosophy             0.75                          tyrannosaurus rex hospital philosophy       0.75
  wizard cook container                                                   wizard cook container                                                           wizard cook container
  tool vulture machine                      0.50                          tool vulture machine                              0.50                          tool vulture machine                        0.50
  glass                                                                   glass                                                                           glass
  Selection:       ­1                       0.25                          Selection:       ­1                               0.25                          Selection:       ­1                         0.25
  Goal:            airplane                                               Goal:            airplane                                                       Goal:            airplane
                                            0.00                                                                            0.00                                                                      0.00
                                                   Abstraction­Level 1                                                             Abstraction­Level 0
  State:                                    1.00                          State:                                            1.00

  tyrannosaurus rex hospital philosophy     0.75                          tyrannosaurus rex hospital philosophy             0.75
  wizard cook container                                                   wizard cook container
  tool vulture machine                      0.50                          tool vulture machine                              0.50
  glass                                                                   glass
  Selection:       ­1                       0.25                          Selection:       ­1                               0.25
  Goal:            airplane                                               Goal:            airplane
                                            0.00                                                                            0.00

Figure 13: Visualisation of the policy trained via the residual approach, random embeddings and a class hierarchy extracted from WordNet.
For each abstraction level the state representation is shown, which entity has been selected (-1 means no entity has been selected yet) and the
abstracted goal entity. The probability with which each entity is picked is plotted and the red bars indicate the optimal actions.
You can also read