Talk2Nav: Long-Range Vision-and-Language Navigation with Dual Attention and Spatial Memory - (ESAT) KU Leuven

Page created by Bryan Turner
 
CONTINUE READING
Talk2Nav: Long-Range Vision-and-Language Navigation with Dual Attention and Spatial Memory - (ESAT) KU Leuven
Noname manuscript No.
                                        (will be inserted by the editor)

                                        Talk2Nav: Long-Range Vision-and-Language Navigation with Dual
                                        Attention and Spatial Memory
                                        Arun Balajee Vasudevan · Dengxin Dai · Luc Van Gool
arXiv:1910.02029v2 [cs.CV] 3 Feb 2020

                                        Received: date / Accepted: date

                                        Abstract The role of robots in society keeps expanding,          1 Introduction
                                        bringing with it the necessity of interacting and communi-
                                        cating with humans. In order to keep such interaction in-        Consider that you are traveling as a tourist in a new city and
                                        tuitive, we provide automatic wayfinding based on verbal         are looking for a destination that you would like to visit. You
                                        navigational instructions. Our first contribution is the cre-    ask the locals and get a directional description “go ahead
                                        ation of a large-scale dataset with verbal navigation instruc-   for about 200 meters until you hit a small intersection, then
                                        tions. To this end, we have developed an interactive visual      turn left and continue along the street before you see a yel-
                                        navigation environment based on Google Street View; we           low building on your right”. People give indications that
                                        further design an annotation method to highlight mined an-       are not purely directional, let alone metric. They mix in re-
                                        chor landmarks and local directions between them in order        ferrals to landmarks that you will find along your route.
                                        to help annotators formulate typical, human references to        This may seem like a trivial ability, as humans do this rou-
                                        those. The annotation task was crowdsourced on the AMT           tinely. Yet, this is a complex cognitive task that relies on
                                        platform, to construct a new Talk2Nav dataset with 10, 714       the development of an internal, spatial representation that
                                        routes. Our second contribution is a new learning method.        includes visual landmarks (e.g. “the yellow building”) and
                                        Inspired by spatial cognition research on the mental con-        possible, local directions (e.g. “going forward for about 200
                                        ceptualization of navigational instructions, we introduce a      meters”). Such representation can support a continuous self-
                                        soft dual attention mechanism defined over the segmented         localization as well as conveying a sense of direction to-
                                        language instructions to jointly extract two partial instruc-    wards the goal.
                                        tions – one for matching the next upcoming visual landmark           Just as a human can navigate in an environment when
                                        and the other for matching the local directions to the next      provided with navigational instructions, our aim is to teach
                                        landmark. On the similar lines, we also introduce spatial        an agent to perform the same task to make the human-robot
                                        memory scheme to encode the local directional transitions.       interaction more intuitive. The problem is tackled recently
                                        Our work takes advantage of the advance in two lines of re-      as a Vision-and-Language Navigation (VLN) problem [5].
                                        search: mental formalization of verbal navigational instruc-     Although important progress has been made, e.g. in con-
                                        tions and training neural network agents for automatic way       structing good datasets [5, 16] and proposing effective learn-
                                        finding. Extensive experiments show that our method signif-      ing methods [5, 23, 73, 54], this stream of work mainly fo-
                                        icantly outperforms previous navigation methods. For demo        cuses on synthetic worlds [35, 19, 15] or indoor room-to-
                                        video, dataset and code, please refer to our project page.       room navigation [5, 23, 73]. Synthetic environments limit the
                                                                                                         complexity of the visual scenes while the room-to-room nav-
                                        Keywords Vision-and-language Navigation · Long-range             igation comes with the kind of challenges different from
                                        Navigation · Spatial Memory · Dual Attention                     those of outdoors.
                                        Arun Balajee Vasudevan · Dengxin Dai · Luc Van Gool                  The first challenge to learning a long-range wayfinding
                                        ETH Zurich, Switzerland                                          model lies in the creation of large-scale datasets. In order to
                                        E-mail: {arunv, dai, vangool}@vision.ee.ethz.ch                  be fully effective, the annotators providing the navigation in-
                                        Luc Van Gool                                                     structions ought to know the environment like locals would.
                                        K.U Leuven, Belgium                                              Training annotators to reach the same level of understanding
Talk2Nav: Long-Range Vision-and-Language Navigation with Dual Attention and Spatial Memory - (ESAT) KU Leuven
2                                                                                                                    Arun Balajee Vasudevan et al.

                                            Navigational Instructions
                                                                                                                                          o
                                                                                                                                      0             o
                                            go straight and move forward till tri                                          315
                                                                                                                                 o
                                                                                                                                              45
                              Destination   intersection intersection near to the
             Agent            Node          glendale shop turn right and reach the
                                                                                                                            o                           o
                                                                                                                          270        Stop       90
                                            intersection road on the right, we have                                             o                  o
                                            park entrance go straight and move                                             225                135
                                                                                                                                          o
                                            forward four blocks we have a thai                                                       180
                                            restaurant on our left proceed
                Source Node                 forward 4 more blocks reached the
                                            front of an atm
                                                    Directional Instructions                                                Observation
                                                    Landmark Instructions

    (a)a)TopAgent  navigating
             view of           fromroute
                     the navigation                                  (b) Street view of
                                                                           b) Agent  atthe route and
                                                                                        Current      the agent’s status
                                                                                                  location
           source to destination

Fig. 1: (a) An illustration of an agent finding its way from a source point to a destination. (b) Status of the agent at the
current location including the segmented navigational instructions with red indicating visual landmark descriptions and blue
the local directional instructions, and the agent’s panoramic view and a pie chart depicting the action space. The bold text
represents currently attended description. The predicted action at the current location is highlighted. The landmark that the
agent is looking for at the moment is indicated by a yellow box.

for a large number of unknown environments is inefficient                      erating local directional instructions between consecutive land-
– to create one verbal navigation instruction, an annotator                    marks. This way, the annotation tasks become simpler.
needs to search through hundreds of street-view images, re-
member their spatial arrangement, and summarize them into                          We develop an interactive visual navigation environment
a sequence of route instructions. This straightforward anno-                   based on Google Street View, and more importantly design
tation approach would be very time-consuming and error-                        a novel annotation method which highlights selected land-
prone. Because of this challenge, the state-of-the-art work                    marks and the spatial transitions in between. This enhanced
uses synthetic directional instructions [36] or works mostly                   annotation method makes it feasible to crowdsource this com-
on indoor room-to-room navigation. For indoor room-to-                         plicated annotation task. By hosting the tasks on the Ama-
room navigation, this challenge is less severe, due to two                     zon Mechanical Turk (AMT) platform, this work has con-
reasons: 1) the paths in indoor navigation are shorter; and                    structed a new dataset Talk2Nav with 10, 714 long-range
2) indoor environments have a higher density of familiar                       routes within New York City (NYC).
‘landmarks’. This makes self-localization, route remember-                         The second challenge lies in training a long-range wayfind-
ing and describing easier.                                                     ing agent. This learning task requires accurate visual atten-
    To our knowledge, there is only one other work by Chen                     tion and language attention, accurate self-localization and a
et al. [16] on natural language based outdoor navigation,                      good sense of direction towards the goal. Inspired by the re-
which also proposes an outdoor VLN dataset. Although they                      search on mental conceptualization of navigational instruc-
have designed a great method for data annotation through                       tions in spatial cognition [67, 56, 49], we introduce a soft
gaming – to find a hidden object at the goal position, the                     attention mechanism defined over the segmented language
method has difficulty to be applied to longer routes (see dis-                 instructions to jointly extract two partial instructions – one
cussion in Section 3.2.2). Again, it is very hard for the an-                  for matching the next coming visual landmark and the other
notators to remember, summarize and describe long routes                       for matching the spatial transition to the next landmark. Fur-
in an unknown environment.                                                     thermore, the spatial transitions of the agent are encoded by
                                                                               an explicit memory framework which can be read from and
     To address this challenge, we draw on the studies in cog-                 written to as the agent navigates. One example of the out-
nition and psychology on human visual navigation which                         door VLN task can be found in Figure 1. Our work connects
state the significance of using visual landmarks in route de-                  two lines of research that have been less explored together
scriptions [39, 43, 66]. It has been found that route descrip-                 so far: mental formalization of verbal navigational instruc-
tions consist of descriptions for visual landmarks and local                   tions [67, 56, 49] and training neural network agent for auto-
directional instructions between consecutive landmarks [58,                    matic wayfinding [5, 36].
56]. Similar techniques – a combination of visual landmarks,
as rendered icons, and highlighted routes between consecu-                         Extensive experiments show that our method outperforms
tive landmarks – are constantly used for making efficient                      previous methods by a large margin. We also show the con-
maps [67, 74, 26]. We divide the task of generating the de-                    tributions of the sub-components of our method, accompa-
scription for a whole route into a number of sub-tasks con-                    nied with their detailed ablation studies. The collected dataset
sist of generating descriptions for visual landmarks and gen-                  will be made publicly available.
Talk2Nav: Long-Range Vision-and-Language Navigation with Dual Attention and Spatial Memory - (ESAT) KU Leuven
Talk2Nav: Long-Range Vision-and-Language Navigation with Dual Attention and Spatial Memory                                 3

2 Related Works                                                fers significantly from them. The environment domain is
                                                               different; as discussed in Section 1, long-range navigation
Vision & Language. Research at the intersection of lan-        raises more challenges both in data annotation and model
guage and vision has been conducted extensively in the last    training. There are concurrent works aiming at extending
few years. The main topics include image captioning [44,       Vision-and-Language Navigation to city environment [36,
78], visual question answering (VQA) [2, 6], object referring  16, 47]. The difference to Hermann et al. [36] lies in that our
expressions [8, 10], and grounded language learning [35, 37].  method works with real navigational instructions, instead of
Although the goals are different from ours, some of the fun-   the synthetic ones summarized by Google Maps. This dif-
damental techniques are shared. For example, it is a com-      ference leads to different tasks and in turn to different solu-
mon practice to represent visual data with CNNs pre-trained    tions. Kim et al. [47] proposes end-to-end driving model that
for image recognition and to represent textual data with word  takes   natural language advice to predict control commands
embeddings pre-trained on large text corpora. The main dif-    to navigate in city environment. Chen et al. [16] proposes
ference is that the perceptual input to the system is static   outdoor VLN dataset similar to ours, where real instructions
while ours is active, i.e. the systems behavior changes the    are created from Google Street View1 images. However, our
perceived input.                                               method differs in the way that it decomposes the naviga-
Vision Based Navigation. Navigation based on vision and        tional instructions to make the annotation easier. It draws
reinforcement learning (RL) has become a very interesting      inspiration from spatial cognition field to specifically pro-
research topic recently. The technique has proven quite suc-   mote annotators’ memory and thinking, making the task less
cessful in simulated environments [60, 83] and is being ex-    energy-consuming and less error-prone. More details can be
tended to more sophisticated real environments [59]. There     found in Section 3.
has been active research on navigation-related tasks, such     Attention for Language Modeling. Attention mechanism
as localizing from only an image [75], finding the direction   has been used widely for language [55] and visual mod-
to the closest McDonalds, using Google Street View Im-         eling [78, 71, 4]. Language attention mechanism has been
ages [46, 14], goal based visual navigation [30] and others.   shown to produce state-of-the-art results in machine trans-
Gupta et al. [30] uses a differentiable mapper which writes    lation [9] and other natural language processing tasks like
into a latent spatial memory corresponding to an egocentric    VQA [42, 79, 40], image captioning [7], grounding referen-
map of the environment and a differentiable planner which      tial expressions [40, 41] and others. Attention mechanism
uses this memory and the given goal to give navigational ac-   is one of the main component for the top-performing al-
tions to navigate in novel environments. There are few other   gorithms such as Transformer [68] and BERT [21] in NLP
recent works on vision based navigation [64, 76]. Thoma        tasks. The MAC by Hudson et al. [42] has a control unit
et al. [64] formulates compact map construction and accu-      which performs weighted average of the question words based
rate self localization for image-based navigation by a careful on a soft attention for VQA task. Hu et al. [40] also proposed
selection of suitable visual landmarks. Recently, Wortsman     a similar language attention mechanism but they decom-
et al. [76] proposes a meta-reinforcement learning approach    posed the reasoning into sub-tasks/modules and predicted
for visual navigation, where the agent learns to adapt in un-  modular weights from the input text. In our model, we adapt
seen environments in a self-supervised manner. There is an-    the soft attention proposed by Kumar et al. [50] by applying
other stream of work called end-to-end driving, aiming to      the soft attention over segmented language instructions to
learn vehicle navigation directly from video inputs [33, 34].  put the attention over two different sub-instructions: a) for
                                                               landmarks and b) for local directions. The attention mecha-
Vision-and-Language Navigation. The task is to navigate
                                                               nism is named dual attention.
an agent in an environment to a particular destination based
                                                               Memory. There are generally two kinds of memory used in
on language instructions. The following are some recent works
                                                               the literature: a) implicit memory and b) explicit memory.
in Vision-and-Language Navigation (VLN) [5, 73, 23, 72, 61,
                                                               Implicit memory learns to memorize knowledge in the hid-
54,45] task. The general goal of these works are similar to
                                                               den state vectors via back-propagation of errors. Typical ex-
ours – to navigate from a starting point to a destination in a
                                                               amples include RNNs [44] and LSTMs [22]. Explicit mem-
visual environment with language directional descriptions.
                                                               ory, however, features explicit read and write modules with
Anderson et al. [5] created the R2R dataset for indoor room-
                                                               an attention menchanism. Notable examples include Neural
to-room navigation and proposed a learning method based
                                                               Turing Machines [28] and Differentiable Neural Computers
on sequence-to-sequence neural networks. Subsequent meth-
                                                               (DNCs) [29]. In our work, we use external explicit mem-
ods [73, 72] apply reinforcement learning and cross modal
                                                               ory in the form of a memory image which can be read from
matching techniques on the same dataset. The same task was
                                                               and written to by its read and write modules. Training a soft
tackled by Fried et al. [23] using speaker-follower technique
                                                               attention mechanism over language segments coupled with
to generate synthetic instructions for data augmentation and
                                                                 1
pragmatic inference. While sharing similarity, our work dif-        https://developers.google.com/streetview/
Talk2Nav: Long-Range Vision-and-Language Navigation with Dual Attention and Spatial Memory - (ESAT) KU Leuven
4                                                                                                          Arun Balajee Vasudevan et al.

                                                                                                  Route
an explicit memory scheme makes our method more suit-
                                                                                                        Sub-route
able for long-range navigation where the reward signals are
                                                                  Source                   Landmark 2                        Landmark 4
sparse.                                                             S                                                             D

                                                                             Landmark 1                         Landmark 3   Destination

                                                                                     Road nodes            Road Segments
3 Talk2Nav Dataset
                                                                  Fig. 2: An illustrative route from a source node to a destina-
The target is to navigate using the language descriptions in
                                                                  tion node with all landmarks labelled along the route. In our
real outdoor environment. Recently, a few datasets on lan-
                                                                  case, the destination is landmark 4. Sub-routes are defined
guage based visual navigation task have been released both
                                                                  as the routes between two consecutive landmarks.
for indoor [5] and outdoor [13, 16] environments. Existing
datasets typically annotate one overall language description
for the entire path/route. This poses challenges for annotat-     to the destination node, which are sampled from the set of
ing longer routes as stated in Section 1. Furthermore, these      all compiled nodes. The A∗ search algorithm is then used to
annotations lack the correspondence between language de-          generate a route by finding the shortest traversal path from
scriptions and sub-units of a route. This in turn poses chal-     the source to the destination in the city graph.
lenges in learning long-range vision-and-language naviga-
tion (VLN). To address the issue, this work proposes a new        Street View Images. Once the routes are generated using
annotation method and uses it to create a new dataset Talk2Nav.   OSM, we densely interpolated the latitudes and longitudes
    Talk2Nav contains navigation routes at city levels. A         of the locations along the routes to extract more road loca-
navigational city graph is created with nodes as locations        tions. We then use Google Street View APIs to get the clos-
in the city and connecting edges as the road segments be-         est StreetView panorama and omit the locations which do
tween the nodes, which is similar to Mirowski et al. [59]. A      not have a streetview image. We collect the 360◦ street-view
route from a source node to a destination node is composed        images along with their heading angles by using Google
of densely sampled atomic unit nodes. Each node contains          Street View APIs and the metadata2 . The API allows for
a) a street-view panoramic image, b) the GPS coordinates          downloading tiles of a street-view panoramic image which
of that location, c) the bearing angles to connecting edges       are then stitched together to get the equirectangular projec-
(roads). Furthermore, we enrich the routes with intermedi-        tion image. We use the heading angle to re-render the street-
ary visual landmarks, language descriptions for these visual      view panorama image such that the image is centre-aligned
landmarks and the local directional instructions for the sub-     to the heading directions of the route.
routes connecting the landmarks. An illustration of a route           We use OSM and opensourced OSM related APIs for
is shown in Figure 2.                                             the initial phase of extraction of road nodes and route plans.
                                                                  However later, we move to Google Street View for min-
                                                                  ing streetview images and their corresponding metadata. We
3.1 Data Collection
                                                                  noted that it was not straightforward to mine the locations/road
To retrieve city route data, we used OpenStreetMap to obtain      nodes from Google Street View.
the metadata information of locations and Google’s APIs for
maps and street-view images.
                                                                  3.2 Directional Instruction Annotation
Path Generation. The path from a source to a destination
                                                                  The main challenge of data annotation for automatic language-
is sampled from a city graph. To that aim, we used Open-
                                                                  based wayfinding lies in the fact that the annotators need
StreetMap (OSM) which provides latitudes, longitudes and
                                                                  to play the role of an instructor as the local people do to
bearing angles of all the locations (waypoints) within a pre-
                                                                  tourists. This is especially challenging when the annotators
defined region in the map. We use Overpass API to extract
                                                                  do not know the environment well. The number of street-
all the locations from OSM, given a region. A city graph
                                                                  view images for a new environment is tremendous and search-
is defined by taking the locations of the atomic units as the
                                                                  ing through them can be costly, not to mention remembering
nodes, and the directional connections between neighbour-
                                                                  and summarizing them to verbal directional instructions.
hood locations as the edges. The K-means clustering algo-
                                                                      Inspired by the large body of work in cognitive science
rithm (k = 5) is applied to the GPS coordinates of all nodes.
                                                                  on how people mentally conceptualize route information and
We then randomly picked up two nodes from different clus-
                                                                  convey routes [48, 56, 67], our annotation method is designed
ters to ensure that the source node and the destination node
are not too close. We then use opensourced OpenStreet Rout-        2
                                                                      http://maps.google.com/cbk?output=xml&ll=40.735357,-
ing Machine to extract a route plan from every source node        73.918551&dm=1
Talk2Nav: Long-Range Vision-and-Language Navigation with Dual Attention and Spatial Memory - (ESAT) KU Leuven
Talk2Nav: Long-Range Vision-and-Language Navigation with Dual Attention and Spatial Memory                                                                             5

                                                                      3
                                                                                                                                Destination
                                                                                                                                                   4
                                                                                                                                Current Location

                                                                                                                                Landmarks
  Left view                   Front view                   Right view                  Rear view

                                                                                                      2

  Describe the route description in a short sentence. E.g: Go straight for 50m and then turn right.
                                                                                                      1

Fig. 3: The annotation interface used in Mechanical Turk. Box 1 represents the text box where the annotators write the
descriptions. Box 2 denotes the perspective projected images of left, front, right and rear view at the current location. Box
3 shows the Street View environment where the annotator can drag and rotate to get 360◦ view. Box 4 represents the path
to-be-annotated with the markers for the landmarks. Red line in Box 4 denotes the current subroute. We can navigate forward
and backward along the marked route (lines drawn in red and green), perceiving the Street View simultaneously on the left.
Boxes are provided for describing the marked route for landmarks and directions between them. Zoom In for a better view.

to specifically promote memory or thinking of the annota-                                                 and 3) images which are easy to be described, remembered
tors. For the route directions, people usually refer to visual                                            and identified are preferred for effective communication.
landmarks [39, 56, 65] along with a local directional instruc-                                                 Given the set of all images I along a path P , the problem
tion [67, 69]. Visualizing the route with highlighted salient                                             is formulated as a subset selection problem that maximizes
landmarks and local directional transitions compensates for                                               a linear combination of the three submodular objectives:
limited familiarity or understanding of the environment.                                                                3
                                                                                                                        X
                                                                                                          L = argmax          wi fi (L0 , P ),     s.t.|L0 | = l     (1)
                                                                                                               L0 ⊆℘(I) i=1

3.2.1 Landmark Mining                                                                                     where ℘(I) is the powerset of I, L0 is the set of all possi-
                                                                                                          ble solutions for the size of l, wi are non-negative weights,
The ultimate goal of this work is to make human-to-robot in-                                              and fi are the sub-modular objective functions. More specif-
teraction more intuitive, thus the instructions need to be sim-                                           ically, f1 is the minimum travel distance between any of the
ilar to those used by daily human communication. Simple                                                   two consecutive selected images along the route P ; f2 =
methods of mining visual landmarks based on some CNN                                                      1/(d + σ) with d the distance to the closest approaching
features may lead to images which are hard for human to dis-                                              intersection and σ is set to 15 meters to avoid having an
tinguish and describe. Hence, this may lead to low-quality or                                             infinitely large value for intersection nodes; f3 is a learned
unnatural instructions.                                                                                   ranking function which signals the easiness of describing
    We frame the landmark mining task as a summariza-                                                     and remembering the selected images. The weights wi in
tion problem using sub-modular optimization to create sum-                                                Equation 1 are set empirically: w1 = 1, w2 = 1 and w3 = 3.
maries that takes into account multiple objectives. In this                                               l is set to 3 in this work as we have the source and the
work, three criteria are considered: 1) the selected images                                               destination nodes fixed and we choose three intermediate
are encouraged to spread out along the route to support con-                                              landmarks as shown in Figure 2. Routes of this length with
tinuous localization and guidance; 2) images close to road                                                3 intermediate landmarks are already long enough for the
intersections and the approaching side of the intersections                                               current navigation algorithms. The function f3 is a learned
are preferred for better guidance through the intersections;                                              ranking model, which is presented in the next paragraph.
Talk2Nav: Long-Range Vision-and-Language Navigation with Dual Attention and Spatial Memory - (ESAT) KU Leuven
6                                                                                                                   Arun Balajee Vasudevan et al.
    a)   New York City in GoogleMap   b) Paths annotated in StreetNav      c) One route with landmarks        d) Atomic node in City graph

                                                                                                                                       o
                                                                                                                                  20
                                                                                                                                                    o
                                                                                                                                               85
                                                                                                                              o
                                                                                                                        293

                                                                                                                                           o
                                                                                                                                  206

    (a) the annotated region in NYC      (b) the created city graph     (c) a route with landmarks and sub-routes     (d) the structure of a node

                            Fig. 4: An illustration of our Talk2Nav dataset at multiple granularity levels.

Ranking model. In order to train the ranking model for im-               have to move back and forth multiple times, c) the annota-
ages of being ‘visual landmarks’ that are easier to describe             tion errors are very high. These observations confirmed our
and remember, we compile images from three cities: New                   conjecture that it is very challenging to create high-quality
York City (NYC), San Francisco (SFO) and London cov-                     annotations for large-scale, long-range VLN tasks. For each
ering different scenes such as high buildings, open fields,              annotation, the annotator needs to know the environment as
downtown areas, etc. We select 20,000 pairs of images from               well as the locals would in order to be accurate. This is time-
the compiled set. A pairwise comparison is performed over                consuming and error-prone. To address this issue, we sim-
20,000 pairs to choose one over the other for the preferred              plify the annotation task.
landmark. We crowd-sourced the annotation with the fol-
                                                                             In our route description annotation process, the mined
lowing criteria: a) Describability – how easy to describe it
                                                                         visual landmarks are provided, along with an overhead topo-
by the annotator, b) Memorability – how easy to remem-
                                                                         logical map and the street-view interface. We crowd-source
ber it by the agent (e.g. a traveler such as a tourist), and
                                                                         this annotation task on Amazon Mechanical Turk. In the in-
c) Recognizability – how easy to recognize it by the agent.
                                                                         terface, a pre-defined route on the GoogleMap is shown to
Again, our ultimate goal is to make human-to-robot inter-
                                                                         the annotator. An example of the interface is shown in Fig-
action more intuitive, so these criteria are inspired by how
                                                                         ure 3. The green line denotes the complete route while the
human select landmarks when formulating navigational de-
                                                                         red line shows the current road segment. We use Move for-
scriptions in our daily life.
                                                                         ward and Move backward button to navigate from the source
    We learn the ranking model with a Siamese network [17]               node to the destination node. The annotator is instructed to
by following [31]. The model takes a pair of images and                  watch the 360◦ Street-view images on the left. Here, we have
scores the selected image more than the other one. We use                customized the Google Street View interface to allow the an-
the Huber rank loss as the cost function. In the inference               notator to navigate along the street-view images simultane-
stage, the model (f3 in Equation 1) outputs the averaged                 ously as they move forward/backward in the overhead map.
score for all selected images signalling their suitability as            The street view is aligned to the direction of the navigation
visual landmarks for the VLN task.                                       such that forward is always aligned with the moving direc-
                                                                         tion. To minimize the effort of understanding the street-view
                                                                         scenes, we also provide four perspective images for the left-
3.2.2 Annotation and Dataset Statistics
                                                                         view, the front-view, the right-view, and the rare-view.
In the literature, a few datasets have been created for similar              The annotator can navigate through the complete route
tasks [5, 16, 70, 80]. For instance, Anderson et al. [5] anno-           comprising of m landmark nodes and m intermediate sub-
tated the language description for the route by asking the               routes and is asked to provide descriptions for all landmarks
user to navigate the entire path in egocentric perspective. In-          and descriptions for all sub-routes. In this work, we use m =
corporation of overhead map of navigated route as an aid for             4, which means that we collect 4 landmark descriptions and
describing the route can be seen in [16, 70, 80].                        4 local directional descriptions for each route as shown in
    In the initial stages of our annotation process, we fol-             Figure 2. We then append the landmark descriptions and the
lowed the above works. We allowed the annotators to nav-                 directional descriptions for the sub-routes one after the other
igate the entire route and describe a single instruction for             to yield the complete route description. The whole route de-
the complete navigational route. We observed that a) the de-             scription is generated in a quite formulaic way which may
scriptions are mostly about the ending part of the routes in-            lead to not very fluent descriptions. The annotation method
dicating that the annotators forget the earlier stages of the            offers an efficient and reliable solution at a modest price of
routes; b) the annotators took a lot of time to annotate as they         language fluency.
Talk2Nav: Long-Range Vision-and-Language Navigation with Dual Attention and Spatial Memory - (ESAT) KU Leuven
Talk2Nav: Long-Range Vision-and-Language Navigation with Dual Attention and Spatial Memory                                                  7

                                                                        Table 1: A comparison of our Talk2Nav dataset with the
                                                                        R2R dataset [5] and the TouchDown dataset [16]. Avg stands
                                                                        for Average, desc for description, and (w) for (words).

                                                                        Criteria                          R2R [5] Touchdown [16] Talk2Nav
                                                                        # of navigational routes           7,189     9,326       10,714
                                                                        # of panoramic views              10,800     29,641      21,233
                                                                        # of panoramic views per path       6.0       35.2         40.0
  (a) Landmark descriptions        (b) Local directional instructions   # of navigational descriptions    21,567      9,326      10,714
                                                                        # of navigational desc per path      3          1           1
                                                                        Length of navigational desc (w)      29       89.6         68.8
                                                                        # of landmark desc                    -         -        34,930
                                                                        Length of landmark desc (w)           -         -           8
                                                                        # of local directional desc           -         -        27,944
                                                                        Length of landmark desc (w)           -         -          7.2
                                                                        Vocabulary size                   3,156      4,999        5,240
                                                                        Sub-unit correspondence             No         No          Yes

                (c) Complete navigational instructions                  ple aspects, the R2R and TouchDown datasets only anno-
                                                                        tate one overall language description for the entire route.
Fig. 5: The length distribution of landmark instructions (a),           It is costly to scale the annotation method to longer routes
local directional instructions (b), and complete navigational           and to a large number of unknown environments. Further-
instructions (c).                                                       more, the annotations in these two datasets lacks the corre-
                                                                        spondence between language descriptions and sub-units of
Statistics. We gathered 43, 630 locations which include GPS             a route. Our dataset offers it without increasing the anno-
coordinates (latitudes and longitudes), bearing angles, etc. in         tation effort. The detailed correspondence of the sub-units
New York City (NYC) covering an area of 10 km × 10 km as                facilitates the learning task and enables an interesting cross-
shown in Figure 4. Out of all those locations, we managed to            difficulty evaluation as presented in Section 5.2.
compile 21, 233 street-view images (outdoor) – each for one                  The average time taken for landmark ranking task is 5
road node. We constructed a city graph with 43, 630 nodes               seconds and for the route description task is around 3 min-
and 80, 000 edges. The average # of navigable directions per            utes. In terms of payment, we paid $0.01 and $0.5 for the
node is 3.4.                                                            landmark ranking task and the route description task, re-
    We annotated 10, 714 navigational routes. These route               spectively. Hence, the hourly rate of payment for annota-
descriptions are composed of 34, 930 node descriptions and              tion is $10. We have a qualification test. Only the annotators
27, 944 local directional descriptions. Each navigational in-           who passed the qualification test were employed for the real
struction comprises of 5 landmark descriptions (we use only             annotation. In total, 150 qualified workers were employed.
4 in our work) and 4 directional instructions. These 5 land-            During the whole course of annotation process, we regularly
mark descriptions include the description about the starting            sample a subset of the annotations from each worker and
road node, the destination node and the three intermediate              check their annotated route descriptions manually. If the an-
landmarks. Since the agent starts the VLN task from the                 notation is of low quality, we reject the particular jobs and
starting node, we use only 4 landmark descriptions along                give feedback to the workers for improvement.
with 4 directional instructions. The starting point descrip-                 We restrict the natural language to English in our work.
tion can be used for automatic localization which is left for           However, there are still biases in the collected annotations
future work. The average length of the navigational instruc-            as is generally studied in the works of Bender et al. [11,
tions, the landmark descriptions and the local directional in-          12]. The language variety in our route descriptions com-
structions of the sub-routes are 68.8 words, 8 words and 7.2            prise of social dialects from United States, India, Canada,
words, respectively. In total, our dataset Talk2Nav contains            United Kingdom, Australia and others, showing diverse an-
5, 240 unique words. Figure 5 shows the distribution of the             notator demographics. Annotators are instructed to follow
length (number of words) of the landmark descriptions, the              guidelines for descriptions like i) Do not describe based on
local directional instructions and the complete navigational            street names or road names, ii) Try to use non-stationary ob-
instructions.                                                           jects mainly as visual landmarks. We also later perform cer-
    In Table 1, we show a detailed comparison with the R2R              tain curation techniques to the provided descriptions such as
dataset [5] and the TouchDown dataset [16] under different              a) removing all non-unicode characters, b) correcting mis-
criteria. While the three datasets share similarities in multi-         spelled words using opensourced LanguageTool [1].
Talk2Nav: Long-Range Vision-and-Language Navigation with Dual Attention and Spatial Memory - (ESAT) KU Leuven
8                                                                                                       Arun Balajee Vasudevan et al.

4 Approach                                                           an action, it interacts with the environment and receives a
                                                                     new visual observation. The agent performs the actions se-
     Our system is a single agent traveling in an environ-           quentially until it reaches the destination node successfully
ment represented as a directed connected graph. We create            or it fails the task because it exceeds the maximum episode
a street-view environment on top of the compiled city graph          length.
of our Talk2Nav dataset. The city graph consists of 21, 233              The agent learns to predict an action at each state to nav-
nodes/locations with the corresponding street-view images            igate in the environment. Learning long-range vision-and-
(outdoor) and 80, 000 edges representing the travelable road         language navigation (VLN) requires an accurate sequential
segments between the nodes. Nodes are points on the road             matching between the navigation instruction and the route.
belonging to the annotated region of NYC. Each node con-             As argued in the introduction, we observe that navigation
tains a) a 360◦ street-view panoramic image, b) the GPS co-          instructions consist of two major classes: landmark descrip-
ordinates of the corresponding location (defined by latitudes        tions and local directional instructions between consecutive
and longitudes), c) and the bearing angles of this node in           landmarks. In this work, given a language instruction, our
the road network. A valid route is a sequence of connected           method segments it to a sequence of two interleaving classes:
edges from a source node to a destination node. Please see           landmark descriptions and local directional instructions. Please
Figure 4 for the visualization of these concepts.                    see Figure 1 for an example of a segmented navigational in-
     During the training and testing stages, we use a simula-        struction. As it moves, the agent learns an associated refer-
tor to navigate the agent in the street view environment in the      ence position in the language instruction to obtain a softly-
city graph. Based on the predicted action at each node, the          attended local directional instruction and landmark descrip-
agent moves to the next node in the environment. A success-          tion. When one landmark is achieved, the agent updates its
ful navigation is defined as when the agent correctly follows        attention and moves towards the new goal, i.e. the next land-
the right route referred by the natural language instruction.        mark. Two matching modules are used to score each state
For the sake of tractability, we assume no uncertainty (such         by matching: 1) the traversed path in memory and the lo-
as dynamic changes in the environment, noise in the actu-            cal, softly-attended directional instruction, and 2) the visual
ators) in the environment, hence it is a deterministic goal          scene the agent observes at the current node and the local,
setting.                                                             softly-attended landmark description. An explicit external
     Our underlying simulator is the same as the ones used           memory is used to store the traversed path from the latest
by previous methods [5, 16, 36]. It, however, has a few dif-         visited landmark to the current position.
ferent features. The simulator of [5] is compiled for indoor             Each time a visual observation is successfully matched
navigation. The simulators of [16, 36] are developed for out-        to a landmark, a learned controller increments the reference
door (city) navigation but they differ in the region used for        position of the soft attention map over the language instruc-
annotation. The action spaces are also different. The simula-        tion to update both the landmark and directional descrip-
tors of [5, 16, 36] use an action space consisting of left, right,   tions. This process continues until the agent reaches the des-
up, down, forward and stop actions. We employ an action              tination node or the episodic length of the navigation is ex-
space of nine actions: eight moving actions in the eight equi-       ceeded. A schematic diagram of the method is shown in Fig-
spaced angular directions between 0◦ and 360◦ and the stop           ure 6. Below we detail all the models used by our method.
action. For each moving action, we select the road which
has the minimum angular distance to the predicted moving             4.1.1 Language
direction. In the following sections, we present the method
in details.                                                          We segment the given instruction to two classes: a) visual
                                                                     landmark descriptions and b) local directional instructions
                                                                     between the landmarks. We employ the BERT transformer
4.1 Route Finding Task                                               model [21] to classify a given language instruction X into a
                                                                     sequence of token classes {ci }ni=1 where ci ∈ {0, 1}, with
The task defines the goal of an embodied agent to navigate           0 denoting the landmark descriptions and and 1 the local
from a source node to a destination node based on a naviga-          directional instructions. By grouping consecutive segments
tional instruction. Specifically, given a natural language in-       of the same token class, the whole instruction is segmented
struction X = {x1 , x2 , ..., xn }, the agent needs to perform       into interleaving segments. An example of the segmenta-
a sequence of actions {a1 , a2 , ..., am } from the action space     tion is shown in Figure 1 and multiple more in Figure 8.
A to hop over nodes in the environment space to reach the            Those segments are used as the basic units of our attention
destination node. Here xi represents individual word in an           scheme rather than the individual words used by previous
instruction while ai denotes an element from the set of nine         methods [5]. This segmentation converts the language de-
actions (Detailed in Section 4.1.5). When the agent takes            scription into a more structured representation, aiming to fa-
Talk2Nav: Long-Range Vision-and-Language Navigation with Dual Attention and Spatial Memory - (ESAT) KU Leuven
Talk2Nav: Long-Range Vision-and-Language Navigation with Dual Attention and Spatial Memory                                                          9

                                                                          o
                                                                               45 o
                                                                      0
                                                                 o
                                                        315

                                                       270
                                                             o
                                                                     Stop       90o
                                                                                        Action
                                                                 o                  o
                                                        225                   135
               Language         Move                                 180
                                                                        o
                                                                                                        Write
               features
                ResNet                           Action Module
                features

                                                      Scores                                                                 Indicator
                StreetView Image                                                                        Memory Image         Controller
                                                 Match                              Match
                                               Module 1                           Module 2

                                                   Language Model

                           Directional           Landmark                                Directional         Landmark
                         Description j-1      Description j                             Description j     Description j+1

Fig. 6: The illustrative diagram of our model. We have a soft attention mechanism over segmented language instructions,
which is controlled by the Indicator Controller. The segmented landmark description and local directional instruction are
matched with the visual image and the memory image respectively by two matching modules. The action module fetches
features from the visual observation, the memory image, the features of the language segments, and the two matching scores
to predict an action at each step. The agent then moves to the next node, updates the visual observation and the memory
image, continues the movement, and so on until it reaches the destination.

cilitate the language-vision matching problem and the language- where ηt+1 = ηt + φt (.), η0 = 1, and φt (.) is an Indicator
trajectory matching problem. Denoted by T (X) a sequence        Controller learned to output 1 when a landmark is reached
of segments of landmark and local directional instructions,     and 0 otherwise. The controller is shown in Figure 6 and
then                                                            is defined in Section 4.1.4. After a landmark is reached, ηt
                                                                increments 1 and the attention map then centers around the
T (X) = (L1 , D1 ), (L2 , D2 ), ..., (LJ , DJ )
                                               
                                                         (2)    next pair of landmark and directional instructions. The agent
          j
where L denotes the feature representation for segment j of     stops the navigation task successfully when φt () outputs 1
                          j
landmark description, D denotes the feature representation      and  there are no more pair of landmark and directional in-
for segment j of directional instruction, and J is the total    structions left to continue. We initialize η0 as 1 to position
number of segments in the route description.                    the language   attention around the first pair of landmark and
     As it moves in the environment, the agent is associated    directional instruction.
with a reference position in the language description in order
to put the focus on the most relevant landmark descriptions
and the most relevant local directional instructions. This is                            4.1.2 Visual Observation
modeled by a differentiable soft attention map. Let us denote
by ηt the reference position for time step t. The relevant                               The agent perceives the environment with an equipped 360◦
landmark description at time step t is extracted as                                      camera which obtains the visual observation It of the envi-
                                                                                         ronment at time step t. From the image, a feature ψ1 (It ) is
         J
         X                                                                               extracted and passed to the matching module as shown in
L̄ηt =         Lj e−|ηt −j|                                                   (3)
                                                                                         Figure 6 which estimates the similarity (matching score) be-
         j=1
                                                                                         tween the visual observation It and a softly attended land-
and the relevant directional instruction as                                              mark description L̄ηt which is modulated by our attention
         J                                                                               module. The visual feature is also passed to the action mod-
                                                                                         ule to predict the action for the next step as shown in Sec-
         X
D̄ηt =         Dj e−|ηt −j| ,                                                 (4)
         j=1                                                                             tion 4.1.5.
Talk2Nav: Long-Range Vision-and-Language Navigation with Dual Attention and Spatial Memory - (ESAT) KU Leuven
10                                                                                                          Arun Balajee Vasudevan et al.

                                                                       vided to the action module along with the local directional
                                                                       instruction features to predict the action for the next step.

                                                                       4.1.4 Matching Module
       5.0             5.0                              17.839   5.0

                             Scale (metres per pixel)
                                                                       Our matching module is used to determine whether the aimed
Fig. 7: Exemplar memory images. The blue square denotes                landmark is reached. As shown in Figure 6, the matching
the source node, and the blue disk denotes the current loca-           score is determined by two complementary matching mod-
tion of the agent. The red lines indicate the traversed paths.         ules: 1) between the visual scene ψ1 (It ) and the extracted
Each memory image is associated with its map scale (metres             landmark description L̄ηt and 2) between the spatial mem-
per pixel) which is shown in the text box below.                       ory ψ2 (Mt ) and the extracted directional instruction D̄ηt .
                                                                       For both cases, we use a generative image captioning model
                                                                       based on the transformer model proposed in [68] and com-
4.1.3 Spatial Memory                                                   pute the probability of reconstructing the language descrip-
                                                                       tion given the image. Scores are the averaged generative
Inspired by [25, 29], an external memory Mt explicitly mem-            probability over all words in the instruction. Let s1t be the
orizes the agent’s traversed path from the latest visited land-        score for pair (ψ1 (It ), L̄ηt ) and s2t be the score for pair
mark. When the agent reaches a landmark, the memory Mt                 (ψ2 (Mt ), D̄ηt ). We then compute the score feature st by
is reinitialized to memorize the traversed path from the lat-          concatenating the two: st = (s1t , s2t ). The score feature st is
est visited landmark. This reinitialization can be understood          fed into a controller φ(.) to decide whether the aimed land-
as a type of attention to focus on the recently traversed path         mark is reached:
in order to better localize and to better match against the rel-
                                                                       φt (st , ht−1 ) ∈ {0, 1},                                     (5)
evant directional instructions D̄ηt modeled by the learned
language attention module defined in Section 4.1.1.                    where 1 indicates that the aimed landmark is reached and
    As the agent navigates in the environment, we have a               0 otherwise. φt (.) is an Adaptive Computation Time (ACT)
write module to write the traversed path to the memory im-             LSTM [27] which allows the controller to learn to make de-
age. For a real system navigating in the physical world, the           cisions at variable time steps. ht−1 is the hidden state of
traversed path can be obtained by an odometry system. In               the controller. In this work, φt (.) learns to identify the land-
this work, we compute the traversed path directly from the             marks with the variable number of intermediate navigation
road network in our simulated environment. Our write mod-              steps.
ule traces the path travelled from the latest visited landmark
to the current position. The path is rasterized and written            4.1.5 Action Module
into the memory image. In the image, the path is indicated
by red lines, the starting point by a blue square marker, and          The action module takes the following inputs to decide the
the current location is by a blue disk. The write module al-           moving action: a) the pair of (ψ1 (It ), L̄ηt ), b) the pair of
ways writes from the centre of the memory image to make                (ψ2 (Mt ), D̄ηt ), and c) the matching score feature st . The
sure that there is room for all directions. Whenever the co-           illustrative diagram is shown in Figure 6.
ordinates of the new rasterized pixel are beyond the image                 The inputs in the a) group is used to predict aet – a prob-
dimensions, the module incrementally increases the scale of            ability vector of moving actions over the action space. The
the memory image until the new pixel is in the image and               inputs in the b) is used to predict amt – the second probabil-
has a distance of 10 pixels to the boundary. An image of               ity vector of moving actions over the action space. The two
200 × 200 pixels is used and the initial scale of the map is           probability vectors are adaptively averaged together, with
set to 5 meters per pixel. Please find examples of the mem-            weights learned from the score feature st . Specifically, st is
ory images in Figure 7.                                                fed to a fully-connected (FC) network to output the weights
    Each memory image is associated with the value of its              wt ∈ R2 . For both a) and b), we use an encoder LSTM
scale (meters per pixel). Deep features ψ2 (Mt ) are extracted         as used in [5] to encode the language segments. We then
from the memory image Mt which are then concatenated                   concatenate the encoder’s hidden states with the image en-
with its scale value to form the representation of the mem-            codings (i.e. ψ1 (It ) and ψ2 (Mt )) and pass through a FC net-
ory. The concatenated features are passed to the matching              work to predict the probability distribution aet and am  t . By

module. The matching module verifies the semantic similar-             adaptively fusing the, we get the final action prediction at :
ity between the traversed path and the provided local direc-                    1
tional instruction. The concatenated features are also pro-            at = P      i
                                                                                     (wt0 ∗ aet + wt1 ∗ am
                                                                                                         t ).                        (6)
                                                                                 w
                                                                                i t
Talk2Nav: Long-Range Vision-and-Language Navigation with Dual Attention and Spatial Memory                                                     11

     The action at at time t is defined as the weighted aver-
                                                                          On right we have a warehouse Move forward till four way
age of aet and amt . For a different part of the trajectory, one          intersection there is a wellness center on the right near to the
of the actions (aet or am t ) or both of them are reliable. This          shop turn right and go forward on the left, we have citi bank
is heavily dependent on the situation. For instance, when                 Turn right and go ahead for two blocks we see a road bridge
                                                                          straight ahead turn right again and move forward reached the
the next landmark is not visible, the prediction should rely              traffic light with orange polls on road
more on am  t ; when the landmark is clearly recognizable,
                                                                          we have restaurant on our left go ahead until you see first
the opposite holds. The learned matching scores will de-
                                                                          road intersection there is a post office in the road corner and
cide adaptively at each time step which prediction to be                  a restaurant on the right turn left and move forward there is
trusted and by how much. This adaptive fusion can be un-                  red building of studios on our right turn left and go ahead till a
derstood as a calibration system for the two complementary                tri junction we see a glass building which is a boutique turn
                                                                          right and move forward for two junctions reached the stoned
sub-systems for action prediction. The calibration method                 building of bank on the right
needs to be time- or situation-dependent. A simpler summa-
                                                                          There is blue color building on our right go ahead for two
tion/normalization of the two predictions is rigid, and cannot            blocks there is a red stoned building on left near to huge
distinguish between a confident true prediction and a confi-              electric pole turn left and go to next road junction we see a
                                                                          green bridge on the top turn right and go ahead for two
dent false prediction. Confident false predictions are very
                                                                          blocks two restaurants on the road intersection and a
common in deeply learned models.                                          shopping mall on the left take left and move forward for two
     The action space A is defined as follows. We divide                  junctions you reach metro station on the right
the action space A into eight moving actions in eight di-
                                                                                Directional Instructions          Landmark Instructions
rections and the stop action. Each direction is centred at
{(i∗45◦ ) : i ∈ [0, ..., 7]} with ±22.5◦ offset as illustrated in
                                                                    Fig. 8: Examples of given instructions from Talk2Nav
Figure 6. When an action angle is predicted, the agent turns
                                                                    dataset. Different colours denote word token segmentation.
to the road which has the minimum angular distance to the
                                                                    Red denotes landmark descriptions and blue the local direc-
predicted angle and moves forward to the next node along
                                                                    tional instructions.
the road. The turning and moving-forward define an atomic
action. The agent comes to a stop when it encounters the
final landmark as stated in Section 4.1.1.                          This is in contrast with previous methods [5], in which it is
     It is worth noting that the implementation of our stop         the direction to the final destination.
action is different from previous VLN methods [5, 16, 36].              We use the cross-entropy loss to train the action module
As pointed out in the deep learning book [24, p.384], the           and the matching module as they are formulated as classifi-
stop action can be achieved in various ways. One can add            cation tasks. For the ACT model, we use the weighted binary
a special symbol corresponding to the end of a sequence,            cross-entropy loss at every step. The supervision of the pos-
i.e. the STOP used by previous VLN methods. When the                itive label (‘1’) for ACT model comes into effect only if the
STOP is generated, the sampling process stops. Another op-          agent reaches the landmark. The total loss is the sum of all
tion is to introduce an extra Bernoulli output to the model         module losses:
that represents the decision to either continue generation or
halt generation at each time step. In our work, we adopt the        Loss = Lossaction + Losslandmark       direction
                                                                                            matching + Lossmatching + LossACT .
second approach. The agent uses Indicator Controller (IC)                                                                                  (7)
to learn to determine whether it reaches landmarks. IC out-
puts 1 when the agent reaches a landmark and updates the            The losses of the two matching modules take effect only at
language attention for finding the next landmark until the          the place of landmarks which are much sparser than the road
destination is reached. The destination is the final landmark       nodes where the action loss and ACT loss are computed. Be-
described in the instruction. The agent stops when the IC           cause of this, we first train the matching networks individ-
predicts 1 and there is no more language description left to        ually for the matching tasks, and then integrate them with
continue. Thus, our model has a stop mechanism but in a             other components for the overall training.
rather implicit manner.

                                                                    4.3 Implementation Details
4.2 Learning
                                                                    Language Module. We use the BERT [21] transformer model
The model is trained in a supervised way. We followed the           pretrained on the BooksCorpus [82] and the English Wikipedia3 ,
student-forcing approach proposed in [5] to train our model.        for modelling language. This yields contextual word repre-
At each step, the action module is trained with a supervisory       sentations which is different from classical models such as
                                                                      3
signal of the action in the direction of the next landmark.               https://en.wikipedia.org/wiki/English Wikipedia
12                                                                                                      Arun Balajee Vasudevan et al.

word2vec [57] and GloVe [62]. We use a word-piece tok-              Network details. The SphereNet model consists of five blocks
enizer to tokenize the sentence, by following [21, 77]. Out         of convolution and max-pooling layers, followed by a fully-
of vocabulary words are split into sub-words based on the           connected layer. We use 32, 64, 128, 256, 128 filters in the
available vocabulary words. For word token classification,          1 to 5 convolutional layers and each layer is followed by
we first train BERT transformer with a token classification         a Max pooling and ReLU activation. This is the backbone
head. Here, we use the alignment between the given lan-             structure of the SphereNet. For ResNet, we use ResNet-101
guage instruction X and their corresponding set of landmark         as the backbone. Later, we have two branches: the fully con-
and directional instruction segments in the training set of the     nected layer has 8 neurons using a softmax activation func-
Talk2Nav dataset. We then train the transformer model to            tion in one branch of the network for the action module and
classify each word token in the navigational instruction to         512 neurons with ReLU activation in the other branch for
be a landmark description or a local directional instruction.       the match module. Hence, input of image feature size from
     At the inference stage, the model predicts a binary la-        these models are of 512 and attention feature size over the
bel for each word token. We later convert the sequence of           image is 512. The convolutional filter kernels are of size 5x5
word token classes into segments T (X) by simply group-             and are applied with stride 1. Max pooling is performed with
ing adjacent tokens which have the same class. We note that         kernels of size 3x3 and a stride of 2. For language, we use
this model has a classification accuracy of 91.4% on the test       an input encoding size of 512 for each token in the vocabu-
set. We have shown few word token segmentation results in           lary. We use Adam optimizer with learning rate of 0.01 and
Figure 8.                                                           alpha and beta for Adam as 0.999 and 10−8 . We trained for
                                                                    20 epochs with a mini-batch size of 16.
Visual inputs. The Google street-view images are acquired
in the form of equirectangular projection. We experimented
                                                                    5 Experiments
with SphereNet [18] architecture pretrained on the MNIST
[51] dataset and ResNet-101 [32] pretrained on ImageNet [20]        Our experiments focus on 1) the overall performance of our
to extract ψ1 (I). Since the SphereNet pre-trained on MNIST         method when compared to the state-of-the-art (s-o-t-a) navi-
is a relatively shallow network, we adapted the architecture        gation algorithms, and 2) multiple ablation studies to further
of SphereNet to suit Street View images by adding more              understand our method. The ablation studies cover a) the
convolutional blocks. We then define a pretext task using the       importance of using explicit external memory, b) a perfor-
Street View images from the Talk2Nav dataset to learn the           mance evaluation of navigation at different levels of diffi-
weights for SphereNet. Given two street-view images with            culty, c) a comparison of visual features, d) a comparison of
an overlap of their visible view, the task is to predict the dif-   the performance with and without Indicator Controller (IC),
ference between their bearing angles and the projection of          and across variants of matching modules. We train and eval-
line joining the locations on the bearing angle of second lo-       uate our model on the Talk2Nav dataset. We use 70% of the
cation. We frame the problem as a regression task of predict-       dataset for training, 10% for validation and the rest for test-
ing the two angles. This encourages SphereNet to learn the          ing. By following [3], we evaluate our method under three
semantics in the scene. We compiled the training set for this       metrics:
pre-training task from our own training split of the dataset.
In the case of memory image, we use ResNet-101 pretrained            – SPL: the success rate weighted by normalized inverse
on ImageNet to extract ψ2 (M ) from the memory image M .               path length. It penalizes the successes made with longer
                                                                       paths.
                                                                     – Navigation Error: the distance to the goal after finishing
Other modules. We perform image captioning in the match-
                                                                       the episode.
ing modules using the transformer model, by following [52,
                                                                     – Total Steps: the total number of steps required to reach
81,68]. We pretrain the transformer model for captioning in
                                                                       the goal successfully.
the matching module using the landmark street-view images
and their corresponding descriptions from the training split            We compare with the following s-o-t-a navigation meth-
of Talk2Nav for matching of the landmarks. For the other            ods: Student Forcing [5], RPA [73], Speaker-Follower [23]
matching module of local directions, we pretrain the trans-         and Self-Monitoring [53]. We trained their models on the
former model using the ground-truth memory images and               same data that our model uses. The work [72] yields very
their corresponding directional instructions. We synthesized        good results on the Room-to-Room dataset for indoor navi-
the ground truth memory image in the same way as our write          gation. We, however, have not found publicly available code
module writes to the memory image (as mentioned in Sec-             for the method and thus can only compare with other top-
tion 4.1.3). We finetune both the matching modules in the           performing methods on our dataset.
training stage. All the other models such as Indicator Con-             We also study the performance of our complete model
troller, Action module are trained in an end-to-end fashion.        and other s-o-t-a methods at varying difficulty levels: from
You can also read