Talk2Nav: Long-Range Vision-and-Language Navigation with Dual Attention and Spatial Memory - (ESAT) KU Leuven

Page created by Bryan Turner

Travel

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Talk2Nav: Long-Range Vision-and-Language Navigation with Dual Attention and Spatial Memory - (ESAT) KU Leuven

Noname manuscript No.
                                        (will be inserted by the editor)

                                        Talk2Nav: Long-Range Vision-and-Language Navigation with Dual
                                        Attention and Spatial Memory
                                        Arun Balajee Vasudevan · Dengxin Dai · Luc Van Gool
arXiv:1910.02029v2 [cs.CV] 3 Feb 2020

                                        Received: date / Accepted: date

                                        Abstract The role of robots in society keeps expanding,          1 Introduction
                                        bringing with it the necessity of interacting and communi-
                                        cating with humans. In order to keep such interaction in-        Consider that you are traveling as a tourist in a new city and
                                        tuitive, we provide automatic wayfinding based on verbal         are looking for a destination that you would like to visit. You
                                        navigational instructions. Our first contribution is the cre-    ask the locals and get a directional description “go ahead
                                        ation of a large-scale dataset with verbal navigation instruc-   for about 200 meters until you hit a small intersection, then
                                        tions. To this end, we have developed an interactive visual      turn left and continue along the street before you see a yel-
                                        navigation environment based on Google Street View; we           low building on your right”. People give indications that
                                        further design an annotation method to highlight mined an-       are not purely directional, let alone metric. They mix in re-
                                        chor landmarks and local directions between them in order        ferrals to landmarks that you will find along your route.
                                        to help annotators formulate typical, human references to        This may seem like a trivial ability, as humans do this rou-
                                        those. The annotation task was crowdsourced on the AMT           tinely. Yet, this is a complex cognitive task that relies on
                                        platform, to construct a new Talk2Nav dataset with 10, 714       the development of an internal, spatial representation that
                                        routes. Our second contribution is a new learning method.        includes visual landmarks (e.g. “the yellow building”) and
                                        Inspired by spatial cognition research on the mental con-        possible, local directions (e.g. “going forward for about 200
                                        ceptualization of navigational instructions, we introduce a      meters”). Such representation can support a continuous self-
                                        soft dual attention mechanism defined over the segmented         localization as well as conveying a sense of direction to-
                                        language instructions to jointly extract two partial instruc-    wards the goal.
                                        tions – one for matching the next upcoming visual landmark           Just as a human can navigate in an environment when
                                        and the other for matching the local directions to the next      provided with navigational instructions, our aim is to teach
                                        landmark. On the similar lines, we also introduce spatial        an agent to perform the same task to make the human-robot
                                        memory scheme to encode the local directional transitions.       interaction more intuitive. The problem is tackled recently
                                        Our work takes advantage of the advance in two lines of re-      as a Vision-and-Language Navigation (VLN) problem [5].
                                        search: mental formalization of verbal navigational instruc-     Although important progress has been made, e.g. in con-
                                        tions and training neural network agents for automatic way       structing good datasets [5, 16] and proposing effective learn-
                                        finding. Extensive experiments show that our method signif-      ing methods [5, 23, 73, 54], this stream of work mainly fo-
                                        icantly outperforms previous navigation methods. For demo        cuses on synthetic worlds [35, 19, 15] or indoor room-to-
                                        video, dataset and code, please refer to our project page.       room navigation [5, 23, 73]. Synthetic environments limit the
                                                                                                         complexity of the visual scenes while the room-to-room nav-
                                        Keywords Vision-and-language Navigation · Long-range             igation comes with the kind of challenges different from
                                        Navigation · Spatial Memory · Dual Attention                     those of outdoors.
                                        Arun Balajee Vasudevan · Dengxin Dai · Luc Van Gool                  The first challenge to learning a long-range wayfinding
                                        ETH Zurich, Switzerland                                          model lies in the creation of large-scale datasets. In order to
                                        E-mail: {arunv, dai, vangool}@vision.ee.ethz.ch                  be fully effective, the annotators providing the navigation in-
                                        Luc Van Gool                                                     structions ought to know the environment like locals would.
                                        K.U Leuven, Belgium                                              Training annotators to reach the same level of understanding

2                                                                                                                    Arun Balajee Vasudevan et al.

                                            Navigational Instructions
                                                                                                                                          o
                                                                                                                                      0             o
                                            go straight and move forward till tri                                          315
                                                                                                                                 o
                                                                                                                                              45
                              Destination   intersection intersection near to the
             Agent            Node          glendale shop turn right and reach the
                                                                                                                            o                           o
                                                                                                                          270        Stop       90
                                            intersection road on the right, we have                                             o                  o
                                            park entrance go straight and move                                             225                135
                                                                                                                                          o
                                            forward four blocks we have a thai                                                       180
                                            restaurant on our left proceed
                Source Node                 forward 4 more blocks reached the
                                            front of an atm
                                                    Directional Instructions                                                Observation
                                                    Landmark Instructions

    (a)a)TopAgent  navigating
             view of           fromroute
                     the navigation                                  (b) Street view of
                                                                           b) Agent  atthe route and
                                                                                        Current      the agent’s status
                                                                                                  location
           source to destination

Fig. 1: (a) An illustration of an agent finding its way from a source point to a destination. (b) Status of the agent at the
current location including the segmented navigational instructions with red indicating visual landmark descriptions and blue
the local directional instructions, and the agent’s panoramic view and a pie chart depicting the action space. The bold text
represents currently attended description. The predicted action at the current location is highlighted. The landmark that the
agent is looking for at the moment is indicated by a yellow box.

for a large number of unknown environments is inefficient                      erating local directional instructions between consecutive land-
– to create one verbal navigation instruction, an annotator                    marks. This way, the annotation tasks become simpler.
needs to search through hundreds of street-view images, re-
member their spatial arrangement, and summarize them into                          We develop an interactive visual navigation environment
a sequence of route instructions. This straightforward anno-                   based on Google Street View, and more importantly design
tation approach would be very time-consuming and error-                        a novel annotation method which highlights selected land-
prone. Because of this challenge, the state-of-the-art work                    marks and the spatial transitions in between. This enhanced
uses synthetic directional instructions [36] or works mostly                   annotation method makes it feasible to crowdsource this com-
on indoor room-to-room navigation. For indoor room-to-                         plicated annotation task. By hosting the tasks on the Ama-
room navigation, this challenge is less severe, due to two                     zon Mechanical Turk (AMT) platform, this work has con-
reasons: 1) the paths in indoor navigation are shorter; and                    structed a new dataset Talk2Nav with 10, 714 long-range
2) indoor environments have a higher density of familiar                       routes within New York City (NYC).
‘landmarks’. This makes self-localization, route remember-                         The second challenge lies in training a long-range wayfind-
ing and describing easier.                                                     ing agent. This learning task requires accurate visual atten-
    To our knowledge, there is only one other work by Chen                     tion and language attention, accurate self-localization and a
et al. [16] on natural language based outdoor navigation,                      good sense of direction towards the goal. Inspired by the re-
which also proposes an outdoor VLN dataset. Although they                      search on mental conceptualization of navigational instruc-
have designed a great method for data annotation through                       tions in spatial cognition [67, 56, 49], we introduce a soft
gaming – to find a hidden object at the goal position, the                     attention mechanism defined over the segmented language
method has difficulty to be applied to longer routes (see dis-                 instructions to jointly extract two partial instructions – one
cussion in Section 3.2.2). Again, it is very hard for the an-                  for matching the next coming visual landmark and the other
notators to remember, summarize and describe long routes                       for matching the spatial transition to the next landmark. Fur-
in an unknown environment.                                                     thermore, the spatial transitions of the agent are encoded by
                                                                               an explicit memory framework which can be read from and
     To address this challenge, we draw on the studies in cog-                 written to as the agent navigates. One example of the out-
nition and psychology on human visual navigation which                         door VLN task can be found in Figure 1. Our work connects
state the significance of using visual landmarks in route de-                  two lines of research that have been less explored together
scriptions [39, 43, 66]. It has been found that route descrip-                 so far: mental formalization of verbal navigational instruc-
tions consist of descriptions for visual landmarks and local                   tions [67, 56, 49] and training neural network agent for auto-
directional instructions between consecutive landmarks [58,                    matic wayfinding [5, 36].
56]. Similar techniques – a combination of visual landmarks,
as rendered icons, and highlighted routes between consecu-                         Extensive experiments show that our method outperforms
tive landmarks – are constantly used for making efficient                      previous methods by a large margin. We also show the con-
maps [67, 74, 26]. We divide the task of generating the de-                    tributions of the sub-components of our method, accompa-
scription for a whole route into a number of sub-tasks con-                    nied with their detailed ablation studies. The collected dataset
sist of generating descriptions for visual landmarks and gen-                  will be made publicly available.

Talk2Nav: Long-Range Vision-and-Language Navigation with Dual Attention and Spatial Memory 3

2 Related Works fers significantly from them. The environment domain is
different; as discussed in Section 1, long-range navigation
Vision & Language. Research at the intersection of lan- raises more challenges both in data annotation and model
guage and vision has been conducted extensively in the last training. There are concurrent works aiming at extending
few years. The main topics include image captioning [44, Vision-and-Language Navigation to city environment [36,
78], visual question answering (VQA) [2, 6], object referring 16, 47]. The difference to Hermann et al. [36] lies in that our
expressions [8, 10], and grounded language learning [35, 37]. method works with real navigational instructions, instead of
Although the goals are different from ours, some of the fun- the synthetic ones summarized by Google Maps. This dif-
damental techniques are shared. For example, it is a com- ference leads to different tasks and in turn to different solu-
mon practice to represent visual data with CNNs pre-trained tions. Kim et al. [47] proposes end-to-end driving model that
for image recognition and to represent textual data with word takes natural language advice to predict control commands
embeddings pre-trained on large text corpora. The main dif- to navigate in city environment. Chen et al. [16] proposes
ference is that the perceptual input to the system is static outdoor VLN dataset similar to ours, where real instructions
while ours is active, i.e. the systems behavior changes the are created from Google Street View1 images. However, our
perceived input. method differs in the way that it decomposes the naviga-
Vision Based Navigation. Navigation based on vision and tional instructions to make the annotation easier. It draws
reinforcement learning (RL) has become a very interesting inspiration from spatial cognition field to specifically pro-
research topic recently. The technique has proven quite suc- mote annotators’ memory and thinking, making the task less
cessful in simulated environments [60, 83] and is being ex- energy-consuming and less error-prone. More details can be
tended to more sophisticated real environments [59]. There found in Section 3.
has been active research on navigation-related tasks, such Attention for Language Modeling. Attention mechanism
as localizing from only an image [75], finding the direction has been used widely for language [55] and visual mod-
to the closest McDonalds, using Google Street View Im- eling [78, 71, 4]. Language attention mechanism has been
ages [46, 14], goal based visual navigation [30] and others. shown to produce state-of-the-art results in machine trans-
Gupta et al. [30] uses a differentiable mapper which writes lation [9] and other natural language processing tasks like
into a latent spatial memory corresponding to an egocentric VQA [42, 79, 40], image captioning [7], grounding referen-
map of the environment and a differentiable planner which tial expressions [40, 41] and others. Attention mechanism
uses this memory and the given goal to give navigational ac- is one of the main component for the top-performing al-
tions to navigate in novel environments. There are few other gorithms such as Transformer [68] and BERT [21] in NLP
recent works on vision based navigation [64, 76]. Thoma tasks. The MAC by Hudson et al. [42] has a control unit
et al. [64] formulates compact map construction and accu- which performs weighted average of the question words based
rate self localization for image-based navigation by a careful on a soft attention for VQA task. Hu et al. [40] also proposed
selection of suitable visual landmarks. Recently, Wortsman a similar language attention mechanism but they decom-
et al. [76] proposes a meta-reinforcement learning approach posed the reasoning into sub-tasks/modules and predicted
for visual navigation, where the agent learns to adapt in un- modular weights from the input text. In our model, we adapt
seen environments in a self-supervised manner. There is an- the soft attention proposed by Kumar et al. [50] by applying
other stream of work called end-to-end driving, aiming to the soft attention over segmented language instructions to
learn vehicle navigation directly from video inputs [33, 34]. put the attention over two different sub-instructions: a) for
landmarks and b) for local directions. The attention mecha-
Vision-and-Language Navigation. The task is to navigate
nism is named dual attention.
an agent in an environment to a particular destination based
Memory. There are generally two kinds of memory used in
on language instructions. The following are some recent works
the literature: a) implicit memory and b) explicit memory.
in Vision-and-Language Navigation (VLN) [5, 73, 23, 72, 61,
Implicit memory learns to memorize knowledge in the hid-
54,45] task. The general goal of these works are similar to
den state vectors via back-propagation of errors. Typical ex-
ours – to navigate from a starting point to a destination in a
amples include RNNs [44] and LSTMs [22]. Explicit mem-
visual environment with language directional descriptions.
ory, however, features explicit read and write modules with
Anderson et al. [5] created the R2R dataset for indoor room-
an attention menchanism. Notable examples include Neural
to-room navigation and proposed a learning method based
Turing Machines [28] and Differentiable Neural Computers
on sequence-to-sequence neural networks. Subsequent meth-
(DNCs) [29]. In our work, we use external explicit mem-
ods [73, 72] apply reinforcement learning and cross modal
ory in the form of a memory image which can be read from
matching techniques on the same dataset. The same task was
and written to by its read and write modules. Training a soft
tackled by Fried et al. [23] using speaker-follower technique
attention mechanism over language segments coupled with
to generate synthetic instructions for data augmentation and
1
pragmatic inference. While sharing similarity, our work dif- https://developers.google.com/streetview/

4 Arun Balajee Vasudevan et al.

Route
an explicit memory scheme makes our method more suit-
Sub-route
able for long-range navigation where the reward signals are
Source Landmark 2 Landmark 4
sparse. S D

Landmark 1 Landmark 3 Destination

Road nodes Road Segments
3 Talk2Nav Dataset
Fig. 2: An illustrative route from a source node to a destina-
The target is to navigate using the language descriptions in
tion node with all landmarks labelled along the route. In our
real outdoor environment. Recently, a few datasets on lan-
case, the destination is landmark 4. Sub-routes are defined
guage based visual navigation task have been released both
as the routes between two consecutive landmarks.
for indoor [5] and outdoor [13, 16] environments. Existing
datasets typically annotate one overall language description
for the entire path/route. This poses challenges for annotat- to the destination node, which are sampled from the set of
ing longer routes as stated in Section 1. Furthermore, these all compiled nodes. The A∗ search algorithm is then used to
annotations lack the correspondence between language de- generate a route by finding the shortest traversal path from
scriptions and sub-units of a route. This in turn poses chal- the source to the destination in the city graph.
lenges in learning long-range vision-and-language naviga-
tion (VLN). To address the issue, this work proposes a new Street View Images. Once the routes are generated using
annotation method and uses it to create a new dataset Talk2Nav. OSM, we densely interpolated the latitudes and longitudes
Talk2Nav contains navigation routes at city levels. A of the locations along the routes to extract more road loca-
navigational city graph is created with nodes as locations tions. We then use Google Street View APIs to get the clos-
in the city and connecting edges as the road segments be- est StreetView panorama and omit the locations which do
tween the nodes, which is similar to Mirowski et al. [59]. A not have a streetview image. We collect the 360◦ street-view
route from a source node to a destination node is composed images along with their heading angles by using Google
of densely sampled atomic unit nodes. Each node contains Street View APIs and the metadata2 . The API allows for
a) a street-view panoramic image, b) the GPS coordinates downloading tiles of a street-view panoramic image which
of that location, c) the bearing angles to connecting edges are then stitched together to get the equirectangular projec-
(roads). Furthermore, we enrich the routes with intermedi- tion image. We use the heading angle to re-render the street-
ary visual landmarks, language descriptions for these visual view panorama image such that the image is centre-aligned
landmarks and the local directional instructions for the sub- to the heading directions of the route.
routes connecting the landmarks. An illustration of a route We use OSM and opensourced OSM related APIs for
is shown in Figure 2. the initial phase of extraction of road nodes and route plans.
However later, we move to Google Street View for min-
ing streetview images and their corresponding metadata. We
3.1 Data Collection
noted that it was not straightforward to mine the locations/road
To retrieve city route data, we used OpenStreetMap to obtain nodes from Google Street View.
the metadata information of locations and Google’s APIs for
maps and street-view images.
3.2 Directional Instruction Annotation
Path Generation. The path from a source to a destination
The main challenge of data annotation for automatic language-
is sampled from a city graph. To that aim, we used Open-
based wayfinding lies in the fact that the annotators need
StreetMap (OSM) which provides latitudes, longitudes and
to play the role of an instructor as the local people do to
bearing angles of all the locations (waypoints) within a pre-
tourists. This is especially challenging when the annotators
defined region in the map. We use Overpass API to extract
do not know the environment well. The number of street-
all the locations from OSM, given a region. A city graph
view images for a new environment is tremendous and search-
is defined by taking the locations of the atomic units as the
ing through them can be costly, not to mention remembering
nodes, and the directional connections between neighbour-
and summarizing them to verbal directional instructions.
hood locations as the edges. The K-means clustering algo-
Inspired by the large body of work in cognitive science
rithm (k = 5) is applied to the GPS coordinates of all nodes.
on how people mentally conceptualize route information and
We then randomly picked up two nodes from different clus-
convey routes [48, 56, 67], our annotation method is designed
ters to ensure that the source node and the destination node
are not too close. We then use opensourced OpenStreet Rout- 2
http://maps.google.com/cbk?output=xml&ll=40.735357,-
ing Machine to extract a route plan from every source node 73.918551&dm=1

Talk2Nav: Long-Range Vision-and-Language Navigation with Dual Attention and Spatial Memory                                                                             5

                                                                      3
                                                                                                                                Destination
                                                                                                                                                   4
                                                                                                                                Current Location

                                                                                                                                Landmarks
  Left view                   Front view                   Right view                  Rear view

                                                                                                      2

  Describe the route description in a short sentence. E.g: Go straight for 50m and then turn right.
                                                                                                      1

Fig. 3: The annotation interface used in Mechanical Turk. Box 1 represents the text box where the annotators write the
descriptions. Box 2 denotes the perspective projected images of left, front, right and rear view at the current location. Box
3 shows the Street View environment where the annotator can drag and rotate to get 360◦ view. Box 4 represents the path
to-be-annotated with the markers for the landmarks. Red line in Box 4 denotes the current subroute. We can navigate forward
and backward along the marked route (lines drawn in red and green), perceiving the Street View simultaneously on the left.
Boxes are provided for describing the marked route for landmarks and directions between them. Zoom In for a better view.

to specifically promote memory or thinking of the annota-                                                 and 3) images which are easy to be described, remembered
tors. For the route directions, people usually refer to visual                                            and identified are preferred for effective communication.
landmarks [39, 56, 65] along with a local directional instruc-                                                 Given the set of all images I along a path P , the problem
tion [67, 69]. Visualizing the route with highlighted salient                                             is formulated as a subset selection problem that maximizes
landmarks and local directional transitions compensates for                                               a linear combination of the three submodular objectives:
limited familiarity or understanding of the environment.                                                                3
                                                                                                                        X
                                                                                                          L = argmax          wi fi (L0 , P ),     s.t.|L0 | = l     (1)
                                                                                                               L0 ⊆℘(I) i=1

3.2.1 Landmark Mining                                                                                     where ℘(I) is the powerset of I, L0 is the set of all possi-
                                                                                                          ble solutions for the size of l, wi are non-negative weights,
The ultimate goal of this work is to make human-to-robot in-                                              and fi are the sub-modular objective functions. More specif-
teraction more intuitive, thus the instructions need to be sim-                                           ically, f1 is the minimum travel distance between any of the
ilar to those used by daily human communication. Simple                                                   two consecutive selected images along the route P ; f2 =
methods of mining visual landmarks based on some CNN                                                      1/(d + σ) with d the distance to the closest approaching
features may lead to images which are hard for human to dis-                                              intersection and σ is set to 15 meters to avoid having an
tinguish and describe. Hence, this may lead to low-quality or                                             infinitely large value for intersection nodes; f3 is a learned
unnatural instructions.                                                                                   ranking function which signals the easiness of describing
    We frame the landmark mining task as a summariza-                                                     and remembering the selected images. The weights wi in
tion problem using sub-modular optimization to create sum-                                                Equation 1 are set empirically: w1 = 1, w2 = 1 and w3 = 3.
maries that takes into account multiple objectives. In this                                               l is set to 3 in this work as we have the source and the
work, three criteria are considered: 1) the selected images                                               destination nodes fixed and we choose three intermediate
are encouraged to spread out along the route to support con-                                              landmarks as shown in Figure 2. Routes of this length with
tinuous localization and guidance; 2) images close to road                                                3 intermediate landmarks are already long enough for the
intersections and the approaching side of the intersections                                               current navigation algorithms. The function f3 is a learned
are preferred for better guidance through the intersections;                                              ranking model, which is presented in the next paragraph.

6 Arun Balajee Vasudevan et al.
a) New York City in GoogleMap b) Paths annotated in StreetNav c) One route with landmarks d) Atomic node in City graph

o
20
o
85
o
293

o
206

(a) the annotated region in NYC (b) the created city graph (c) a route with landmarks and sub-routes (d) the structure of a node

Fig. 4: An illustration of our Talk2Nav dataset at multiple granularity levels.

Ranking model. In order to train the ranking model for im- have to move back and forth multiple times, c) the annota-
ages of being ‘visual landmarks’ that are easier to describe tion errors are very high. These observations confirmed our
and remember, we compile images from three cities: New conjecture that it is very challenging to create high-quality
York City (NYC), San Francisco (SFO) and London cov- annotations for large-scale, long-range VLN tasks. For each
ering different scenes such as high buildings, open fields, annotation, the annotator needs to know the environment as
downtown areas, etc. We select 20,000 pairs of images from well as the locals would in order to be accurate. This is time-
the compiled set. A pairwise comparison is performed over consuming and error-prone. To address this issue, we sim-
20,000 pairs to choose one over the other for the preferred plify the annotation task.
landmark. We crowd-sourced the annotation with the fol-
In our route description annotation process, the mined
lowing criteria: a) Describability – how easy to describe it
visual landmarks are provided, along with an overhead topo-
by the annotator, b) Memorability – how easy to remem-
logical map and the street-view interface. We crowd-source
ber it by the agent (e.g. a traveler such as a tourist), and
this annotation task on Amazon Mechanical Turk. In the in-
c) Recognizability – how easy to recognize it by the agent.
terface, a pre-defined route on the GoogleMap is shown to
Again, our ultimate goal is to make human-to-robot inter-
the annotator. An example of the interface is shown in Fig-
action more intuitive, so these criteria are inspired by how
ure 3. The green line denotes the complete route while the
human select landmarks when formulating navigational de-
red line shows the current road segment. We use Move for-
scriptions in our daily life.
ward and Move backward button to navigate from the source
We learn the ranking model with a Siamese network [17] node to the destination node. The annotator is instructed to
by following [31]. The model takes a pair of images and watch the 360◦ Street-view images on the left. Here, we have
scores the selected image more than the other one. We use customized the Google Street View interface to allow the an-
the Huber rank loss as the cost function. In the inference notator to navigate along the street-view images simultane-
stage, the model (f3 in Equation 1) outputs the averaged ously as they move forward/backward in the overhead map.
score for all selected images signalling their suitability as The street view is aligned to the direction of the navigation
visual landmarks for the VLN task. such that forward is always aligned with the moving direc-
tion. To minimize the effort of understanding the street-view
scenes, we also provide four perspective images for the left-
3.2.2 Annotation and Dataset Statistics
view, the front-view, the right-view, and the rare-view.
In the literature, a few datasets have been created for similar The annotator can navigate through the complete route
tasks [5, 16, 70, 80]. For instance, Anderson et al. [5] anno- comprising of m landmark nodes and m intermediate sub-
tated the language description for the route by asking the routes and is asked to provide descriptions for all landmarks
user to navigate the entire path in egocentric perspective. In- and descriptions for all sub-routes. In this work, we use m =
corporation of overhead map of navigated route as an aid for 4, which means that we collect 4 landmark descriptions and
describing the route can be seen in [16, 70, 80]. 4 local directional descriptions for each route as shown in
In the initial stages of our annotation process, we fol- Figure 2. We then append the landmark descriptions and the
lowed the above works. We allowed the annotators to nav- directional descriptions for the sub-routes one after the other
igate the entire route and describe a single instruction for to yield the complete route description. The whole route de-
the complete navigational route. We observed that a) the description is generated in a quite formulaic way which may
scriptions are mostly about the ending part of the routes in- lead to not very fluent descriptions. The annotation method
dicating that the annotators forget the earlier stages of the offers an efficient and reliable solution at a modest price of
routes; b) the annotators took a lot of time to annotate as they language fluency.

Talk2Nav: Long-Range Vision-and-Language Navigation with Dual Attention and Spatial Memory 7

Table 1: A comparison of our Talk2Nav dataset with the
R2R dataset [5] and the TouchDown dataset [16]. Avg stands
for Average, desc for description, and (w) for (words).

Criteria R2R [5] Touchdown [16] Talk2Nav
# of navigational routes 7,189 9,326 10,714
# of panoramic views 10,800 29,641 21,233
# of panoramic views per path 6.0 35.2 40.0
(a) Landmark descriptions (b) Local directional instructions # of navigational descriptions 21,567 9,326 10,714
# of navigational desc per path 3 1 1
Length of navigational desc (w) 29 89.6 68.8
# of landmark desc - - 34,930
Length of landmark desc (w) - - 8
# of local directional desc - - 27,944
Length of landmark desc (w) - - 7.2
Vocabulary size 3,156 4,999 5,240
Sub-unit correspondence No No Yes

(c) Complete navigational instructions ple aspects, the R2R and TouchDown datasets only anno-
tate one overall language description for the entire route.
Fig. 5: The length distribution of landmark instructions (a), It is costly to scale the annotation method to longer routes
local directional instructions (b), and complete navigational and to a large number of unknown environments. Further-
instructions (c). more, the annotations in these two datasets lacks the corre-
spondence between language descriptions and sub-units of
Statistics. We gathered 43, 630 locations which include GPS a route. Our dataset offers it without increasing the anno-
coordinates (latitudes and longitudes), bearing angles, etc. in tation effort. The detailed correspondence of the sub-units
New York City (NYC) covering an area of 10 km × 10 km as facilitates the learning task and enables an interesting cross-
shown in Figure 4. Out of all those locations, we managed to difficulty evaluation as presented in Section 5.2.
compile 21, 233 street-view images (outdoor) – each for one The average time taken for landmark ranking task is 5
road node. We constructed a city graph with 43, 630 nodes seconds and for the route description task is around 3 min-
and 80, 000 edges. The average # of navigable directions per utes. In terms of payment, we paid $0.01 and $0.5 for the
node is 3.4. landmark ranking task and the route description task, re-
We annotated 10, 714 navigational routes. These route spectively. Hence, the hourly rate of payment for annota-
descriptions are composed of 34, 930 node descriptions and tion is $10. We have a qualification test. Only the annotators
27, 944 local directional descriptions. Each navigational in- who passed the qualification test were employed for the real
struction comprises of 5 landmark descriptions (we use only annotation. In total, 150 qualified workers were employed.
4 in our work) and 4 directional instructions. These 5 land- During the whole course of annotation process, we regularly
mark descriptions include the description about the starting sample a subset of the annotations from each worker and
road node, the destination node and the three intermediate check their annotated route descriptions manually. If the an-
landmarks. Since the agent starts the VLN task from the notation is of low quality, we reject the particular jobs and
starting node, we use only 4 landmark descriptions along give feedback to the workers for improvement.
with 4 directional instructions. The starting point descrip- We restrict the natural language to English in our work.
tion can be used for automatic localization which is left for However, there are still biases in the collected annotations
future work. The average length of the navigational instruc- as is generally studied in the works of Bender et al. [11,
tions, the landmark descriptions and the local directional in- 12]. The language variety in our route descriptions com-
structions of the sub-routes are 68.8 words, 8 words and 7.2 prise of social dialects from United States, India, Canada,
words, respectively. In total, our dataset Talk2Nav contains United Kingdom, Australia and others, showing diverse an-
5, 240 unique words. Figure 5 shows the distribution of the notator demographics. Annotators are instructed to follow
length (number of words) of the landmark descriptions, the guidelines for descriptions like i) Do not describe based on
local directional instructions and the complete navigational street names or road names, ii) Try to use non-stationary ob-
instructions. jects mainly as visual landmarks. We also later perform cer-
In Table 1, we show a detailed comparison with the R2R tain curation techniques to the provided descriptions such as
dataset [5] and the TouchDown dataset [16] under different a) removing all non-unicode characters, b) correcting mis-
criteria. While the three datasets share similarities in multi- spelled words using opensourced LanguageTool [1].

8 Arun Balajee Vasudevan et al.

4 Approach an action, it interacts with the environment and receives a
new visual observation. The agent performs the actions se-
Our system is a single agent traveling in an environ- quentially until it reaches the destination node successfully
ment represented as a directed connected graph. We create or it fails the task because it exceeds the maximum episode
a street-view environment on top of the compiled city graph length.
of our Talk2Nav dataset. The city graph consists of 21, 233 The agent learns to predict an action at each state to nav-
nodes/locations with the corresponding street-view images igate in the environment. Learning long-range vision-and-
(outdoor) and 80, 000 edges representing the travelable road language navigation (VLN) requires an accurate sequential
segments between the nodes. Nodes are points on the road matching between the navigation instruction and the route.
belonging to the annotated region of NYC. Each node con- As argued in the introduction, we observe that navigation
tains a) a 360◦ street-view panoramic image, b) the GPS co- instructions consist of two major classes: landmark descrip-
ordinates of the corresponding location (defined by latitudes tions and local directional instructions between consecutive
and longitudes), c) and the bearing angles of this node in landmarks. In this work, given a language instruction, our
the road network. A valid route is a sequence of connected method segments it to a sequence of two interleaving classes:
edges from a source node to a destination node. Please see landmark descriptions and local directional instructions. Please
Figure 4 for the visualization of these concepts. see Figure 1 for an example of a segmented navigational in-
During the training and testing stages, we use a simula- struction. As it moves, the agent learns an associated refer-
tor to navigate the agent in the street view environment in the ence position in the language instruction to obtain a softly-
city graph. Based on the predicted action at each node, the attended local directional instruction and landmark descrip-
agent moves to the next node in the environment. A success- tion. When one landmark is achieved, the agent updates its
ful navigation is defined as when the agent correctly follows attention and moves towards the new goal, i.e. the next land-
the right route referred by the natural language instruction. mark. Two matching modules are used to score each state
For the sake of tractability, we assume no uncertainty (such by matching: 1) the traversed path in memory and the lo-
as dynamic changes in the environment, noise in the actu- cal, softly-attended directional instruction, and 2) the visual
ators) in the environment, hence it is a deterministic goal scene the agent observes at the current node and the local,
setting. softly-attended landmark description. An explicit external
Our underlying simulator is the same as the ones used memory is used to store the traversed path from the latest
by previous methods [5, 16, 36]. It, however, has a few dif- visited landmark to the current position.
ferent features. The simulator of [5] is compiled for indoor Each time a visual observation is successfully matched
navigation. The simulators of [16, 36] are developed for out- to a landmark, a learned controller increments the reference
door (city) navigation but they differ in the region used for position of the soft attention map over the language instruc-
annotation. The action spaces are also different. The simula- tion to update both the landmark and directional descrip-
tors of [5, 16, 36] use an action space consisting of left, right, tions. This process continues until the agent reaches the des-
up, down, forward and stop actions. We employ an action tination node or the episodic length of the navigation is ex-
space of nine actions: eight moving actions in the eight equi- ceeded. A schematic diagram of the method is shown in Fig-
spaced angular directions between 0◦ and 360◦ and the stop ure 6. Below we detail all the models used by our method.
action. For each moving action, we select the road which
has the minimum angular distance to the predicted moving 4.1.1 Language
direction. In the following sections, we present the method
in details. We segment the given instruction to two classes: a) visual
landmark descriptions and b) local directional instructions
between the landmarks. We employ the BERT transformer
4.1 Route Finding Task model [21] to classify a given language instruction X into a
sequence of token classes {ci }ni=1 where ci ∈ {0, 1}, with
The task defines the goal of an embodied agent to navigate 0 denoting the landmark descriptions and and 1 the local
from a source node to a destination node based on a naviga- directional instructions. By grouping consecutive segments
tional instruction. Specifically, given a natural language in- of the same token class, the whole instruction is segmented
struction X = {x1 , x2 , ..., xn }, the agent needs to perform into interleaving segments. An example of the segmenta-
a sequence of actions {a1 , a2 , ..., am } from the action space tion is shown in Figure 1 and multiple more in Figure 8.
A to hop over nodes in the environment space to reach the Those segments are used as the basic units of our attention
destination node. Here xi represents individual word in an scheme rather than the individual words used by previous
instruction while ai denotes an element from the set of nine methods [5]. This segmentation converts the language de-
actions (Detailed in Section 4.1.5). When the agent takes scription into a more structured representation, aiming to fa-

Talk2Nav: Long-Range Vision-and-Language Navigation with Dual Attention and Spatial Memory                                                          9

                                                                          o
                                                                               45 o
                                                                      0
                                                                 o
                                                        315

                                                       270
                                                             o
                                                                     Stop       90o
                                                                                        Action
                                                                 o                  o
                                                        225                   135
               Language         Move                                 180
                                                                        o
                                                                                                        Write
               features
                ResNet                           Action Module
                features

                                                      Scores                                                                 Indicator
                StreetView Image                                                                        Memory Image         Controller
                                                 Match                              Match
                                               Module 1                           Module 2

                                                   Language Model

                           Directional           Landmark                                Directional         Landmark
                         Description j-1      Description j                             Description j     Description j+1

Fig. 6: The illustrative diagram of our model. We have a soft attention mechanism over segmented language instructions,
which is controlled by the Indicator Controller. The segmented landmark description and local directional instruction are
matched with the visual image and the memory image respectively by two matching modules. The action module fetches
features from the visual observation, the memory image, the features of the language segments, and the two matching scores
to predict an action at each step. The agent then moves to the next node, updates the visual observation and the memory
image, continues the movement, and so on until it reaches the destination.

cilitate the language-vision matching problem and the language- where ηt+1 = ηt + φt (.), η0 = 1, and φt (.) is an Indicator
trajectory matching problem. Denoted by T (X) a sequence        Controller learned to output 1 when a landmark is reached
of segments of landmark and local directional instructions,     and 0 otherwise. The controller is shown in Figure 6 and
then                                                            is defined in Section 4.1.4. After a landmark is reached, ηt
                                                                increments 1 and the attention map then centers around the
T (X) = (L1 , D1 ), (L2 , D2 ), ..., (LJ , DJ )
                                               
                                                         (2)    next pair of landmark and directional instructions. The agent
          j
where L denotes the feature representation for segment j of     stops the navigation task successfully when φt () outputs 1
                          j
landmark description, D denotes the feature representation      and  there are no more pair of landmark and directional in-
for segment j of directional instruction, and J is the total    structions left to continue. We initialize η0 as 1 to position
number of segments in the route description.                    the language   attention around the first pair of landmark and
     As it moves in the environment, the agent is associated    directional instruction.
with a reference position in the language description in order
to put the focus on the most relevant landmark descriptions
and the most relevant local directional instructions. This is                            4.1.2 Visual Observation
modeled by a differentiable soft attention map. Let us denote
by ηt the reference position for time step t. The relevant                               The agent perceives the environment with an equipped 360◦
landmark description at time step t is extracted as                                      camera which obtains the visual observation It of the envi-
                                                                                         ronment at time step t. From the image, a feature ψ1 (It ) is
         J
         X                                                                               extracted and passed to the matching module as shown in
L̄ηt =         Lj e−|ηt −j|                                                   (3)
                                                                                         Figure 6 which estimates the similarity (matching score) be-
         j=1
                                                                                         tween the visual observation It and a softly attended land-
and the relevant directional instruction as                                              mark description L̄ηt which is modulated by our attention
         J                                                                               module. The visual feature is also passed to the action mod-
                                                                                         ule to predict the action for the next step as shown in Sec-
         X
D̄ηt =         Dj e−|ηt −j| ,                                                 (4)
         j=1                                                                             tion 4.1.5.

10 Arun Balajee Vasudevan et al.

vided to the action module along with the local directional
instruction features to predict the action for the next step.

4.1.4 Matching Module
5.0 5.0 17.839 5.0

Scale (metres per pixel)
Our matching module is used to determine whether the aimed
Fig. 7: Exemplar memory images. The blue square denotes landmark is reached. As shown in Figure 6, the matching
the source node, and the blue disk denotes the current loca- score is determined by two complementary matching mod-
tion of the agent. The red lines indicate the traversed paths. ules: 1) between the visual scene ψ1 (It ) and the extracted
Each memory image is associated with its map scale (metres landmark description L̄ηt and 2) between the spatial mem-
per pixel) which is shown in the text box below. ory ψ2 (Mt ) and the extracted directional instruction D̄ηt .
For both cases, we use a generative image captioning model
based on the transformer model proposed in [68] and com-
4.1.3 Spatial Memory pute the probability of reconstructing the language descrip-
tion given the image. Scores are the averaged generative
Inspired by [25, 29], an external memory Mt explicitly mem- probability over all words in the instruction. Let s1t be the
orizes the agent’s traversed path from the latest visited land- score for pair (ψ1 (It ), L̄ηt ) and s2t be the score for pair
mark. When the agent reaches a landmark, the memory Mt (ψ2 (Mt ), D̄ηt ). We then compute the score feature st by
is reinitialized to memorize the traversed path from the lat- concatenating the two: st = (s1t , s2t ). The score feature st is
est visited landmark. This reinitialization can be understood fed into a controller φ(.) to decide whether the aimed land-
as a type of attention to focus on the recently traversed path mark is reached:
in order to better localize and to better match against the rel-
φt (st , ht−1 ) ∈ {0, 1}, (5)
evant directional instructions D̄ηt modeled by the learned
language attention module defined in Section 4.1.1. where 1 indicates that the aimed landmark is reached and
As the agent navigates in the environment, we have a 0 otherwise. φt (.) is an Adaptive Computation Time (ACT)
write module to write the traversed path to the memory im- LSTM [27] which allows the controller to learn to make de-
age. For a real system navigating in the physical world, the cisions at variable time steps. ht−1 is the hidden state of
traversed path can be obtained by an odometry system. In the controller. In this work, φt (.) learns to identify the land-
this work, we compute the traversed path directly from the marks with the variable number of intermediate navigation
road network in our simulated environment. Our write mod- steps.
ule traces the path travelled from the latest visited landmark
to the current position. The path is rasterized and written 4.1.5 Action Module
into the memory image. In the image, the path is indicated
by red lines, the starting point by a blue square marker, and The action module takes the following inputs to decide the
the current location is by a blue disk. The write module al- moving action: a) the pair of (ψ1 (It ), L̄ηt ), b) the pair of
ways writes from the centre of the memory image to make (ψ2 (Mt ), D̄ηt ), and c) the matching score feature st . The
sure that there is room for all directions. Whenever the co- illustrative diagram is shown in Figure 6.
ordinates of the new rasterized pixel are beyond the image The inputs in the a) group is used to predict aet – a prob-
dimensions, the module incrementally increases the scale of ability vector of moving actions over the action space. The
the memory image until the new pixel is in the image and inputs in the b) is used to predict amt – the second probabil-
has a distance of 10 pixels to the boundary. An image of ity vector of moving actions over the action space. The two
200 × 200 pixels is used and the initial scale of the map is probability vectors are adaptively averaged together, with
set to 5 meters per pixel. Please find examples of the mem- weights learned from the score feature st . Specifically, st is
ory images in Figure 7. fed to a fully-connected (FC) network to output the weights
Each memory image is associated with the value of its wt ∈ R2 . For both a) and b), we use an encoder LSTM
scale (meters per pixel). Deep features ψ2 (Mt ) are extracted as used in [5] to encode the language segments. We then
from the memory image Mt which are then concatenated concatenate the encoder’s hidden states with the image en-
with its scale value to form the representation of the mem- codings (i.e. ψ1 (It ) and ψ2 (Mt )) and pass through a FC net-
ory. The concatenated features are passed to the matching work to predict the probability distribution aet and am t . By

module. The matching module verifies the semantic similar- adaptively fusing the, we get the final action prediction at :
ity between the traversed path and the provided local direc- 1
tional instruction. The concatenated features are also pro- at = P i
(wt0 ∗ aet + wt1 ∗ am
t ). (6)
w
i t

Talk2Nav: Long-Range Vision-and-Language Navigation with Dual Attention and Spatial Memory 11

The action at at time t is defined as the weighted aver-
On right we have a warehouse Move forward till four way
age of aet and amt . For a different part of the trajectory, one intersection there is a wellness center on the right near to the
of the actions (aet or am t ) or both of them are reliable. This shop turn right and go forward on the left, we have citi bank
is heavily dependent on the situation. For instance, when Turn right and go ahead for two blocks we see a road bridge
straight ahead turn right again and move forward reached the
the next landmark is not visible, the prediction should rely traffic light with orange polls on road
more on am t ; when the landmark is clearly recognizable,
we have restaurant on our left go ahead until you see first
the opposite holds. The learned matching scores will de-
road intersection there is a post office in the road corner and
cide adaptively at each time step which prediction to be a restaurant on the right turn left and move forward there is
trusted and by how much. This adaptive fusion can be un- red building of studios on our right turn left and go ahead till a
derstood as a calibration system for the two complementary tri junction we see a glass building which is a boutique turn
right and move forward for two junctions reached the stoned
sub-systems for action prediction. The calibration method building of bank on the right
needs to be time- or situation-dependent. A simpler summa-
There is blue color building on our right go ahead for two
tion/normalization of the two predictions is rigid, and cannot blocks there is a red stoned building on left near to huge
distinguish between a confident true prediction and a confi- electric pole turn left and go to next road junction we see a
green bridge on the top turn right and go ahead for two
dent false prediction. Confident false predictions are very
blocks two restaurants on the road intersection and a
common in deeply learned models. shopping mall on the left take left and move forward for two
The action space A is defined as follows. We divide junctions you reach metro station on the right
the action space A into eight moving actions in eight di-
Directional Instructions Landmark Instructions
rections and the stop action. Each direction is centred at
{(i∗45◦ ) : i ∈ [0, ..., 7]} with ±22.5◦ offset as illustrated in
Fig. 8: Examples of given instructions from Talk2Nav
Figure 6. When an action angle is predicted, the agent turns
dataset. Different colours denote word token segmentation.
to the road which has the minimum angular distance to the
Red denotes landmark descriptions and blue the local direc-
predicted angle and moves forward to the next node along
tional instructions.
the road. The turning and moving-forward define an atomic
action. The agent comes to a stop when it encounters the
final landmark as stated in Section 4.1.1. This is in contrast with previous methods [5], in which it is
It is worth noting that the implementation of our stop the direction to the final destination.
action is different from previous VLN methods [5, 16, 36]. We use the cross-entropy loss to train the action module
As pointed out in the deep learning book [24, p.384], the and the matching module as they are formulated as classifi-
stop action can be achieved in various ways. One can add cation tasks. For the ACT model, we use the weighted binary
a special symbol corresponding to the end of a sequence, cross-entropy loss at every step. The supervision of the pos-
i.e. the STOP used by previous VLN methods. When the itive label (‘1’) for ACT model comes into effect only if the
STOP is generated, the sampling process stops. Another op- agent reaches the landmark. The total loss is the sum of all
tion is to introduce an extra Bernoulli output to the model module losses:
that represents the decision to either continue generation or
halt generation at each time step. In our work, we adopt the Loss = Lossaction + Losslandmark direction
matching + Lossmatching + LossACT .
second approach. The agent uses Indicator Controller (IC) (7)
to learn to determine whether it reaches landmarks. IC out-
puts 1 when the agent reaches a landmark and updates the The losses of the two matching modules take effect only at
language attention for finding the next landmark until the the place of landmarks which are much sparser than the road
destination is reached. The destination is the final landmark nodes where the action loss and ACT loss are computed. Be-
described in the instruction. The agent stops when the IC cause of this, we first train the matching networks individ-
predicts 1 and there is no more language description left to ually for the matching tasks, and then integrate them with
continue. Thus, our model has a stop mechanism but in a other components for the overall training.
rather implicit manner.

4.3 Implementation Details
4.2 Learning
Language Module. We use the BERT [21] transformer model
The model is trained in a supervised way. We followed the pretrained on the BooksCorpus [82] and the English Wikipedia3 ,
student-forcing approach proposed in [5] to train our model. for modelling language. This yields contextual word repre-
At each step, the action module is trained with a supervisory sentations which is different from classical models such as
3
signal of the action in the direction of the next landmark. https://en.wikipedia.org/wiki/English Wikipedia

12 Arun Balajee Vasudevan et al.

word2vec [57] and GloVe [62]. We use a word-piece tok- Network details. The SphereNet model consists of five blocks
enizer to tokenize the sentence, by following [21, 77]. Out of convolution and max-pooling layers, followed by a fully-
of vocabulary words are split into sub-words based on the connected layer. We use 32, 64, 128, 256, 128 filters in the
available vocabulary words. For word token classification, 1 to 5 convolutional layers and each layer is followed by
we first train BERT transformer with a token classification a Max pooling and ReLU activation. This is the backbone
head. Here, we use the alignment between the given lan- structure of the SphereNet. For ResNet, we use ResNet-101
guage instruction X and their corresponding set of landmark as the backbone. Later, we have two branches: the fully con-
and directional instruction segments in the training set of the nected layer has 8 neurons using a softmax activation func-
Talk2Nav dataset. We then train the transformer model to tion in one branch of the network for the action module and
classify each word token in the navigational instruction to 512 neurons with ReLU activation in the other branch for
be a landmark description or a local directional instruction. the match module. Hence, input of image feature size from
At the inference stage, the model predicts a binary la- these models are of 512 and attention feature size over the
bel for each word token. We later convert the sequence of image is 512. The convolutional filter kernels are of size 5x5
word token classes into segments T (X) by simply group- and are applied with stride 1. Max pooling is performed with
ing adjacent tokens which have the same class. We note that kernels of size 3x3 and a stride of 2. For language, we use
this model has a classification accuracy of 91.4% on the test an input encoding size of 512 for each token in the vocabu-
set. We have shown few word token segmentation results in lary. We use Adam optimizer with learning rate of 0.01 and
Figure 8. alpha and beta for Adam as 0.999 and 10−8 . We trained for
20 epochs with a mini-batch size of 16.
Visual inputs. The Google street-view images are acquired
in the form of equirectangular projection. We experimented
5 Experiments
with SphereNet [18] architecture pretrained on the MNIST
[51] dataset and ResNet-101 [32] pretrained on ImageNet [20] Our experiments focus on 1) the overall performance of our
to extract ψ1 (I). Since the SphereNet pre-trained on MNIST method when compared to the state-of-the-art (s-o-t-a) navi-
is a relatively shallow network, we adapted the architecture gation algorithms, and 2) multiple ablation studies to further
of SphereNet to suit Street View images by adding more understand our method. The ablation studies cover a) the
convolutional blocks. We then define a pretext task using the importance of using explicit external memory, b) a perfor-
Street View images from the Talk2Nav dataset to learn the mance evaluation of navigation at different levels of diffi-
weights for SphereNet. Given two street-view images with culty, c) a comparison of visual features, d) a comparison of
an overlap of their visible view, the task is to predict the dif- the performance with and without Indicator Controller (IC),
ference between their bearing angles and the projection of and across variants of matching modules. We train and eval-
line joining the locations on the bearing angle of second lo- uate our model on the Talk2Nav dataset. We use 70% of the
cation. We frame the problem as a regression task of predict- dataset for training, 10% for validation and the rest for test-
ing the two angles. This encourages SphereNet to learn the ing. By following [3], we evaluate our method under three
semantics in the scene. We compiled the training set for this metrics:
pre-training task from our own training split of the dataset.
In the case of memory image, we use ResNet-101 pretrained – SPL: the success rate weighted by normalized inverse
on ImageNet to extract ψ2 (M ) from the memory image M . path length. It penalizes the successes made with longer
paths.
– Navigation Error: the distance to the goal after finishing
Other modules. We perform image captioning in the match-
the episode.
ing modules using the transformer model, by following [52,
– Total Steps: the total number of steps required to reach
81,68]. We pretrain the transformer model for captioning in
the goal successfully.
the matching module using the landmark street-view images
and their corresponding descriptions from the training split We compare with the following s-o-t-a navigation meth-
of Talk2Nav for matching of the landmarks. For the other ods: Student Forcing [5], RPA [73], Speaker-Follower [23]
matching module of local directions, we pretrain the trans- and Self-Monitoring [53]. We trained their models on the
former model using the ground-truth memory images and same data that our model uses. The work [72] yields very
their corresponding directional instructions. We synthesized good results on the Room-to-Room dataset for indoor navi-
the ground truth memory image in the same way as our write gation. We, however, have not found publicly available code
module writes to the memory image (as mentioned in Sec- for the method and thus can only compare with other top-
tion 4.1.3). We finetune both the matching modules in the performing methods on our dataset.
training stage. All the other models such as Indicator Con- We also study the performance of our complete model
troller, Action module are trained in an end-to-end fashion. and other s-o-t-a methods at varying difficulty levels: from

You can also read