Low-Resource Task-Oriented Semantic Parsing via Intrinsic Modeling

 
CONTINUE READING
Low-Resource Task-Oriented Semantic Parsing
                                                                        via Intrinsic Modeling

                                                  Shrey Desai          Akshat Shrivastava     Alexander Zotov  Ahmed Aly
                                                                                      Facebook
                                                                    {shreyd, akshats, azotov, ahhegazy}@fb.com

                                                                Abstract                           Li et al., 2020; Chen et al., 2020; Ghoshal et al.,
                                                                                                   2020). However, current methodology typically
                                             Task-oriented semantic parsing models typi-           requires crowdsourcing thousands of samples for
                                             cally have high resource requirements: to sup-
                                                                                                   each domain, which can be both time-consuming
arXiv:2104.07224v1 [cs.CL] 15 Apr 2021

                                             port new ontologies (i.e., intents and slots),
                                             practitioners crowdsource thousands of sam-           and cost-ineffective at scale (Wang et al., 2015; Jia
                                             ples for supervised fine-tuning. Partly, this is      and Liang, 2016; Herzig and Berant, 2019; Chen
                                             due to the structure of de facto copy-generate        et al., 2020). One step towards reducing these data
                                             parsers; these models treat ontology labels as        requirements is improving the sample efficiency
                                             discrete entities, relying on parallel data to ex-    of current, de facto copy-generate parsers. How-
                                             trinsically derive their meaning. In our work,        ever, even when leveraging pre-trained transform-
                                             we instead exploit what we intrinsically know
                                                                                                   ers, these models are often ill-equipped to handle
                                             about ontology labels; for example, the fact
                                             that SL:TIME_ZONE has the categorical type
                                                                                                   low-resource settings, as they fundamentally lack
                                             “slot” and language-based span “time zone”.           inductive bias and cross-domain reusability.
                                             Using this motivation, we build our approach             In this work, we explore a task-oriented semantic
                                             with offline and online stages. During prepro-        parsing model which leverages the intrinsic proper-
                                             cessing, for each ontology label, we extract          ties of an ontology to improve generalization. To il-
                                             its intrinsic properties into a component, and
                                                                                                   lustrate, consider the ontology label SL:TIME_ZONE
                                             insert each component into an inventory as a
                                             cache of sorts. During training, we fine-tune         : a slot representing the time zone in a user’s query.
                                             a seq2seq, pre-trained transformer to map ut-         Copy-generate models typically treat this label as
                                             terances and inventories to frames, parse trees       a discrete entity, relying on parallel data to extrin-
                                             comprised of utterance and ontology tokens.           sically learn its semantics. In contrast, our model
                                             Our formulation encourages the model to con-          exploits what we intrinsically know about this la-
                                             sider ontology labels as a union of its intrinsic     bel, such as its categorical type (e.g., “slot”) and
                                             properties, therefore substantially bootstrap-        language-based span (e.g., “time zone”). Guided
                                             ping learning in low-resource settings. Exper-
                                                                                                   by this principle, we extract the properties of each
                                             iments show our model is highly sample effi-
                                             cient: using a low-resource benchmark derived         label in a domain’s ontology, building a component
                                             from TOPv2 (Chen et al., 2020), our inven-            with these properties and inserting each compo-
                                             tory parser outperforms a copy-generate parser        nent into an inventory. By processing this domain-
                                             by +15 EM absolute (44% relative) when fine-          specific inventory through strong language models,
                                             tuning on 10 samples from an unseen domain.           we effectively synthesize an inductive bias useful
                                                                                                   in low-resource settings.
                                         1   Introduction
                                                                                                      Concretely, we build our model on top of
                                         Task-oriented conversational assistants face an in-       seq2seq, pre-trained transformer, namely BART
                                         creasing demand to support a wide range of do-            (Lewis et al., 2020), and fine-tune it to map utter-
                                         mains (e.g., reminders, messaging, weather) as a re-      ances to frames, as depicted in Figure 1. Our model
                                         sult of their emerging popularity (Chen et al., 2020;     operates in two stages: (1) the encoder inputs a
                                         Ghoshal et al., 2020). For practitioners, enabling        domain-specific utterance and inventory and (2)
                                         these capabilities first requires training semantic       the decoder outputs a frame composed of utterance
                                         parsers which map utterances to frames executable         and ontology tokens. Here, instead of perform-
                                         by assistants (Gupta et al., 2018; Einolghozati et al.,   ing vocabulary augmentation, the standard way
                                         2018; Pasupat et al., 2019; Aghajanyan et al., 2020;      of “generating” ontology tokens in copy-generate
Figure 1: Illustration of our low-resource, task-oriented semantic parser. Let x represent an utterance and d
the utterance’s domain, which consists of an ontology (list of intents and slots) `d1 , · · · , `dm . We create an inventory
Id , where each component contains intrinsic properties (index, type, span) derived from its respective label. Then,
we fine-tune a seq2seq transformer to input a linearized inventory Id and utterance x and output a frame y. Here,
frames are composed of utterance and ontology tokens; ontology tokens, specifically, are referenced naturally via
encoder self-attention rather than an augmented decoder vocabulary, like with copy-generate mechanisms.

parsers, we treat ontology tokens as pointers to in-            sitionality and ontology size, such as reminders
ventory components. This is particularly useful in              and navigation. Finally, through error analysis,
low-resource settings; our model is encouraged to               we show our model’s predicted frames are largely
represent ontology tokens as a union of its intrinsic           precise and linguistically consistent; even when
properties, and as a result, does not require many              inaccurate, our frames not require substantial mod-
labeled examples to achieve strong performance.                 ifications to achieve gold quality.
   We view sample efficiency as one of the advan-
tages of our approach. As such, we develop a com-               2    Background and Motivation
prehensive low-resource benchmark derived from                  Task-oriented semantic parsers typically cast pars-
TOPv2 (Chen et al., 2020), a task-oriented seman-               ing as transduction, utilizing seq2seq transformers
tic parsing dataset spanning 8 domains. Using a                 to map utterances to frames comprised of intents
leave-one-out setup on each domain, models are                  and slots (Aghajanyan et al., 2020; Li et al., 2020;
fine-tuned on a high-resource, source dataset (other            Chen et al., 2020; Ghoshal et al., 2020). Because
domains; 100K+ samples), then fine-tuned and                    frames are a composition of utterance and ontol-
evaluated on a low-resource, target dataset (this               ogy tokens, these models are often equipped with
domain; 10-250 samples). We also randomly sam-                  copy-generate mechanisms; at each timestep, the
ple target subsets to make the transfer task more               decoder either copies from the utterance or gen-
difficult. In aggregate, our benchmark provides 32              erates from the ontology (See et al., 2017; Agha-
experiments, each varying in domain and number                  janyan et al., 2020; Chen et al., 2020; Ghoshal
of target samples.                                              et al., 2020). These parsers are empirically ef-
   Both coarse-grained and fine-grained experi-                 fective in high-resource settings, achieving state-
ments show our approach outperforms baselines                   of-the-art performance on numerous benchmarks
by a wide margin. Overall, when averaging across                (Aghajanyan et al., 2020), but typically lack induc-
all domains, our inventory model outperforms a                  tive bias in low-resource settings.
copy-generate model by +15 EM absolute (44%                        To illustrate, consider a hypothetical domain
relative) in the most challenging setting, where                adaptation scenario where a copy-generate parser
only 10 target samples are provided. Notably, our               adapts to the weather domain. In standard method-
base inventory model (139M parameters) also out-                ology, a practitioner augments the decoder’s vocab-
performs a large copy-generate model (406M pa-                  ulary with weather ontology labels, then fine-tunes
rameters) in most settings, suggesting usability in             the parser on weather samples. This subsequently
resource-constrained environments. We also show                 trains the copy-generate mechanism to generate
systematic improvements on a per-domain basis,                  these labels as deemed appropriate. But, the effi-
even for challenging domains with high compo-                   cacy of this process scales with the amount of train-
Components                 target-side, a pre-trained decoder mimics a copy-
                                                           generate mechanism by either selecting from the
 Ontology Label        Index Type               Span
                                                           utterance or ontology. Instead of selecting ontology
 IN:CREATE_ALARM            1 intent create alarm          labels from an augmented vocabulary, as in copy-
 IN:UPDATE_ALARM            2 intent update alarm          generate methods, our decoder naturally references
 SL:ALARM_NAME              3 slot alarm name              these labels in the source-side inventory through
 SL:DATE_TIME               4 slot       date time         self-attention. The sequence ultimately decoded
 SL:TIME_ZONE               5 slot      time zone          during generation represents the frame.
                                                              As alluded to earlier, we focus on two intrin-
Table 1: Example inventory Ialarm (right-half) with com-   sic properties of ontology labels: types and spans.
ponents c1:5 , each corresponding to ontology labels       The type is particularly useful for enforcing syn-
(left-half) from the alarm domain.
                                                           tactic structure; for example, the rule “slots cannot
                                                           be nested in other slots” (Gupta et al., 2018) is
ing data as these ontology labels (though, more            challenging to meet unless a model can delineate
specifically, their embeddings) iteratively derive         between intents and slots. Furthermore, the span
extrinsic meaning. Put another way, before train-          is effectively a natural language description, which
ing, there exists no correspondence between an             provides a general overview of what the label aligns
ontology label (e.g., SL:LOCATION) and an utter-           to. Though inventory components can incorporate
ance span (e.g., “Menlo Park”). Such an alignment          other intrinsic properties, types and spans do not
is only established once the model has seen enough         require manual curation and can be automatically
parallel data where the two co-occur.                      sourced from existing annotations.
   In contrast, we focus on reducing data require-            Despite the reformulation of the semantic pars-
ments by exploiting the intrinsic properties of on-        ing task with inventories, our approach inherits the
tology labels. These labels typically have several         flexibility and simplicity of copy-generate mod-
core elements, such as their label types and spans.        els (Aghajanyan et al., 2020; Li et al., 2020; Chen
For example, by teasing apart SL:LOCATION, we              et al., 2020; Ghoshal et al., 2020); we also treat
can see it is composed of the type “slot” and span         parsing as transduction, leverage pre-trained mod-
“location”. These properties, when pieced together         ules, and fine-tune with log loss. However, a key
and encoded by strong language models, provide             difference is our parser is entirely text-to-text and
an accurate representation of what the ontology            does not require extra parameters, which we show
label is. While these properties are learned em-           promotes reusability in low-resource settings. In
pirically with sufficient training data, as we see         the following sub-sections, we elaborate on our
with the copy-generate parser, our goal is to train        approach in more detail and comment on several
high-quality parsers with as little data as possible       design decisions.
by explicitly supplying this information. Therefore,
a central question of our work is whether we can           3.1   Inventories
build a parser which leverages the intrinsic nature
                                                           Task-oriented semantic parsing datasets typically
of an ontology space while retaining the flexibility
                                                           have samples of the form (d, x, y) (Gupta et al.,
of seq2seq modeling; we detail our approach in the
                                                           2018; Li et al., 2020; Chen et al., 2020); that is, a
next section.
                                                           domain d, utterance x, and frame y. There exist
                                                           many domains d1 , · · · , dn , where each domain de-
3   Semantic Parsing via Label Inventories
                                                           fines an ontology `d1 , · · · , `dm , or list of intents and
Illustrated in Figure 1, we develop a seq2seq parser       slots. For a given domain d, we define its inventory
which uses inventories—that is, tables enumerat-           as a table Id = [cd1 , · · · , cdm ] where each compo-
ing the intrinsic properties of ontology labels—to         nent cdi = (i, t, s) is a tuple storing the intrinsic
map utterances to frames. Inventories are domain-          properties of a corresponding label `di .
specific, and each component carries the intrinsic            Specifically, these components consist of: (1) an
properties of a single label in the domain’s ontol-        index i ∈ Z≥ representing the label’s position in
ogy. On the source-side, a pre-trained encoder             the (sorted) ontology; (2) a type t ∈ {intent, slot}
consumes both an utterance and inventory (corre-           denoting whether the label is an intent or slot; and
sponding to the utterance’s domain). Then, on the          (3) a span s ∈ V ∗ representing an ontology de-
scription, formally represented as a string from a              Domains        Source          Target (SPIS)
vocabulary V . The index is a unique referent to
                                                                                              1    2      5    10
each component and is largely used as an optimiza-
tion trick during generation; we elaborate on this in           Alarm          104,167      13    25 56 107
the next section. In Table 1, we show an example                Event          115,427      22    33 81 139
inventory for the alarm domain.                                 Messaging      114,579      23    44 89 158
                                                                Music          113,034      19    41 92 187
3.2   Seq2Seq Model                                             Navigation     103,599      33    63 141 273
Our model is built on top of a pre-trained, seq2seq             Reminder       106,757      34    59 130 226
transformer architecture (Vaswani et al., 2017) with            Timer          113,073      13    27 62 125
vocabulary V .                                                  Weather        101,543      10    22 47 84
Encoder. The input to the model is a concate-           Table 2: TOPv2-DA benchmark training splits.
nation of an utterance x ∈ V and its domain d’s         Models are initially fine-tuned on a source dataset, then
inventory Id ∈ V .                                      fine-tuned on an SPIS subset from a target dataset.
   Following recent work in tabular understanding
(Yin et al., 2020), we encode our tabular inventory
Id as a linearized string Id . As shown in Figure 1,    token at each timestep, conditioning on the utter-
for each component, the index is preceded by [          ance, inventory, and previous timesteps:
and the remaining elements are demarcated by |.                             X X
                                                                L(θ) = −                 log P (yt |x, Id , y
for target fine-tuning, depending on the subset used.             benchmarked on TOPv2-DA? (2) Does our model
In aggregate, our benchmark provides 32 experi-                   perform well on average or does it selectively work
ments (8 scenarios × 4 subsets), offering a rigorous              on particular domains? (3) How do the intrinsic
evaluation of sample efficiency.                                  components of an inventory component (e.g., types
                                                                  and spans) contribute to performance?
Creating Experiments. We use a leave-one-out
algorithm to create source and target datasets.                   Systems for Comparison. We chiefly experi-
Given domains {d1 , · · · , dn }, we create n scenar-             ment with CopyGen and Inventory, a classical
ios where the ith scenario uses domains {dj : di 6=               copy-generate parser and our proposed inventory
dj } as the source dataset and domain {di } as the                parser, as discussed in Sections 2 and 3, respec-
target dataset. For each target dataset, we also cre-             tively. Though both models are built on top of off-
ate m subsets using a random sampling algorithm,                  the-shelf, seq2seq transformers, the copy-generate
each with an increasing number of samples.                        parser requires special tweaking; to prepare its
                                                                  “generate” component, we augment the decoder
Algorithm 1 SPIS algorithm (Chen et al., 2020)                    vocabulary with dataset-specific ontology tokens
    1:   Input: dataset D = {(d(i) , x(i) , y (i) )}ni=1 , sub-   and initialize their embeddings randomly, as is
         set cardinality k                                        standard practice (Aghajanyan et al., 2020; Chen
    2:   procedure SPIS(D, k)                                     et al., 2020; Li et al., 2020). In addition, both
    3:       Shuffle D using a fixed seed                         models are initialized with pre-trained weights.
    4:       S ← subset of dataset samples                        We use BART (Lewis et al., 2020), a seq2seq
    5:       C ← counter of ontology tokens                       transformer pre-trained with a denoising objective
    6:       for (d(i) , x(i) , y (i) ) ∈ D do                    for generation tasks. Specifically, we we use the
    7:           for ontology token t ∈ y (i) do                  BARTBASE (139M parameters; 12L, 768H, 16A)
    8:               if C[t] < k then                             and BARTLARGE (406M parameters; 24L, 1024H,
    9:                     S ← S + (d(i) , x(i) , y (i) )         16A) checkpoints.
10:                        Store yi ’s ontology token                We benchmark the sample efficiency of these
11:                             counts in C                       models on TOPv2-DA. Following the methodology
12:                        break                                  outlined in Section 4, for each scenario and sub-
13:                  end if                                       set experiment, each model undergoes two rounds
14:              end for                                          of fine-tuning: it is initially fine-tuned on a high-
15:          end for                                              resource, source dataset, then fine-tuned again on a
16:      end procedure                                            low-resource, target dataset using the splits in Ta-
                                                                  ble 2. The resulting model is then evaluated on the
   For our random sampling algorithm, we use sam-                 target domain’s TOPv2 test set; note that this set is
ples per intent slot (SPIS), shown abovex, which                  not subsampled for accurate evaluation. We report
ensures at least k ontology labels (i.e., intents and             the exact match (EM) between the predicted and
slots) appear in the resulting subset (Chen et al.,               gold frame. To account for variance, we average
2020). Unlike a traditional algorithm which selects               EM across three runs, each with a different seed.
k samples exactly, SPIS guarantees coverage over
                                                                  Hyperparameters. We use BART checkpoints
the entire ontology, but as a result, the number of
                                                                  from fairseq (Ott et al., 2019) and elect to use
samples per subset is typically much greater than
                                                                  most hyperparameters out-of-the-box. However,
k (Chen et al., 2020). Therefore, we use conser-
                                                                  during initial experimentation, we find the batch
vative values of k; for each scenario, we sample
                                                                  size, learning rate, and dropout settings to heav-
target subsets of 1, 2, 5, and 10 SPIS. Our most
                                                                  ily impact performance, especially for target fine-
extreme setting of 1 SPIS is still 10× smaller than
                                                                  tuning. For source fine-tuning, our models use a
the equivalent setting in prior work (Chen et al.,
                                                                  batch size of 16, dropout in [0, 0.5], and learning
2020; Ghoshal et al., 2020).
                                                                  rate in [1e-5, 3e-5]. Each model is fine-tuned on
5        Experimental Setup                                       a single 32GB GPU given the size of the source
                                                                  datasets. For target fine-tuning, our models use a
We seek to answer three questions in our experi-                  batch size in [1, 2, 4, 8], dropout in [0, 0.5], and
ments: (1) How sample efficient is our model when                 learning rate in [1e-6, 3e-5]. Each model is fine-
SPIS
                          1        2       5      10
    CopyGenBASE       27.93   39.12    46.23   52.51
    CopyGenLARGE      35.51   44.40    51.32   56.09
    InventoryBASE     38.93   48.98    57.51   63.19
    InventoryLARGE    51.34   57.63    63.06   68.76

Table 3: Coarse-grained results on TOPv2-DA. Each
model’s EMs are averaged across 8 domains. Both
InventoryBASE and InventoryLARGE outperform Copy-
Gen in 1, 2, 5, and 10 SPIS settings.

                                                           Figure 3: EM versus % compositionality and # on-
                                                           tology labels, two important characteristics of task-
                                                           oriented domains at 1 SPIS. Horizontal lines show av-
                                                           erage EM, while vertical lines show average character-
Figure 2: ∆ EM between the base and large variants of      istics. Domains include alarm (alm), event (eve), mes-
Inventory and CopyGen on TOPv2-DA. Notably, Inven-         saging (msg), music (mus), navigation (nav), reminder
tory makes the best use of large representations, with ∆   (rem), timer (tim), and weather (wth).
= 12.41 EM at 1 SPIS.

                                                              However, provided that these constraints are not
tuned on a single 16GB GPU. Finally, across both
                                                           a concern, Inventory makes better use of larger
source and target fine-tuning, we optimize models
                                                           representations. Figure 2 illustrates this by plot-
with Adam (Kingma and Ba, 2015).
                                                           ting the ∆ EM between the base and large vari-
6     Results and Discussion                               ants of both models. The delta is especially
                                                           pronounced at 1 SPIS, where InventoryBASE →
6.1    TOPv2-DA Experiments                                InventoryLARGE yields +12 EM but CopyGenBASE
Table 3 presents the EM of CopyGen and Inventory           → CopyGenLARGE only yields +7 EM. Unlike
on TOPv2-DA averaged across 8 domains. We                  CopyGen which requires fine-tuning extra param-
also present more fine-grained results in Table 4,         eters in a target domain, Inventory seamlessly in-
breaking down EM by domain. From these tables,             tegrates stronger representations without modifi-
we draw the following conclusions:                         cation to the underlying architecture. This is an
                                                           advantage: we expect our model to iteratively im-
Inventory consistently outperforms CopyGen                 prove in quality with the advent of new pre-trained
in 1, 2, 5, and 10 SPIS settings. On average,              transformers.
Inventory shows improvements across the board,
improving upon CopyGen by at least +10 EM on               Inventory also yields strong results when in-
each SPIS subset. Compared to CopyGen, Inven-              specting each domain separately. TOPv2 do-
tory is especially strong at 1 SPIS, demonstrating         mains typically have a wide range of characteristics,
gains of +11 and +15 EM across the base and                such as their compositionality or ontology size, so
large variants, respectively. Furthermore, we see          one factor we can investigate is how our model
InventoryBASE outperforms CopyGenLARGE , in-               performs on a per-domain basis. Specifically, is
dicating our model’s performance can be attributed         our model generalizing across the board or overfit-
to more than just the pre-trained weights and, as a        ting to particular settings? Using the per-domain,
result, carries more utility in compute-constrained        large model, 1 SPIS results in Table 4, we analyze
environments.                                              EM versus % compositionality (fraction of nested
Base Model (SPIS)                                Large Model (SPIS)
                       1            2           5         10            1           2          5         10
  Domain: Alarm
  CopyGen      20.411.16    38.904.92   45.504.00 52.012.68     36.913.10 43.704.73 45.731.42 53.890.58
  Inventory    62.130.42    65.260.94   71.811.36 75.272.16     67.250.86 72.110.96 71.822.83 78.152.04
  Domain: Event
  CopyGen      31.850.22    38.850.84   38.310.71 41.782.00     32.370.72 34.590.62 38.480.21 43.930.41
  Inventory    46.571.64    54.310.53   58.875.03 68.422.06     64.771.41 55.844.20 67.700.38 71.211.08
  Domain: Messaging
  CopyGen      38.120.61    49.792.26   52.792.56 58.902.29     46.574.70 58.422.80 56.547.94 63.101.90
  Inventory    46.544.27    57.432.48   63.725.82 70.143.98     60.363.38 66.683.45 74.692.17 78.042.46
  Domain: Music
  CopyGen      25.580.19    33.282.15   48.752.07 55.163.34 23.8417.42 36.843.78 56.170.55 59.180.35
  Inventory    23.000.65    39.654.48   53.590.45 52.182.49 38.681.14 52.751.71 58.231.54 59.731.73
  Domain: Navigation
  CopyGen      19.960.84    30.113.88   43.381.24 45.264.59     24.310.97 36.282.29 48.711.75 56.140.53
  Inventory    21.166.59    29.080.47   42.593.48 53.970.56     28.742.11 47.472.42 49.982.13 64.080.88
  Domain: Reminder
  CopyGen      23.663.18    23.302.24   36.372.64 41.661.46     31.742.53 31.822.34 41.571.73 42.622.09
  Inventory    28.582.28    38.212.81   48.881.18 52.045.76     40.722.22 41.952.25 53.570.73 58.242.75
  Domain: Timer
  CopyGen      16.621.50 40.8021.47     54.792.55 63.260.85     32.640.33 59.941.73 59.803.10 66.272.54
  Inventory    28.921.95 53.583.44      55.543.90 66.820.15     48.454.44 61.703.04 63.741.09 68.442.09
  Domain: Weather
  CopyGen 47.2411.10        57.972.35 49.9411.30 62.072.17      53.081.31 53.600.43 63.563.41 63.581.60
  Inventory 54.531.94       54.312.87 65.091.71 66.663.40       61.771.71 62.521.85 64.732.20 72.141.76

Table 4: Fine-grained results on TOPv2-DA. For each domain, base model EM is shown in the left-half and large
model EM is shown in the right-half. Subscripts show standard deviation across three runs. Even on a per-domain
basis, both InventoryBASE and InventoryLARGE outperform CopyGen in most 1, 2, 5, and 10 SPIS settings.
Model                Utterance / Frame
                        I need you to send a video message now
   Index                [IN:SEND_MESSAGE ]
     + Type, Span       [IN:SEND_MESSAGE + [SL:TYPE_CONTENT video ] ]
                        Did I get any messages Tuesday on Twitter
   Index                [IN:GET_MESSAGE - [SL:RECIPIENT i ] [SL:ORDINAL Tuesday ] -
                        [SL:TAG_MESSAGE Twitter ] ]
      + Type, Span      [IN:GET_MESSAGE + [SL:DATE_TIME Tuesday ] + [SL:RESOURCE Twitter
                        ] ]
                        Message Lacey and let her know I will be at the Boxer Rescue Fundraiser
                        Saturday around 8
   Index                [IN:SEND_MESSAGE [SL:RECIPIENT Lacey ] [SL:CONTENT_EXACT I will
                        be at the Boxer Rescue Fundraiser - ] [SL:GROUP Saturday around 8 ] ]
      + Type, Span      [IN:SEND_MESSAGE [SL:RECIPIENT Lacey ] [SL:CONTENT_EXACT I will
                        be at the Boxer Rescue Fundraiser Saturday around 8 ] ]

Table 5: Comparing index only and index + type + span parsers. In each row, we show the utterance and, for
each model, its predicted frame; here, the index + type + span frames are always correct. For visualization of edit
distance, we use + and - to indicate additions and deletions, respectively.

                                SPIS                       6.2   Inventory Ablation
    Component           1       2        5     10          Moving beyond benchmark performance, we now
    Index          36.78 44.13 60.63 61.90                 turn towards better understanding the driving fac-
      + Type       46.54 49.67 65.21 69.98                 tors behind our model’s performance. From Sec-
      + Span       60.36 66.68 74.69 78.04                 tion 3.1, recall each inventory component consists
                                                           of an index, type, and span. The index is merely
Table 6: Inventory ablation experiment results. We         a unique identifier, while the type and span rep-
benchmark the performance of three InventoryLARGE          resent intrinsic properties of a label. Therefore,
models on the messaging domain, each adding an in-         the goal of our ablation is to quantify the impact
trinsic property to their inventories. Our full model,     adding types and spans to inventories. Because
where each component consists of an index, type, and       conducting ablation experiments on each domain
span, outperforms baselines by a wide margin.
                                                           is cost-prohibitive, we use the messaging domain
                                                           as a case study given its samples strike a balance
                                                           between compositionality and ontology size.
                                                              We experiment with three InventoryLARGE mod-
                                                           els, where each model iteratively adds an element
frames) and # ontology labels (count of intent and         to its inventory components: (1) index only, (2)
slots). Figure 3 plots these relationships. A key          index and type, (3) index, type, and span. The re-
trend we notice is Inventory improves EM in gen-           sults are shown in Table 6. Here, we see that while
eral, though better performance is skewed towards          an index model performs poorly, adding types and
domains with 20% compositionality and 20-30 on-            spans improve performance across all subsets. At
tology labels. This can be partially explained by          1 SPIS, in particular, an index model improves by
the fact that domains with these characteristics are       roughly +10 and +20 EM when types and spans
more empirically dominant in TOPv2, as shown               are added, respectively. These results suggest that
by the proximity of the dots to the vertical bars.         these intrinsic properties provide a useful inductive
Domains like reminder and navigation are more              bias in the absence of copious training data.
challenging given the size of their ontology space,           In Table 5, we contrast the predictions of the
but Inventory still outperforms CopyGen by a rea-          index only (1) and index + type + span (3) mod-
sonable margin.                                            els more closely, specifically looking at 1 SPIS
Model                Utterance / Frame
   Domain: Alarm
                        Delete my 6pm alarm
   Inventory            [IN:DELETE_ALARM [SL:DATE_TIME 6pm ] ]
   Oracle               [IN:DELETE_ALARM + [SL:ALARM_NAME [IN:GET_TIME [SL:DATE_TIME
                        6pm ] + ] ] ]
   Domain: Event
                       Fun activities in Letchworth next summer
   Inventory           [IN:GET_EVENT [SL:CATEGORY_EVENT - fun activities ] [SL:LOCATION
                       Letchworth ] [SL:DATE_TIME next summer ] ]
   Oracle              [IN:GET_EVENT [SL:CATEGORY_EVENT activities ] [SL:LOCATION Letch-
                       worth ] [SL:DATE_TIME next summer ] ]
   Domain: Messaging
                        Message Candy to send me details for her baby shower
   Inventory            [IN:SEND_MESSAGE - [SL:SENDER Candy ] [SL:CONTENT_EXACT details
                        for her baby shower ] ]
   Oracle               [IN:SEND_MESSAGE + [SL:RECIPIENT Candy ] [SL:CONTENT_EXACT +
                        send me details for her baby shower ] ]
   Domain: Navigation
                       What is the distance between Myanmar and Thailand
   Inventory           [IN:GET_DISTANCE         -   [SL:UNIT_DISTANCE    Myanmar ]   -
                       [SL:UNIT_DISTANCE Thailand ] ]
   Oracle              [IN:GET_DISTANCE + [SL:SOURCE Myanmar ] + [SL:DESTINATION Thai-
                       land ] ]
   Domain: Reminder
                        Remind me that I have lunch plans with Derek in two days at 1pm
   Inventory            [IN:CREATE_REMINDER [SL:PERSON_REMINDED me ] [SL:TODO I have lunch
                        plans ] [SL:ATTENDEE_EVENT Derek ] [SL:DATE_TIME in two days ]
                        [SL:DATE_TIME at 1pm ] ]
   Oracle               [IN:CREATE_REMINDER [SL:PERSON_REMINDED me ] [SL:TODO +
                        [IN:GET_TODO [SL:TODO lunch plans ] [SL:ATTENDEE Derek ] + ]
                        ] [SL:DATE_TIME + in two days at 1pm ] ]
   Domain: Timer
                        Stop the timer
   Inventory            - [IN:DELETE_TIMER [SL:METHOD_TIMER timer ] ]
   Oracle               + [IN:PAUSE_TIMER [SL:METHOD_TIMER timer ] ]
   Domain: Weather
                       What is the pollen count for today in Florida
   Inventory           - [IN:GET_WEATHER [SL:WEATHER_ATTRIBUTE pollen ] [SL:DATE_TIME
                       for today ] [SL:LOCATION Florida ] ]
   Oracle              + [IN:UNSUPPORTED_WEATHER [SL:WEATHER_ATTRIBUTE pollen + count ]
                       [SL:DATE_TIME for today ] [SL:LOCATION Florida ] ]

Table 7: Error analysis of domain-specific inventory parsers. In each row, we show the utterance and compare
our inventory model’s predicted frames to an oracle model’s gold frames. For visualization of edit distance, we
use + and - to indicate additions and deletions, respectively.
cases where the frame goes from being incorrect to      couragingly, the vast majority of spans our model
correct. We see a couple of cases where knowing         copies over are correct, and the cases which are
about a label’s intrinsic properties might help make    incorrect consist of adding or dropping modifiers.
the correct assessment during frame generation.         For example, in the event example, our model adds
The second example shows a scenario where our           the adjective “fun”, and in the weather example,
model labels “tuesday” as SL:DATE_TIME rather           our model drops the noun “count”. These cases are
than SL:ORDINAL. This distinction is more or less       relatively insignificant; they are typically a result
obvious when contrasting the phrases “date time”        of annotation inconsistency and do not carry much
and “ordinal”, where the latter typically maps to       weight in practice. However, a more serious error
numbers. In the third example, a more tricky sce-       we see is failing to copy over larger spans. For
nario, our model correctly labels the entire sub-       example, in the reminder example, SL:DATE_TIME
ordinate clause as an exact content slot. While         corresponds to both “in two days” and “at 1pm”,
partitioning this clause and assigning slots to its     but our model only copies over the latter.
constituents may yield a plausible frame, in this in-
stance, there is not much correspondence between        Predicting compositional structures is challeng-
SL:GROUP and “saturday around 8”.                       ing in low-resource settings. Our model also
                                                        struggles with compositionality in low-resource
7   Error Analysis                                      settings. In both the alarm and reminder examples,
Thus far, we have demonstrated the efficacy of in-      our model does not correctly create nested struc-
ventory parsers, but we have not yet conducted          tures, which reflect how slots ought to be handled
a thorough investigation of their errors. Though        during execution. Specifically, in the alarm exam-
models may not achieve perfect EM in low-               ple, because “6pm” is both a name and date/time,
resource settings, they should ideally fail grace-      the gold frame suggests resolving the alarm in ques-
fully, making mistakes which roughly align with         tion before deleting it. Similarly, in the reminder
intuition. In this section, we assess this by comb-     example, we must first retrieve the “lunch plans”
ing through our model’s cross-domain errors. Us-        todo before including it as a component in the re-
ing InventoryLARGE models fine-tuned in each do-        mainder of the frame. Both of these cases are tricky
main’s 1 SPIS setting, we first manually inspect        as they target prescriptive rather than descriptive
100 randomly sampled errors to build an under-          behavior. Parsers often learn this type of compo-
standing of the error distribution. Then, for each      sitionality in a data-driven fashion, but it remains
domain, we select one representative error, and         an open question how to encourage this behavior
present the predicted and gold frame in Table 7.        given minimal supervision.
   In most cases, the edit distance between the pre-
                                                        Ontology labels referring to “concepts” are
dicted and gold frames is quite low, indicating
                                                        also difficult. Another trend we notice is our
the frames our models produce are fundamentally
                                                        model predicts concept-based ontology labels
good and do not require substantial modification.
                                                        with low precision. These labels require under-
We do not see evidence of erratic behavior caused
                                                        standing a deeper concept which is not imme-
by autoregressive modeling, such as syntactically
                                                        diately clear from the surface description. A
invalid frames or extraneous subword tokens in
                                                        prominent example of this is the ontology la-
the output sequence. Instead, most errors are rela-
                                                        bel IN:UNSUPPORTED_WEATHER used to tag unsup-
tively benign; we can potentially resolve them with
                                                        ported weather intents. To use this label, a parser
rule-based transformations or data augmentation,
                                                        must understand the distinction between in-domain
though these are outside the scope of our work.
                                                        and out-of-domain intents, which is difficult to
Below, we comment on specific observations:
                                                        ascertain from inventories alone. Other exam-
Frame slots are largely correct and respect lin-        ples of this phenomenon manifest in the messag-
guistic properties. One factor we investigate           ing and navigation domain with the slot pairs
is if our model copies over utterance spans cor-        (SL:SENDER, SL:RECIPIENT) and (SL:SOURCE,
rectly, which correspond to arguments in an API         SL:DESTINATION), respectively. While these slots
call. These spans typically lie on well-defined con-    are easier to comprehend given their intrinsic prop-
stituent boundaries (e.g., prepositional phrases), so   erties, a parser must leverage contextual signals
we inspect the degree to which this is respected. En-   and jointly reason over their spans to predict them.
8   Related Work                                        experiments use MLE and Adam for simplicity,
                                                        though future work can considering improving our
Prior work improving the generalization of task-        source and target fine-tuning steps with better al-
oriented semantic parsers can be categorized into       gorithms. However, one important caveat is both
two groups: (1) contextual model architectures and      Reptile and LORAS rely on strong representations
(2) fine-tuning and optimization. We compare and        (i.e., BARTLARGE ) for maximum efficiency, and
contrast our work along these two axes below.           typically show marginal returns with weaker rep-
Contextual Model Architectures. Bapna et al.            resentations (i.e., BARTBASE ). In contrast, even
(2017); Lee and Jha (2018); Shah et al. (2019) pro-     when using standard practices, both the base and
pose BiLSTMs which process both utterance and           large variants of our model perform well, indicating
slot description embeddings, and optionally, entire     our approach is more broadly applicable.
examples, to generalize to unseen domains. Similar
                                                        9   Conclusion
to our work, slot descriptions help contextualize
what their respective labels align to. These descrip-   In this work, we present a seq2seq-based, task-
tions can either be manually curated or automati-       oriented semantic parser based on inventories, tab-
cally sourced. Our work has three key differences:      ular structures which capture the intrinsic proper-
(1) Inventories are more generalizable, specifying a    ties of an ontology space, such as label types (e.g.,
format which encompasses multiple intrinsic prop-       “slot”) and spans (e.g., “time zone”). Our approach
erties of ontology labels, namely their types and       is both simple and flexible: we leverage out-of-
spans. In contrast, prior work largely focuses on       the-box, pre-trained transformers with no modifi-
spans, and that too, only for slot labels. (2) Our      cation to the underlying architecture. We chiefly
model is interpretable: the decoder explicitly aligns   perform evaluations on TOPv2-DA, a benchmark
inventory components and utterance spans during         consisting of 32 low-resource experiments across
generation, which can aid debugging. However,           8 domains. Experiments show our inventory parser
slot description embeddings are used in an opaque       outperforms classical copy-generate parsers by a
manner; the mechanism through which BiLSTMs             wide margin and ablations illustrate the importance
use them to tag slots is largely hidden. (3) We         of types and spans. Finally, we conclude with an
leverage a seq2seq framework which integrates in-       error analysis, providing insight on the types of er-
ventories without modification to the underlying        rors practitioners can expect when using our model
encoder and decoder. In contrast, prior work typ-       in low-resource settings.
ically builds task-specific architectures consisting
of a range of trainable components, which can com-
plicate training.                                       References
                                                        Armen Aghajanyan, Jean Maillard, Akshat Shrivas-
Fine-tuning and Optimization. Recently, low-              tava, Keith Diedrick, Michael Haeger, Haoran Li,
resource semantic parsing has seen a methodologi-         Yashar Mehdad, Veselin Stoyanov, Anuj Kumar,
cal shift with the advent of pre-trained transform-       Mike Lewis, and Sonal Gupta. 2020. Conversational
ers. Instead of developing new architectures, as          Semantic Parsing. In Proceedings of the Conference
                                                          on Empirical Methods in Natural Language Process-
discussed above, one thrust of research tackles do-       ing (EMNLP).
main adaptation via robust optimization. These
methods are typically divided between source and        Ankur Bapna, Gokhan Tür, Dilek Hakkani-Tür, and
target domain fine-tuning. Chen et al. (2020) use         Larry Heck. 2017. Towards Zero-Shot Frame Se-
                                                          mantic Parsing for Domain Scaling. In Proceedings
Reptile (Nichol et al., 2018), a meta-learning al-        of INTERSPEECH.
gorithm which explicitly optimizes for generaliza-
tion during source fine-tuning. Similarly, Ghoshal      Xilun Chen, Ashish Ghoshal, Yashar Mehdad, Luke
et al. (2020) develop LORAS, a low-rank adap-             Zettlemoyer, and Sonal Gupta. 2020.         Low-
                                                          Resource Domain Adaptation for Compositional
tive label smoothing algorithm which navigates            Task-Oriented Semantic Parsing. In Proceedings
structured output spaces, therefore improving tar-        of the Conference on Empirical Methods in Natural
get fine-tuning. Our work is largely orthogonal;          Language Processing (EMNLP).
we focus on redefining the inputs and outputs of a      Arash Einolghozati, Panupong Pasupat, Sonal Gupta,
transformer-based parser, but do not subscribe to         Rushin Shah, Mrinal Mohit, Mike Lewis, and Luke
specific fine-tuning or optimization practices. Our       Zettlemoyer. 2018. Improving Semantic Parsing for
Task-Oriented Dialog. In Proceedings of the Con-      Panupong Pasupat, Sonal Gupta, Karishma Mandyam,
  versational AI Workshop.                                Rushin Shah, Michael Lewis, and Luke Zettlemoyer.
                                                          2019. Span-based Hierarchical Semantic Parsing for
Asish Ghoshal, Xilun Chen, Sonal Gupta, Luke Zettle-      Task-Oriented Dialog. In Proceedings of the Con-
  moyer, and Yashar Mehdad. 2020. Learning Better         ference on Empirical Methods in Natural Language
  Structured Representations using Low-Rank Adap-         Processing and International Joint Conference on
  tive Label Smoothing. In Proceedings of the Inter-      Natural Language Processing (EMNLP-IJCNLP).
  national Conference on Learning Representations
  (ICLR).                                               Abigail See, Peter J. Liu, and Christopher D. Manning.
                                                          2017. Get to the Point: Summarization with Pointer-
Sonal Gupta, Rushin Shah, Mrinal Mohit, Anuj Ku-          Generator Networks. In Proceedings of the Annual
  mar, and Mike Lewis. 2018. Semantic Parsing for         Meeting of the Association for Computational Lin-
  Task Oriented Dialog using Hierarchical Represen-       guistics (ACL).
  tations. In Proceedings of the Conference on Em-
  pirical Methods in Natural Language Processing        Darsh Shah, Raghav Gupta, Amir Fayazi, and Dilek
  (EMNLP).                                                Hakkani-Tur. 2019.      Robust Zero-Shot Cross-
                                                          Domain Slot Filling with Example Values. In Pro-
Jonathan Herzig and Jonathan Berant. 2019. Don’t          ceedings of the Annual Meeting of the Association
  Paraphrase, Detect! Rapid and Effective Data Col-       for Computational Linguistics (ACL).
  lectionf or Semantic Parsing. In Proceedings of
                                                        Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
  the Conference on Empirical Methods in Natural
                                                          Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz
  Language Processing and International Joint Con-
                                                          Kaiser, and Illia Polosukhin. 2017. Attention is All
  ference on Natural Language Processing (EMNLP-
                                                          You Need. In Proceedings of the Conference on
  IJCNLP).
                                                          Advances in Neural Information Processing Systems
                                                          (NeurIPS).
Robin Jia and Percy Liang. 2016. Data Recombina-
  tion for Neural Semantic Parsing. In Proceedings of   Yushi Wang, Jonathan Berant, and Percy Liang. 2015.
  the Annual Meeting of the Association for Computa-      Building a Semantic Parser Overnight. In Proceed-
  tional Linguistics (ACL).                               ings of the Annual Meeting of the Association for
                                                          Computational Linguistics (ACL).
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A
  Method for Stochastic Optimization. In Proceed-       Pengcheng Yin, Graham Neubig, Wen tau Yih, and Se-
  ings of the International Conference for Learning       bastian Riedel. 2020. TaBERT: Pretraining for Joint
  Representations (ICLR).                                 Understanding of Textual and Tabular Data. In Pro-
                                                          ceedings of the Annual Meeting of the Association
Sungjin Lee and Rahul Jha. 2018. Zero-Shot Adaptive       for Computational Linguistics (ACL).
  Transfer for Conversational Language Understand-
  ing. In Proceedings of the AAAI Conference on Arti-
  ficial Intelligence (AAAI).

Mike Lewis, Yinhan Liu, Naman Goyal, Mar-
  jan Ghazvininejad, Abdelrahman Mohamed, Omer
  Levy, Veselin Stoyanov, and Luke Zettlemoyer.
  2020. BART: Denoising Sequence-to-Sequence Pre-
  training for Natural Language Generation, Transla-
  tion, and Comprehension. In Proceedings of the An-
  nual Meeting of the Association for Computational
  Linguistics (ACL).

Haoran Li, Abhinav Arora, Shuohui Chen, An-
  chit Gupta, Sonal Gupta, and Yashar Mehdad.
  2020. MTOP: A Comprehensive Multilingual Task-
  Oriented Semantic Parsing Benchmark.    arXiv
  preprint arXiv:2008.09335.

Alex Nichol, Joshua Achiam, and John Schulman.
  2018. On First-Order Meta-Learning Algorithms.
  arXiv preprint arXiv:1803.02999.

Myle Ott, Sergey Edunov, Alexei Baevski, Angela
 Fan, Sam Gross, Nathan Ng, David Grangier, and
 Michael Auli. 2019. fairseq: A Fast, Extensible
 Toolkit for Sequence Modeling. arXiv preprint
 arXiv:1904.01038.
You can also read