Mind Your Inflections! Improving NLP for Non-Standard Englishes with Base-Inflection Encoding

Page created by Bertha Reyes
 
CONTINUE READING
Mind Your Inflections! Improving NLP for
                                                            Non-Standard Englishes with Base-Inflection Encoding
                                                            Samson Tan§\ , Shafiq Joty§‡ , Lav R. Varshneyf§ , Min-Yen Kan\
                                                               §
                                                                 Salesforce AI Research \ National University of Singapore
                                                    ‡
                                                      Nanyang Technological University f University of Illinois at Urbana-Champaign
                                                                     §
                                                                       {samson.tan,sjoty}@salesforce.com
                                                                              \
                                                                                kanmy@comp.nus.edu.sg
                                                                             f
                                                                                varshney@illinois.edu

                                                                    Abstract
                                                 Inflectional variation is a common feature of
                                                 World Englishes such as Colloquial Singa-
arXiv:2004.14870v2 [cs.CL] 11 Oct 2020

                                                 pore English and African American Vernacu-
                                                 lar English. Although comprehension by hu-              Figure 1: Base-Inflection Encoding reduces inflected
                                                 man readers is usually unimpaired by non-               words to their base forms, then reinjects the grammati-
                                                 standard inflections, current NLP systems are           cal information into the sentence as inflection symbols.
                                                 not yet robust. We propose Base-Inflection En-
                                                 coding (BITE), a method to tokenize English
                                                 text by reducing inflected words to their base          these variations predisposes English NLP systems
                                                 forms before reinjecting the grammatical infor-         to discriminate against speakers of World Englishes
                                                 mation as special symbols. Fine-tuning pre-             by either misunderstanding or misinterpreting them
                                                 trained NLP models for downstream tasks us-
                                                                                                         (Hern, 2017; Tatman, 2017). Left unchecked, these
                                                 ing our encoding defends against inflectional
                                                 adversaries while maintaining performance on            biases could inadvertently propagate to future mod-
                                                 clean data. Models using BITE generalize bet-           els via metrics built around pretrained models, such
                                                 ter to dialects with non-standard inflections           as BERTScore (Zhang et al., 2020).
                                                 without explicit training and translation mod-             In particular, Tan et al. (2020) show that cur-
                                                 els converge faster when trained with BITE. Fi-         rent question answering and machine transla-
                                                 nally, we show that our encoding improves the           tion systems are overly sensitive to non-standard
                                                 vocabulary efficiency of popular data-driven
                                                                                                         inflections—a common feature of dialects such as
                                                 subword tokenizers. Since there has been no
                                                 prior work on quantitatively evaluating vocab-          Colloquial Singapore English (CSE) and African
                                                 ulary efficiency, we propose metrics to do so.1         American Vernacular English (AAVE).2 Since peo-
                                                                                                         ple naturally correct for or ignore non-standard
                                         1       Introduction                                            inflection use (Foster and Wigglesworth, 2016), we
                                         Large-scale neural models have proven success-                  should expect NLP systems to be equally robust.
                                         ful at a wide range of natural language process-                   Existing work on adversarial robustness for NLP
                                         ing (NLP) tasks but are susceptible to amplifying               primarily focuses on adversarial training methods
                                         discrimination against minority linguistic commu-               (Belinkov and Bisk, 2018; Ribeiro et al., 2018; Tan
                                         nities (Hovy and Spruit, 2016; Tan et al., 2020)                et al., 2020) or classifying and correcting adversar-
                                         due to selection bias in the training data and model            ial examples (Zhou et al., 2019a). However, this
                                         overamplification (Shah et al., 2019).                          effectively increases the size of the training dataset
                                            Most datasets implicitly assume a distribution of            by including adversarial examples or training a new
                                         error-free Standard English speakers, but this does             model to identify and correct perturbations, thereby
                                         not accurately reflect the majority of the global               significantly increasing the overall computational
                                         English speaking population who are either sec-                 cost of creating robust models.
                                         ond language (L2) or non-standard dialect speakers                 These approaches also only operate on either
                                         (Crystal, 2003; Eberhard et al., 2019). These World             raw text or the model, ignoring tokenization—an
                                         Englishes differ at lexical, morphological, and syn-            operation that transforms raw text into a form that
                                         tactic levels (Kachru et al., 2009); sensitivity to             the neural network can learn from. We introduce a
                                             1                                                              2
                                                 Code will be available at github.com/salesforce/bite.          Examples in Appendix A.
new representation for word tokens that separates               models represented each word as a single symbol
base from inflection. This improves both model                  in the vocabulary (Bengio et al., 2001; Collobert
robustness and vocabulary efficiency by explicitly              et al., 2011) and uncommon words were repre-
inducing linguistic structure in the input to the NLP           sented by an unknown symbol. However, such
system (Erdmann et al., 2019; Henderson, 2020).                 a representation is unable to adequately deal with
   Many extant NLP systems use a combination of                 words absent in the training vocabulary. Therefore,
a whitespace and punctuation tokenizer followed                 subword representations like WordPiece (Schus-
by a data-driven subword tokenizer such as byte                 ter and Nakajima, 2012) and BPE (Sennrich et al.,
pair encoding (BPE; Sennrich et al. (2016)). How-               2016) were proposed to encode out-of-vocabulary
ever, a purely data-driven approach may fail to find            (OOV) words by segmenting them into subwords
the optimal encoding, both in terms of vocabulary               and encoding each subword as a separate symbol.
efficiency and cross-dialectal generalization. This             This way, less information is lost in the encoding
could make the neural model more vulnerable to                  process since OOV words are approximated as a
inflectional perturbations. Hence, we:                          combination of subwords in the vocabulary. Wang
                                                                et al. (2019) reduce vocabulary sizes by operating
• Propose Base-InflecTion Encoding (BITE),
                                                                on bytes instead of characters (as in standard BPE).
  which uses morphological information to help
                                                                   To make subword regularization more tractable,
  the data-driven tokenizer use its vocabulary effi-
                                                                Kudo (2018) proposed an alternative method of
  ciently and generate robust symbol3 sequences.
                                                                building a subword vocabulary by reducing an ini-
  In contrast to morphological segmentors such
                                                                tially oversized vocabulary down to the required
  as Linguistica (Goldsmith, 2000) and Morfessor
                                                                size with the aid of a unigram language model, as
  (Creutz and Lagus, 2002), we reduce inflected
                                                                opposed to incrementally building a vocabulary as
  forms to their base forms before reinjecting the
                                                                in WordPiece and BPE variants. However, machine
  inflection information into the encoded sequence
                                                                translation systems operating on subwords still
  as special symbols. This approach gracefully han-
                                                                have trouble translating rare words from highly-
  dles the canonicalization of words with noncon-
                                                                inflected categories (Koehn and Knowles, 2017).
  catenative morphology while generally allowing
                                                                   Sadat and Habash (2006),Koehn and Hoang
  the original sentence to be reconstructed.
                                                                (2007), and Kann and Schütze (2016) propose to
• Demonstrate BITE’s effectiveness at making neu-               improve machine translation and morphological
  ral NLP systems robust to non-standard inflection             reinflection by encoding morphological features
  use while preserving performance on Standard                  separately while Sylak-Glassman et al. (2015) pro-
  English examples. Crucially, simply fine-tuning               pose a schema for inflectional features. Avraham
  the pretrained model for the downstream task af-              and Goldberg (2017) explore the effect of learning
  ter adding BITE is sufficient. Unlike adversarial             word embeddings from base forms and morpho-
  training, BITE does not enlarge the dataset and               logical tags for Hebrew, while Chaudhary et al.
  is more computationally efficient.                            (2018) show that representing words as base forms,
• Show that BITE helps BERT (Devlin et al., 2019)               phonemes, and morphological tags improve cross-
  generalize to dialects unseen during training and             lingual transfer for low-resource languages.
  also helps Transformer-big (Ott et al., 2018) con-
                                                                Adversarial robustness in NLP. To harden
  verge faster for the WMT’14 En-De task.
                                                                NLP systems against adversarial examples, exist-
• Propose metrics like symbol complexity to oper-               ing work largely uses adversarial training (Good-
  ationalize and evaluate the vocabulary efficiency             fellow et al., 2015; Jia and Liang, 2017; Ebrahimi
  of an encoding scheme. Our metrics are generic                et al., 2018; Belinkov and Bisk, 2018; Ribeiro et al.,
  and can be used to evaluate any tokenizer.                    2018; Iyyer et al., 2018; Cheng et al., 2019). How-
2    Related Work                                               ever, this generally involves retraining the model
                                                                with the adversarial data, which is computationally
Subword tokenization. Before neural models                      expensive and time-consuming. Tan et al. (2020)
can learn, raw text must first be encoded into sym-             showed that simply fine-tuning a trained model
bols with the help of a fixed-size vocabulary. Early            for a single epoch on appropriately generated ad-
    3
      Following Sennrich et al. (2016), we use symbol instead   versarial training data is sufficient to harden the
of token to avoid confusion with the unencoded word token.      model against inflectional adversaries. Instead of
adversarial training, Piktus et al. (2019) train word              Algorithm 1 Base-InflecTion Encoding (BITE)
embeddings to be robust to misspellings, while                     Require: Input sentence S = [w1 , . . . , wN ]
Zhou et al. (2019b) propose using a BERT-based                     Ensure: Encoded sequence S 0
                                                                     S 0 ← [∅]
model to detect adversaries and recover clean ex-                    for all i = 1, . . . , |N | do
amples. Jia et al. (2019) and Huang et al. (2019)                        if POS(wi ) ∈ {NOUN, VERB, ADJ} then
                                                                              base ← G ET L EMMA(wi , POS(wi ))
use Interval Bound Propagation to train provably                              inflection ← G ET I NFLECTION(wi )
robust pre-Transformer models, while Shi et al.                               S 0 ← S 0 + [base, inflection]
(2020) propose an efficient algorithm for training                       else
                                                                              S 0 ← S 0 + [wi ]
certifiably robust Transformer architectures.                            end if
                                                                     end for
Summary. Popular subword tokenizers operate                          return S 0
on surface forms in a purely data-driven manner.
Existing adversarial robustness methods for large-
scale Transformers are computationally expensive,                  lap between the base and inflected forms, e.g., the
while provably robust methods have only been                       -ed and -d suffixes, the suffix may be encoded as a
shown to work for pre-Transformer architectures                    separate subword and base forms / suffixes may not
and small-scale Transformers.                                      be consistently represented. To illustrate, encod-
   Our work uses linguistic information (inflec-                   ing danced as [dance, d] and dancing as [danc, ing]
tional morphology) in conjunction with data-driven                 results in two different “base forms” for the same
subword encoding schemes to make large-scale                       word, dance. This again burdens the model with
NLP models robust to non-standard inflections and                  learning the two “base forms” mean the same thing
generalize better to L2 and World Englishes, while                 and makes inefficient use of a limited vocabulary.
preserving performance for Standard English. We                       When encoded in conjunction with another in-
also show that our method helps existing subword                   flected form like entered, which should be encoded
tokenizers use their vocabulary more efficiently.                  as [enter, ed], this encoding scheme also produces
                                                                   two different subwords for the same type of inflec-
3       Linguistically-Grounded Tokenization                       tion -ed vs -d. As in the first example, the burden
                                                                   of learning that the two suffixes correspond to the
Data-driven subword tokenizers like BPE improve
                                                                   same tense is transferred to the learning model.
a model’s ability to approximate the semantics of
unknown words by splitting them into subwords.                        A possible solution is to instead encode danced
   Although the fully data-driven nature of such                   as [danc, ed] and dancing as [danc, ing], but there
methods make them language-agnostic, this forces                   is no guarantee that a data-driven encoding scheme
them to rely only on the statistics of the surface                 will learn this pattern without some language-
forms when transforming words into subwords                        specific linguistic supervision. In addition, this
since they do not exploit any language-specific mor-               unnecessarily splits up the base form into two sub-
phological regularities. To illustrate, the past tense             words danc and e; the latter contains no extra se-
of go, take, and keep have the inflected forms went,               mantic or grammatical information yet increases
took, and kept, respectively, which have little to                 the encoded sequence length. Although individ-
no overlap with their base forms4 and each other                   ually minor, encoding many base words in this
even though they share the same tense. These six                   manner increases the computational cost for any
surface forms would likely have no subwords in                     encoder or decoder network.
common in the vocabulary. Consequently, the neu-                      Finally, although it is theoretically possible to
ral model would have the burden of learning both                   force a data-driven tokenizer to segment inflected
the relation between base forms and inflected forms                forms into morphologically logical subwords by
and the relation between inflections for the same                  limiting the vocabulary size, many inflected forms
tense. Additionally, since vocabularies are fixed                  are represented as individual symbols at common
before model training, such an encoding does not                   vocabulary sizes (30–40k). We found that the
optimally use a limited vocabulary.                                BERTbase WordPiece tokenizer and BPE5 encoded
   Even when inflections do not orthographically                   each of the above examples as single symbols.
alter the base form and there is a significant over-
                                                                       5
                                                                         Trained on Wikipedia+BookCorpus (1M) with a vocabu-
    4
        Base (no quotes) is synonymous with lemma in this paper.   lary size of 30k symbols.
3.1   Base-Inflection Encoding                           Implementation details. We use the BertPreTo-
                                                         kenizer from the tokenizers6 library for whites-
To address these issues, we propose the Base-            pace and punctuation splitting. We use the NLTK
InflecTion Encoding framework (or BITE), which           (Bird et al., 2009) implementation of the aver-
encodes the base form and inflection of content          aged perceptron tagger (Collins, 2002) with greedy
words separately. Similar to how existing subword        decoding to generate POS tags, which serve to
encoding schemes improve the model’s ability to          improve lemmatization accuracy and as inflec-
approximate the semantics of out-of-vocabulary           tion symbols. For lemmatization and reinflection,
words with in-vocabulary subwords, BITE helps            we use lemminflect7 , which uses a dictionary
the model better handle out-of-distribution inflec-      look-up together with rules for lemmatizing and
tion usage by keeping a content word’s base form         inflecting words. A benefit of this approach is that
consistent even when its inflected form drastically      the neural network can now generate orthographi-
changes. This distributional deviation could mani-       cally appropriate inflected forms by generating the
fest as adversarial examples, such as those gener-       base form and the corresponding inflection symbol.
ated by M ORPHEUS (Tan et al., 2020), or sentences
produced by L2 or World Englishes speakers. By
keeping the base forms consistent, BITE provides         3.2       Compatibility with Data-Driven Methods
adversarial robustness to the model.                     Although BITE has the numerous advantages out-
                                                         lined above, it suffers from the same weakness as
BITE (Fig. 1). Given an input sentence S =               regular word-level tokenization schemes when used
[w1 , . . . , wN ] where wi is the ith word, BITE gen-   alone: a limited ability to handle out-of-vocabulary
erates a sequence of symbols S 0 = [w10 , . . . , wN
                                                   0 ]   words. Hence, we designed BITE to be a gen-
                0
such that wi = [BASE(wi ),I NFLECT(wi )] where           eral framework that seamlessly incorporates exist-
BASE(wi ) is the base form of the word and               ing data-driven schemes to take advantage of their
I NFLECT(wi ) is the inflection (grammatical cat-        proven ability to handle OOV words.
egory) of the word (Algorithm 1). If wi is not in-          To achieve this, a whitespace/punctuation-based
flected, I NFLECT(wi ) is N ULL and excluded from        pretokenizer is first used to transform the input into
the sequence of symbols to reduce the neural net-        a sequence of words and punctuation characters, as
work’s computational cost. In our implementation,        is common in machine translation. Next, BITE is
we use Penn Treebank tags to represent inflections.      applied and the resulting sequence is converted into
   By lemmatizing each inflected word to obtain          a sequence of integers by a data-driven encoding
the base form instead of segmenting it like in most      scheme (Fig. 6 in Appendix B). In our experiments,
data-driven encoding schemes, BITE ensures this          we use BITE in this manner and refer to the com-
base form is consistent for all inflected forms of       bined tokenizer as “BITE+D”, where D refers to
a word, unlike a subword produced by segmen-             the data-driven encoding scheme.
tation, which can only contain characters present
in the original word. For example, BASE(took),           4       Model-Based Experiments
BASE(taking), and BASE(taken) all correspond
to the same base form, take, even though it is or-
                                                         We first demonstrate the effectiveness of BITE us-
thographically significantly different from took.
                                                         ing the pretrained cased BERTbase (Devlin et al.,
   Similarly, encoding all inflections of the same       2019) before training a full Transformer (Vaswani
grammatical category (e.g., verb-past-tense) in a        et al., 2017) from scratch. We do not replace Word-
canonical form should help the model to learn each       Piece and BPE but instead incorporate them into
inflection’s grammatical role more quickly. This         the BITE framework as described in §3.2. The ad-
is because the model does not need to first learn        vantages and disadvantages to this approach will
that the same grammatical category can manifest          be discussed in the next section. We do not do any
in orthographically different forms.                     hyperparameter tuning but use the original models’
   Crucially, the original sentence can usually be       in all experiments (detailed in Appendix B).
reconstructed from the base forms and grammatical
information preserved by the inflection symbols,             6
                                                                 github.com/huggingface/tokenizers
                                                             7
except in cases of overabundance (Thornton, 2019).               github.com/bjascob/LemmInflect
SQuAD 2 Ans. (F1 )     SQuAD 2 All (F1 )       MNLI (Acc.)          MNLI-MM (Acc.)
 Encoding                Clean M ORPHEUS       Clean M ORPHEUS       Clean M ORPHEUS        Clean M ORPHEUS
 WordPiece (WP)          74.58      61.37       72.75     59.32      83.44       58.70      83.59      59.75
 BITE + WP               74.50      71.33       72.71     69.23      83.01       76.11      83.50      76.64
 WP + Adv. FT.           79.07      72.21       74.45     68.23      83.86       83.87      83.86      75.77
 BITE + WP (+1 epoch)    75.46      72.56       73.69     70.66      82.21       81.05      83.36      81.04

Table 1: BERTbase results on the clean and adversarial MultiNLI and SQuAD 2.0 examples. We compare
BITE+WordPiece to both WordPiece alone and with one epoch of adversarial fine-tuning. For fair comparison
with adversarial fine-tuning, we trained the BITE+WordPiece model for an extra epoch (bottom) on clean data.

4.1   Adversarial Robustness (Classification)                Condition        Encoding    BLEU      METEOR
We evaluate BITE’s ability to improve model ro-                               BPE only    29.13       47.80
                                                             Clean
                                                                             BITE + BPE   29.61       48.31
bustness for question answering and natural lan-
guage understanding using SQuAD 2.0 (Rajpurkar                                BPE only    14.71       39.54
                                                             M ORPHEUS
                                                                             BITE + BPE   17.77       41.58
et al., 2018) and MultiNLI (Williams et al., 2018),
respectively. We use M ORPHEUS (Tan et al., 2020),        Table 2: Results on newstest2014 for Transformer-big
an adversarial attack targeting inflectional mor-         trained on WMT’16 English-German (En-De).
phology, to test the overall system’s robustness to
non-standard inflections. They previously demon-
strated M ORPHEUS’s ability to generate plausible
and semantically equivalent adversarial examples          the above methodology and adversarially fine-tune
resembling L2 English sentences. We attack each           the WordPiece-only BERTbase for one epoch with
BERTbase model separately and report F1 scores on         k set to 4. To ensure a fair comparison, we also
the answerable questions and the full SQuAD 2.0           train the BITE+WordPiece BERTbase on the origi-
dataset, following Tan et al. (2020). In addition,        nal training set for an extra epoch.
for MNLI, we report scores for both the in-domain
(MNLI) and out-of-domain dev. set (MNLI-MM).                From Table 1, we observe that BITE is often
                                                          more effective than adversarial fine-tuning at mak-
BITE+WordPiece vs. only WordPiece. First,
                                                          ing the model more robust against inflectional ad-
we demonstrate the effectiveness of BITE at mak-
                                                          versaries and in some cases (SQuAD 2.0 All and
ing the model robust to inflectional adversaries.
                                                          MNLI-MM) even without needing the additional
After fine-tuning two separate BERTbase models on
                                                          epoch of training. However, the adversarially fine-
SQuAD 2.0 and MultiNLI, we generate adversarial
                                                          tuned model consistently achieves better perfor-
examples for them using M ORPHEUS. From Ta-
                                                          mance on clean data. This is likely because even
ble 1, we observe that the BITE+WordPiece model
                                                          though adversarial fine-tuning requires only a sin-
not only achieves similar performance (±0.5) on
                                                          gle epoch of extra training, the process of generat-
clean data, but is significantly more robust to inflec-
                                                          ing the training set increases its size by a factor of k
tional adversaries (10-point difference for SQuAD
                                                          and hence the number of updates. In contrast, BITE
2.0, 17-point difference for MultiNLI).
                                                          requires no extra training and is more economical.
BITE vs. adversarial fine-tuning. Next, we
compare the BITE to adversarial fine-tuning (Tan            Adversarial fine-tuning is also less effective at
et al., 2020), an economical variation of adversarial     inducing model robustness when the adversarial
training (Goodfellow et al., 2015) for making mod-        example is from an out-of-domain distribution (8
els robust to inflectional variation. In adversarial      point difference between MNLI and MNLI-MM).
fine-tuning, an adversarial training set is generated     This makes it less useful for practical scenarios,
by randomly sampling inflectional adversaries k           where this is often the case. In contrast, BITE per-
times from the adversarial distribution found by          forms equally well on both in- and out-of-domain
M ORPHEUS and adding them to the original train-          data, demonstrating its applicability to practical
ing set. Rather than retraining the model on this ad-     scenarios where the training and testing domains
versarial training set, the previously trained model      may not match. This is the result of preserving the
is simply trained for one extra epoch. We follow          base forms, which we investigate further in §5.2.
4.2   Machine Translation                                                        150

                                                        Pseudo Perplexity
                                                                                                                      WordPiece only
Next, we evaluate BITE’s impact on machine trans-                                                                    WordPiece + BITE

lation using the Transformer-big architecture (Ott                               100
et al., 2018) and WMT’14 English–German (En–
De) task. We apply BITE+BPE to the English                                        50
examples and compare it to the BPE-only baseline.                                      0      1         2        3           4              5
More details about our experimental setup can be                                                 # of MLM training examples         ·10  6

found in Appendix B.3.
                                                                                 (a) Colloquial Singapore English (forum threads)
   To obtain the final models, we perform early-
                                                                                 20

                                                            Pseudo Perplexity
stopping based on the validation perplexity and av-                                                                   WordPiece only
erage the last ten checkpoints. We observe that the                                                                  WordPiece + BITE
                                                                                 15
BITE+BPE model converges 28% faster (Fig. 7)
than the baseline (20k vs. 28k updates) in addition
to outperforming it by 0.48 BLEU on the standard                                 10

data and 3.06 BLEU on the M ORPHEUS adversarial                                       0      1          2        3          4            5
examples (Table 2). This suggests that explicit en-                                           # of MLM training examples           ·10  6

coding of morphological information helps models
                                                                                (b) African American Vernacular English (CORAAL)
learn better and more robust representations faster.
                                                                                  10
                                                            Pseudo Perplexity                                         WordPiece only
4.3   Dialectal Variation                                                         9                                  WordPiece + BITE

Apart from second languages, dialects are another                                 8
common source of non-standard inflections. How-                                   7
ever, there is a dearth of task-specific datasets in
                                                                                  6
English dialects like AAVE and CSE. Therefore,                                        0      1          2        3          4            5
in this section’s experiments, we use the model’s                                             # of MLM training examples           ·10  6

pseudo perplexity (pPPL) (Wang and Cho, 2019)
                                                                                  (c) Standard English (Wikipedia+BookCorpus)
on monodialectal corpora as a proxy for its per-
formance on downstream tasks in the correspond-        Figure 2: Pseudo perplexity of BERTbase on CSE,
ing dialect. The pPPL measures how certain the         AAVE, Standard English corpora. BITEabl refers to the
pretrained model is about its prediction and re-       ablated version without grammatical information.
flects its generalization ability on the dialectal
datasets. To ensure fair comparisons across dif-          To obtain a CSE corpus, we scrape the Infotech
ferent subword segmentations, we normalize the         Clinics section of the Hardware Zone Forums8 , a
pseudo log-likelihoods by the number of word to-       forum frequented by Singaporeans and where CSE
kens fed into the WordPiece component of each          is commonly used. Similar preprocessing to the
tokenization pipeline (Mielke, 2019). This avoids      AAVE data yields a 2.2M line corpus (45,803,898
unfairly penalizing BITE for inevitably generating     word tokens, 253,326 word types).
longer sequences. Finally, we scale the pseudo log-
likelihoods by the masking probability (0.15) so       Setup. We take the same pretrained BERTbase
that the final pPPLs are within a reasonable range.    model and fine-tune two separate variants (with and
                                                       without BITE) on English Wikipedia and BookCor-
Corpora. For AAVE, we use the Corpus of Re-            pus (Zhu et al., 2015) using the masked language
gional African American Language (CORAAL)              modeling (MLM) loss without the next sentence
(Kendall and Farrington, 2018), which comprises        prediction (NSP) loss. We fine-tune for one epoch
transcriptions of interviews with African Ameri-       on increasingly large subsets of the dataset, since
cans born between 1891 and 2005. For our evalua-       this has been shown to be more effective than do-
tion, only the interviewee’s speech was used. In ad-   ing the same number of gradient updates on a fixed
dition, we strip all in-line glosses and annotations   subset (Raffel et al., 2019). Preprocessing steps are
from the transcriptions before dropping all lines      described in Appendix B.1.
with less than three words. After preprocessing,          Next, we evaluate model pPPLs on the AAVE
this corpus consists of slightly under 50k lines of
                                                                        8
text (1,144,803 word tokens, 17,324 word types).                                forums.hardwarezone.com.sg
and CSE corpora, which we consider to be from                                     Clean        M ORPHEUS
dialectal distributions that differ from the training          Dataset       BITEabl BITE     BITEabl BITE
data which is considered to be Standard English.             SQuAD 2 (F1 )
                                                               Ans. Qns.     68.85    74.50    70.68    71.33
Since calculating the stochastic pPPL requires ran-            All Qns.      72.90    72.71    69.29    69.23
domly masking a certain percentage of symbols
                                                             MNLI (Acc.)
for prediction, we also experiment with doing this            Matched        82.28    83.01    80.17    76.11
for each sentence multiple times before averaging             Mismatched     83.18    83.50    81.21    76.64
them. However, we find no significant difference             WMT’14 (BLEU)   28.14    29.61    20.91    17.77
between doing the calculation once or five times;
the random effects likely canceled out due to the        Table 3: Effect of reinjecting grammatical information
large sizes of our corpora.                              via inflection symbols. BITEabl refers to the ablation
                                                         with the dummy symbol instead of inflection symbols.
Results. From Fig. 2, we observe that the
BITE+WordPiece model initially has a much
higher pPPL on the dialectal corpora, before con-        task performance, we ablate the extra grammatical
verging to 50–65% of the standard model’s pPPL           information from the encoding by replacing all in-
as the model adapts to the presence of the new in-       flection symbols with a dummy symbol (BITEabl ).
flection symbols (e.g., VBD, NNS, etc.). Crucially,      As expected, BITEabl is significantly more robust
the models are not trained on dialectal corpora,         to adversarial inflections (Table 3) and the slight
which demonstrates the effectiveness of BITE at          performance drop is likely due to the POS tagger
helping models better generalize to unseen dialects.     being adversarially affected. However, different
For Standard English, WordPiece+BITE performs            tasks likely require different levels of attention to
slightly worse than WordPiece, reflecting the re-        inflections and BITE allows the network to learn
sults on QA and NLI in Table 1. However, it is           this for each task. For example, NLI performance
important to note that the WordPiece vocabulary          on clean data is only slightly affected by the ab-
used was not optimized for BITE; results from §4.2       sence of morphosyntactic information, while MT
indicate that training the data-driven tokenizer from    and QA performance is more significantly affected.
scratch with BITE might improve performance.                In a similar ablation for the pPPL experiments,
                                                         we find that both the canonicalizing effect of the
CSE vs. AAVE. Astute readers might notice that
                                                         base form and knowledge of each word’s grammat-
there is a large difference in pPPL between the
                                                         ical role contribute to the lower pPPL on dialectal
two dialectal corpora, even for the same tokenizer
                                                         data (Table 4 in the Appendix). We discuss this in
combination. One possible explanation is that CSE
                                                         greater detail in Appendix B.2 and also report the
differs significantly from Standard English in mor-
                                                         pseudo log-likelihoods and per-symbol pPPLs in
phology and syntax due to its Austronesian and
                                                         the spirit of transparency and reproducibility.
Sinitic influences (Tongue, 1974). In addition, loan
words and discourse particles not found in Standard
English like lah, lor and hor are commonplace in         5     Model-Independent Analyses
CSE (Leimgruber, 2009). AAVE, however, gener-            Finally, we analyze WordPiece, BPE, and unigram
ally shares the same syntax as Standard English due      LM subword tokenizers that are trained with and
to its largely English origins (Poplack, 2000) and       without BITE. Implementation details can be found
is more similar linguistically. These differences        in Appendix B.4. Through our experiments, we ex-
are likely responsible for the significant increase in   plore how BITE improves adversarial robustness
pPPL for CSE compared to AAVE.                           and helps the data-driven tokenizer use its vocab-
   Another possible explanation is that the Book-        ulary more efficiently. We use 1M examples from
Corpus may contain examples of AAVE since the            Wikipedia+BookCorpus for training.
BookCorpus’ source, Smashwords, also publishes
African American fiction. We believe the reason          5.1     Vocabulary Efficiency
for the difference is a mixture of these two factors.
                                                         We may operationalize the question of whether
4.4   Ablation Study                                     BITE improves vocabulary efficiency in numerous
To tease apart the effects of BITE’s two compo-          ways. We discuss two vocabulary-level measures
nents (lemmatization and inflection symbol) on           here and a sequence-level measure in Appendix C.
BPE
Coverage (%)   90

                                                                                 Symbol Complexity ·105
                                                                                                           3.5                                  BPE + BITE
                                                                                                                                                 WordPiece
               80
                                                                                                                                              WordPiece + BITE
               70                                              Baseline                                      3                                  Unigram LM
                                                                BITE                                                                         Unigram LM + BITE
               60
                    0         0.5           1            1.5               2                               2.5

                            Vocabulary Size (symbols) ·104
                                                                                                             2
Figure 3: Comparison of coverage between BITE and                                                                0          1            2         3             4
a trivial baseline (word counts).                                                                                                                      4
                                                                                                                         Vocabulary Size (symbols) ·10

Vocabulary coverage. One measure of vocabu-                                     Figure 4: Symbol complexities of tokenizer vocabular-
lary efficiency is the coverage of a representative                             ies as computed in Eqs. (1) and (2). Lower is better.
corpus by a vocabulary’s symbols. We measure
coverage by computing the total number of tokens                                by the number of word types in the corpus may
(words and punctuation) in the corpus that are rep-                             be helpful for cross-corpus comparisons. For sim-
resented in the vocabulary divided by the total num-                            plicity, we define f (Si ) = 0 when there are only
ber of tokens in the corpus. We use the 1M subset                               unknown symbols in the encoded sequence and the
of Wikipedia+BookCorpus as our representative                                   penalty of each extra unknown symbol to be double
corpus. Since BITE does not require a vocabulary                                that of a symbol in the vocabulary.9 A general form
size to be fixed before training, we set the N most                             of Eq. (2) is included in Appendix B.4.
frequent types (base forms and inflections) to be                                  To measure the symbol complexities of our vo-
our vocabulary. We use the N most frequent types                                cabularies, we use WordNet’s single-word lemmas
in the unencoded text as our baseline vocabulary.                               (Miller, 1995) as our “corpus” (N = 83118). From
   From Fig. 3, we observe that the BITE vocabu-                                Fig. 4, we see that training data-driven tokenizers
lary achieves a higher coverage of the corpus than                              with BITE produces vocabularies with lower sym-
the baseline, hence demonstrating the efficacy of                               bol complexities. Additionally, we observe that
BITE at improving vocabulary efficiency. Addition-                              tokenizer combinations incorporating WordPiece
ally, we note that this advantage is most significant                           or unigram LM generally outperform the BPE ones.
(5–7%) when the vocabulary contains less than 10k                               We believe this to be the result of using a language
symbols. This implies that inflected word forms                                 model to inform vocabulary creation. It is logical
comprise a large portion of frequently occurring                                that a symbol that maximizes a language model’s
types, which comports with intuition.                                           likelihood on the training data is also semantically
Symbol complexity. Another measure of vocab-                                    “denser”, hence prioritizing such symbols produces
ulary efficiency is the total number of symbols                                 efficient vocabularies. We leave the in-depth inves-
needed to encode a representative set of word types.                            tigation of this relationship to future work.
We term this the symbol complexity. Formally,                                   5.2                         Adversarial Robustness
given N , the total number of word types in the
evaluation corpus; Si , the sequence of symbols                                 BITE’s ability to make models more robust to in-
obtained from encoding the ith type; and u, the                                 flectional variation can be directly attributed to its
number of unknown symbols in Si , we define:                                    preservation of consistent, inflection-independent
                                                                                base forms. We demonstrate this by measuring
                                                   N                            the similarity between the encoded clean and ad-
                                                                                versarial sentences with the Ratcliff/Obershelp al-
                                                   X
               SymbComp(S1 , . . . , SN ) =              f (Si ),         (1)
                                                   i=1                          gorithm (Ratcliff and Metzener, 1988). We use
                              (                                                 the MultiNLI in-domain development set and the
                               |Si | + ui       |Si | − ui > 0                  M ORPHEUS adversaries generated in §4.1.
                    f (Si ) =                                             (2)      We find that clean and adversarial sequences en-
                               0                otherwise.
                                                                                coded by the BITE+D tokenizers were more sim-
  While not strictly necessary when comparing vo-                               ilar (1–2.5%) than those encoded without BITE
                                                                                                     9
cabularies on the same corpus, normalizing Eq. (1)                                                        |S| contributes the extra count.
BPE only       6   Limitations
                                                          BPE + BITE
                 98
                                                         WordPiece only
                                                        WordPiece + BITE
                                                        Unigram LM only
                                                                           Our BITE implementation relies on an external
 % Similarity

                                                       Unigram LM + BITE   POS tagger to assign inflection tags to each word.
                 96
                                                                           This tagger requires language-specific training data,
                                                                           which can be a challenge for low resource lan-
                 94
                                                                           guages. However, this could be an advantage since
                                                                           the overall system can be improved by training the
                                                                           tagger on dialect-specific datasets, or readily ex-
                                                                           tended to other languages given a suitable tagger.
                      0          1          2          3             4
                                                                           Another drawback of BITE is that it increases the
                              Vocabulary Size (symbols) ·104
                                                                           length of the encoded sequence which may lead to
Figure 5: Mean percentage of symbols that are the                          extremely long sequences if used on morphologi-
same in the clean and adversarial encoded sequences.                       cally rich languages. However, this is not an issue
                                                                           for English Transformer models since the increase
                                                                           in length will always be
Acknowledgments                                          Mathias Creutz and Krista Lagus. 2002. Unsupervised
We are grateful to Michael Yoshitaka Erlewine             discovery of morphemes. In Proceedings of the
                                                          ACL-02 Workshop on Morphological and Phonolog-
from the NUS Dept. of English Language and Liter-         ical Learning, pages 21–30. Association for Compu-
ature and our anonymous reviewers for their invalu-       tational Linguistics.
able feedback. We also thank Xuan-Phi Nguyen
                                                         David Crystal. 2003. English as a Global Language.
for his help with reproducing the Transformer-big
                                                           Cambridge University Press.
baseline. Samson is supported by Salesforce and
Singapore’s Economic Development Board under             Michael Denkowski and Alon Lavie. 2014. Meteor uni-
its Industrial Postgraduate Programme.                     versal: Language specific translation evaluation for
                                                           any target language. In Proceedings of the EACL
                                                           2014 Workshop on Statistical Machine Translation.
References                                               Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Oded Avraham and Yoav Goldberg. 2017. The inter-            Kristina Toutanova. 2019. BERT: Pre-training of
  play of semantics and morphology in word embed-           deep bidirectional transformers for language under-
  dings. In Proceedings of the 15th Conference of the       standing. In Proceedings of the 2019 Conference
  European Chapter of the Association for Computa-          of the North American Chapter of the Association
  tional Linguistics: Volume 2, Short Papers, pages         for Computational Linguistics: Human Language
  422–426, Valencia, Spain. Association for Computa-       Technologies, Volume 1 (Long and Short Papers),
  tional Linguistics.                                       pages 4171–4186, Minneapolis, Minnesota. Associ-
                                                            ation for Computational Linguistics.
Yonatan Belinkov and Yonatan Bisk. 2018. Synthetic
  and natural noise both break neural machine transla-   David M. Eberhard, Gary F. Simons, and Charles D.
  tion. In 6th International Conference on Learning        Fennig, editors. 2019. Ethnologue: Languages of
  Representations, Vancouver, BC, Canada.                  the World, 22 edition. SIL International.

Yoshua Bengio, Réjean Ducharme, and Pascal Vincent.     Javid Ebrahimi, Anyi Rao, Daniel Lowd, and Dejing
  2001. A neural probabilistic language model. In           Dou. 2018. HotFlip: White-box adversarial exam-
  T. K. Leen, T. G. Dietterich, and V. Tresp, editors,      ples for text classification. In Proceedings of the
  Advances in Neural Information Processing Systems         56th Annual Meeting of the Association for Compu-
  13, pages 932–938. MIT Press.                             tational Linguistics (Volume 2: Short Papers), pages
                                                            31–36, Melbourne, Australia. Association for Com-
Steven Bird, Ewan Klein, and Edward Loper.                  putational Linguistics.
   2009. Natural Language Processing with Python.
   O’Reilly Media.                                       Alexander Erdmann, Salam Khalifa, Mai Oudah, Nizar
                                                           Habash, and Houda Bouamor. 2019. A little lin-
Aditi Chaudhary, Chunting Zhou, Lori Levin, Graham         guistics goes a long way: Unsupervised segmenta-
  Neubig, David R. Mortensen, and Jaime Carbonell.         tion with limited language specific guidance. In Pro-
  2018. Adapting word embeddings to new languages          ceedings of the 16th Workshop on Computational
  with morphological and phonological subword rep-         Research in Phonetics, Phonology, and Morphol-
  resentations. In Proceedings of the 2018 Conference      ogy, pages 113–124, Florence, Italy. Association for
  on Empirical Methods in Natural Language Process-        Computational Linguistics.
  ing, pages 3285–3295, Brussels, Belgium. Associa-
  tion for Computational Linguistics.                    Pauline Foster and Gillian Wigglesworth. 2016. Cap-
                                                           turing accuracy in second language performance:
Yong Cheng, Lu Jiang, and Wolfgang Macherey. 2019.         The case for a weighted clause ratio. Annual Review
  Robust neural machine translation with doubly ad-        of Applied Linguistics, 36:98–116.
  versarial inputs. In Proceedings of the 57th Annual
  Meeting of the Association for Computational Lin-      John Goldsmith. 2000. Linguistica: An automatic mor-
  guistics, pages 4324–4333, Florence, Italy. Associa-     phological analyzer. In Proceedings of 36th meeting
  tion for Computational Linguistics.                      of the Chicago Linguistic Society.

Michael Collins. 2002. Discriminative training meth-     Ian J. Goodfellow, Jonathon Shlens, and Christian
  ods for hidden Markov models: Theory and experi-          Szegedy. 2015. Explaining and harnessing adversar-
  ments with perceptron algorithms. In Proceedings          ial examples. In 3rd International Conference on
  of the 2002 Conference on Empirical Methods in            Learning Representations, San Diego, California.
  Natural Language Processing, pages 1–8.
                                                         James Henderson. 2020. The unstoppable rise of com-
Ronan Collobert, Jason Weston, Léon Bottou, Michael       putational linguistics in deep learning. In Proceed-
  Karlen, Koray Kavukcuoglu, and Pavel Kuksa.              ings of the 58th Annual Meeting of the Association
  2011. Natural language processing (almost) from          for Computational Linguistics (Volume 1: Long Pa-
  scratch. Journal of Machine Learning Research,           pers), Seattle, Washington. Association for Compu-
  12(76):2493–2537.                                        tational Linguistics.
Alex Hern. 2017. Facebook translates ‘good morning’          Conference on Empirical Methods in Natural Lan-
  into ‘attack them’, leading to arrest. The Guardian.       guage Processing and Computational Natural Lan-
                                                             guage Learning (EMNLP-CoNLL), pages 868–876,
Dirk Hovy and Shannon L. Spruit. 2016. The social            Prague, Czech Republic. Association for Computa-
  impact of natural language processing. In Proceed-         tional Linguistics.
  ings of the 54th Annual Meeting of the Association
  for Computational Linguistics (Volume 2: Short Pa-       Philipp Koehn and Rebecca Knowles. 2017. Six chal-
  pers), pages 591–598, Berlin, Germany. Association         lenges for neural machine translation. In Proceed-
  for Computational Linguistics.                             ings of the First Workshop on Neural Machine Trans-
                                                             lation, pages 28–39, Vancouver. Association for
Po-Sen Huang, Robert Stanforth, Johannes Welbl,              Computational Linguistics.
  Chris Dyer, Dani Yogatama, Sven Gowal, Krish-
  namurthy Dvijotham, and Pushmeet Kohli. 2019.            Taku Kudo. 2018. Subword regularization: Improving
  Achieving verified robustness to symbol substitu-          neural network translation models with multiple sub-
  tions via interval bound propagation. In Proceed-          word candidates. In Proceedings of the 56th Annual
  ings of the 2019 Conference on Empirical Methods           Meeting of the Association for Computational Lin-
  in Natural Language Processing and the 9th Inter-          guistics (Volume 1: Long Papers), pages 66–75, Mel-
  national Joint Conference on Natural Language Pro-         bourne, Australia. Association for Computational
  cessing (EMNLP-IJCNLP), pages 4083–4093, Hong              Linguistics.
  Kong, China. Association for Computational Lin-
  guistics.                                                Taku Kudo and John Richardson. 2018. Sentencepiece:
                                                             A simple and language independent subword tok-
Mohit Iyyer, John Wieting, Kevin Gimpel, and Luke            enizer and detokenizer for neural text processing.
 Zettlemoyer. 2018. Adversarial example generation           arXiv preprint arXiv:1808.06226.
 with syntactically controlled paraphrase networks.
 In Proceedings of the 2018 Conference of the North        Guillaume Lample and Alexis Conneau. 2019. Cross-
 American Chapter of the Association for Computa-            lingual language model pretraining. In Advances
 tional Linguistics: Human Language Technologies,            in Neural Information Processing Systems 32, pages
 Volume 1 (Long Papers), pages 1875–1885, New                7059–7069. Curran Associates, Inc.
 Orleans, Louisiana. Association for Computational         Jacob RE Leimgruber. 2009. Modelling variation in
 Linguistics.                                                 Singapore English. Ph.D. thesis, Oxford University.
Robin Jia and Percy Liang. 2017. Adversarial exam-         Sabrina J. Mielke. 2019. Can you compare perplexity
  ples for evaluating reading comprehension systems.         across different segmentations?
  In Proceedings of the 2017 Conference on Empiri-
  cal Methods in Natural Language Processing, pages        George A. Miller. 1995. Wordnet: A lexical database
  2021–2031, Copenhagen, Denmark. Association for            for english. Communications of the ACM, 38:39–41.
  Computational Linguistics.
                                                           Myle Ott, Sergey Edunov, Alexei Baevski, Angela
Robin Jia, Aditi Raghunathan, Kerem Göksel, and            Fan, Sam Gross, Nathan Ng, David Grangier, and
  Percy Liang. 2019. Certified robustness to adver-         Michael Auli. 2019. fairseq: A fast, extensible
  sarial word substitutions. In Proceedings of the          toolkit for sequence modeling. In Proceedings of
  2019 Conference on Empirical Methods in Natu-             NAACL-HLT 2019: Demonstrations.
  ral Language Processing and the 9th International
  Joint Conference on Natural Language Processing          Myle Ott, Sergey Edunov, David Grangier, and
  (EMNLP-IJCNLP), pages 4129–4142, Hong Kong,               Michael Auli. 2018. Scaling neural machine trans-
  China. Association for Computational Linguistics.         lation. In Proceedings of the Third Conference on
                                                            Machine Translation: Research Papers, pages 1–9,
Braj B. Kachru, Yamuna Kachru, and Cecil Nelson,            Brussels, Belgium. Association for Computational
  editors. 2009. The Handbook of World Englishes.           Linguistics.
  Wiley-Blackwell.
                                                           Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Katharina Kann and Hinrich Schütze. 2016. Single-           Jing Zhu. 2002. Bleu: a method for automatic eval-
  model encoder-decoder with explicit morphological          uation of machine translation. In Proceedings of
  representation for reinflection. In Proceedings of the     the 40th annual meeting on association for compu-
  54th Annual Meeting of the Association for Compu-          tational linguistics, pages 311–318. Association for
  tational Linguistics (Volume 2: Short Papers), pages       Computational Linguistics.
  555–560, Berlin, Germany. Association for Compu-
  tational Linguistics.                                    Aleksandra Piktus, Necati Bora Edizel, Piotr Bo-
                                                             janowski, Edouard Grave, Rui Ferreira, and Fabrizio
Tyler Kendall and Charlie Farrington. 2018. The cor-         Silvestri. 2019. Misspelling oblivious word embed-
  pus of regional african american language.                 dings. In Proceedings of the 2019 Conference of
                                                             the North American Chapter of the Association for
Philipp Koehn and Hieu Hoang. 2007. Factored trans-          Computational Linguistics: Human Language Tech-
  lation models. In Proceedings of the 2007 Joint            nologies, Volume 1 (Long and Short Papers), pages
3226–3234, Minneapolis, Minnesota. Association            Zhouxing Shi, Huan Zhang, Kai-Wei Chang, Minlie
  for Computational Linguistics.                              Huang, and Cho-Jui Hsieh. 2020. Robustness ver-
                                                              ification for transformers. In 8th International Con-
Shana Poplack. 2000. The English History of African           ference on Learning Representations, Addis Ababa,
  American English. Blackwell Malden, MA.                     Ethiopia.

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine         John Sylak-Glassman, Christo Kirov, David Yarowsky,
  Lee, Sharan Narang, Michael Matena, Yanqi Zhou,             and Roger Que. 2015. A language-independent fea-
  Wei Li, and Peter J. Liu. 2019. Exploring the limits        ture schema for inflectional morphology. In Pro-
  of transfer learning with a unified text-to-text trans-     ceedings of the 53rd Annual Meeting of the Associ-
  former. arXiv e-prints, arXiv:1910.10683.                   ation for Computational Linguistics and the 7th In-
                                                              ternational Joint Conference on Natural Language
                                                              Processing (Volume 2: Short Papers), pages 674–
Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018.
                                                              680, Beijing, China. Association for Computational
  Know what you don’t know: Unanswerable ques-
                                                              Linguistics.
  tions for SQuAD. In Proceedings of the 56th An-
  nual Meeting of the Association for Computational         Samson Tan, Shafiq Joty, Min-Yen Kan, and Richard
  Linguistics (Volume 2: Short Papers), pages 784–            Socher. 2020. It’s morphin’ time! Combating
  789, Melbourne, Australia. Association for Compu-           linguistic discrimination with inflectional perturba-
  tational Linguistics.                                       tions. In Proceedings of the 58th Annual Meet-
                                                              ing of the Association for Computational Linguistics,
John W Ratcliff and David E Metzener. 1988. Pattern-          pages 2920–2935, Online. Association for Computa-
  matching: The gestalt approach. Dr Dobbs Journal,           tional Linguistics.
  13(7):46.
                                                            Rachael Tatman. 2017. Gender and dialect bias in
Marco Tulio Ribeiro, Sameer Singh, and Carlos                 YouTube’s automatic captions. In Proceedings of
 Guestrin. 2018. Semantically equivalent adversar-            the First ACL Workshop on Ethics in Natural Lan-
 ial rules for debugging NLP models. In Proceedings           guage Processing, pages 53–59, Valencia, Spain. As-
 of the 56th Annual Meeting of the Association for            sociation for Computational Linguistics.
 Computational Linguistics (Volume 1: Long Papers),
 pages 856–865, Melbourne, Australia. Association           Anna M Thornton. 2019. Overabundance in morphol-
 for Computational Linguistics.                               ogy. In Oxford Research Encyclopedia of Linguis-
                                                              tics.
Fatiha Sadat and Nizar Habash. 2006. Combination            Ray K Tongue. 1974. The English of Singapore and
  of Arabic preprocessing schemes for statistical ma-         Malaysia. Eastern Universities Press.
  chine translation. In Proceedings of the 21st Interna-
  tional Conference on Computational Linguistics and        Lav R. Varshney, Nitish Shirish Keskar, and Richard
  44th Annual Meeting of the Association for Compu-           Socher. 2019.     Pretrained AI models: Perfor-
  tational Linguistics, pages 1–8, Sydney, Australia.         mativity, mobility, and change. arXiv e-prints,
  Association for Computational Linguistics.                  arXiv:1909.03290.

Julian Salazar, Davis Liang, Toan Q. Nguyen, and Ka-        Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
   trin Kirchhoff. 2020. Masked language model scor-          Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
   ing. In Proceedings of the 58th Annual Meeting             Kaiser, and Illia Polosukhin. 2017. Attention is all
   of the Association for Computational Linguistics,          you need. In I. Guyon, U. V. Luxburg, S. Bengio,
   pages 2699–2712, Online. Association for Compu-            H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar-
   tational Linguistics.                                      nett, editors, Advances in Neural Information Pro-
                                                              cessing Systems 30, pages 5998–6008. Curran Asso-
Mike Schuster and Kaisuke Nakajima. 2012. Japanese            ciates, Inc.
  and korean voice search. In 2012 IEEE Interna-
  tional Conference on Acoustics, Speech and Signal         Alex Wang and Kyunghyun Cho. 2019. BERT has
  Processing (ICASSP), pages 5149–5152. IEEE.                 a mouth, and it must speak: BERT as a Markov
                                                              random field language model.   arXiv e-prints,
                                                              arXiv:1902.04094.
Rico Sennrich, Barry Haddow, and Alexandra Birch.
  2016. Neural machine translation of rare words with       Changhan Wang, Kyunghyun Cho, and Jiatao Gu.
  subword units. In Proceedings of the 54th Annual            2019. Neural machine translation with byte-level
  Meeting of the Association for Computational Lin-           subwords. arXiv e-prints, arXiv:1909.03341.
  guistics (Volume 1: Long Papers), pages 1715–1725.
  Association for Computational Linguistics.                Adina Williams, Nikita Nangia, and Samuel Bowman.
                                                              2018. A broad-coverage challenge corpus for sen-
Deven Shah, H. Andrew Schwartz, and Dirk Hovy.                tence understanding through inference. In Proceed-
  2019. Predictive biases in natural language process-        ings of the 2018 Conference of the North American
  ing models: A conceptual framework and overview.            Chapter of the Association for Computational Lin-
  arXiv e-prints, arXiv:1912.11078.                           guistics: Human Language Technologies, Volume
1 (Long Papers), pages 1112–1122. Association for
  Computational Linguistics.
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
  Chaumond, Clement Delangue, Anthony Moi, Pier-
  ric Cistac, Tim Rault, Rémi Louf, Morgan Funtow-
  icz, and Jamie Brew. 2019. Huggingface’s trans-
  formers: State-of-the-art natural language process-
  ing. arXiv e-prints, arXiv:1910.03771.
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q.
  Weinberger, and Yoav Artzi. 2020. Bertscore: Eval-
   uating text generation with bert. In International
  Conference on Learning Representations.
Yichao Zhou, Jyun-Yu Jiang, Kai-Wei Chang, and Wei
  Wang. 2019a. Learning to discriminate perturba-
  tions for blocking adversarial attacks in text classi-
  fication. In Proceedings of the 2019 Conference on
  Empirical Methods in Natural Language Processing
  and the 9th International Joint Conference on Natu-
  ral Language Processing (EMNLP-IJCNLP), pages
  4904–4913, Hong Kong, China. Association for
  Computational Linguistics.

Yichao Zhou, Jyun-Yu Jiang, Kai-Wei Chang, and Wei
  Wang. 2019b. Learning to discriminate perturba-
  tions for blocking adversarial attacks in text classi-
  fication. In Proceedings of the 2019 Conference on
  Empirical Methods in Natural Language Processing
  and the 9th International Joint Conference on Natu-
  ral Language Processing (EMNLP-IJCNLP), pages
  4906–4915, Hong Kong, China. Association for
  Computational Linguistics.
Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhut-
  dinov, Raquel Urtasun, Antonio Torralba, and Sanja
  Fidler. 2015. Aligning books and movies: Towards
  story-like visual explanations by watching movies
  and reading books. In The IEEE International Con-
  ference on Computer Vision (ICCV).
Jacob Ziv and Abraham Lempel. 1978. Compres-
   sion of individual sequences via variable-rate cod-
   ing. IEEE Transactions on Information Theory, IT-
   24(5):530–536.
A         Examples of Inflectional Variation in             Datasets and metrics. MultiNLI (Williams
          English Dialects                                  et al., 2018) is a natural language inference dataset
                                                            of 392,702 training examples, 10k in-domain and
African American Vernacular English                         10k out-of-domain dev. examples, and 10k in-
(Kendall and Farrington, 2018)                              domain and 10k out-of-domain test examples span-
     • I dreamed about we was over my uh, father            ning 10 domains. Each example comprises a
       mother house, and then we was moving.                premise, hypothesis, and a label indicating whether
                                                            the premise entails, contradicts, or is irrelevant
     • I be over with my friends.                           to the hypothesis. Models are evaluated using
                                                            Accuracy = # correct   predictions
                                                                              # predictions    .
     • And this boy name RD-NAME-3, he was                     SQuAD 2.0 (Rajpurkar et al., 2018) is an extrac-
       tryna be tricky, pretend like he don’t do noth-      tive question answering dataset comprising more
       ing all the time.                                    than 100k answerable questions and 50k unanswer-
                                                            able questions (130,319 training examples, 11,873
Colloquial Singapore English (Singlish)
                                                            development examples, and 8,862 test examples).
(Source: forums.hardwarezone.com.sg)
                                                            Each example is composed of a question, a passage,
     • Anyone face the problem after fresh installed        and an answer. Answerable questions are questions
       the Win 10 Pro, under NetWork File sharing           that can be answered by a span in the passage and
       after you enable this function (Auto discov-         unanswerable questions are questions that cannot
       ery), the computer still failed to detect our        be answered by a span in the passage. Models are
       Users connected to the same NetWork?                 evaluated using the F1 score.
                                                               Wikipedia+BookCorpus is a combination of En-
     • I have try it already, but no solutions appear.      glish Wikipedia and BookCorpus. We use Lample
                                                            and Conneau (2019)’s script to download and pre-
     • How do time machine works??
                                                            process the Wikipedia dump before removing blank
B         Implementation/Experiment Details                 lines, overly short lines (less than three words or
                                                            four characters), and lines with doc tags. We also
All models are trained on 8 16GB Tesla V100s.               remove blank and overly short lines from Book-
                                                            Corpus before concatenating and shuffling both
                                                            datasets.

                                                            B.2   Discussion for Perplexity Experiments
                                                            Effect of lemmatization and inflection symbols.
                                                            We conduct two ablations to investigate the effects
                                                            of lemmatization and inflection symbols on the
                                                            models’ pseudo perplexities: the first simply lem-
                                                            matizes the input before encoding it with Word-
                                                            Piece (WordPiece+L EMM) and the second replaces
                                                            every inflection symbol generated by BITE with
Figure 6: How BITE fits into the tokenization pipeline.
                                                            a dummy symbol (WordPiece+BITEabl ). The lat-
                                                            ter is the same ablation used in Table 3 and from
B.1        Classification Experiments                       Table 4, we see that this condition consistently
                                                            achieved the lowest pPPL on all three corpora.
For our BERT experiments, we build BITE on                  However, we believe that the highly predictable
top of the BertTokenizer class in Wolf et al.               dummy symbols likely account for the significant
(2019) and use their BERT implementation and                drops in pseudo perplexity.
fine-tuning scripts11 . BERTbase has 110M parame-
                                                               To test this hypothesis, we perform another ab-
ters. We do not perform a hyperparameter search
                                                            lation, WordPiece+L EMM, where the the dummy
and instead use the example hyperparameters for
                                                            symbols are removed entirely. If the dummy sym-
the respective scripts.
                                                            bols were not truly responsible for the large drops
    11
         github.com/huggingface/transformers/.../examples   in pPPL, we should observe similar results for
WordPiece (WP)    WP + L EMM               WP + BITE          WP + BITEabl
   Dataset                                  —            (Lemmatize)            (+Infl. Symbols)   (+Dummy Symbol)
 Colloquial Singapore English
   Total word tokens before WP            45803898                45803898          51982873             51982873
   Pseudo Negative Log-Likelihood         30910290                30558864          31110740             30292923
   pPPL (per word token before WP)          92.58                   85.43             52.67                48.66
   pPPL (per symbol after WP)               49.10                   46.39             32.02                30.20
 African American Vernacular English
   Total word tokens before WP            1144803                     1144803       1320730              1320730
   Pseudo Negative Log-Likelihood         452269                      444021        453031                434621
   pPPL (per word token before WP)         13.92                       13.27          9.84                 8.96
   pPPL (per symbol after WP)              12.90                       12.41          9.18                 8.43
 Standard English
   Total word tokens before WP             252153                     252153        290391               290391
   Pseudo Negative Log-Likelihood          77339                      78074         90148                75467
   pPPL (per word token before WP)          7.72                       7.87          7.92                 5.65
   pPPL (per symbol after WP)               6.34                       6.36          6.07                 4.86

Table 4: Effect of lemmatization, inflection symbols, and dummy symbol on pseudoperplexity (pPPL). We also
show the effect of normalizing by the word token vs. subword symbol count. Lower is better. Bolded values
indicate lowest row-wise pPPLs, excluding WP+BITEabl due to the confounding effect of the highly predictable
dummy symbols.

both WordPiece+L EMM and WordPiece+BITEabl .            dition, with the exception of the inflection/dummy
From Table 4 (pPPL per word token before                symbols that replaced some unused tokens, the vo-
WP), we see that the decrease in pPPL between           cabularies of all the WordPiece tokenizers used in
WordPiece+L EMM and WordPiece is less drastic,          our pseudo perplexity experiments are exactly the
thereby lending evidence for rejecting the null hy-     same since we do not retrain them. Therefore, we
pothesis.                                               attempt to balance these two factors by normalizing
                                                        by the number of word tokens fed into the Word-
Poorer performance on Standard English. We              Piece component of each tokenization pipeline in
observe that lemmatizing all content words and/or       Fig. 2. We also report the per subword pPPL and
reinjecting the grammatical information appears to      raw pseudo negative log-likelihood in Table 4.
have the opposite effect on Standard English data
compared to the dialectal data. Intuitively, such       B.3             Machine Translation Experiments
an encoding should result in even more significant
reductions in perplexity on Standard English since
                                                                                                          BPE only
the POS tagger and lemmatizer were trained on                          4.8
                                                                                                         BITE + BPE
Standard English data. A possible explanation for
                                                         Perplexity

these results is that the WordPiece tokenizer and                      4.6
BERT model are overfitted on Standard English,
since they were both (pre-)trained on Standard En-                     4.4
glish data.
                                                                       4.2
Normalizing log-likehoods. In an earlier ver-
sion of this paper, we computed pseudo perplexity                               1             2          3          4
by normalizing the pseudo log-likehoods with the                                        # Updates ·104

number of masked subword symbols (the default).
A reviewer pointed out that per subword symbol          Figure 7: Validation perplexity over the course of train-
perplexities are not directly comparable across dif-    ing for Transformer-big.
ferent subword segmentations/vocabularies, but
per word perplexities are (Mielke, 2019; Salazar           For our Transformer-big experiments, we use the
et al., 2020). However, using the same denominator      fairseq (Ott et al., 2019) implementation and
would unfairly penalize models using BITE since         the hyperparameters from Ott et al. (2018):
it inevitably increases the symbol sequence length,        • Parameters: 210,000,000
which affects the predicted log-likelihoods. In ad-        • BPE operations: 32,000
You can also read