Overcoming Poor Word Embeddings with Word Definitions
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Overcoming Poor Word Embeddings with Word Definitions Christopher Malon NEC Laboratories America Princeton, NJ 08540 malon@nec-labs.com Abstract the definition is concatenated to the premise se- quence. Because T 0 has the same form as T , a Modern natural language understanding mod- els depend on pretrained subword embeddings, model accepting the augmented information may but applications may need to reason about be trained in the same way as the original model. words that were never or rarely seen during Because there are not enough truly untrained pretraining. We show that examples that de- words like “COVID” in natural examples, we probe pend critically on a rarer word are more chal- performance by scrambling real words so that their lenging for natural language inference mod- word embedding becomes useless, and supplying els. Then we explore how a model could learn definitions. Our method recovers most of the per- to use definitions, provided in natural text, to overcome this handicap. Our model’s under- formance lost by scrambling. Moreover, the pro- standing of a definition is usually weaker than posed technique removes biases in more ad hoc a well-modeled word embedding, but it recov- solutions like adding definitions to examples with- ers most of the performance gap from using a out special training. completely untrained word. 1 Introduction 2 Related Work The reliance of natural language understanding We focus on NLI because it depends more deeply models on the information in pre-trained word em- on word meaning than sentiment or topic classi- beddings limits these models from being applied fication tasks. Chen et al. (2018) pioneered the reliably to rare words or technical vocabulary. To addition of background information to an NLI overcome this vulnerability, a model must be able model’s classification on a per-example basis, aug- to compensate for a poorly modeled word embed- menting a sequence of token embeddings with fea- ding with background knowledge to complete the tures encoding WordNet relations between pairs of required task. words, to achieve a 0.6% improvement on the SNLI For example, a natural language inference (NLI) (Bowman et al., 2015) task. Besides this explicit model based on pre-2020 word embeddings may reasoning approach, implicit reasoning over back- not be able to deduce from “Jack has COVID” that ground knowledge can be achieved if one updates “Jack is sick.” By providing the definition, “COVID the base model itself with background informa- is a respiratory disease,” we want to assist this tion. Lauscher et al. (2020) follows this approach classification. to add information from ConceptNet (Speer et al., We describe a general procedure for enhancing a 2018) and the Open Mind Common Sense corpus classification model such as natural language infer- (Singh et al., 2002) through a fine-tuned adapter ence (NLI) or sentiment classification, to perform added to a pretrained language model, achieving the same task on sequences including poorly mod- better performance on subsets of NLI examples eled words using definitions of those words. From that are known to require world knowledge. Talmor the training set T of the original model, we con- et al. (2020) explore the interplay between explic- struct an augmented training set T 0 for a model itly added knowledge and implicitly stored knowl- that may accept the same token sequence option- edge on artificially constructed NLI problems that ally concatenated with a word definition. In the require counting or relations from a taxonomy. case of NLI, where there are two token sequences, In the above works, explicit background infor- 288 Proceedings of the 10th Conference on Lexical and Computational Semantics, pages 288–293 August 5–6, 2021, Bangkok, Thailand (online) ©2021 Association for Computational Linguistics
mation comes from a taxonomy or knowledge base. We simplify by looking only at the classification Only a few studies have worked with definition and not the probability. Like Kim et al. (2020), we text directly, and not in the context of NLI. Tissier truncate the computation of p(yc |x̃i , x−i ) to words et al. (2017) used definitions to create embeddings such that p(x̃i |x−i ) exceeds a threshold, here .05. for better performance on word similarity tasks, Ultimately we mark a word xi as a critical word if compared to word2vec (Mikolov et al., 2013) and there exists a replacement x̃i such that fastText (Bojanowski et al., 2017) while maintain- ing performance on text classification. Their work argmaxy p(y|x̃i , x−i ) 6= argmaxy p(y|x) (2) pushes together embeddings of words that co-occur in each other’s definitions. Recently, Kaneko and and Bollegala (2021) used definitions to remove biases p(x̃i |x−i ) > .05. (3) from pretrained word embeddings while maintain- ing coreference resolution accuracy. In contrast, Additionally we require that the word not appear our work reasons with natural language definitions more than once in the example, because the mean- without forming a new embedding, allowing atten- ing of repeated words usually impacts the classifi- tion between a definition and the rest of an exam- cation less than the fact that they all match. Table 1 ple. shows an example. Alternatively, Schick and Schütze (2020) im- Premise A young man sits, looking out of proved classification using rare words by collecting a train [side → Neutral, small → and attending to all of the contexts in which they Neutral] window. occur in BookCorpus (Zhu et al., 2015) combined Hypothesis The man is in his room. with Westbury Wikipedia Corpus.1 Like the meth- Label Contradiction ods above that use definitions, this method con- structs a substitute or supplementary embedding Table 1: An SNLI example, with critical words shown for a rare word. in italics and replacements shown in brackets. 3 Methods A technicality remains because our classifica- 3.1 Critical words tion models use subwords as tokens, whereas we consider replacements of whole words returned The enhanced training set T 0 will be built by pro- by pattern.en. We remove all subwords of xi viding definitions for words in existing examples, when forming x−i , but we consider only replace- while obfuscating the existing embeddings of those ments x̃i that are a single subword long. words. If a random word of the original text is ob- fuscated, the classification still may be determined 3.2 Definitions or strongly biased by the remaining words. To ensure the definitions matter, we select carefully. We use definitions from Simple English Wiktionary To explain which words of a text are important when available, or English Wiktionary otherwise.2 for classification, Kim et al. (2020) introduced the Tissier et al. (2017) downloaded definitions from idea of input marginalization. Given a sequence of four commercial online dictionaries, but these are tokens x, let x−i represent the sequence without no longer freely available online as of January the ith token xi . They marginalize the probability 2021. of predicting a class yc over possible replacement To define a word, first we find its part of speech words x̃i in the vocabulary V as in the original context and lemmatize the word using the pattern.en library (Smedt and Daele- X p(yc |x−i ) = p(yc |x̃i , x−i )p(x̃i |x−i ) (1) mans, 2012). Then we look for a section labeled x˜i ∈V “English” in the retrieved Wiktionary article, and for a subsection for the part of speech we identi- and then compare p(yc |x−i ) to p(yc |x) to quantify fied. We extract the first numbered definition in this the importance of xi . The probabilities p(x̃i |x−i ) subsection. In practice, we find that this method are computed by a language model. usually gives us short, simple definitions that match 1 http://www.psych.ualberta.ca/ the usage in the original text. ˜westburylab/downloads/westburylab. 2 wikicorp.download.html We use the 2018-02-01 dumps. 289
When defining a word, we always write its defi- on later rounds of adversarial annotation for ANLI nition as “word means: definition.” This common (Nie et al., 2020). For the language model probabil- format ensures that the definitions and the word ities p(x̃i |x−i ), pretrained BERT (base, uncased) being defined can be recognized easily by the clas- is used rather than XLNet because the XLNet prob- sifier. abilities have been observed to be very noisy on short sequences.3 3.3 Enhancing a model One test set SN LIcrit f ull is constructed in the same Consider an example (x, yc ) ∈ T . If the example way as the augmented training set, but our main has a critical word xi ∈ x that appears only once test set SN LIcrit true is additionally constrained to use in the example, and x̃i is the most likely replace- only examples of the form (x, hi , yc ) where yc is ment word that changes the classification, we let the original label, because labels for the examples x0 denote the sequence where xi is replaced by (x0 , h0i , yc0 ) might be incorrect. All of our derived x̃i , and let yc0 = argmaxy p(y|x0 ). If definitions datasets are available for download.4 hi and h0i for xi and x̃i are found by the method In each experiment, training is run for three described above, we add (x, hi , yc ) and (x0 , h0i , yc0 ) epochs distributed across 4 GPU’s, with a batch to the enhanced training set T 0 . size of 10 on each, a learning rate of 5 × 10−5 , 120 In some training protocols, we scramble xi and warmup steps, a single gradient accumulation step, x̃i in the examples and definitions added to T 0 , re- and a maximum sequence length of 384. placing them with random strings of between four and twelve letters. This prevents the model from 4.2 Results relying on the original word embeddings. Table 2 Table 3 compares the accuracy of various training shows an NLI example and the corresponding ex- protocols. amples generated for the enhanced training set. Protocol true SN LIcrit Original A blond man is drinking from a Original 85.1% public fountain. / The man is No scrambling, no defs 84.6% drinking water. / Entailment No scrambling, defs 85.2% Scrambled a blond man is drinking from Scrambling, no defs 36.9% word a public yfcqudqqg. yfcqudqqg Scrambling, defs 81.2% means: a natural source of water; Scrambling, subs 84.7% a spring. / the man is drinking Train on normal SNLI, test on 54.1% water. / Entailment scrambled no defs Scrambled a blond man is drinking from Train on normal SNLI, test on 63.8% alternate a public lxuehdeig. lxuehdeig scrambled defs means: lxuehdeig is a transparent Train on unscrambled defs, test 51.4% solid and is usually clear. win- on scrambled defs dows and eyeglasses are made from it, as well as drinking Table 3: Accuracy of enhancement protocols glasses. / the man is drinking wa- ter. / Neutral Our task cannot be solved well without read- ing definitions. When words are scrambled but no Table 2: Adding background information to examples definitions are provided, an SNLI model without from SNLI special training achieves 54.1% on SN LIcrittrue . If 0 trained on T with scrambled words but no defini- tions, performance drops to 36.9%, reflecting that 4 Experiments T 0 is constructed to prevent a model from utilizing 4.1 Setup the contextual bias. We consider the SNLI task (Bowman et al., 2015). With definitions and scrambled words, per- We fine-tune an XLNet (base, cased) model (Yang formance is slightly below that of using the orig- et al., 2019), because it achieves near state-of-the- inal words. Our method using definitions applied art performance on SNLI and outperforms Roberta 3 https://github.com/huggingface/transformers/issues/4343 4 (Liu et al., 2019) and BERT (Devlin et al., 2019) https://figshare.com/s/edd5dc26b78817098b72 290
to the scrambled words yields 81.2%, compared to Definitions are not simply being memorized. 84.6% if words are left unscrambled but no defi- We selected the subset SN LIcrit new of SN LI true crit nitions are provided. Most of the accuracy lost by consisting of the 44 examples in which the defined obfuscating the words is recovered, but evidently word was not defined in a training example. The there is slightly more information accessible in the definition scrambled model achieves 68.2% on this original word embeddings. set, well above 45.5% for the original SNLI model If alternatives to the critical words are not in- reading the scrambled words and definitions but cluded, the classifier learns biases that do not without special training. Remembering a defini- true tion from training is thus an advantage (SN LIcrit depend on the definition. We explore restricting the training set to verified examples Ttrue 0 ⊂ T 0 in accuracy was 81.2%), but not the whole capability. true the same way as the SN LIcrit , still scrambling the Definition reasoning is harder than simple critical or replaced words in the training and testing substitutions. When definitions are given as one- sets. Using this subset, a model that is not given the word substitutions, in the form “scrambled means: definitions can be trained to achieve 69.9% perfor- original” instead of “scrambled means: definition”, true , showing a heavy contextual the model achieves 84.7% on SN LIcrit true compared mance on SN LIcrit bias. A model trained on this subset that uses the to 81.2% using the definition text. Of course this definitions achieves marginally higher performance is not a possibility for rare words that are not syn- (82.3%) than the one trained on all of T 0 . On the onyms of a word that has been well trained, but other hand, testing on SN LIcritf ull yields only 72.3% it suggests that the kind of multi-hop reasoning in compared to 80.3% using the full T 0 , showing that which words just have to be matched in sequence the classifier is less sensitive to the definition. is easier than understanding a text definition. Noisy labels from replacements do not hurt 4.3 A hard subset of SNLI accuracy much. The only difference between the “original” training protocol and “no scrambling, no By construction of the SentencePiece dictionary defs” is that the original trains on T and does not in- (Kudo and Richardson, 2018), only the most fre- clude examples with replaced words and unverified quent words in the training data of the XLNet labels. Training including the replacements reduces language model are represented as single tokens. accuracy by 0.5% on SN LIcrit true , which includes Other words are tokenized by multiple subwords. only verified labels. For comparison, training and Sometimes the subwords reflect a morphological testing on all of SNLI with the original protocol change to a well-modeled word, such as a change achieves 90.4%, so a much larger effect on accu- in tense or plurality. The language model probably racy must be due to harder examples in SN LIcrit true . understands these changes well and the subwords give important hints. The lemma form of a word Definitions are not well utilized without spe- strips many morphological features, so when the cial training. The original SNLI model, if pro- lemma form of a word has multiple subwords, the vided definitions of scrambled words at test time as basic concept may be less frequently encountered part of the premise, achieves only 63.8%, compared in training. We hypothesize that such words are to 81.2% for our specially trained model. less well understood by the language model. If the defined words are not scrambled, the To test this hypothesis, we construct a subset classifier uses the original embedding and ig- true of the test set, consisting of examples SN LImulti nores the definitions. Training with definitions where a critical word exists whose lemma form but no scrambling, 85.2% accuracy is achieved, but spans multiple subwords. This set consists of 332 this trained model is unable to use the definitions test examples. The critical word used may be dif- when words are scrambled: it achieves 51.4%. ferent from the one chosen for SN LIcrit true . This We have not discovered a way to combine the subset is indeed harder: the XLNet model trained benefit of the definitions with the knowledge in on all of SNLI attains only 77.7% on this subset the original word embedding. To force the model using no definitions, compared to 90.4% on the to use both techniques, we prepare a version of original test set. the training set which is half scrambled and half In Table 4 we apply various models constructed unscrambled. This model achieves 83.5% on the in the previous subsection to this hard test set. Ide- unscrambled test set, worse than no definitions. ally, a model leveraging definitions could compen- 291
Protocol true SN LImulti Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Normal SNLI on unscrambled 77.7% Kristina Toutanova. 2019. Bert: Pre-training of deep Defs & unscrambled on defs & 77.1% bidirectional transformers for language understand- ing. arXiv preprint, 1810.04805. unscrambled Defs & some scrambling on defs 73.8% Masahiro Kaneko and Danushka Bollegala. 2021. & unscrambled Dictionary-based debiasing of pre-trained word em- beddings. arXiv preprint, 2101.09525. Defs & scrambled on defs & 69.9% scrambled Siwon Kim, Jihun Yi, Eunji Kim, and Sungroh Yoon. Defs & scrambled on defs & un- 62.7% 2020. Interpretation of NLP models through input scrambled marginalization. In Proceedings of the 2020 Con- ference on Empirical Methods in Natural Language Table 4: Accuracy on the hard SNLI subset Processing (EMNLP), pages 3154–3167, Online. As- sociation for Computational Linguistics. Taku Kudo and John Richardson. 2018. SentencePiece: sate for these weaker word embeddings, but the A simple and language independent subword tok- method here does not do so. enizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical 5 Conclusion Methods in Natural Language Processing: System Demonstrations, pages 66–71, Brussels, Belgium. This work shows how a model’s training may be en- Association for Computational Linguistics. hanced to support reasoning with definitions in nat- Anne Lauscher, Olga Majewska, Leonardo F. R. ural text, to handle cases where word embeddings Ribeiro, Iryna Gurevych, Nikolai Rozanov, and are not useful. Our method forces the definitions to Goran Glavaš. 2020. Common sense or world be considered and avoids the application of biases knowledge? investigating adapter-based knowledge independent of the definition. Using the approach, injection into pretrained transformers. In Proceed- ings of Deep Learning Inside Out (DeeLIO): The entailment examples like “Jack has COVID / Jack First Workshop on Knowledge Extraction and Inte- is sick” that are misclassified by an XLNet trained gration for Deep Learning Architectures, pages 43– on normal SNLI are correctly recognized as entail- 49, Online. Association for Computational Linguis- ment when a definition “COVID is a respiratory tics. disease” is added. Methods that can leverage def- Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- initions without losing the advantage of partially dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, useful word embeddings are still needed. In an Luke Zettlemoyer, and Veselin Stoyanov. 2019. application, it also will be necessary to select the Roberta: A robustly optimized bert pretraining ap- proach. arXiv preprint, 1907.11692. words that would benefit from definitions, and to make a model that can accept multiple definitions. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word represen- tations in vector space. arXiv preprint, 1301.3781. References Yixin Nie, Adina Williams, Emily Dinan, Mohit Piotr Bojanowski, Edouard Grave, Armand Joulin, and Bansal, Jason Weston, and Douwe Kiela. 2020. Ad- Tomas Mikolov. 2017. Enriching word vectors with versarial NLI: A new benchmark for natural lan- subword information. arXiv preprint, 1607.04606. guage understanding. In Proceedings of the 58th An- nual Meeting of the Association for Computational Samuel R. Bowman, Gabor Angeli, Christopher Potts, Linguistics, pages 4885–4901, Online. Association and Christopher D. Manning. 2015. A large anno- for Computational Linguistics. tated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empiri- Timo Schick and Hinrich Schütze. 2020. BERTRAM: cal Methods in Natural Language Processing, pages Improved word embeddings have big impact on con- 632–642, Lisbon, Portugal. Association for Compu- textualized model performance. In Proceedings of tational Linguistics. the 58th Annual Meeting of the Association for Com- putational Linguistics, pages 3996–4007, Online. Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Diana Association for Computational Linguistics. Inkpen, and Si Wei. 2018. Neural natural language inference models enhanced with external knowledge. Push Singh, Thomas Lin, Erik T. Mueller, Grace Lim, In Proceedings of the 56th Annual Meeting of the As- Travell Perkins, and Wan Li Zhu. 2002. Open sociation for Computational Linguistics (Volume 1: mind common sense: Knowledge acquisition from Long Papers), pages 2406–2417, Melbourne, Aus- the general public. In On the Move to Mean- tralia. Association for Computational Linguistics. ingful Internet Systems 2002: CoopIS, DOA, and 292
ODBASE, pages 1223–1237, Berlin, Heidelberg. Springer Berlin Heidelberg. Tom De Smedt and Walter Daelemans. 2012. Pattern for python. Journal of Machine Learning Research, 13(66):2063–2067. Robyn Speer, Joshua Chin, and Catherine Havasi. 2018. Conceptnet 5.5: An open multilingual graph of gen- eral knowledge. arXiv preprint, 1612.03975. Alon Talmor, Oyvind Tafjord, Peter Clark, Yoav Gold- berg, and Jonathan Berant. 2020. Leap-of-thought: Teaching pre-trained models to systematically rea- son over implicit knowledge. In Advances in Neu- ral Information Processing Systems 33: Annual Con- ference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual. Julien Tissier, Christophe Gravier, and Amaury Habrard. 2017. Dict2vec : Learning word em- beddings using lexical dictionaries. In Proceed- ings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 254–263, Copenhagen, Denmark. Association for Computa- tional Linguistics. Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car- bonell, Russ R Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in Neural Information Processing Systems, volume 32, pages 5753–5763. Curran Associates, Inc. Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhut- dinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In 2015 IEEE International Con- ference on Computer Vision (ICCV), pages 19–27. 293
You can also read