The Role of Syntax in Vector Space Models of Compositional Semantics
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
The Role of Syntax in Vector Space Models of Compositional Semantics Karl Moritz Hermann and Phil Blunsom Department of Computer Science University of Oxford Oxford, OX1 3QD, UK {karl.moritz.hermann,phil.blunsom}@cs.ox.ac.uk Abstract in this field includes the Combinatory Categorial Grammar (CCG), which also places increased em- Modelling the compositional process by phasis on syntactic coverage (Szabolcsi, 1989). which the meaning of an utterance arises Recently those searching for the right represen- from the meaning of its parts is a funda- tation for compositional semantics have drawn in- mental task of Natural Language Process- spiration from the success of distributional mod- ing. In this paper we draw upon recent els of lexical semantics. This approach represents advances in the learning of vector space single words as distributional vectors, implying representations of sentential semantics and that a word’s meaning is a function of the envi- the transparent interface between syntax ronment it appears in, be that its syntactic role or and semantics provided by Combinatory co-occurrences with other words (Pereira et al., Categorial Grammar to introduce Com- 1993; Schütze, 1998). While distributional se- binatory Categorial Autoencoders. This mantics is easily applied to single words, spar- model leverages the CCG combinatory op- sity implies that attempts to directly extract distri- erators to guide a non-linear transforma- butional representations for larger expressions are tion of meaning within a sentence. We use doomed to fail. Only in the past few years has this model to learn high dimensional em- it been attempted to extend these representations beddings for sentences and evaluate them to semantic composition. Most approaches here in a range of tasks, demonstrating that use the idea of vector-matrix composition to learn the incorporation of syntax allows a con- larger representations from single-word encodings cise model to learn representations that are (Baroni and Zamparelli, 2010; Grefenstette and both effective and general. Sadrzadeh, 2011; Socher et al., 2012b). While these models have proved very promising for com- 1 Introduction positional semantics, they make minimal use of Since Frege stated his ‘Principle of Semantic linguistic information beyond the word level. Compositionality’ in 1892 researchers have pon- In this paper we bridge the gap between recent dered both how the meaning of a complex expres- advances in machine learning and more traditional sion is determined by the meanings of its parts, approaches within computational linguistics. We and how those parts are combined. (Frege, 1892; achieve this goal by employing the CCG formal- Pelletier, 1994). Over a hundred years on the ism to consider compositional structures at any choice of representational unit for this process point in a parse tree. CCG is attractive both for its of compositional semantics, and how these units transparent interface between syntax and seman- combine, remain open questions. tics, and a small but powerful set of combinatory Frege’s principle may be debatable from a lin- operators with which we can parametrise our non- guistic and philosophical standpoint, but it has linear transformations of compositional meaning. provided a basis for a range of formal approaches We present a novel class of recursive mod- to semantics which attempt to capture meaning in els, the Combinatory Categorial Autoencoders logical models. The Montague grammar (Mon- (CCAE), which marry a semantic process pro- tague, 1970) is a prime example for this, build- vided by a recursive autoencoder with the syn- ing a model of composition based on lambda- tactic representations of the CCG formalism. calculus and formal logic. More recent work Through this model we seek to answer two ques-
Tina likes tigers tions: Can recursive vector space models be recon- N (S[dcl]\NP)/NP N ciled with a more formal notion of compositional- NP NP ity; and is there a role for syntax in guiding seman- > S[dcl]\NP tics in these types of models? CCAEs make use of < S[dcl] CCG combinators and types by conditioning each composition function on its equivalent step in a Figure 1: CCG derivation for Tina likes tigers with CCG proof. In terms of learning complexity and forward (>) and backward application (
resentations have then successfully been applied to a number of tasks including word sense disam- biguation (Schütze, 1998) and selectional prefer- ence (Pereira et al., 1993; Lin, 1999). While it is theoretically possible to apply the same mechanism to larger expressions, sparsity prevents learning meaningful distributional repre- sentations for expressions much larger than single words.2 Vector space models of compositional seman- Figure 2: A simple three-layer autoencoder. The tics aim to fill this gap by providing a methodol- input represented by the vector at the bottom is ogy for deriving the representation of an expres- being encoded in a smaller vector (middle), from sion from those of its parts. While distributional which it is then reconstructed (top) into the same representations frequently serve to encode single dimensionality as the original input vector. words in such approaches this is not a strict re- quirement. There are a number of ideas on how to de- used on multiple inputs. By optimizing the two fine composition in such vector spaces. A gen- functions to minimize the difference between all eral framework for semantic vector composition inputs and their respective reconstructions, this au- was proposed in Mitchell and Lapata (2008), with toencoder will effectively discover some hidden Mitchell and Lapata (2010) and more recently Bla- structures within the data that can be exploited to coe and Lapata (2012) providing good overviews represent it more efficiently. of this topic. Notable approaches to this issue in- As a simple example, assume input vectors clude Baroni and Zamparelli (2010), who com- xi ∈ Rn , i ∈ (0..N ), weight matrices W enc ∈ pose nouns and adjectives by representing them as R(m×n) , W rec ∈ R(n×m) and biases benc ∈ Rm , vectors and matrices, respectively, with the com- brec ∈ Rn . The encoding matrix and bias are used positional representation achieved by multiplica- to create an encoding ei from xi : tion. Grefenstette and Sadrzadeh (2011) use a sim- ilar approach with matrices for relational words ei = f enc (xi ) = W enc xi + benc (1) and vectors for arguments. These two approaches are combined in Grefenstette et al. (2013), produc- Subsequently e ∈ Rm is used to reconstruct x as ing a tensor-based semantic framework with ten- x0 using the reconstruction matrix and bias: sor contraction as composition operation. x0i = f rec (ei ) = W rec ei + brec (2) Another set of models that have very success- fully been applied in this area are recursive autoen- θ = (W enc , W rec , benc , brec ) can then be learned coders (Socher et al., 2011a; Socher et al., 2011b), by minimizing the error function describing the which are discussed in the next section. difference between x0 and x: 2.3 Recursive Autoencoders N 1X 0 2 E= xi − xi (3) Autoencoders are a useful tool to compress in- 2 i formation. One can think of an autoencoder as a funnel through which information has to Now, if m < n, this will intuitively lead to ei pass (see Figure 2). By forcing the autoencoder encoding a latent structure contained in xi and to reconstruct an input given only the reduced shared across all xj , j ∈ (0..N ), with θ encoding amount of information available inside the funnel and decoding to and from that hidden structure. it serves as a compression tool, representing high- It is possible to apply multiple autoencoders on dimensional objects in a lower-dimensional space. top of each other, creating a deep autoencoder Typically a given autoencoder, that is the func- (Bengio et al., 2007; Hinton and Salakhutdinov, tions for encoding and reconstructing data, are 2006). For such a multi-layered model to learn 2 anything beyond what a single layer could learn, a The experimental setup in (Baroni and Zamparelli, 2010) is one of the few examples where distributional representa- non-linear transformation g needs to be applied at tions are used for word pairs. each layer. Usually, a variant of the sigmoid (σ)
Model CCG Elements CCAE-A parse CCAE-B parse + rules CCAE-C parse + rules + types CCAE-D parse + rules + child types Table 1: Aspects of the CCG formalism used by the different models explored in this paper. and 3D object identification (Blacoe and Lapata, 2012; Socher et al., 2011b; Socher et al., 2012a). Figure 3: RAE with three inputs. Vectors with 3 Model filled (blue) circles represent input and hidden The models in this paper combine the power of units; blanks (white) denote reconstruction layers. recursive, vector-based models with the linguistic intuition of the CCG formalism. Their purpose is to learn semantically meaningful vector represen- or hyperbolic tangent (tanh) function is used for tations for sentences and phrases of variable size, g (LeCun et al., 1998). while the purpose of this paper is to investigate f enc (xi ) = g (W enc xi + benc ) (4) the use of syntax and linguistic formalisms in such rec rec rec vector-based compositional models. f (ei ) = g (W ei + b ) We assume a CCG parse to be given. Let C de- Furthermore, autoencoders can easily be used as note the set of combinatory rules, and T the set a composition function by concatenating two input of categories used, respectively. We use the parse vectors, such that: tree to structure an RAE, so that each combina- tory step is represented by an autoencoder func- e = f (x1 , x2 ) = g (W (x1 kx2 ) + b) (5) tion. We refer to these models Categorial Com- (x01 kx02 ) = g W 0 e + b0 binatory Autoencoders (CCAE). In total this pa- per describes four models making increasing use Extending this idea, recursive autoencoders (RAE) of the CCG formalism (see table 1). allow the modelling of data of variable size. By As an internal baseline we use model CCAE- setting the n = 2m, it is possible to recursively A, which is an RAE structured along a CCG parse combine a structure into an autoencoder tree. See tree. CCAE-A uses a single weight matrix each for Figure 3 for an example, where x1 , x2 , x3 are re- the encoding and reconstruction step (see Table 2. cursively encoded into y2 . This model is similar to Socher et al. (2011b), ex- The recursive application of autoencoders was cept that we use a fixed structure in place of the first introduced in Pollack (1990), whose recursive greedy tree building approach. As CCAE-A uses auto-associative memories learn vector represen- only minimal syntactic guidance, this should al- tations over pre-specified recursive data structures. low us to better ascertain to what degree the use of More recently this idea was extended and applied syntax helps our semantic models. to dynamic structures (Socher et al., 2011b). Our second model (CCAE-B) uses the compo- These types of models have become increas- sition function in equation (6), with c ∈ C. ingly prominent since developments within the f enc (x, y, c) = g (Wenc c (xky) + bcenc ) (6) field of Deep Learning have made the training rec c f (e, c) = g (Wrec e + bcrec ) of such hierarchical structures more effective and tractable (LeCun et al., 1998; Hinton et al., 2006). This means that for every combinatory rule we de- Intuitively the top layer of an RAE will encode fine an equivalent autoencoder composition func- aspects of the information stored in all of the input tion by parametrizing both the weight matrix and vectors. Previously, RAE have successfully been bias on the combinatory rule (e.g. Figure 4). applied to a number of tasks including sentiment In this model, as in the following ones, we as- analysis, paraphrase detection, relation extraction sume a reconstruction step symmetric with the
Model Encoding Function CCAE-A f (x, y)= g (W (xky) + b) CCAE-B f (x, y, c)= g (W c (xky) + bc ) P CCAE-C f (x, y, c, t)= g (W p (xky) + bp ) p∈{c,t} f (x, y, c, tx, ty)= g W W tx x + W ty y + bc c CCAE-D Table 2: Encoding functions of the four CCAE models discussed in this paper. α : X/Y β : Y > (αkβ) + b> ) lows for some model smoothing. > g (Wenc enc αβ : X An alternative approach is encoded in model CCAE-D. Here we consider the categories not of Figure 4: Forward application as CCG combinator the element represented, but of the elements it is and autoencoder rule respectively. generated from together with the combinatory rule applied to them. The intuition is that in the first step we transform two expressions based on their syntax. Subsequently we combine these two con- ditioned on their joint combinatory rule. 4 Learning In this section we briefly discuss unsupervised learning for our models. Subsequently we de- scribe how these models can be extended to allow for semi-supervised training and evaluation. Let θ = (W, B, L) be our model parameters Figure 5: CCAE-B applied to Tina likes tigers. and λ a vector with regularization parameters for Next to each vector are the CCG category (top) all model parameters. W represents the set of all and the word or function representing it (bottom). weight matrices, B the set of all biases and L the lex describes the unary type-changing operation. set of all word vectors. Let N be the set of training > and < are forward and backward application. data consisting of tree-nodes n with inputs xn , yn and reconstruction rn . The error given n is: 1 2 composition step. For the remainder of this paper E(n|θ) = rn − (xn kyn ) (7) we will focus on the composition step and drop the 2 use of enc and rec in variable names where it isn’t The gradient of the regularised objective func- explicitly required. Figure 5 shows model CCAE- tion then becomes: B applied to our previous example sentence. While CCAE-B uses only the combinatory N rules, we want to make fuller use of the linguis- ∂J 1 X ∂E(n|θ) = + λθ (8) tic information available in CCG. For this pur- ∂θ N n ∂θ pose, we build another model CCAE-C, which parametrizes on both the combinatory rule c ∈ C We learn the gradient using backpropagation and the CCG category t ∈ T at every step (see through structure (Goller and Küchler, 1996), and Figure 2). This model provides an additional de- minimize the objective function using L-BFGS. gree of insight, as the categories T are semanti- For more details about the partial derivatives cally and syntactically more expressive than the used for backpropagation, see the documentation CCG combinatory rules by themselves. Summing accompanying our model implementation.3 over weights parametrised on c and t respectively, 3 adds an additional degree of freedom and also al- http://www.karlmoritz.com/
4.1 Supervised Learning use word-vectors of size 50, initialized using the The unsupervised method described so far learns embeddings provided by Turian et al. (2010) based a vector representation for each sentence. Such a on the model of Collobert and Weston (2008).4 representation can be useful for some tasks such as We use the C&C parser (Clark and Curran, paraphrase detection, but is not sufficient for other 2007) to generate CCG parse trees for the data tasks such as sentiment classification, which we used in our experiments. For models CCAE-C and are considering in this paper. CCAE-D we use the 25 most frequent CCG cate- In order to extract sentiment from our models, gories (as extracted from the British National Cor- we extend them by adding a supervised classifier pus) with an additional general weight matrix in on top, using the learned representations v as input order to catch all remaining types. for a binary classification model: 5.1 Sentiment Analysis pred(l=1|v, θ) = sigmoid(Wlabel v + blabel ) (9) We evaluate our model on the MPQA opinion corpus (Wiebe et al., 2005), which annotates ex- Given our corpus of CCG parses with label pairs pressions for sentiment.5 The corpus consists of (N, l), the new objective function becomes: 10,624 instances with approximately 70 percent describing a negative sentiment. We apply the 1 X λ J= E(N, l, θ) + ||θ||2 (10) same pre-processing as (Nakagawa et al., 2010) N 2 and (Socher et al., 2011b) by using an additional (N,l) sentiment lexicon (Wilson et al., 2005) during the Assuming each node n ∈ N contains children model training for this experiment. xn , yn , encoding en and reconstruction rn , so that As a second corpus we make use of the sentence n = {x, y, e, r} this breaks down into: polarity (SP) dataset v1.0 (Pang and Lee, 2005).6 This dataset consists of 10662 sentences extracted E(N, l, θ) = (11) X from movie reviews which are manually labelled αErec (n, θ) + (1−α)Elbl (en , l, θ) with positive or negative sentiment and equally n∈N distributed across sentiment. Experiment 1: Semi-Supervised Training In 1 2 the first experiment, we use the semi-supervised Erec (n, θ) = [xn kyn ] − rn (12) 2 training strategy described previously and initial- 1 Elbl (e, l, θ) = kl − ek2 (13) ize our models with the embeddings provided by 2 Turian et al. (2010). The results of this evalua- This method of introducing a supervised aspect tion are in Table 3. While we achieve the best per- to the autoencoder largely follows the model de- formance on the MPQA corpus, the results on the scribed in Socher et al. (2011b). SP corpus are less convincing. Perhaps surpris- ingly, the simplest model CCAE-A outperforms 5 Experiments the other models on this dataset. When considering the two datasets, sparsity We describe a number of standard evaluations seems a likely explanation for this difference in to determine the comparative performance of our results: In the MPQA experiment most instances model. The first task of sentiment analysis allows are very short with an average length of 3 words, us to compare our CCG-conditioned RAE with while the average sentence length in the SP corpus similar, existing models. In a second experiment, is 21 words. The MPQA task is further simplified we apply our model to a compound similarity eval- through the use or an additional sentiment lexicon. uation, which allows us to evaluate our models Considering dictionary size, the SP corpus has a against a larger class of vector-based models (Bla- dictionary of 22k words, more than three times the coe and Lapata, 2012). We conclude with some size of the MPQA dictionary. qualitative analysis to get a better idea of whether 4 the combination of CCG and RAE can learn se- http://www.metaoptimize.com/projects/ wordreprs/ mantically expressive embeddings. 5 http://mpqa.cs.pitt.edu/ In our experiments we use the hyperbolic tan- 6 http://www.cs.cornell.edu/people/ gent as nonlinearity g. Unless stated otherwise we pabo/movie-review-data/
Method MPQA SP Training Voting with two lexica 81.7 63.1 Model Regular Pretraining MV-RNN (Socher et al., 2012b) - 79.0 CCAE-A 77.8 79.5 RAE (rand) (Socher et al., 2011b) 85.7 76.8 CCAE-B 76.9 79.8 TCRF (Nakagawa et al., 2010) 86.1 77.3 CCAE-C 77.1 81.0 RAE (init) (Socher et al., 2011b) 86.4 77.7 CCAE-D 76.9 79.7 NB (Wang and Manning, 2012) 86.7 79.4 CCAE-A 86.3 77.8 Table 4: Effect of pretraining on model perfor- CCAE-B 87.1 77.1 mance on the SP dataset. Results are reported on a CCAE-C 87.1 77.3 random subsection of the SP corpus; thus numbers CCAE-D 87.2 76.7 for the regular training method differ slightly from those in Table 3. Table 3: Accuracy of sentiment classification on the sentiment polarity (SP) and MPQA datasets. For NB we only display the best result among a help to overcome issues of sparsity. larger group of models analysed in that paper. If we consider the results of the pre-trained ex- periments in Table 4, this seems to be the case. In fact, the trend of the previous results has been This issue of sparsity is exacerbated in the more reversed, with the more complex models now per- complex CCAE models, where the training points forming best, whereas in the previous experiments are spread across different CCG types and rules. the simpler models performed better. Using the While the initialization of the word vectors with Turian embeddings instead of random initialisa- previously learned embeddings (as was previously tion did not improve results in this setup. shown by Socher et al. (2011b)) helps the mod- els, all other model variables such as composition 5.2 Compound Similarity weights and biases are still initialised randomly In a second experiment we use the dataset from and thus highly dependent on the amount of train- Mitchell and Lapata (2010) which contains sim- ing data available. ilarity judgements for adjective-noun, noun-noun Experiment 2: Pretraining Due to our analy- and verb-object pairs.7 All compound pairs have sis of the results of the initial experiment, we ran a been ranked for semantic similarity by a number of second series of experiments on the SP corpus. We human annotators. The task is thus to rank these follow (Scheible and Schütze, 2013) for this sec- pairs of word pairs by their semantic similarity. ond series of experiments, which are carried out on For instance, the two compounds vast amount a random 90/10 training-testing split, with some and large quantity are given a high similarity score data reserved for development. by the human judges, while northern region and Instead of initialising the model with external early age are assigned no similarity at all. word embeddings, we first train it on a large We train our models as fully unsupervised au- amount of data with the aim of overcoming the toencoders on the British National Corpus for this sparsity issues encountered in the previous exper- task. We assume fixed parse trees for all of the iment. Learning is thus divided into two steps: compounds (Figure 6), and use these to compute The first, unsupervised training phase, uses the compound level vectors for all word pairs. We British National Corpus together with the SP cor- subsequently use the cosine distance between each pus. In this phase only the reconstruction signal compound pair as our similarity measure. We is used to learn word embeddings and transforma- use Spearman’s rank correlation coefficient (ρ) for tion matrices. Subsequently, in the second phase, evaluation; hence there is no need to rescale our only the SP corpus is used, this time with both the scores (-1.0 – 1.0) to the original scale (1.0 – 7.0). reconstruction and the label error. Blacoe and Lapata (2012) have an extensive By learning word embeddings and composition comparison of the performance of various vector- matrices on more data, the model is likely to gen- based models on this data set to which we compare eralise better. Particularly for the more complex our model in Table 5. The CCAE models outper- models, where the composition functions are con- 7 http://homepages.inf.ed.ac.uk/mlap/ ditioned on various CCG parameters, this should resources/index.html
Verb Object Noun Noun Adjective Noun VB NN NN NN JJ NN (S\NP)/NP N N/N N N/N N NP > > > N N S\NP Figure 6: Assumed CCG parse structure for the compound similarity evaluation. Method Adj-N N-N V-Obj 6 Discussion Human 0.52 0.49 0.55 Overall, our models compare favourably with the (Blacoe and Lapata, 2012) state of the art. On the MPQA corpus model /+ 0.21 - 0.48 0.22 - 0.50 0.18 - 0.35 CCAE-D achieves the best published results we RAE 0.19 - 0.31 0.24 - 0.30 0.09 - 0.28 are aware of, whereas on the SP corpus we achieve CCAE-B 0.38 0.44 0.34 competitive results. With an additional, unsuper- CCAE-C 0.38 0.41 0.23 vised training step we achieved results beyond the CCAE-D 0.41 0.44 0.29 current state of the art on this task, too. Table 5: Correlation coefficients of model predic- Semantics The qualitative analysis and the ex- tions for the compound similarity task. Numbers periment on compounds demonstrate that the show Spearman’s rank correlation coefficient (ρ). CCAE models are capable of learning semantics. Higher numbers indicate better correlation. An advantage of our approach—and of autoen- coders generally—is their ability to learn in an unsupervised setting. The pre-training step for form the RAE models provided by Blacoe and La- the sentiment task was essentially the same train- pata (2012), and score towards the upper end of the ing step as used in the compound similarity task. range of other models considered in that paper. While other models such as the MV-RNN (Socher 5.3 Qualitative Analysis et al., 2012b) achieve good results on a particu- lar task, they do not allow unsupervised training. To get better insight into our models we also per- This prevents the possiblity of pretraining, which form a small qualitative analysis. Using one of the we showed to have a big impact on results, and fur- models trained on the MPQA corpus, we gener- ther prevents the training of general models: The ate word-level representations of all phrases in this CCAE models can be used for multiple tasks with- corpus and subsequently identify the most related out the need to re-train the main model. expressions by using the cosine distance measure. We perform this experiment on all expressions of Complexity Previously in this paper we argued length 5, considering all expressions with a word that our models combined the strengths of other length between 3 and 7 as potential matches. approaches. By using a grammar formalism we increase the expressive power of the model while As can be seen in Table 6, this works with vary- the complexity remains low. For the complex- ing success. Linking expressions such as convey- ity analysis see Table 7. We strike a balance be- ing the message of peace and safeguard(ing) peace tween the greedy approaches (e.g. Socher et al. and security suggests that the model does learn (2011b)), where learning is quadratic in the length some form of semantics. of each sentence and existing syntax-driven ap- On the other hand, the connection between ex- proaches such as the MV-RNN of Socher et al. pressed their satisfaction and support and ex- (2012b), where the size of the model, that is the pressed their admiration and surprise suggests number of variables that needs to be learned, is that the pure word level content still has an impact quadratic in the size of the word-embeddings. on the model analysis. Likewise, the expressions is a story of success and is a staunch supporter Sparsity Parametrizing on CCG types and rules have some lexical but little semantic overlap. Fur- increases the size of the model compared to a ther reducing this link between the lexical and the greedy RAE (Socher et al., 2011b). The effect semantic representation is an issue that should be of this was highlighted by the sentiment analysis addressed in future work in this area. task, with the more complex models performing
Expression Most Similar convey the message of peace safeguard peace and security keep alight the flame of keep up the hope has a reason to repent has no right a significant and successful strike a much better position it is reassuring to believe it is a positive development expressed their satisfaction and support expressed their admiration and surprise is a story of success is a staunch supporter are lining up to condemn are going to voice their concerns more sanctions should be imposed charges being leveled could fray the bilateral goodwill could cause serious damage Table 6: Phrases from the MPQA corpus and their semantically closest match according to CCAE-D. Complexity 7 Conclusion Model Size Learning MV-RNN 2 O(nw ) O(l) In this paper we have brought a more formal no- RAE O(nw) O(l2 ) tion of semantic compositionality to vector space CCAE-* O(nw) O(l) models based on recursive autoencoders. This was achieved through the use of the CCG formalism Table 7: Comparison of models. n is dictionary to provide a conditioning structure for the matrix size, w embedding width, l is sentence length. We vector products that define the RAE. can assume l n w. Additional factors such We have explored a number of models, each of as CCG rules and types are treated as small con- which conditions the compositional operations on stants for the purposes of this analysis. different aspects of the CCG derivation. Our ex- perimental findings indicate a clear advantage for a deeper integration of syntax over models that use worse in comparison with the simpler ones. We only the bracketing structure of the parse tree. were able to overcome this issue by using addi- The most effective way to condition the compo- tional training data. Beyond this, it would also be sitional operators on the syntax remains unclear. interesting to investigate the relationships between Once the issue of sparsity had been addressed, the different types and to derive functions to incorpo- complex models outperformed the simpler ones. rate this into the learning procedure. For instance Among the complex models, however, we could model learning could be adjusted to enforce some not establish significant or consistent differences mirroring effects between the weight matrices of to convincingly argue for a particular approach. forward and backward application, or to support While the connections between formal linguis- similarities between those of forward application tics and vector space approaches to NLP may not and composition. be immediately obvious, we believe that there is a case for the continued investigation of ways to best combine these two schools of thought. This paper CCG-Vector Interface Exactly how the infor- represents one step towards the reconciliation of mation contained in a CCG derivation is best ap- traditional formal approaches to compositional se- plied to a vector space model of compositionality mantics with modern machine learning. is another issue for future research. Our investi- gation of this matter by exploring different model Acknowledgements setups has proved somewhat inconclusive. While CCAE-D incorporated the deepest conditioning on We thank the anonymous reviewers for their feed- the CCG structure, it did not decisively outperform back and Richard Socher for providing additional the simpler CCAE-B which just conditioned on insight into his models. Karl Moritz would further the combinatory operators. Issues of sparsity, as like to thank Sebastian Riedel for hosting him at shown in our experiments on pretraining, have a UCL while this paper was written. This work has significant influence, which requires further study. been supported by the EPSRC.
References Julia Hockenmaier and Mark Steedman. 2007. CCG- bank: A Corpus of CCG Derivations and Depen- Marco Baroni and Roberto Zamparelli. 2010. Nouns dency Structures Extracted from the Penn Treebank. are vectors, adjectives are matrices: Representing CL, 33(3):355–396, September. adjective-noun constructions in semantic space. In Proceedings of EMNLP, pages 1183–1193. Ray Jackendoff. 1972. Semantic Interpretation in Generative Grammar. MIT Press, Cambridge, MA. Valerio Basile, Johan Bos, Kilian Evang, and Noortje Venhuizen. 2012. Developing a large semantically Marcus Kracht. 2008. Compositionality in Montague annotated corpus. In Proceedings of LREC, pages Grammar. In Edouard Machery und Markus Wern- 3196–3200, Istanbul, Turkey. ing Wolfram Hinzen, editor, Handbook of Composi- tionality, pages 47 – 63. Oxford University Press. Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. 2007. Greedy layer-wise training Yann LeCun, Leon Bottou, Genevieve Orr, and Klaus- of deep networks. In Advances in Neural Informa- Robert Muller. 1998. Efficient backprop. In G. Orr tion Processing Systems 19, pages 153–160. and Muller K., editors, Neural Networks: Tricks of William Blacoe and Mirella Lapata. 2012. A com- the trade. Springer. parison of vector-based representations for seman- Dekang Lin. 1999. Automatic identification of non- tic composition. In Proceedings of EMNLP-CoNLL, compositional phrases. In Proceedings of ACL, pages 546–556. pages 317–324. Stephen Clark and James R. Curran. 2007. Wide- Jeff Mitchell and Mirella Lapata. 2008. Vector-based coverage efficient statistical parsing with ccg and models of semantic composition. In In Proceedings log-linear models. CL, 33(4):493–552, December. of ACL, pages 236–244. Ronan Collobert and Jason Weston. 2008. A unified architecture for natural language processing: Deep Jeff Mitchell and Mirella Lapata. 2010. Composition neural networks with multitask learning. In Pro- in distributional models of semantics. Cognitive Sci- ceedings of ICML. ence, 34(8):1388–1429. James Curran, Stephen Clark, and Johan Bos. 2007. R. Montague. 1970. Universal grammar. Theoria, Linguistically motivated large-scale nlp with c&c 36(3):373–398. and boxer. In Proceedings of ACL Demo and Poster Tetsuji Nakagawa, Kentaro Inui, and Sadao Kurohashi. Sessions, pages 33–36. 2010. Dependency tree-based sentiment classifica- J. R. Firth. 1957. A synopsis of linguistic theory 1930- tion using crfs with hidden variables. In NAACL- 55. 1952-59:1–32. HLT, pages 786–794. Gottfried Frege. 1892. Über Sinn und Bedeutung. In Bo Pang and Lillian Lee. 2005. Seeing stars: exploit- Mark Textor, editor, Funktion - Begriff - Bedeutung, ing class relationships for sentiment categorization volume 4 of Sammlung Philosophie. Vandenhoeck with respect to rating scales. In Proceedings of ACL, & Ruprecht, Göttingen. pages 115–124. Christoph Goller and Andreas Küchler. 1996. Learn- Francis Jeffry Pelletier. 1994. The principle of seman- ing task-dependent distributed representations by tic compositionality. Topoi, 13:11–24. backpropagation through structure. In Proceedings of the ICNN-96, pages 347–352. IEEE. Fernando Pereira, Naftali Tishby, and Lillian Lee. 1993. Distributional clustering of english words. In Edward Grefenstette and Mehrnoosh Sadrzadeh. 2011. Proceedings of ACL, ACL ’93, pages 183–190. Experimental support for a categorical composi- tional distributional model of meaning. In Proceed- Jordan B. Pollack. 1990. Recursive distributed repre- ings of EMNLP, pages 1394–1404. sentations. Artificial Intelligence, 46:77–105. Edward Grefenstette, Georgiana Dinu, Yao-Zhong Christian Scheible and Hinrich Schütze. 2013. Cutting Zhang, Mehrnoosh Sadrzadeh, and Marco Baroni. recursive autoencoder trees. In Proceedings of the 2013. Multi-step regression learning for composi- International Conference on Learning Representa- tional distributional semantics. tions. G. E. Hinton and R. R. Salakhutdinov. 2006. Reduc- Hinrich Schütze. 1998. Automatic word sense dis- ing the dimensionality of data with neural networks. crimination. CL, 24(1):97–123, March. Science, 313(5786):504–507. Richard Socher, Eric H. Huang, Jeffrey Pennington, Geoffrey E. Hinton, Simon Osindero, Max Welling, Andrew Y. Ng, and Christopher D. Manning. 2011a. and Yee Whye Teh. 2006. Unsupervised discovery Dynamic Pooling and Unfolding Recursive Autoen- of nonlinear structure using contrastive backpropa- coders for Paraphrase Detection. In Advances in gation. Cognitive Science, 30(4):725–731. Neural Information Processing Systems 24.
Richard Socher, Jeffrey Pennington, Eric H. Huang, Andrew Y. Ng, and Christopher D. Manning. 2011b. Semi-supervised recursive autoencoders for predict- ing sentiment distributions. In Proceedings of EMNLP, pages 151–161. Richard Socher, Brody Huval, Bharath Bhat, Christo- pher D. Manning, and Andrew Y. Ng. 2012a. Convolutional-Recursive Deep Learning for 3D Ob- ject Classification. In Advances in Neural Informa- tion Processing Systems 25. Richard Socher, Brody Huval, Christopher D. Man- ning, and Andrew Y. Ng. 2012b. Semantic com- positionality through recursive matrix-vector spaces. In Proceedings of EMNLP-CoNLL, pages 1201– 1211. Mark Steedman and Jason Baldridge, 2011. Combina- tory Categorial Grammar, pages 181–224. Wiley- Blackwell. Anna Szabolcsi. 1989. Bound Variables in Syntax: Are There Any? In Renate Bartsch, Johan van Ben- them, and Peter van Emde Boas, editors, Semantics and Contextual Expression, pages 295–318. Foris, Dordrecht. Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: a simple and general method for semi-supervised learning. In Proceedings of ACL, pages 384–394. Sida Wang and Christopher D. Manning. 2012. Base- lines and bigrams: simple, good sentiment and topic classification. In Proceedings of ACL, pages 90–94. Janyce Wiebe, Theresa Wilson, and Claire Cardie. 2005. Annotating expressions of opinions and emo- tions in language. Language Resources and Evalu- ation, 39(2-3):165–210. Theresa Wilson, Janyce Wiebe, and Paul Hoffmann. 2005. Recognizing contextual polarity in phrase- level sentiment analysis. In Proceedings of EMNLP- HLT, HLT ’05, pages 347–354.
You can also read