Documents Representation via Generalized Coupled Tensor Chain with the Rotation Group Constraint

Page created by Miguel Daniels

Style & Fashion

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Documents Representation via Generalized Coupled Tensor Chain
with the Rotation Group Constraint

Igor Vorona, Anh-Huy Phan, Alexander Panchenko and Andrzej Cichocki
Skolkovo Institute of Science and Technology, Moscow, Russia
{igor.vorona,a.phan, a.panchenko, a.cichocki}@skoltech.ru

Abstract at the cost of loss of some interpretability and the
cost of increased computation time.
Continuous representations of linguistic struc- However, there exist other ways to utilize higher-
tures are an important part of modern natu- order interactions and still preserve the efficiency
ral language processing systems. Despite the of linear models. In this research, we focus on more
diversity, most of the existing log-multilinear
data-oriented, i.e., more linguistic grounded, and
embedding models are organized under vector
operations. However, these operations can not better interpretable methods which still can achieve
precisely represent the compositionality of nat- high results in the practical tasks. Particularly, we
ural language due to a lack of order-preserving investigate the matrix-space model of language, in
properties. In this work, we focus on one of which semantic space consists of square matrices
the promising alternatives based on the embed- of real values. The key idea behind this method
ding of documents and words in the rotation goes from realization of the Frege’s principle of
group through the generalization of the cou-
compositionality through order-preserving prop-
pled tensor chain decomposition to the expo-
nential family of the probability distributions. erty of matrix operations. As shown by Rudolph
In this model, documents and words are rep- and Giesbrecht (2010), this type of models can in-
resented as matrices, and n-grams representa- ternally combine various properties from statistical
tions are combined from word representations and symbolic models of natural language and there-
by matrix multiplication. The proposed model fore it is more flexible than vector space models.
is optimized via noise-contrastive estimation. In spite of that fact, such models are usually hard
We show empirically that capturing word or-
to optimize on the real data. To this end, Yesse-
der and higher-order word interactions allows
our model to achieve the best results in several
nalina and Cardie (2011) took attention to the needs
document classification benchmarks. of nontrivial initialization and proposed to learn the
weights by the bag-of-words model. Asaadi and
1 Introduction Rudolph (2017) used complex multi-stage initial-
ization based on unigrams and bigrams scoring.
The current progress in natural language process- Both approaches try to solve the sentiment anal-
ing systems is largely based on the success of the ysis task. Recently Mai et al. (2019) considered
representation learning of linguistic structures such the problem of self-supervised continuous repre-
as word, sentence and document embeddings. The sentation of words via matrix-space models. They
most promising and successful methods are based optimized a modified word2vec objective function
on learning representations via two types of models: (Mikolov et al., 2013) and proposed a novel initial-
shallow log-multilinear models and deep neural ization by adding small isotropic Gaussian noise to
networks. Despite the efficiency and interpretabil- the identity matrix.
ity of log-multilinear models, they can not use In this paper, we use a similarity function be-
higher-order linguistic features like dependency tween matrices similar to Mai et al. (2019), but
between subsequences of words. To avoid these instead of neural network type of learning, we im-
disadvantages, we usually use nonlinear predictors plement the model as the coupled tensor chain and
like recurrent, recursive, convolution neural net- impose the rotation group constraint. We focus on
works, and more recently Transformers. Neverthe- the document representation problem. Given a doc-
less, these methods can achieve high performance ument collection, we try to find unsupervised doc-

1674
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 1674–1684
August 1–6, 2021. ©2021 Association for Computational Linguistics

ument and word representation suitable for down- tensor for computation of compositionality. The
stream linear classification tasks. To this end, we most similar to our task is the word embedding
represent words, n-grams and documents as ma- problem. In this direction, Sharan and Valiant
trices and train self-supervised model. Our intu- (2017) explored the Canonical Polyadic Decom-
ition is based on the fact that modeling interaction position (CPD) of word triplet tensor. Bailey and
between words and documents is insufficient for Aeron (2017) used symmetric CPD of pointwise
modeling relations between complex phrases and mutual information tensor. Frandsen and Ge (2019)
documents. extended the RAND-WALK model (Arora et al.,
Contributions. The main contributions of this 2016) to word triplets. The main drawback of exist-
work can be summarized as follows: ing approaches is that they can not preserve word
• To the best of our knowledge, this is the first order information of long n-grams properly. For
representation learning method based on the example, in the case of the CPD, we need to use
Riemannian geometry of matrix groups. separate parameters for each word based on its po-
sition in the text. This restriction does not allow
• We show that our approach to model the com- us to efficiently use the linguistic meaning of ten-
positionality and word order allows us to in- sor modes. The symmetric CPD completely loses
crease the quality of document embedding on word order information and the Tucker model suf-
downstream tasks. Moreover, it is also more fers from exponentially increasing parameter size
computationally efficient in comparison with in the case of long length n-grams. Our approach
neural network models. eliminates these disadvantages.
• Our model achieves state-of-the-art perfor-
mance on the task of representation learning 3 Problem and model description
for multiclass classification both on short and In this section, we describe our model for the doc-
long document datasets. ument representation task. We begin with a short
introduction of the multilinear algebra, then present
Our implementation of the proposed model is
the proposed document modeling framework in
available online1 .
the view of the coupled tensor decomposition and
2 Related work provide the detailed description of our model and
indicate benefits/drawbacks which are related to
Euclidean embedding models (Mikolov et al., rotational group constraints.
2013; Pennington et al., 2014) based on implicit
word-context co-occurance matrix factorization 3.1 Basic multilinear algebra
(Levy and Goldberg, 2014) are an important frame- A tensor is a higher-order generalization of vectors
work for current NLP tasks. Proposed models and matrices to multiple indices. The order of a
achieve relatively high performance in various NLP tensor is the number of dimensions, also known
tasks like text classification (Kim, 2014), named as modes or ways. An N -th order tensor is rep-
entity recognition (Lample et al., 2016), machine resented as X ∈ RI1 ×I2 ×···×IN , and its element
translation (Cho et al., 2014). is denoted as xi1 ...iN . We can always represent a
Riemannian embedding models have shown tensor X as sum of rank-1 tensors, where each of
promising results by expanding embedding meth- them is defined as outer product of N -vectors, i.e.,
ods beyond Euclidean geometry. There are sev- a(1) ◦ · · · ◦ a(N ) and a(n) ∈ RIn for n = 1, . . . , N .
eral models with negative sectional curvature like The minimal number of rank-1 tensors in this sum
Poincare (Dhingra et al., 2018; Nickel and Kiela, defines tensor rank.
2017) and Lorentz models (Nickel and Kiela, 2018). In this research we focus on a particular type of
Furthermore, Meng et al. (2019) proposed a text tensor decomposition called tensor chain (Perez-
embedding model based on spherical geometry. Garcia et al., 2007; Khoromskij, 2011; Zhao et al.,
Tensor decomposition models have been applied 2019). It represents tensor via the following sum
to many tasks in the NLP. Particularly, Van de of rank-1 tensors
Cruys et al. (2013) proposed the Tucker model
R1X
,...,RN
for decomposing subject-verb-object co-occurance
X= a(1) (N )
r1 r2 ◦ · · · ◦ arN r1 , (1)
1
https://github.com/harrycrow/RDM r1 ,...,rN =1

1675

Notation     Description                                gram distribution as with the separate distribution
 a            Scalar                                     of the single random variable rather than define
 a            Vector or tuple                            joint distribution for all n-grams sets and assume
 A            Matrix                                     that particular distribution for each n-gram set can
 A            Higher-order tensor                        be constructed through marginalization from this
 ◦            Outer (tensor) product                     joint distribution.
 ⊗            Kronecker product                             Following this intuition for each n-gram set, n,
 tr(·)        Trace of matrix                            we assign appropriate joint distribution, p(w, d),
 vec(·)       Vectorization of tensor                    where w ∈ Wn and d ∈ D. We represent co-
 [·]+         Euclidean projection onto                  occurrence of each n-gram, w = (w1 , . . . , wn ),
              nonnegative orthant, i.e., max{0, ·}       and each document, d ∈ D, as (n + 1)th-order
 QR(·)        QR decomposition                           tensor X(wd) ∈ R|W|×···×|W|×|D| , where

              Table 1: Basic notation.                                          (
                                                                   (wd)          1, if i = w, j = d
                                                                  xij         =                                      (3)
                                                                                 0, otherwise .
where R1 , . . . , RN are called ranks of the TC
model and the element-wise form of the follow-           Then we represent probability p(w, d) as the mean
ing decomposition is given by                            of these tensors
                          (1)   (2)      (N )
       xi1 i2 ...iN = tr(Ai1 Ai2 · · · AiN ),     (2)              X̄
                                                                        (n)
                                                                              = E(w,d)∼p(w,d) X(wd) .
      (n)
and Ain ∈ RRn ×Rn+1 represents the in -th frontal
                                                           Note that co-occurrences of n-grams and doc-
slice of the core tensors A(n) ∈ RRn ×In ×Rn+1 for       uments define bipartite graphs between them and
                                        (n)
n = 1, . . . , N and RN +1 = R1 and arn rn+1 are           (n)
                                                         X̄ can be interpreted as adjacency tensors of
tubes of A(n) .                                          these graphs.
3.2   Document modelling setting
                                                         3.3   Proposed model
In our research, we use the fact that the same text
can be represented in different ways via differ-         Following compositional matrix-space modelling
ent sets of n-grams with fixed lengths {Wn }N            approach we represent each word, w ∈ W, as a
                                              n=1 ,                          R×R , a n-gram, w ∈ Wn , as
                                                         matrix U Qwn ∈ R
         n
where W = W   | × ·{z
                    · · × W} and W is the word set.
                     n                                   Uw = k=1 Uwk , and each document, d ∈ D,
The main hypothesis is that the occurance statistics     as a matrix Vd ∈ RR×R . To measure dependence
of the each of these n-grams sets contains some new      between n-gram, w, and document, d, we use the
information about this text, which can not be ex-        Frobenius inner product, defined as hUw , Vd iF =
tracted from any other n-grams set. If we combine        tr(UTw Vd ) = tr(Uw VdT ) = vec(Uw )T vec(Vd ).
information from all of these sets we can achieve        We assume that embeddings organized accordingly
better quality for our document embedding model.         to this operation can be suitable for linear classi-
   Due to the fact that the occurrence of the se-        fiers.
quence of words depends on the occurrence of each           The resulting model is the generalized to expo-
word from this sequence, it is reasonable to treat the   nential family of probability distributions coupled
distribution of each fixed-length n-grams set sep-       tensor decomposition (Collins et al., 2001; Yilmaz
arately. Otherwise, by the reason of dependence                                                   (n)
                                                         et al., 2011) of the set of tensors {X̄ }N   n=1 by
between the length of word sequence and their fre-       the corresponding set of tensor chain models with
quency, small length n-grams can downweight the          restricted set of parameters
importance of long length n-grams. Thus the effect
                                                                                    |W|                      |D|
of higher-order interaction can become low. Also,                   Θ = {Uw }w=1 ∪ {Vd }d=1 ,                        (4)
we notice that consider p(w) as a distribution over
unordered sets is a quite restrictive assumption on      in the following optimization problem
the structure of the model due to the importance                          N
of the order of the words in the language seman-                          X               (n)          (n)
                                                                    min          KL[X̄          ||X̂         ; Θ],   (5)
tics. For all these reasons, we work with each n-                    Θ
                                                                          n=1

                                                     1676

Figure 1: Illustration of representation of document collection in a multi-view way as a collection of bipartite
graphs. Each of these graphs represents dependence (number of co-occurrence) between word’s strings of length
                                                        n
n and documents. Adjacency matrices X̄(n) ∈ R|W| ×|D| of these graphs can be appropriately tensorized to
                     (n)
adjacency tensors X̄ ∈ R|W|×···×|W|×|D| which can be linked through modified multinomial link function with
                (n)
latent tensors Z    ∈ R|W|×···×|W|×|D| . Latent tensors can be decomposed via the Coupled Tensor Chain model.
In our model, all core tensors U (VT ) which represent words (documents) are additionally restricted to have the
                         |W|        |D|
same parameters {Uw }w=1 ({VdT }d=1 ).

where each Kullback-Leibler divergence can be                   and Tsvetkov, 2019; Meng et al., 2019) enforcing
expressed as                                                    the spherical geometry constraints is a promising
                                                           !    choice for tasks focusing on directional similar-
                                                       (n)
      (n)    (n)          X      X      (n)         x̄ wd       ity. For doing so it can be reasonable to constrain
KL[X̄ ||X̂ ; Θ] =                     x̄wd log         (n)
                                                    x̂wd        our model to the orthogonal group. In this case
                        w∈Wn d∈D
                                                                Frobenius inner product became proper similarity
     |W |      |W | |D|                       (n)
                                                          !
     X         X    X (n)                   x̄w1 ...wn d        measure and the sequential matrix product always
  =       ···           x̄w1 ...wn d log      (n)
                                                            ,
                                            x̂w1 ...wn d        preserves fixed norm and group structure (i.e., in-
    w1 =1     wn =1 d=1
                                                                vertibility of matrix multiplication). Due to the
      (n)
and x̂wd = p(w, d).                                             group structure, our model has an interesting prop-
  Our model represents p(w, d) by using follow-                 erty to uniquely determine each word in the n-gram
ing mean function                                               by their left and right aggregated context matrices
                                                                and general n-gram matrix.
p(w, d) ∝ p(w)p(d) exp(vec(X(wd) )T vec(Z(n) )),                   However, orthogonal group is a disjoint set of
                                                                two connected components: set of rotations
                                            (n)
where vec(X(wd) )T vec(Z(n) ) = zwd is one of
natural parameters, which organized in the la-                  SO(R) = {A|AT A = AAT = I, det(A) = +1}
tent tensors Z(n) ∈ R|W|×···×|W|×|D| . This latent
tensors contain pointwise mutual information be-                and set of reflections {A|AT A = AAT =
tween n-grams and documents and we assume that                  I, det(A) = −1}. We impose constraints on our
each of this tensor has low tensor chain rank, i.e.,            model parameters enforcing rotation matrices since
 (n)
zwd = tr(Uw VdT ).                                              the product of any number of rotations is always
                                                                rotation, i.e. rotation set forms a matrix group.
3.4   Intuition from geometric interpretation
                                                                While the product of an even number of reflections
If we avoid generative assumptions (Saunshi et al.,             becomes rotation.
2019), our task can be interpreted as maximizing
the similarity between document d and n-grams                   3.5   Noise-contrastive estimation
from its document distribution p(w|d) with simul-
taneous minimization of similarity between this                 In practice we do not need to construct set of ten-
                                                                        (n)                                        (n)
document and n-grams from common n-gram distri-                 sors {X̄ }N n=1 explicitly. Instead, since each X̄
bution p(w). As shown in previous works (Kumar                  represents a higher-order frequency table, we can

                                                           1677

optimize the sum of MLE tasks:                                      Stiefel manifold we can swap two columns for each
                   (1)                                    (N )
                                                                    parameter matrix if the determinant of the this ma-
Ew,d∼X̄(1) log(x̂wd ) + · · · + Ew,d∼X̄(N ) log(x̂wd ).             trix is -1. However, this initialization can be below
                                                                    optimal way, because these rotation matrices can be
Usually, we have huge amount of data which make
                                                                    far away from each other and due to the non-trivial
problem of computing partition function for each
  (n)                                                               structure of the loss function on this manifold we
x̂wd intractable for many current computing ar-                     can stuck in local minima. To overcome this prob-
chitectures. We can avoid this problem by us-                       lem, we can fix particular point on the manifold for
ing noise-contrastive estimation (Gutmann and                       all matrices and perform small movement from this
Hyvärinen, 2012) for conditional model (Ma and                     point in arbitrary direction. We use following strat-
Collins, 2018). Similar to Chen et al. (2017), we                                                         |W|         |D|
                                                                    egy for each parameter A ∈ {Uw }w=1 ∪{Vd }d=1 :
construct negative samples from our batch by con-
necting non-linked n-grams and documents. Fi-                                          
nally, for parameter set                                                              1
                                                                           A0 ∼ N 0, 2
               |W|                |D|                                                 R
    Θ = {Uw }w=1 ∪ {Vd }d=1 ∪ {κ(n) }N
                                     n=1 ,                  (6)                                     
                                                                                          1       T                  (9)
                                                                           [Q, R] = QR I + (A0 − A0 )
we formulate optimization problem in the follow-                                          2
ing way:                                                                   A = Q.
             N                           N
             X                     λ X (n) 2                          We initialize all concentration parameters us-
       min         L(n) (Θ) +         (κ )                  (7)     ing following equation κ(n) = R    u
                                                                                                         , where u ∼
         Θ                         2
             n=1                        n=1
                                                                    U(0.9, 1.1) and n = 1, . . . , N .
        s.t. Uw ∈ SO(R), w = 1, . . . , |W|
             Vd ∈ SO(R), d = 1, . . . , |D|                         4.3   Riemannian optimization
             κ(n) ≥ 0, n = 1, . . . , N,                            We solve our problem on the product manifold
                                                                    of rotation group and nonnegative orthant by
    where each risk function is equal to                            simultaneous optimization of model parameters
                                                                           |W|          |D|
                                                                    {Uw }w=1 ∪ {Vd }d=1 and {κ(n) }N  n=1 by Rieman-
      L(n) (Θ) = E{(w ,d )}I             QI      (n)
                         i   i   i=1 ∼    i=1   x̄w
                                                   i di             nian (Bécigneul and Ganea, 2019) and projected
               exp κ(n) tr(Uwi VdTi )                               Adagrad (Duchi et al., 2011) respectively.
                                      
                                                            (8)
      − log P                          .                             Let M be a real smooth manifold and L : M →
             I           (n) tr(U VT )
             j=1 exp   κ         wj di                              R a smooth real-valued function over parameters
                                                                    θ ∈ M. Riemannian gradient descent (Gabay,
   We add concentration parameters κ(n) to our                      1982; Absil et al., 2008) based on two sequential
loss function to overcome the problem of fixed                      steps. At first we compute Riemannian gradient
scale. This makes our model more flexible to rep-                   by orthogonal projection of the Euclidean gradient
resent sharp distributions. Due to the fact that each               on the tangent space at the same point on which
n-gram distribution has its own scale, it can be rea-               we compute Euclidean gradient by projθt : R →
sonable to have a different κ(n) for different n-gram               Tθ M
distributions.
                                                                             ∇R L(θ t ) = projθt (∇E L(θ t )),      (10)
4     Optimization setup
                                                                    and then we perform movement on the manifold
4.1    N-gram construction
                                                                    by specific curve, which called retraction Rθt :
We construct n-grams from text corpora by using se-                 Tθ M → M
quentially moving of the sliding window of length
n (from 1 to N ) inside each document.                                         θ t+1 = Rθt (−α∇R L(θ t )).          (11)

4.2    Parameters initialization                                    For optimization on rotation group the orthogonal
Initialization from the uniform distribution on the                 projector of matrix G ∈ RR×R on the tangent
Stiefel manifold (Saxe et al., 2014) is one of promis-              space at the point A ∈ SO(R) is given by:
ing ways to initialize deep neural network. To ini-                                     1
tialize parameters only from rotation component of                           projA (G) = (G − AGT A),               (12)
                                                                                        2
                                                                 1678

For movement on the manifold in this direction we        Bi-LSTM models like ELMo, our model has lower
use QR-based retraction:                                 complexity on embedding dimension, and can be
                                                         computed in parallel using the associativity prop-
      [Q, R] = QR (At − α∇R L(At ))                      erty of matrix multiplication. Although our model
                                                 (13)
      At+1 = Q.                                          has a higher theoretical time complexity than the
                                                         vector space models, the real gap between them is
We choose the QR-based retraction because it al-         relatively small at ordinary embedding dimension
lows Riemannian Adagrad to achieve the fastest           (∼ 400).
convergence in our experiments in comparison
with Cayley retraction (first-order), Polar retraction         Method               Time         Space
(second-order), and geodesic (matrix exponential).             PV-DBOW              O(kd)        O(d)
                                                               PV-DM (Concat)       O(kd)        O(kd)
                                                               ELMo                O(kld2 )     O(ld2 )
Algorithm 1 Optimization algorithm for RDM                     BERT                O(k 2 hld)   O(khld)
  Input: Learning rates α and β, number of itera-              RDM (Ours)          O(kd1.5 )     O(d)
  tions T , maximum n-gram length N .
                                            |W|
  Output: Embedding parameters {Uw }w=1 ∪                Table 2: Comparison of time and space complexity of
        |D|
  {Vd }d=1 and {κ(n) }N n=1 .
                                                         several document embedding models, where k - size of
  for t = 1 to T do . after computing gradients:         context window, d - embedding dimension, l - number
                          |W|         |D|                of layers, h - number of heads. The time complexity of
      for all At ∈ {Uw }w=1 ∪ {Vd }d=1 do
                         PN                              other discussed here vector space models is equivalent
          ∇E L(At ) = n=1 ∇E L(n) (At )                  to the complexity of PV-DBOW.
          ∇R L(At ) ← (12) with ∇E L(At )
                           α
          αt = qP
                     t               2                   5     Numerical experiments
                     i=1 k∇R L(Ai )kF
          At+1 ← (13) with αt                            5.1    Experimental setup
      end for
                                                         Downstream Linear Protocol. We estimate
      for n = 1 to N do
                 (n)              (n)        (n)         quality of pre-trained representations on the multi-
          grad(κt ) = ∇E L(n) (κt ) + λκt                class document categorization tasks on the 20
                                        (n)
            (n)     (n)        grad(κt )                 Newsgroups and the ArXiv based long document
          κt+ 1 = κt − β q
               2              Pt          2 (n)          dataset (He et al., 2019). We choose these datasets
                               i=1 grad (κi )
                                                       for our benchmarks because they are significantly
            (n)      (n)
          κt+1 = κt+ 1                                   different in the document’s average length. This im-
                         2   +
                                                         plies that statistics of long n-grams differ between
     end for
                                                         these datasets too and in the case of the ArXiv
  end for
                                                         dataset statistics of n-grams are significantly bet-
                                                         ter converged than in the case of 20 newsgroups.
                                                         This fact allow us to hypothesize that matrix-space
4.4   Computational efficiency                           models should less overfit on the ArXiv dataset.
The computational complexity of our model de-               We fix the document embeddings and optimize
pends on the complexity R3 of multiplication of          multinomial logistic regression with SAGA opti-
matrices of size R × R, and QR decomposition             mizer and l2 -norm regularization. Instead of test
4 3
3 R (Layton and Sussman, 2020; Trefethen and             set we use nested 10-fold cross-validation to esti-
III, 1997). Due to the number of elements in these       mate statistical significance using Wilcoxon signed-
matrices d = R2 , we can transform the complexity        rank test (Japkowicz and Shah, 2011; Dror et al.,
of our model in the dimension of embedding. In           2018). For each fold we estimate the hyperparam-
this view, the time complexity is O(kd1.5 ), where       eter of l2 -regularization on 10 point logarithmic
k is the size of the context window. As shown in         grid from 0.01 to 100 by using additional 10-fold
Table (2), our model has computational benefits          cross-validation with macro-averaged F1 score. For
in comparison with Transformer due to the linear         text preprocessing, we use CountVectorizer from
dependence of time complexity on the word se-            the Scikit-learn package. Additionally we remove
quence length. We note that in comparison with           words which occur in the NLTK stopwords list or

                                                    1679

occur only in 1 document. In the case of ArXiv vectors from BERT model ”bert-base-uncased”.
dataset we use half of this dataset and use only Following Devlin et al. (2019) we take the last layer
documents in the range from 1000 to 5000 words hidden state corresponding to the [CLS] token as
(smaller documents are removed and bigger docu- the aggregate document representation. If length of
ments are reduced to the first 5000 words). document is bigger than 512 we cut document on
512-length parts and average representation of this
Dataset #cls |W| |D| #w parts. Finally, we add Sentence BERT (Reimers
20 Newsgroups 20 75752 18846 180 and Gurevych, 2019) to the baseline models. This
ArXiv 6 251108 16371 3829 model is fine-tuned on SNLI and MultiNLI datasets
Table 3: Datasets considered in the paper, where #cls for sentence embedding generation. We use 768-
- number of classes, #w - average number of words in dimensional embedding vectors from model ”bert-
documents, |W| - vocabulary size, |D| - number of doc- base-nli-mean-tokens”.
uments. We do not perform fine-tuning of BERT and
ELMo models for our datasets, because in our ex-
periments it doesn’t give any positive effect on
Baselines. We compare our model with differ- the final performance of these models. However,
ent vector space models: paragraph vectors (Le this is not true for Sentence BERT. Fine-tuning
and Mikolov, 2014), weighted combinations of slightly improve performance of this model on the
the word2vec skipgram vectors (Mikolov et al., 20 newsgroup (50 epoch with maximum margin
2013) (average, TF-IDF and SIF (Arora et al., triplet loss).
2017)), Doc2vecC (Chen, 2017), sent2vec (Pagliar- We should notice that this experimental design
dini et al., 2018) and recently proposed JoSe (Meng gives some benefits to neural network models in
et al., 2019). The comparison with the last two of comparison with log-multilinear models, but it is
these models seems to be more informative than more consistent with the ordinary practical use case
with others because of some similarity of these of the Transformers and RNN models. However,
models to our model (sent2vec can use n-gram in- the next experiment will show that log-multilinear
formation and JoSe also based on the spherical type models can still outperform pre-trained neural net-
of embedding geometry). works.
Due to the large number of possible values of hy-
perparameters for each model, we used the default Ablation study. For ablation study, we use dif-
values proposed by the authors of these models or ferent settings of our model. RDM means rotation
proposed in subsequent studies of these models like document model, i.e. our model. RDM-R means
in the case of paragraph vectors (Lau and Baldwin, our model without rotation group constraints. By
2016). We modify only the min count to 1 and (1) we mean model which utilize only unigrams
window size to be equal to the number of negative and documents co-occurance information. By (3)
samples for paragraph2vec and word2vec models and (5) we mean model which utilizes informa-
because it gives better results for these methods in tion from (1, 2, 3)-grams and (1, 2, 3, 4, 5)-grams
our experiments. We choose n-grams number equal respectively. For RDM, we use 1e-2 and 1e-3 as
to 1 for sent2vec, because other values doesn’t im- learning rates of Radagrad and projected Adagrad
prove results. To preserve the fairness of compar- respectively and we use λ = 15 for 20 newsgroups
ison we use fixed embedding dimension, number and change λ to 5 for ArXiv dataset due to smaller
of negative samples, and number of epochs for all number of epoch in this experiment. For RDM-R
models including ours. These values try to mimic we use 1e-2 as learning rate for Adagrad.
the usual values of these hyperparameters in prac-
tice. 5.2 Experimental results and comparison of
We compare our model not only with vector mod- performance
els but also with neural network models. We use Comparison to baselines. As one may observe
5.5B ELMo (Peters et al., 2018) version which in the Table (4), our models yield results compara-
is pre-trained on Wikipedia (1.9B) and all of the ble or outperforming the baseline methods, includ-
monolingual news crawl data from WMT 2008- ing the simpler log-multilinear models (e.g. Skip-
2012 (3.6B). ELMo embedding dimension is equal gram) and more complex models featuring nonlin-
to 1024. Also we use 768-dimensional embedding ear transformations, such as recurrences (ELMo)

1680

Models                               20 Newsgroups                              ArXiv
                                     Acc       Prec   Rec     F1             Acc      Prec    Rec        F1
         PV-DBOW                     88.7      88.4   88.2   88.2           89.1      89.2   89.2       89.2
         PV-DM                       77.2      76.8   76.5   76.5           42.2      42.3   42.0       41.9
         Skipgram+Average            90.3      90.2   90.0   90.0           92.4      92.3   92.3       92.3
         Skipgram+TF-IDF             90.4      90.2   90.1   90.1           92.6      92.5   92.4       92.4
         Skipgram+SIF                90.4      90.2   90.1   90.1           92.4      92.3   92.3       92.3
         Sent2vec                    87.9      87.6   87.5   87.5           91.9      91.7   91.7       91.7
         Doc2vecC                    90.0      89.8   89.7   89.7           93.2      93.1   93.1       93.1
         JoSe                        87.8      87.6   87.4   87.4           91.3      91.3   91.2       91.2
         ELMo                        79.2      78.8   78.8   78.7           91.6      91.5   91.4       91.4
         BERT                        74.3      73.6   73.6   73.5           92.3      92.2   92.1       92.1
         Sentence BERT               79.5      79.1   79.0   79.0           89.0      88.9   88.8       88.8
         RDM-R (1)                   86.8      86.4   86.3   86.3           94.0∗     93.9 ∗ 93.8∗      93.8∗
                                                                                           ∗
         RDM-R (3)                   87.9      87.6   87.4   87.4           94.5∗     94.4   94.4∗      94.4∗
         RDM-R (5)                   88.3      88.0   87.9   87.9           94.6∗     94.4∗ 94.4∗       94.4∗
         RDM (1)                     89.3      89.2   89.0   89.0           94.0∗     93.9∗ 93.9∗       93.9∗
         RDM (3)                     90.7      90.5   90.4   90.4           94.0∗     93.9∗ 93.9∗       93.9∗
         RDM (5)                    91.1∗     90.9 ∗ 90.8 ∗ 90.8∗           94.0∗     93.9∗ 93.9∗       93.9∗

Table 4: Text classification performance on the 20 Newsgroups (short documents) and on the modified ArXiv (long
documents) datasets. We fix the number of epochs to 50, the embedding dimension to 400, and the number of neg-
ative samples to 15 for all models on the 20 Newsgroups. On the ArXiv dataset, we use the same hyperparameters
except for only the number of epochs which is equal to 5. We use macro-average for Precision, Recall, and F1.

and transformer blocks (BERT, SentenceBERT).                    can see the model without rotation constraints is
More specifically, on the 20 newsgroups, the RDM                less robust in respect to noisy statistics of small doc-
(5) model yields the best results significantly2 out-           ument dataset and achieve performance less than
performing all the listed baseline approaches. It is            the PV-DBOW model, while the rotation group
interesting that contrary to the 20 Newsgroups, all             model outperforms all other models. However,
RDM variants with any number of the n-grams sets                once we move on to a dataset with a larger doc-
show strong results and significantly outperform                ument’s average length (ArXiv), RDM-R performs
other models on the ArXiv dataset. Weighted com-                better than all other models, including RDM. This
binations of the skipgram vectors and doc2vecC                  confirms our hypothesis that the existence of good
model achieve the closest to our result. This con-              statistics of long n-grams has critical value for
firms that neural models like ELMo and BERT, are                matrix space models. Due to strong associativ-
not the best way for all datasets and log-multilinear           ity between sets of n-grams, our model needs more
models can outperform them. We can see that per-                parameters to approximate all the co-occurrences.
formance of nonlinear models increase if we use                 The RDM-R has a higher number of degrees of
a large document dataset and BERT can outper-                   freedom than the RDM R2 vs R(R−1)      2    . We think
form some of the log-multilinear approaches (PV-                that it allows RDM-R to outperform RDM in this
DBOW, JoSe and sent2vec), but still, its result is              experiment. This intuition can be also confirmed
not on the top level.                                           by the fact that RDM (1) and RDM-R (1) have the
                                                                same performance level. And only if we increase
Comparison between our models. On observ-                       the number of n-grams for the model from 1 to
ing the results we can see that our model increase              3, then RDM-R can achieve better performance.
the performance of classification when adding the               However, if we increase the number of n-grams set
n-gram set with a bigger length. This property has              from 3 to 5 both models stay on the same level of
both models with rotation group constraints and                 performance. This is the sign that we need to use a
without such constraints. Despite this fact, as we              bigger embedding dimension if we want to achieve
  2
    We use ∗ if the comparison of our model with the baseline   even better results.
models has p < 0.05.

                                                            1681

In summary, if we have short documents, it’s Shima Asaadi and Sebastian Rudolph. 2017. Gradual
better to use RDM. For long document dataset with learning of matrix-space models of language for sen-
timent analysis. In Proceedings of the 2nd Workshop
restriction on the embedding dimension, we sug-
on Representation Learning for NLP, pages 178–
gest to relax the rotation group constraints. This 185, Vancouver, Canada. Association for Computa-
trick allows RDM to use more degrees of freedom tional Linguistics.
to estimate data precisely.
Eric Bailey and Shuchin Aeron. 2017. Word
6 Conclusion embeddings via tensor factorization. CoRR,
abs/1704.02686.
In this paper, we proposed a novel unsupervised
Gary Bécigneul and Octavian-Eugen Ganea. 2019. Rie-
representation learning method based on the gener- mannian adaptive optimization methods. In 7th
alized tensor chain with rotation group constraints, International Conference on Learning Representa-
which can utilize higher-order word interactions tions, ICLR 2019, New Orleans, LA, USA, May 6-9,
and preserve most part of the computational effi- 2019. OpenReview.net.
ciency and interpretability of vector-based mod- Minmin Chen. 2017. Efficient vector representation
els. Our model achieves state-of-the-art results in for documents through corruption. In 5th Inter-
the document classification benchmarks on the 20 national Conference on Learning Representations,
ICLR 2017, Toulon, France, April 24-26, 2017, Con-
newsgroups and modified ArXiv dataset. A further
ference Track Proceedings. OpenReview.net.
direction of research could be focused on adding
tensor kernel functions to the model to eliminate Ting Chen, Yizhou Sun, Yue Shi, and Liangjie Hong.
problems with dependence on the dimension of 2017. On sampling strategies for neural network-
based collaborative filtering. In Proceedings of the
embedding. It could be interesting to augment 23rd ACM SIGKDD International Conference on
this type of model with the different loss functions Knowledge Discovery and Data Mining, Halifax,
based not only on n-gram-document interactions NS, Canada, August 13 - 17, 2017, pages 767–776.
but also on word-word interactions from the knowl- ACM.
edge graph or document-document interaction from Kyunghyun Cho, Bart van Merriënboer, Caglar Gul-
the citation graph of the documents. cehre, Dzmitry Bahdanau, Fethi Bougares, Holger
Schwenk, and Yoshua Bengio. 2014. Learning
Acknowledgments phrase representations using RNN encoder–decoder
for statistical machine translation. In Proceedings of
The Zhores supercomputer (Zacharov et al., 2019) the 2014 Conference on Empirical Methods in Nat-
was used for computations. This research was ural Language Processing (EMNLP), pages 1724–
partially supported by the Ministry of Educa- 1734, Doha, Qatar. Association for Computational
Linguistics.
tion and Science of the Russian Federation
(grant 14.756.31.0001). The work of Alexander Michael Collins, Sanjoy Dasgupta, and Robert E.
Panchenko was partially supported by the MTS- Schapire. 2001. A generalization of principal com-
Skoltech laboratory. ponents analysis to the exponential family. In Ad-
vances in Neural Information Processing Systems
14 [Neural Information Processing Systems: Natu-
ral and Synthetic, NIPS 2001, December 3-8, 2001,
References Vancouver, British Columbia, Canada], pages 617–
Pierre-Antoine Absil, Robert E. Mahony, and 624. MIT Press.
Rodolphe Sepulchre. 2008. Optimization Algo-
rithms on Matrix Manifolds. Princeton University Tim Van de Cruys, Thierry Poibeau, and Anna Korho-
Press. nen. 2013. A tensor-based factorization model of se-
mantic compositionality. In Proceedings of the 2013
Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, Conference of the North American Chapter of the
and Andrej Risteski. 2016. A latent variable model Association for Computational Linguistics: Human
approach to PMI-based word embeddings. Transac- Language Technologies, pages 1142–1151, Atlanta,
tions of the Association for Computational Linguis- Georgia. Association for Computational Linguistics.
tics, 4:385–399.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Sanjeev Arora, Yingyu Liang, and Tengyu Ma. 2017. Kristina Toutanova. 2019. BERT: Pre-training of
A simple but tough-to-beat baseline for sentence em- deep bidirectional transformers for language under-
beddings. In 5th International Conference on Learn- standing. In Proceedings of the 2019 Conference
ing Representations, ICLR 2017, Toulon, France, of the North American Chapter of the Association
April 24-26, 2017, Conference Track Proceedings. for Computational Linguistics: Human Language
OpenReview.net. Technologies, Volume 1 (Long and Short Papers),

1682

pages 4171–4186, Minneapolis, Minnesota. Associ- Guillaume Lample, Miguel Ballesteros, Sandeep Sub-
ation for Computational Linguistics. ramanian, Kazuya Kawakami, and Chris Dyer. 2016.
Neural architectures for named entity recognition.
Bhuwan Dhingra, Christopher Shallue, Mohammad In Proceedings of the 2016 Conference of the North
Norouzi, Andrew Dai, and George Dahl. 2018. Em- American Chapter of the Association for Computa-
bedding text in hyperbolic spaces. In Proceedings tional Linguistics: Human Language Technologies,
of the Twelfth Workshop on Graph-Based Methods pages 260–270, San Diego, California. Association
for Natural Language Processing (TextGraphs-12), for Computational Linguistics.
pages 59–69, New Orleans, Louisiana, USA. Asso-
ciation for Computational Linguistics. Jey Han Lau and Timothy Baldwin. 2016. An empiri-
cal evaluation of doc2vec with practical insights into
Rotem Dror, Gili Baumer, Segev Shlomov, and Roi Re- document embedding generation. In Proceedings
ichart. 2018. The hitchhiker’s guide to testing statis- of the 1st Workshop on Representation Learning for
tical significance in natural language processing. In NLP, pages 78–86, Berlin, Germany. Association for
Proceedings of the 56th Annual Meeting of the As- Computational Linguistics.
sociation for Computational Linguistics (Volume 1:
Long Papers), pages 1383–1392, Melbourne, Aus- William Layton and Myron Sussman. 2020. Numerical
tralia. Association for Computational Linguistics. Linear Algebra. WSPC.
John C. Duchi, Elad Hazan, and Yoram Singer. 2011. Quoc V. Le and Tomás Mikolov. 2014. Distributed
Adaptive subgradient methods for online learning representations of sentences and documents. In
and stochastic optimization. J. Mach. Learn. Res., Proceedings of the 31th International Conference
12:2121–2159. on Machine Learning, ICML 2014, Beijing, China,
21-26 June 2014, volume 32 of JMLR Workshop
Abraham Frandsen and Rong Ge. 2019. Understanding and Conference Proceedings, pages 1188–1196.
composition of word embeddings via tensor decom- JMLR.org.
position. In 7th International Conference on Learn-
ing Representations, ICLR 2019, New Orleans, LA, Omer Levy and Yoav Goldberg. 2014. Neural word
USA, May 6-9, 2019. OpenReview.net. embedding as implicit matrix factorization. In Ad-
vances in Neural Information Processing Systems,
Daniel Gabay. 1982. Minimizing a differentiable func- volume 27, pages 2177–2185. Curran Associates,
tion over a differential manifold. Journal of Opti- Inc.
mization Theory and Applications, 37:177–219.
Zhuang Ma and Michael Collins. 2018. Noise con-
Michael Gutmann and Aapo Hyvärinen. 2012. Noise-
trastive estimation and negative sampling for condi-
contrastive estimation of unnormalized statistical
tional models: Consistency and statistical efficiency.
models, with applications to natural image statistics.
In Proceedings of the 2018 Conference on Empirical
J. Mach. Learn. Res., 13:307–361.
Methods in Natural Language Processing, Brussels,
Jun He, Liqun Wang, Liu Liu, Jiao Feng, and Hao Wu. Belgium, October 31 - November 4, 2018, pages
2019. Long document classification from local word 3698–3707. Association for Computational Linguis-
glimpses via recurrent attention learning. IEEE Ac- tics.
cess, 7:40707–40718.
Florian Mai, Lukas Galke, and Ansgar Scherp. 2019.
Nathalie Japkowicz and Mohak Shah, editors. 2011. CBOW is not all you need: Combining CBOW with
Evaluating Learning Algorithms: A Classification the compositional matrix space model. In 7th Inter-
Perspective. Cambridge University Press. national Conference on Learning Representations,
ICLR 2019, New Orleans, LA, USA, May 6-9, 2019.
Boris N. Khoromskij. 2011. O(dlogn)-quantics approx- OpenReview.net.
imation of n-d tensors in high-dimensional numeri-
cal modeling. Constructive Approximation, 34:257– Yu Meng, Jiaxin Huang, Guangyuan Wang, Chao
280. Zhang, Honglei Zhuang, Lance M. Kaplan, and Ji-
awei Han. 2019. Spherical text embedding. In Ad-
Yoon Kim. 2014. Convolutional neural networks vances in Neural Information Processing Systems
for sentence classification. In Proceedings of the 32: Annual Conference on Neural Information Pro-
2014 Conference on Empirical Methods in Natural cessing Systems 2019, NeurIPS 2019, December
Language Processing (EMNLP), pages 1746–1751, 8-14, 2019, Vancouver, BC, Canada, pages 8206–
Doha, Qatar. Association for Computational Lin- 8215.
guistics.
Tomás Mikolov, Ilya Sutskever, Kai Chen, Gregory S.
Sachin Kumar and Yulia Tsvetkov. 2019. Von mises- Corrado, and Jeffrey Dean. 2013. Distributed rep-
fisher loss for training sequence to sequence models resentations of words and phrases and their com-
with continuous outputs. In 7th International Con- positionality. In Advances in Neural Information
ference on Learning Representations, ICLR 2019, Processing Systems 26: 27th Annual Conference on
New Orleans, LA, USA, May 6-9, 2019. OpenRe- Neural Information Processing Systems 2013. Pro-
view.net. ceedings of a meeting held December 5-8, 2013,

1683

Lake Tahoe, Nevada, United States, pages 3111– Nikunj Saunshi, Orestis Plevrakis, Sanjeev Arora,
3119. Mikhail Khodak, and Hrishikesh Khandeparkar.
2019. A theoretical analysis of contrastive unsuper-
Maximilian Nickel and Douwe Kiela. 2018. Learning vised representation learning. In Proceedings of the
continuous hierarchies in the lorentz model of hy- 36th International Conference on Machine Learn-
perbolic geometry. In Proceedings of the 35th In- ing, ICML 2019, 9-15 June 2019, Long Beach, Cali-
ternational Conference on Machine Learning, ICML fornia, USA, volume 97 of Proceedings of Machine
2018, Stockholmsmässan, Stockholm, Sweden, July Learning Research, pages 5628–5637. PMLR.
10-15, 2018, volume 80 of Proceedings of Machine
Learning Research, pages 3776–3785. PMLR. Andrew M. Saxe, James L. McClelland, and Surya
Ganguli. 2014. Exact solutions to the nonlinear dy-
Maximillian Nickel and Douwe Kiela. 2017. Poincaré namics of learning in deep linear neural networks.
embeddings for learning hierarchical representa- In 2nd International Conference on Learning Rep-
tions. In I. Guyon, U. V. Luxburg, S. Bengio, resentations, ICLR 2014, Banff, AB, Canada, April
H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar- 14-16, 2014, Conference Track Proceedings.
nett, editors, Advances in Neural Information Pro-
cessing Systems 30, pages 6338–6347. Curran Asso- Vatsal Sharan and Gregory Valiant. 2017. Orthogonal-
ciates, Inc. ized ALS: A theoretically principled tensor decom-
position algorithm for practical use. In Proceedings
Matteo Pagliardini, Prakhar Gupta, and Martin Jaggi. of the 34th International Conference on Machine
2018. Unsupervised learning of sentence embed- Learning, volume 70 of Proceedings of Machine
dings using compositional n-gram features. In Pro- Learning Research, pages 3095–3104, International
ceedings of the 2018 Conference of the North Amer- Convention Centre, Sydney, Australia. PMLR.
ican Chapter of the Association for Computational
Linguistics: Human Language Technologies, Vol- Lloyd N. Trefethen and David Bau III. 1997. Numeri-
ume 1 (Long Papers), pages 528–540, New Orleans, cal Linear Algebra, first edition. SIAM: Society for
Louisiana. Association for Computational Linguis- Industrial and Applied Mathematics.
tics.
Ainur Yessenalina and Claire Cardie. 2011. Compo-
Jeffrey Pennington, Richard Socher, and Christopher sitional matrix-space models for sentiment analysis.
Manning. 2014. GloVe: Global vectors for word In Proceedings of the 2011 Conference on Empiri-
representation. In Proceedings of the 2014 Confer- cal Methods in Natural Language Processing, pages
ence on Empirical Methods in Natural Language 172–182, Edinburgh, Scotland, UK. Association for
Processing (EMNLP), pages 1532–1543, Doha, Computational Linguistics.
Qatar. Association for Computational Linguistics.
Yusuf Kenan Yilmaz, Ali Taylan Cemgil, and Umut
David Perez-Garcia, Frank Verstraete, Michael M. Simsekli. 2011. Generalised coupled tensor factori-
Wolf, and J. Ignacio Cirac. 2007. Matrix prod- sation. In Advances in Neural Information Process-
uct state representations. Quantum Info. Comput., ing Systems 24: 25th Annual Conference on Neural
7(5):401–430. Information Processing Systems 2011. Proceedings
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt of a meeting held 12-14 December 2011, Granada,
Gardner, Christopher Clark, Kenton Lee, and Luke Spain, pages 2151–2159.
Zettlemoyer. 2018. Deep contextualized word rep-
Igor Zacharov, Rinat Arslanov, Maksim Gunin, Daniil
resentations. In Proceedings of the 2018 Confer-
Stefonishin, Andrey Bykov, Sergey Pavlov, Oleg
ence of the North American Chapter of the Associ-
Panarin, Anton Maliutin, Sergey Rykovanov, and
ation for Computational Linguistics: Human Lan-
Maxim Fedorov. 2019. Zhores — petaflops super-
guage Technologies, Volume 1 (Long Papers), pages
computer for data-driven modeling, machine learn-
2227–2237, New Orleans, Louisiana. Association
ing and artificial intelligence installed in skolkovo
for Computational Linguistics.
institute of science and technology. Open Engineer-
Nils Reimers and Iryna Gurevych. 2019. Sentence- ing, 9(1):512–520.
BERT: Sentence embeddings using Siamese BERT-
networks. In Proceedings of the 2019 Conference on Qibin Zhao, Masashi Sugiyama, Longhao Yuan, and
Empirical Methods in Natural Language Processing Andrzej Cichocki. 2019. Learning efficient ten-
and the 9th International Joint Conference on Natu- sor representations with ring-structured networks.
ral Language Processing (EMNLP-IJCNLP), pages In IEEE International Conference on Acoustics,
3982–3992, Hong Kong, China. Association for Speech and Signal Processing, ICASSP 2019,
Computational Linguistics. Brighton, United Kingdom, May 12-17, 2019, pages
8608–8612. IEEE.
Sebastian Rudolph and Eugenie Giesbrecht. 2010.
Compositional matrix-space models of language. In
Proceedings of the 48th Annual Meeting of the As-
sociation for Computational Linguistics, pages 907–
916, Uppsala, Sweden. Association for Computa-
tional Linguistics.

1684

You can also read