Documents Representation via Generalized Coupled Tensor Chain with the Rotation Group Constraint

Page created by Miguel Daniels
 
CONTINUE READING
Documents Representation via Generalized Coupled Tensor Chain with the Rotation Group Constraint
Documents Representation via Generalized Coupled Tensor Chain
                 with the Rotation Group Constraint

         Igor Vorona, Anh-Huy Phan, Alexander Panchenko and Andrzej Cichocki
                Skolkovo Institute of Science and Technology, Moscow, Russia
                  {igor.vorona,a.phan, a.panchenko, a.cichocki}@skoltech.ru

                      Abstract                             at the cost of loss of some interpretability and the
                                                           cost of increased computation time.
    Continuous representations of linguistic struc-           However, there exist other ways to utilize higher-
    tures are an important part of modern natu-            order interactions and still preserve the efficiency
    ral language processing systems. Despite the           of linear models. In this research, we focus on more
    diversity, most of the existing log-multilinear
                                                           data-oriented, i.e., more linguistic grounded, and
    embedding models are organized under vector
    operations. However, these operations can not          better interpretable methods which still can achieve
    precisely represent the compositionality of nat-       high results in the practical tasks. Particularly, we
    ural language due to a lack of order-preserving        investigate the matrix-space model of language, in
    properties. In this work, we focus on one of           which semantic space consists of square matrices
    the promising alternatives based on the embed-         of real values. The key idea behind this method
    ding of documents and words in the rotation            goes from realization of the Frege’s principle of
    group through the generalization of the cou-
                                                           compositionality through order-preserving prop-
    pled tensor chain decomposition to the expo-
    nential family of the probability distributions.       erty of matrix operations. As shown by Rudolph
    In this model, documents and words are rep-            and Giesbrecht (2010), this type of models can in-
    resented as matrices, and n-grams representa-          ternally combine various properties from statistical
    tions are combined from word representations           and symbolic models of natural language and there-
    by matrix multiplication. The proposed model           fore it is more flexible than vector space models.
    is optimized via noise-contrastive estimation.            In spite of that fact, such models are usually hard
    We show empirically that capturing word or-
                                                           to optimize on the real data. To this end, Yesse-
    der and higher-order word interactions allows
    our model to achieve the best results in several
                                                           nalina and Cardie (2011) took attention to the needs
    document classification benchmarks.                    of nontrivial initialization and proposed to learn the
                                                           weights by the bag-of-words model. Asaadi and
1   Introduction                                           Rudolph (2017) used complex multi-stage initial-
                                                           ization based on unigrams and bigrams scoring.
The current progress in natural language process-          Both approaches try to solve the sentiment anal-
ing systems is largely based on the success of the         ysis task. Recently Mai et al. (2019) considered
representation learning of linguistic structures such      the problem of self-supervised continuous repre-
as word, sentence and document embeddings. The             sentation of words via matrix-space models. They
most promising and successful methods are based            optimized a modified word2vec objective function
on learning representations via two types of models:       (Mikolov et al., 2013) and proposed a novel initial-
shallow log-multilinear models and deep neural             ization by adding small isotropic Gaussian noise to
networks. Despite the efficiency and interpretabil-        the identity matrix.
ity of log-multilinear models, they can not use               In this paper, we use a similarity function be-
higher-order linguistic features like dependency           tween matrices similar to Mai et al. (2019), but
between subsequences of words. To avoid these              instead of neural network type of learning, we im-
disadvantages, we usually use nonlinear predictors         plement the model as the coupled tensor chain and
like recurrent, recursive, convolution neural net-         impose the rotation group constraint. We focus on
works, and more recently Transformers. Neverthe-           the document representation problem. Given a doc-
less, these methods can achieve high performance           ument collection, we try to find unsupervised doc-

                                                       1674
            Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 1674–1684
                          August 1–6, 2021. ©2021 Association for Computational Linguistics
ument and word representation suitable for down-        tensor for computation of compositionality. The
stream linear classification tasks. To this end, we     most similar to our task is the word embedding
represent words, n-grams and documents as ma-           problem. In this direction, Sharan and Valiant
trices and train self-supervised model. Our intu-       (2017) explored the Canonical Polyadic Decom-
ition is based on the fact that modeling interaction    position (CPD) of word triplet tensor. Bailey and
between words and documents is insufficient for         Aeron (2017) used symmetric CPD of pointwise
modeling relations between complex phrases and          mutual information tensor. Frandsen and Ge (2019)
documents.                                              extended the RAND-WALK model (Arora et al.,
   Contributions. The main contributions of this        2016) to word triplets. The main drawback of exist-
work can be summarized as follows:                      ing approaches is that they can not preserve word
    • To the best of our knowledge, this is the first   order information of long n-grams properly. For
      representation learning method based on the       example, in the case of the CPD, we need to use
      Riemannian geometry of matrix groups.             separate parameters for each word based on its po-
                                                        sition in the text. This restriction does not allow
    • We show that our approach to model the com-       us to efficiently use the linguistic meaning of ten-
      positionality and word order allows us to in-     sor modes. The symmetric CPD completely loses
      crease the quality of document embedding on       word order information and the Tucker model suf-
      downstream tasks. Moreover, it is also more       fers from exponentially increasing parameter size
      computationally efficient in comparison with      in the case of long length n-grams. Our approach
      neural network models.                            eliminates these disadvantages.
    • Our model achieves state-of-the-art perfor-
      mance on the task of representation learning      3     Problem and model description
      for multiclass classification both on short and   In this section, we describe our model for the doc-
      long document datasets.                           ument representation task. We begin with a short
                                                        introduction of the multilinear algebra, then present
  Our implementation of the proposed model is
                                                        the proposed document modeling framework in
available online1 .
                                                        the view of the coupled tensor decomposition and
2       Related work                                    provide the detailed description of our model and
                                                        indicate benefits/drawbacks which are related to
Euclidean embedding models (Mikolov et al.,             rotational group constraints.
2013; Pennington et al., 2014) based on implicit
word-context co-occurance matrix factorization          3.1    Basic multilinear algebra
(Levy and Goldberg, 2014) are an important frame-       A tensor is a higher-order generalization of vectors
work for current NLP tasks. Proposed models             and matrices to multiple indices. The order of a
achieve relatively high performance in various NLP      tensor is the number of dimensions, also known
tasks like text classification (Kim, 2014), named       as modes or ways. An N -th order tensor is rep-
entity recognition (Lample et al., 2016), machine       resented as X ∈ RI1 ×I2 ×···×IN , and its element
translation (Cho et al., 2014).                         is denoted as xi1 ...iN . We can always represent a
Riemannian embedding models have shown                  tensor X as sum of rank-1 tensors, where each of
promising results by expanding embedding meth-          them is defined as outer product of N -vectors, i.e.,
ods beyond Euclidean geometry. There are sev-           a(1) ◦ · · · ◦ a(N ) and a(n) ∈ RIn for n = 1, . . . , N .
eral models with negative sectional curvature like      The minimal number of rank-1 tensors in this sum
Poincare (Dhingra et al., 2018; Nickel and Kiela,       defines tensor rank.
2017) and Lorentz models (Nickel and Kiela, 2018).         In this research we focus on a particular type of
Furthermore, Meng et al. (2019) proposed a text         tensor decomposition called tensor chain (Perez-
embedding model based on spherical geometry.            Garcia et al., 2007; Khoromskij, 2011; Zhao et al.,
Tensor decomposition models have been applied           2019). It represents tensor via the following sum
to many tasks in the NLP. Particularly, Van de          of rank-1 tensors
Cruys et al. (2013) proposed the Tucker model
                                                                       R1X
                                                                         ,...,RN
for decomposing subject-verb-object co-occurance
                                                                X=                     a(1)              (N )
                                                                                        r1 r2 ◦ · · · ◦ arN r1 ,   (1)
    1
        https://github.com/harrycrow/RDM                               r1 ,...,rN =1

                                                    1675
Notation     Description                                gram distribution as with the separate distribution
 a            Scalar                                     of the single random variable rather than define
 a            Vector or tuple                            joint distribution for all n-grams sets and assume
 A            Matrix                                     that particular distribution for each n-gram set can
 A            Higher-order tensor                        be constructed through marginalization from this
 ◦            Outer (tensor) product                     joint distribution.
 ⊗            Kronecker product                             Following this intuition for each n-gram set, n,
 tr(·)        Trace of matrix                            we assign appropriate joint distribution, p(w, d),
 vec(·)       Vectorization of tensor                    where w ∈ Wn and d ∈ D. We represent co-
 [·]+         Euclidean projection onto                  occurrence of each n-gram, w = (w1 , . . . , wn ),
              nonnegative orthant, i.e., max{0, ·}       and each document, d ∈ D, as (n + 1)th-order
 QR(·)        QR decomposition                           tensor X(wd) ∈ R|W|×···×|W|×|D| , where

              Table 1: Basic notation.                                          (
                                                                   (wd)          1, if i = w, j = d
                                                                  xij         =                                      (3)
                                                                                 0, otherwise .
where R1 , . . . , RN are called ranks of the TC
model and the element-wise form of the follow-           Then we represent probability p(w, d) as the mean
ing decomposition is given by                            of these tensors
                          (1)   (2)      (N )
       xi1 i2 ...iN = tr(Ai1 Ai2 · · · AiN ),     (2)              X̄
                                                                        (n)
                                                                              = E(w,d)∼p(w,d) X(wd) .
      (n)
and Ain ∈ RRn ×Rn+1 represents the in -th frontal
                                                           Note that co-occurrences of n-grams and doc-
slice of the core tensors A(n) ∈ RRn ×In ×Rn+1 for       uments define bipartite graphs between them and
                                        (n)
n = 1, . . . , N and RN +1 = R1 and arn rn+1 are           (n)
                                                         X̄ can be interpreted as adjacency tensors of
tubes of A(n) .                                          these graphs.
3.2   Document modelling setting
                                                         3.3   Proposed model
In our research, we use the fact that the same text
can be represented in different ways via differ-         Following compositional matrix-space modelling
ent sets of n-grams with fixed lengths {Wn }N            approach we represent each word, w ∈ W, as a
                                              n=1 ,                          R×R , a n-gram, w ∈ Wn , as
                                                         matrix U Qwn ∈ R
         n
where W = W   | × ·{z
                    · · × W} and W is the word set.
                     n                                   Uw = k=1 Uwk , and each document, d ∈ D,
The main hypothesis is that the occurance statistics     as a matrix Vd ∈ RR×R . To measure dependence
of the each of these n-grams sets contains some new      between n-gram, w, and document, d, we use the
information about this text, which can not be ex-        Frobenius inner product, defined as hUw , Vd iF =
tracted from any other n-grams set. If we combine        tr(UTw Vd ) = tr(Uw VdT ) = vec(Uw )T vec(Vd ).
information from all of these sets we can achieve        We assume that embeddings organized accordingly
better quality for our document embedding model.         to this operation can be suitable for linear classi-
   Due to the fact that the occurrence of the se-        fiers.
quence of words depends on the occurrence of each           The resulting model is the generalized to expo-
word from this sequence, it is reasonable to treat the   nential family of probability distributions coupled
distribution of each fixed-length n-grams set sep-       tensor decomposition (Collins et al., 2001; Yilmaz
arately. Otherwise, by the reason of dependence                                                   (n)
                                                         et al., 2011) of the set of tensors {X̄ }N   n=1 by
between the length of word sequence and their fre-       the corresponding set of tensor chain models with
quency, small length n-grams can downweight the          restricted set of parameters
importance of long length n-grams. Thus the effect
                                                                                    |W|                      |D|
of higher-order interaction can become low. Also,                   Θ = {Uw }w=1 ∪ {Vd }d=1 ,                        (4)
we notice that consider p(w) as a distribution over
unordered sets is a quite restrictive assumption on      in the following optimization problem
the structure of the model due to the importance                          N
of the order of the words in the language seman-                          X               (n)          (n)
                                                                    min          KL[X̄          ||X̂         ; Θ],   (5)
tics. For all these reasons, we work with each n-                    Θ
                                                                          n=1

                                                     1676
Figure 1: Illustration of representation of document collection in a multi-view way as a collection of bipartite
graphs. Each of these graphs represents dependence (number of co-occurrence) between word’s strings of length
                                                        n
n and documents. Adjacency matrices X̄(n) ∈ R|W| ×|D| of these graphs can be appropriately tensorized to
                     (n)
adjacency tensors X̄ ∈ R|W|×···×|W|×|D| which can be linked through modified multinomial link function with
                (n)
latent tensors Z    ∈ R|W|×···×|W|×|D| . Latent tensors can be decomposed via the Coupled Tensor Chain model.
In our model, all core tensors U (VT ) which represent words (documents) are additionally restricted to have the
                         |W|        |D|
same parameters {Uw }w=1 ({VdT }d=1 ).

where each Kullback-Leibler divergence can be                   and Tsvetkov, 2019; Meng et al., 2019) enforcing
expressed as                                                    the spherical geometry constraints is a promising
                                                           !    choice for tasks focusing on directional similar-
                                                       (n)
      (n)    (n)          X      X      (n)         x̄ wd       ity. For doing so it can be reasonable to constrain
KL[X̄ ||X̂ ; Θ] =                     x̄wd log         (n)
                                                    x̂wd        our model to the orthogonal group. In this case
                        w∈Wn d∈D
                                                                Frobenius inner product became proper similarity
     |W |      |W | |D|                       (n)
                                                          !
     X         X    X (n)                   x̄w1 ...wn d        measure and the sequential matrix product always
  =       ···           x̄w1 ...wn d log      (n)
                                                            ,
                                            x̂w1 ...wn d        preserves fixed norm and group structure (i.e., in-
    w1 =1     wn =1 d=1
                                                                vertibility of matrix multiplication). Due to the
      (n)
and x̂wd = p(w, d).                                             group structure, our model has an interesting prop-
  Our model represents p(w, d) by using follow-                 erty to uniquely determine each word in the n-gram
ing mean function                                               by their left and right aggregated context matrices
                                                                and general n-gram matrix.
p(w, d) ∝ p(w)p(d) exp(vec(X(wd) )T vec(Z(n) )),                   However, orthogonal group is a disjoint set of
                                                                two connected components: set of rotations
                                            (n)
where vec(X(wd) )T vec(Z(n) ) = zwd is one of
natural parameters, which organized in the la-                  SO(R) = {A|AT A = AAT = I, det(A) = +1}
tent tensors Z(n) ∈ R|W|×···×|W|×|D| . This latent
tensors contain pointwise mutual information be-                and set of reflections {A|AT A = AAT =
tween n-grams and documents and we assume that                  I, det(A) = −1}. We impose constraints on our
each of this tensor has low tensor chain rank, i.e.,            model parameters enforcing rotation matrices since
 (n)
zwd = tr(Uw VdT ).                                              the product of any number of rotations is always
                                                                rotation, i.e. rotation set forms a matrix group.
3.4   Intuition from geometric interpretation
                                                                While the product of an even number of reflections
If we avoid generative assumptions (Saunshi et al.,             becomes rotation.
2019), our task can be interpreted as maximizing
the similarity between document d and n-grams                   3.5   Noise-contrastive estimation
from its document distribution p(w|d) with simul-
taneous minimization of similarity between this                 In practice we do not need to construct set of ten-
                                                                        (n)                                        (n)
document and n-grams from common n-gram distri-                 sors {X̄ }N n=1 explicitly. Instead, since each X̄
bution p(w). As shown in previous works (Kumar                  represents a higher-order frequency table, we can

                                                           1677
optimize the sum of MLE tasks:                                      Stiefel manifold we can swap two columns for each
                   (1)                                    (N )
                                                                    parameter matrix if the determinant of the this ma-
Ew,d∼X̄(1) log(x̂wd ) + · · · + Ew,d∼X̄(N ) log(x̂wd ).             trix is -1. However, this initialization can be below
                                                                    optimal way, because these rotation matrices can be
Usually, we have huge amount of data which make
                                                                    far away from each other and due to the non-trivial
problem of computing partition function for each
  (n)                                                               structure of the loss function on this manifold we
x̂wd intractable for many current computing ar-                     can stuck in local minima. To overcome this prob-
chitectures. We can avoid this problem by us-                       lem, we can fix particular point on the manifold for
ing noise-contrastive estimation (Gutmann and                       all matrices and perform small movement from this
Hyvärinen, 2012) for conditional model (Ma and                     point in arbitrary direction. We use following strat-
Collins, 2018). Similar to Chen et al. (2017), we                                                         |W|         |D|
                                                                    egy for each parameter A ∈ {Uw }w=1 ∪{Vd }d=1 :
construct negative samples from our batch by con-
necting non-linked n-grams and documents. Fi-                                          
nally, for parameter set                                                              1
                                                                           A0 ∼ N 0, 2
               |W|                |D|                                                 R
    Θ = {Uw }w=1 ∪ {Vd }d=1 ∪ {κ(n) }N
                                     n=1 ,                  (6)                                     
                                                                                          1       T                  (9)
                                                                           [Q, R] = QR I + (A0 − A0 )
we formulate optimization problem in the follow-                                          2
ing way:                                                                   A = Q.
             N                           N
             X                     λ X (n) 2                          We initialize all concentration parameters us-
       min         L(n) (Θ) +         (κ )                  (7)     ing following equation κ(n) = R    u
                                                                                                         , where u ∼
         Θ                         2
             n=1                        n=1
                                                                    U(0.9, 1.1) and n = 1, . . . , N .
        s.t. Uw ∈ SO(R), w = 1, . . . , |W|
             Vd ∈ SO(R), d = 1, . . . , |D|                         4.3   Riemannian optimization
             κ(n) ≥ 0, n = 1, . . . , N,                            We solve our problem on the product manifold
                                                                    of rotation group and nonnegative orthant by
    where each risk function is equal to                            simultaneous optimization of model parameters
                                                                           |W|          |D|
                                                                    {Uw }w=1 ∪ {Vd }d=1 and {κ(n) }N  n=1 by Rieman-
      L(n) (Θ) = E{(w ,d )}I             QI      (n)
                         i   i   i=1 ∼    i=1   x̄w
                                                   i di             nian (Bécigneul and Ganea, 2019) and projected
               exp κ(n) tr(Uwi VdTi )                               Adagrad (Duchi et al., 2011) respectively.
                                      
                                                            (8)
      − log P                          .                             Let M be a real smooth manifold and L : M →
             I           (n) tr(U VT )
             j=1 exp   κ         wj di                              R a smooth real-valued function over parameters
                                                                    θ ∈ M. Riemannian gradient descent (Gabay,
   We add concentration parameters κ(n) to our                      1982; Absil et al., 2008) based on two sequential
loss function to overcome the problem of fixed                      steps. At first we compute Riemannian gradient
scale. This makes our model more flexible to rep-                   by orthogonal projection of the Euclidean gradient
resent sharp distributions. Due to the fact that each               on the tangent space at the same point on which
n-gram distribution has its own scale, it can be rea-               we compute Euclidean gradient by projθt : R →
sonable to have a different κ(n) for different n-gram               Tθ M
distributions.
                                                                             ∇R L(θ t ) = projθt (∇E L(θ t )),      (10)
4     Optimization setup
                                                                    and then we perform movement on the manifold
4.1    N-gram construction
                                                                    by specific curve, which called retraction Rθt :
We construct n-grams from text corpora by using se-                 Tθ M → M
quentially moving of the sliding window of length
n (from 1 to N ) inside each document.                                         θ t+1 = Rθt (−α∇R L(θ t )).          (11)

4.2    Parameters initialization                                    For optimization on rotation group the orthogonal
Initialization from the uniform distribution on the                 projector of matrix G ∈ RR×R on the tangent
Stiefel manifold (Saxe et al., 2014) is one of promis-              space at the point A ∈ SO(R) is given by:
ing ways to initialize deep neural network. To ini-                                     1
tialize parameters only from rotation component of                           projA (G) = (G − AGT A),               (12)
                                                                                        2
                                                                 1678
For movement on the manifold in this direction we        Bi-LSTM models like ELMo, our model has lower
use QR-based retraction:                                 complexity on embedding dimension, and can be
                                                         computed in parallel using the associativity prop-
      [Q, R] = QR (At − α∇R L(At ))                      erty of matrix multiplication. Although our model
                                                 (13)
      At+1 = Q.                                          has a higher theoretical time complexity than the
                                                         vector space models, the real gap between them is
We choose the QR-based retraction because it al-         relatively small at ordinary embedding dimension
lows Riemannian Adagrad to achieve the fastest           (∼ 400).
convergence in our experiments in comparison
with Cayley retraction (first-order), Polar retraction         Method               Time         Space
(second-order), and geodesic (matrix exponential).             PV-DBOW              O(kd)        O(d)
                                                               PV-DM (Concat)       O(kd)        O(kd)
                                                               ELMo                O(kld2 )     O(ld2 )
Algorithm 1 Optimization algorithm for RDM                     BERT                O(k 2 hld)   O(khld)
  Input: Learning rates α and β, number of itera-              RDM (Ours)          O(kd1.5 )     O(d)
  tions T , maximum n-gram length N .
                                            |W|
  Output: Embedding parameters {Uw }w=1 ∪                Table 2: Comparison of time and space complexity of
        |D|
  {Vd }d=1 and {κ(n) }N n=1 .
                                                         several document embedding models, where k - size of
  for t = 1 to T do . after computing gradients:         context window, d - embedding dimension, l - number
                          |W|         |D|                of layers, h - number of heads. The time complexity of
      for all At ∈ {Uw }w=1 ∪ {Vd }d=1 do
                         PN                              other discussed here vector space models is equivalent
          ∇E L(At ) = n=1 ∇E L(n) (At )                  to the complexity of PV-DBOW.
          ∇R L(At ) ← (12) with ∇E L(At )
                           α
          αt = qP
                     t               2                   5     Numerical experiments
                     i=1 k∇R L(Ai )kF
          At+1 ← (13) with αt                            5.1    Experimental setup
      end for
                                                         Downstream Linear Protocol. We estimate
      for n = 1 to N do
                 (n)              (n)        (n)         quality of pre-trained representations on the multi-
          grad(κt ) = ∇E L(n) (κt ) + λκt                class document categorization tasks on the 20
                                        (n)
            (n)     (n)        grad(κt )                 Newsgroups and the ArXiv based long document
          κt+ 1 = κt − β q
               2              Pt          2 (n)          dataset (He et al., 2019). We choose these datasets
                               i=1 grad (κi )
                                                       for our benchmarks because they are significantly
            (n)      (n)
          κt+1 = κt+ 1                                   different in the document’s average length. This im-
                         2   +
                                                         plies that statistics of long n-grams differ between
     end for
                                                         these datasets too and in the case of the ArXiv
  end for
                                                         dataset statistics of n-grams are significantly bet-
                                                         ter converged than in the case of 20 newsgroups.
                                                         This fact allow us to hypothesize that matrix-space
4.4   Computational efficiency                           models should less overfit on the ArXiv dataset.
The computational complexity of our model de-               We fix the document embeddings and optimize
pends on the complexity R3 of multiplication of          multinomial logistic regression with SAGA opti-
matrices of size R × R, and QR decomposition             mizer and l2 -norm regularization. Instead of test
4 3
3 R (Layton and Sussman, 2020; Trefethen and             set we use nested 10-fold cross-validation to esti-
III, 1997). Due to the number of elements in these       mate statistical significance using Wilcoxon signed-
matrices d = R2 , we can transform the complexity        rank test (Japkowicz and Shah, 2011; Dror et al.,
of our model in the dimension of embedding. In           2018). For each fold we estimate the hyperparam-
this view, the time complexity is O(kd1.5 ), where       eter of l2 -regularization on 10 point logarithmic
k is the size of the context window. As shown in         grid from 0.01 to 100 by using additional 10-fold
Table (2), our model has computational benefits          cross-validation with macro-averaged F1 score. For
in comparison with Transformer due to the linear         text preprocessing, we use CountVectorizer from
dependence of time complexity on the word se-            the Scikit-learn package. Additionally we remove
quence length. We note that in comparison with           words which occur in the NLTK stopwords list or

                                                    1679
occur only in 1 document. In the case of ArXiv           vectors from BERT model ”bert-base-uncased”.
dataset we use half of this dataset and use only         Following Devlin et al. (2019) we take the last layer
documents in the range from 1000 to 5000 words           hidden state corresponding to the [CLS] token as
(smaller documents are removed and bigger docu-          the aggregate document representation. If length of
ments are reduced to the first 5000 words).              document is bigger than 512 we cut document on
                                                         512-length parts and average representation of this
 Dataset             #cls     |W|       |D|      #w      parts. Finally, we add Sentence BERT (Reimers
 20 Newsgroups        20     75752     18846     180     and Gurevych, 2019) to the baseline models. This
 ArXiv                6     251108     16371    3829     model is fine-tuned on SNLI and MultiNLI datasets
Table 3: Datasets considered in the paper, where #cls    for sentence embedding generation. We use 768-
- number of classes, #w - average number of words in     dimensional embedding vectors from model ”bert-
documents, |W| - vocabulary size, |D| - number of doc-   base-nli-mean-tokens”.
uments.                                                     We do not perform fine-tuning of BERT and
                                                         ELMo models for our datasets, because in our ex-
                                                         periments it doesn’t give any positive effect on
Baselines. We compare our model with differ-             the final performance of these models. However,
ent vector space models: paragraph vectors (Le           this is not true for Sentence BERT. Fine-tuning
and Mikolov, 2014), weighted combinations of             slightly improve performance of this model on the
the word2vec skipgram vectors (Mikolov et al.,           20 newsgroup (50 epoch with maximum margin
2013) (average, TF-IDF and SIF (Arora et al.,            triplet loss).
2017)), Doc2vecC (Chen, 2017), sent2vec (Pagliar-           We should notice that this experimental design
dini et al., 2018) and recently proposed JoSe (Meng      gives some benefits to neural network models in
et al., 2019). The comparison with the last two of       comparison with log-multilinear models, but it is
these models seems to be more informative than           more consistent with the ordinary practical use case
with others because of some similarity of these          of the Transformers and RNN models. However,
models to our model (sent2vec can use n-gram in-         the next experiment will show that log-multilinear
formation and JoSe also based on the spherical type      models can still outperform pre-trained neural net-
of embedding geometry).                                  works.
   Due to the large number of possible values of hy-
perparameters for each model, we used the default        Ablation study. For ablation study, we use dif-
values proposed by the authors of these models or        ferent settings of our model. RDM means rotation
proposed in subsequent studies of these models like      document model, i.e. our model. RDM-R means
in the case of paragraph vectors (Lau and Baldwin,       our model without rotation group constraints. By
2016). We modify only the min count to 1 and             (1) we mean model which utilize only unigrams
window size to be equal to the number of negative        and documents co-occurance information. By (3)
samples for paragraph2vec and word2vec models            and (5) we mean model which utilizes informa-
because it gives better results for these methods in     tion from (1, 2, 3)-grams and (1, 2, 3, 4, 5)-grams
our experiments. We choose n-grams number equal          respectively. For RDM, we use 1e-2 and 1e-3 as
to 1 for sent2vec, because other values doesn’t im-      learning rates of Radagrad and projected Adagrad
prove results. To preserve the fairness of compar-       respectively and we use λ = 15 for 20 newsgroups
ison we use fixed embedding dimension, number            and change λ to 5 for ArXiv dataset due to smaller
of negative samples, and number of epochs for all        number of epoch in this experiment. For RDM-R
models including ours. These values try to mimic         we use 1e-2 as learning rate for Adagrad.
the usual values of these hyperparameters in prac-
tice.                                                    5.2   Experimental results and comparison of
   We compare our model not only with vector mod-              performance
els but also with neural network models. We use          Comparison to baselines. As one may observe
5.5B ELMo (Peters et al., 2018) version which            in the Table (4), our models yield results compara-
is pre-trained on Wikipedia (1.9B) and all of the        ble or outperforming the baseline methods, includ-
monolingual news crawl data from WMT 2008-               ing the simpler log-multilinear models (e.g. Skip-
2012 (3.6B). ELMo embedding dimension is equal           gram) and more complex models featuring nonlin-
to 1024. Also we use 768-dimensional embedding           ear transformations, such as recurrences (ELMo)

                                                     1680
Models                               20 Newsgroups                              ArXiv
                                     Acc       Prec   Rec     F1             Acc      Prec    Rec        F1
         PV-DBOW                     88.7      88.4   88.2   88.2           89.1      89.2   89.2       89.2
         PV-DM                       77.2      76.8   76.5   76.5           42.2      42.3   42.0       41.9
         Skipgram+Average            90.3      90.2   90.0   90.0           92.4      92.3   92.3       92.3
         Skipgram+TF-IDF             90.4      90.2   90.1   90.1           92.6      92.5   92.4       92.4
         Skipgram+SIF                90.4      90.2   90.1   90.1           92.4      92.3   92.3       92.3
         Sent2vec                    87.9      87.6   87.5   87.5           91.9      91.7   91.7       91.7
         Doc2vecC                    90.0      89.8   89.7   89.7           93.2      93.1   93.1       93.1
         JoSe                        87.8      87.6   87.4   87.4           91.3      91.3   91.2       91.2
         ELMo                        79.2      78.8   78.8   78.7           91.6      91.5   91.4       91.4
         BERT                        74.3      73.6   73.6   73.5           92.3      92.2   92.1       92.1
         Sentence BERT               79.5      79.1   79.0   79.0           89.0      88.9   88.8       88.8
         RDM-R (1)                   86.8      86.4   86.3   86.3           94.0∗     93.9 ∗ 93.8∗      93.8∗
                                                                                           ∗
         RDM-R (3)                   87.9      87.6   87.4   87.4           94.5∗     94.4   94.4∗      94.4∗
         RDM-R (5)                   88.3      88.0   87.9   87.9           94.6∗     94.4∗ 94.4∗       94.4∗
         RDM (1)                     89.3      89.2   89.0   89.0           94.0∗     93.9∗ 93.9∗       93.9∗
         RDM (3)                     90.7      90.5   90.4   90.4           94.0∗     93.9∗ 93.9∗       93.9∗
         RDM (5)                    91.1∗     90.9 ∗ 90.8 ∗ 90.8∗           94.0∗     93.9∗ 93.9∗       93.9∗

Table 4: Text classification performance on the 20 Newsgroups (short documents) and on the modified ArXiv (long
documents) datasets. We fix the number of epochs to 50, the embedding dimension to 400, and the number of neg-
ative samples to 15 for all models on the 20 Newsgroups. On the ArXiv dataset, we use the same hyperparameters
except for only the number of epochs which is equal to 5. We use macro-average for Precision, Recall, and F1.

and transformer blocks (BERT, SentenceBERT).                    can see the model without rotation constraints is
More specifically, on the 20 newsgroups, the RDM                less robust in respect to noisy statistics of small doc-
(5) model yields the best results significantly2 out-           ument dataset and achieve performance less than
performing all the listed baseline approaches. It is            the PV-DBOW model, while the rotation group
interesting that contrary to the 20 Newsgroups, all             model outperforms all other models. However,
RDM variants with any number of the n-grams sets                once we move on to a dataset with a larger doc-
show strong results and significantly outperform                ument’s average length (ArXiv), RDM-R performs
other models on the ArXiv dataset. Weighted com-                better than all other models, including RDM. This
binations of the skipgram vectors and doc2vecC                  confirms our hypothesis that the existence of good
model achieve the closest to our result. This con-              statistics of long n-grams has critical value for
firms that neural models like ELMo and BERT, are                matrix space models. Due to strong associativ-
not the best way for all datasets and log-multilinear           ity between sets of n-grams, our model needs more
models can outperform them. We can see that per-                parameters to approximate all the co-occurrences.
formance of nonlinear models increase if we use                 The RDM-R has a higher number of degrees of
a large document dataset and BERT can outper-                   freedom than the RDM R2 vs R(R−1)      2    . We think
form some of the log-multilinear approaches (PV-                that it allows RDM-R to outperform RDM in this
DBOW, JoSe and sent2vec), but still, its result is              experiment. This intuition can be also confirmed
not on the top level.                                           by the fact that RDM (1) and RDM-R (1) have the
                                                                same performance level. And only if we increase
Comparison between our models. On observ-                       the number of n-grams for the model from 1 to
ing the results we can see that our model increase              3, then RDM-R can achieve better performance.
the performance of classification when adding the               However, if we increase the number of n-grams set
n-gram set with a bigger length. This property has              from 3 to 5 both models stay on the same level of
both models with rotation group constraints and                 performance. This is the sign that we need to use a
without such constraints. Despite this fact, as we              bigger embedding dimension if we want to achieve
  2
    We use ∗ if the comparison of our model with the baseline   even better results.
models has p < 0.05.

                                                            1681
In summary, if we have short documents, it’s          Shima Asaadi and Sebastian Rudolph. 2017. Gradual
better to use RDM. For long document dataset with          learning of matrix-space models of language for sen-
                                                           timent analysis. In Proceedings of the 2nd Workshop
restriction on the embedding dimension, we sug-
                                                           on Representation Learning for NLP, pages 178–
gest to relax the rotation group constraints. This         185, Vancouver, Canada. Association for Computa-
trick allows RDM to use more degrees of freedom            tional Linguistics.
to estimate data precisely.
                                                         Eric Bailey and Shuchin Aeron. 2017.             Word
6   Conclusion                                              embeddings via tensor factorization.          CoRR,
                                                            abs/1704.02686.
In this paper, we proposed a novel unsupervised
                                                         Gary Bécigneul and Octavian-Eugen Ganea. 2019. Rie-
representation learning method based on the gener-         mannian adaptive optimization methods. In 7th
alized tensor chain with rotation group constraints,       International Conference on Learning Representa-
which can utilize higher-order word interactions           tions, ICLR 2019, New Orleans, LA, USA, May 6-9,
and preserve most part of the computational effi-          2019. OpenReview.net.
ciency and interpretability of vector-based mod-         Minmin Chen. 2017. Efficient vector representation
els. Our model achieves state-of-the-art results in        for documents through corruption. In 5th Inter-
the document classification benchmarks on the 20           national Conference on Learning Representations,
                                                           ICLR 2017, Toulon, France, April 24-26, 2017, Con-
newsgroups and modified ArXiv dataset. A further
                                                           ference Track Proceedings. OpenReview.net.
direction of research could be focused on adding
tensor kernel functions to the model to eliminate        Ting Chen, Yizhou Sun, Yue Shi, and Liangjie Hong.
problems with dependence on the dimension of               2017. On sampling strategies for neural network-
                                                           based collaborative filtering. In Proceedings of the
embedding. It could be interesting to augment              23rd ACM SIGKDD International Conference on
this type of model with the different loss functions       Knowledge Discovery and Data Mining, Halifax,
based not only on n-gram-document interactions             NS, Canada, August 13 - 17, 2017, pages 767–776.
but also on word-word interactions from the knowl-         ACM.
edge graph or document-document interaction from         Kyunghyun Cho, Bart van Merriënboer, Caglar Gul-
the citation graph of the documents.                       cehre, Dzmitry Bahdanau, Fethi Bougares, Holger
                                                           Schwenk, and Yoshua Bengio. 2014. Learning
Acknowledgments                                            phrase representations using RNN encoder–decoder
                                                           for statistical machine translation. In Proceedings of
The Zhores supercomputer (Zacharov et al., 2019)           the 2014 Conference on Empirical Methods in Nat-
was used for computations. This research was               ural Language Processing (EMNLP), pages 1724–
partially supported by the Ministry of Educa-              1734, Doha, Qatar. Association for Computational
                                                           Linguistics.
tion and Science of the Russian Federation
(grant 14.756.31.0001). The work of Alexander            Michael Collins, Sanjoy Dasgupta, and Robert E.
Panchenko was partially supported by the MTS-              Schapire. 2001. A generalization of principal com-
Skoltech laboratory.                                       ponents analysis to the exponential family. In Ad-
                                                           vances in Neural Information Processing Systems
                                                          14 [Neural Information Processing Systems: Natu-
                                                           ral and Synthetic, NIPS 2001, December 3-8, 2001,
References                                                 Vancouver, British Columbia, Canada], pages 617–
Pierre-Antoine Absil, Robert E. Mahony, and                624. MIT Press.
   Rodolphe Sepulchre. 2008. Optimization Algo-
   rithms on Matrix Manifolds. Princeton University      Tim Van de Cruys, Thierry Poibeau, and Anna Korho-
   Press.                                                  nen. 2013. A tensor-based factorization model of se-
                                                           mantic compositionality. In Proceedings of the 2013
Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma,        Conference of the North American Chapter of the
  and Andrej Risteski. 2016. A latent variable model       Association for Computational Linguistics: Human
  approach to PMI-based word embeddings. Transac-          Language Technologies, pages 1142–1151, Atlanta,
  tions of the Association for Computational Linguis-      Georgia. Association for Computational Linguistics.
  tics, 4:385–399.
                                                         Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Sanjeev Arora, Yingyu Liang, and Tengyu Ma. 2017.           Kristina Toutanova. 2019. BERT: Pre-training of
  A simple but tough-to-beat baseline for sentence em-      deep bidirectional transformers for language under-
  beddings. In 5th International Conference on Learn-       standing. In Proceedings of the 2019 Conference
  ing Representations, ICLR 2017, Toulon, France,           of the North American Chapter of the Association
  April 24-26, 2017, Conference Track Proceedings.          for Computational Linguistics: Human Language
  OpenReview.net.                                          Technologies, Volume 1 (Long and Short Papers),

                                                    1682
pages 4171–4186, Minneapolis, Minnesota. Associ-          Guillaume Lample, Miguel Ballesteros, Sandeep Sub-
  ation for Computational Linguistics.                        ramanian, Kazuya Kawakami, and Chris Dyer. 2016.
                                                              Neural architectures for named entity recognition.
Bhuwan Dhingra, Christopher Shallue, Mohammad                 In Proceedings of the 2016 Conference of the North
  Norouzi, Andrew Dai, and George Dahl. 2018. Em-             American Chapter of the Association for Computa-
  bedding text in hyperbolic spaces. In Proceedings           tional Linguistics: Human Language Technologies,
  of the Twelfth Workshop on Graph-Based Methods              pages 260–270, San Diego, California. Association
  for Natural Language Processing (TextGraphs-12),            for Computational Linguistics.
  pages 59–69, New Orleans, Louisiana, USA. Asso-
  ciation for Computational Linguistics.                    Jey Han Lau and Timothy Baldwin. 2016. An empiri-
                                                               cal evaluation of doc2vec with practical insights into
Rotem Dror, Gili Baumer, Segev Shlomov, and Roi Re-            document embedding generation. In Proceedings
  ichart. 2018. The hitchhiker’s guide to testing statis-      of the 1st Workshop on Representation Learning for
  tical significance in natural language processing. In       NLP, pages 78–86, Berlin, Germany. Association for
  Proceedings of the 56th Annual Meeting of the As-            Computational Linguistics.
  sociation for Computational Linguistics (Volume 1:
  Long Papers), pages 1383–1392, Melbourne, Aus-            William Layton and Myron Sussman. 2020. Numerical
  tralia. Association for Computational Linguistics.          Linear Algebra. WSPC.
John C. Duchi, Elad Hazan, and Yoram Singer. 2011.          Quoc V. Le and Tomás Mikolov. 2014. Distributed
  Adaptive subgradient methods for online learning            representations of sentences and documents. In
  and stochastic optimization. J. Mach. Learn. Res.,          Proceedings of the 31th International Conference
  12:2121–2159.                                               on Machine Learning, ICML 2014, Beijing, China,
                                                              21-26 June 2014, volume 32 of JMLR Workshop
Abraham Frandsen and Rong Ge. 2019. Understanding             and Conference Proceedings, pages 1188–1196.
  composition of word embeddings via tensor decom-            JMLR.org.
  position. In 7th International Conference on Learn-
  ing Representations, ICLR 2019, New Orleans, LA,          Omer Levy and Yoav Goldberg. 2014. Neural word
  USA, May 6-9, 2019. OpenReview.net.                        embedding as implicit matrix factorization. In Ad-
                                                             vances in Neural Information Processing Systems,
Daniel Gabay. 1982. Minimizing a differentiable func-        volume 27, pages 2177–2185. Curran Associates,
  tion over a differential manifold. Journal of Opti-        Inc.
  mization Theory and Applications, 37:177–219.
                                                            Zhuang Ma and Michael Collins. 2018. Noise con-
Michael Gutmann and Aapo Hyvärinen. 2012. Noise-
                                                              trastive estimation and negative sampling for condi-
  contrastive estimation of unnormalized statistical
                                                              tional models: Consistency and statistical efficiency.
  models, with applications to natural image statistics.
                                                              In Proceedings of the 2018 Conference on Empirical
  J. Mach. Learn. Res., 13:307–361.
                                                              Methods in Natural Language Processing, Brussels,
Jun He, Liqun Wang, Liu Liu, Jiao Feng, and Hao Wu.           Belgium, October 31 - November 4, 2018, pages
  2019. Long document classification from local word          3698–3707. Association for Computational Linguis-
  glimpses via recurrent attention learning. IEEE Ac-         tics.
  cess, 7:40707–40718.
                                                            Florian Mai, Lukas Galke, and Ansgar Scherp. 2019.
Nathalie Japkowicz and Mohak Shah, editors. 2011.              CBOW is not all you need: Combining CBOW with
  Evaluating Learning Algorithms: A Classification             the compositional matrix space model. In 7th Inter-
  Perspective. Cambridge University Press.                     national Conference on Learning Representations,
                                                              ICLR 2019, New Orleans, LA, USA, May 6-9, 2019.
Boris N. Khoromskij. 2011. O(dlogn)-quantics approx-           OpenReview.net.
  imation of n-d tensors in high-dimensional numeri-
  cal modeling. Constructive Approximation, 34:257–         Yu Meng, Jiaxin Huang, Guangyuan Wang, Chao
  280.                                                        Zhang, Honglei Zhuang, Lance M. Kaplan, and Ji-
                                                              awei Han. 2019. Spherical text embedding. In Ad-
Yoon Kim. 2014. Convolutional neural networks                 vances in Neural Information Processing Systems
  for sentence classification. In Proceedings of the          32: Annual Conference on Neural Information Pro-
  2014 Conference on Empirical Methods in Natural             cessing Systems 2019, NeurIPS 2019, December
  Language Processing (EMNLP), pages 1746–1751,               8-14, 2019, Vancouver, BC, Canada, pages 8206–
  Doha, Qatar. Association for Computational Lin-             8215.
  guistics.
                                                            Tomás Mikolov, Ilya Sutskever, Kai Chen, Gregory S.
Sachin Kumar and Yulia Tsvetkov. 2019. Von mises-             Corrado, and Jeffrey Dean. 2013. Distributed rep-
  fisher loss for training sequence to sequence models        resentations of words and phrases and their com-
  with continuous outputs. In 7th International Con-          positionality. In Advances in Neural Information
  ference on Learning Representations, ICLR 2019,             Processing Systems 26: 27th Annual Conference on
  New Orleans, LA, USA, May 6-9, 2019. OpenRe-                Neural Information Processing Systems 2013. Pro-
  view.net.                                                   ceedings of a meeting held December 5-8, 2013,

                                                       1683
Lake Tahoe, Nevada, United States, pages 3111–        Nikunj Saunshi, Orestis Plevrakis, Sanjeev Arora,
  3119.                                                   Mikhail Khodak, and Hrishikesh Khandeparkar.
                                                          2019. A theoretical analysis of contrastive unsuper-
Maximilian Nickel and Douwe Kiela. 2018. Learning         vised representation learning. In Proceedings of the
 continuous hierarchies in the lorentz model of hy-       36th International Conference on Machine Learn-
 perbolic geometry. In Proceedings of the 35th In-        ing, ICML 2019, 9-15 June 2019, Long Beach, Cali-
 ternational Conference on Machine Learning, ICML         fornia, USA, volume 97 of Proceedings of Machine
 2018, Stockholmsmässan, Stockholm, Sweden, July         Learning Research, pages 5628–5637. PMLR.
 10-15, 2018, volume 80 of Proceedings of Machine
 Learning Research, pages 3776–3785. PMLR.              Andrew M. Saxe, James L. McClelland, and Surya
                                                          Ganguli. 2014. Exact solutions to the nonlinear dy-
Maximillian Nickel and Douwe Kiela. 2017. Poincaré       namics of learning in deep linear neural networks.
 embeddings for learning hierarchical representa-         In 2nd International Conference on Learning Rep-
 tions. In I. Guyon, U. V. Luxburg, S. Bengio,            resentations, ICLR 2014, Banff, AB, Canada, April
 H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar-      14-16, 2014, Conference Track Proceedings.
 nett, editors, Advances in Neural Information Pro-
 cessing Systems 30, pages 6338–6347. Curran Asso-      Vatsal Sharan and Gregory Valiant. 2017. Orthogonal-
 ciates, Inc.                                             ized ALS: A theoretically principled tensor decom-
                                                          position algorithm for practical use. In Proceedings
Matteo Pagliardini, Prakhar Gupta, and Martin Jaggi.      of the 34th International Conference on Machine
 2018. Unsupervised learning of sentence embed-           Learning, volume 70 of Proceedings of Machine
 dings using compositional n-gram features. In Pro-       Learning Research, pages 3095–3104, International
 ceedings of the 2018 Conference of the North Amer-       Convention Centre, Sydney, Australia. PMLR.
 ican Chapter of the Association for Computational
 Linguistics: Human Language Technologies, Vol-         Lloyd N. Trefethen and David Bau III. 1997. Numeri-
 ume 1 (Long Papers), pages 528–540, New Orleans,         cal Linear Algebra, first edition. SIAM: Society for
 Louisiana. Association for Computational Linguis-        Industrial and Applied Mathematics.
 tics.
                                                        Ainur Yessenalina and Claire Cardie. 2011. Compo-
Jeffrey Pennington, Richard Socher, and Christopher       sitional matrix-space models for sentiment analysis.
   Manning. 2014. GloVe: Global vectors for word          In Proceedings of the 2011 Conference on Empiri-
   representation. In Proceedings of the 2014 Confer-     cal Methods in Natural Language Processing, pages
   ence on Empirical Methods in Natural Language          172–182, Edinburgh, Scotland, UK. Association for
   Processing (EMNLP), pages 1532–1543, Doha,             Computational Linguistics.
   Qatar. Association for Computational Linguistics.
                                                        Yusuf Kenan Yilmaz, Ali Taylan Cemgil, and Umut
David Perez-Garcia, Frank Verstraete, Michael M.          Simsekli. 2011. Generalised coupled tensor factori-
  Wolf, and J. Ignacio Cirac. 2007. Matrix prod-          sation. In Advances in Neural Information Process-
  uct state representations. Quantum Info. Comput.,       ing Systems 24: 25th Annual Conference on Neural
  7(5):401–430.                                           Information Processing Systems 2011. Proceedings
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt           of a meeting held 12-14 December 2011, Granada,
 Gardner, Christopher Clark, Kenton Lee, and Luke         Spain, pages 2151–2159.
 Zettlemoyer. 2018. Deep contextualized word rep-
                                                        Igor Zacharov, Rinat Arslanov, Maksim Gunin, Daniil
 resentations. In Proceedings of the 2018 Confer-
                                                           Stefonishin, Andrey Bykov, Sergey Pavlov, Oleg
 ence of the North American Chapter of the Associ-
                                                           Panarin, Anton Maliutin, Sergey Rykovanov, and
 ation for Computational Linguistics: Human Lan-
                                                           Maxim Fedorov. 2019. Zhores — petaflops super-
 guage Technologies, Volume 1 (Long Papers), pages
                                                           computer for data-driven modeling, machine learn-
 2227–2237, New Orleans, Louisiana. Association
                                                           ing and artificial intelligence installed in skolkovo
 for Computational Linguistics.
                                                           institute of science and technology. Open Engineer-
Nils Reimers and Iryna Gurevych. 2019. Sentence-           ing, 9(1):512–520.
  BERT: Sentence embeddings using Siamese BERT-
  networks. In Proceedings of the 2019 Conference on    Qibin Zhao, Masashi Sugiyama, Longhao Yuan, and
  Empirical Methods in Natural Language Processing        Andrzej Cichocki. 2019. Learning efficient ten-
  and the 9th International Joint Conference on Natu-     sor representations with ring-structured networks.
  ral Language Processing (EMNLP-IJCNLP), pages           In IEEE International Conference on Acoustics,
  3982–3992, Hong Kong, China. Association for            Speech and Signal Processing, ICASSP 2019,
  Computational Linguistics.                              Brighton, United Kingdom, May 12-17, 2019, pages
                                                          8608–8612. IEEE.
Sebastian Rudolph and Eugenie Giesbrecht. 2010.
  Compositional matrix-space models of language. In
  Proceedings of the 48th Annual Meeting of the As-
  sociation for Computational Linguistics, pages 907–
  916, Uppsala, Sweden. Association for Computa-
  tional Linguistics.

                                                    1684
You can also read