Matching Article Pairs with Graphical Decomposition and Convolutions

Page created by Beverly Williamson
 
CONTINUE READING
Matching Article Pairs with Graphical Decomposition and Convolutions

                                          Bang Liu† , Di Niu† , Haojie Wei‡ , Jinghong Lin‡ , Yancheng He‡ , Kunfeng Lai‡ , Yu Xu‡
                                                                †
                                                                  University of Alberta, Edmonton, AB, Canada
                                                                     {bang3, dniu}@ualberta.ca
                                                          ‡
                                                            Platform and Content Group, Tencent, Shenzhen, China
                                         {fayewei, daphnelin, collinhe, calvinlai, henrysxu}@tencent.com

                                                              Abstract                               Traditional term-based matching approaches es-
                                                                                                 timate the semantic distance between a pair of
                                             Identifying the relationship between two arti-
                                                                                                 text objects via unsupervised metrics, e.g., via TF-
arXiv:1802.07459v2 [cs.CL] 28 May 2019

                                             cles, e.g., whether two articles published from
                                             different sources describe the same break-
                                                                                                 IDF vectors, BM25 (Robertson et al., 2009), LDA
                                             ing news, is critical to many document un-          (Blei et al., 2003) and so forth. These methods
                                             derstanding tasks. Existing approaches for          have achieved success in query-document match-
                                             modeling and matching sentence pairs do not         ing, information retrieval and search. In recent
                                             perform well in matching longer documents,          years, a wide variety of deep neural network mod-
                                             which embody more complex interactions be-          els have also been proposed for text matching (Hu
                                             tween the enclosed entities than a sentence         et al., 2014; Qiu and Huang, 2015; Wan et al.,
                                             does. To model article pairs, we propose the
                                                                                                 2016; Pang et al., 2016), which can capture the
                                             Concept Interaction Graph to represent an ar-
                                             ticle as a graph of concepts. We then match         semantic dependencies (especially sequential de-
                                             a pair of articles by comparing the sentences       pendencies) in natural language through layers of
                                             that enclose the same concept vertex through        recurrent or convolutional neural networks. How-
                                             a series of encoding techniques, and aggregate      ever, existing deep models are mainly designed for
                                             the matching signals through a graph convo-         matching sentence pairs, e.g., for paraphrase iden-
                                             lutional network. To facilitate the evaluation      tification, answer selection in question-answering,
                                             of long article matching, we have created two
                                                                                                 omitting the complex interactions among key-
                                             datasets, each consisting of about 30K pairs
                                             of breaking news articles covering diverse top-
                                                                                                 words, entities or sentences that are present in a
                                             ics in the open domain. Extensive evaluations       longer article. Therefore, article pair matching re-
                                             of the proposed methods on the two datasets         mains under-explored in spite of its importance.
                                             demonstrate significant improvements over a             In this paper, we apply the divide-and-conquer
                                             wide range of state-of-the-art methods for nat-     philosophy to matching a pair of articles and bring
                                             ural language matching.                             deep text understanding from the currently domi-
                                                                                                 nating sequential modeling of language elements
                                         1   Introduction
                                                                                                 to a new level of graphical document representa-
                                         Identifying the relationship between a pair of arti-    tion, which is more suitable for longer articles.
                                         cles is an essential natural language understand-       Specifically, we have made the following contri-
                                         ing task, which is critical to news systems and         butions:
                                         search engines. For example, a news system needs            First, we propose the so-called Concept Inter-
                                         to cluster various articles on the Internet report-     action Graph (CIG) to represent a document as
                                         ing the same breaking news (probably in different       a weighted graph of concepts, where each con-
                                         ways of wording and narratives), remove redun-          cept vertex is either a keyword or a set of tightly
                                         dancy and form storylines (Shahaf et al., 2013; Liu     connected keywords. The sentences in the arti-
                                         et al., 2017; Zhou et al., 2015; Vossen et al., 2015;   cle associated with each concept serve as the fea-
                                         Bruggermann et al., 2016). The rich semantic and        tures for local comparison to the same concept ap-
                                         logic structures in longer documents have made it       pearing in another article. Furthermore, two con-
                                         a different and more challenging task to match a        cept vertices in an article are also connected by
                                         pair of articles than to match a pair of sentences or   a weighted edge which indicates their interaction
                                         a query-document pair in information retrieval.         strength. The CIG does not only capture the essen-
tial semantic units in a document but also offers a        Text:                                    Concept Interaction Graph:

                                                           [1] Rick asks Morty to travel with him        [1, 2]                 [5, 6]
way to perform anchored comparison between two                 in the universe.                                 Rick
articles along the common concepts found.                  [2] Morty doesn't want to go as Rick always          Morty
                                                               brings him dangerous experiences.                               Rick
                                                                                                                              Summer
   Second, we propose a divide-and-conquer                 [3] However, the destination of this journey
                                                               is the Candy Planet, which is an fascinating       Morty
framework to match a pair of articles based on                 place that attracts Morty.                        Candy
                                                           [4] The planet is full of delicious candies.          Planet
the constructed CIGs and graph convolutional net-          [5] Summer wishes to travel with Rick.                       [3, 4]
works (GCNs). The idea is that for each concept            [6] However, Rick doesn't like to travel with Summer.

vertex that appears in both articles, we first ob-
tain the local matching vectors through a range of     Figure 1: An example to show a piece of text and its
text pair encoding schemes, including both neu-        Concept Interaction Graph representation.
ral encoding and term-based encoding. We then          tion and convolutions can improve the classifica-
aggregate the local matching vectors into the fi-      tion accuracy by 17.31% and 23.09% on the two
nal matching result through graph convolutional        datasets, respectively.
layers (Kipf and Welling, 2016; Defferrard et al.,
2016). In contrast to RNN-based sequential mod-        2      Concept Interaction Graph
eling, our model factorizes the matching process
into local matching sub-problems on a graph, each      In this section, we present our Concept Interac-
focusing on a different concept, and by using          tion Graph (CIG) to represent a document as an
GCN layers, generates matching results based on        undirected weighted graph, which decomposes a
a holistic view of the entire graph.                   document into subsets of sentences, each subset
   Although there exist many datasets for sentence     focusing on a different concept. Given a docu-
matching, the semantic matching between longer         ment D, a CIG is a graph GD , where each vertex
articles is a largely unexplored area. To the best     in GD is called a concept, which is a keyword or a
of our knowledge, to date, there does not exist a      set of highly correlated keywords in document D.
labeled public dataset for long document match-        Each sentence in D will be attached to the single
ing. To facilitate evaluation and further research     concept vertex that it is the most related to, which
on document and especially news article match-         most frequently is the concept the sentence men-
ing, we have created two labeled datasets1 , one       tions. Hence, vertices will have their own sentence
annotating whether two news articles found on In-      sets, which are disjoint. The weight of the edge
ternet (from different media sources) report the       between a pair of concepts denotes how much the
same breaking news event, while the other anno-        two concepts are related to each other and can be
tating whether they belong to the same news story      determined in various ways.
(yet not necessarily reporting the same breaking          As an example, Fig. 1 illustrates how we con-
news event). These articles were collected from        vert a document into a Concept Interaction Graph.
major Internet news providers in China, including      We can extract keywords Rick, Morty, Summer,
Tencent, Sina, WeChat, Sohu, etc., covering di-        and Candy Planet from the document using stan-
verse topics, and were labeled by professional ed-     dard keyword extraction algorithms, e.g., Tex-
itors. Note that similar to most other natural lan-    tRank (Mihalcea and Tarau, 2004). These key-
guage matching models, all the approaches pro-         words are further clustered into three concepts,
posed in this paper can easily work on other lan-      where each concept is a subset of highly corre-
guages as well.                                        lated keywords. After grouping keywords into
                                                       concepts, we attach each sentence in the document
   Through extensive experiments, we show that
                                                       to its most related concept vertex. For example, in
our proposed algorithms have achieved signifi-
                                                       Fig. 1, sentences 1 and 2 are mainly talking about
cant improvements on matching news article pairs,
                                                       the relationship between Rick and Morty, and are
as compared to a wide range of state-of-the-art
                                                       thus attached to the concept (Rick, Morty). Other
methods, including both term-based and deep text
                                                       sentences are attached to vertices in a similar way.
matching algorithms. With the same encoding or
                                                       The attachment of sentences to concepts naturally
term-based feature representation of a pair of arti-
                                                       dissects the original document into multiple dis-
cles, our approach based on graphical decomposi-
                                                       joint sentence subsets. As a result, we have repre-
    1
      Our code and datasets are available        at:   sented the original document with a graph of key
https://github.com/BangLiu/ArticlePairMatching         concepts, each with a sentence subset, as well as
Doc A         Construct KeyGraph                        Get Edge Weights         Siamese
                                                                                                          Vertex Feature     Siamese       Term-based
                by Word Co-occurrence                     by Vertex Similarities    Encoder                                  matching       matching
                             w
                                                                 1                             Matching Layer
                     w                       w
                                 w                                          2                                                    Aggregation Layer
   Doc B                 w
                                         w       w
                                                                                      Context Layer      Contex Layer
                                                                                                                                                                 Result
                 w
                                                     w      5                                                              transformed
                     w
                                     w               w                          3                                            features
                                                          Concept                                                                                                Classify
                                 w       w
                                                         Interaction   4
                KeyGraph             w                                                 Sentences 1       Sentences 2
Document Pair                                               Graph
                                                                                                                                                               Concatenate
   Input
                   Detect Concepts                         Assign Sentences
                by Community Detection                       by Similarities        Term-based
                                                                                      Feature                                      GCN Layers
                                                                                                                                                                             …
                             w                            Concept 1    Concept 2     Extractor
                     w                       w
                                                                                                          Vertex Feature     vertex
                                 w                         S1   S2     S1   S2
                                                                                                                            features
                         w
                                         w       w
                                                                                                                                                        Siamese Term-based Global
                                                          Concept 5    Concept 3              Feature Extractor                                         matching matching matching
                 w
                                                     w
                     w                                     S1   S2     S1   S2
                                     w               w
                                                          Concepts                        Sentence 1, Sentence 2
                                 w       w                          Concept 4
                                                             with
                Concepts             w                    sentences S1 S2                                                    CIG with vertex features

                                     (a) Representation                                       (b) Encoding                    (c) Transformation            (d) Aggregation

Figure 2: An overview of our approach for constructing the Concept Interaction Graph (CIG) from a pair of
documents and classifying it by Graph Convolutional Networks.

the interaction topology among them.                                                                        use each keyword directly as a concept. The ben-
   Fig 2 (a) illustrates the construction of CIGs for                                                       efit brought by concept detection is that it reduces
a pair of documents aligned by the discovered con-                                                          the number of vertices in a graph and speeds up
cepts. Here we first describe the detailed steps to                                                         matching, as will be shown in Sec. 4.
construct a CIG for a single document:                                                                         Sentence Attachment. After the concepts are
   KeyGraph Construction. Given a document                                                                  discovered, the next step is to group sentences by
D, we first extract the named entities and key-                                                             concepts. We calculate the cosine similarity be-
words by TextRank (Mihalcea and Tarau, 2004).                                                               tween each sentence and each concept, where sen-
After that, we construct a keyword co-occurrence                                                            tences and concepts are represented by TF-IDF
graph, called KeyGraph, based on the set of found                                                           vectors. We assign each sentence to the concept
keywords. Each keyword is a vertex in the Key-                                                              which is the most similar to the sentence. Sen-
Graph. We connect two keywords by an edge if                                                                tences that do not match any concepts in the docu-
they co-occur in a same sentence.                                                                           ment will be attached to a dummy vertex that does
   We can further improve our model by perform-                                                             not contain any keywords.
ing co-reference resolution and synonym analysis                                                               Edge Construction. To construct edges that re-
to merge keywords with the same meaning. How-                                                               veal the correlations between different concepts,
ever, we do not apply these operations due to the                                                           for each vertex, we represent its sentence set as a
time complexity.                                                                                            concatenation of the sentences attached to it, and
   Concept Detection (Optional). The structure                                                              calculate the edge weight between any two ver-
of KeyGraph reveals the connections between key-                                                            tices as the TF-IDF similarity between their sen-
words. If a subset of keywords are highly cor-                                                              tence sets. Although edge weights may be de-
related, they will form a densely connected sub-                                                            cided in other ways, our experience shows that
graph in the KeyGraph, which we call a concept.                                                             constructing edges by TF-IDF similarity generates
Concepts can be extracted by applying commu-                                                                a CIG that is more densely connected.
nity detection algorithms on the constructed Key-                                                              When performing article pair matching, the
Graph. Community detection is able to split a                                                               above steps will be applied to a pair of documents
KeyGraph Gkey into a set of communities C =                                                                 DA and DB , as is shown in Fig. 2 (a). The only ad-
{C1 , C2 , ..., C|C| }, where each community Ci con-                                                        ditional step is that we align the CIGs of the two
tains the keywords for a certain concept. By using                                                          articles by the concept vertices, and for each com-
overlapping community detection, each keyword                                                               mon concept vertex, merge the sentence sets from
may appear in multiple concepts. As the num-                                                                DA and DB for local comparison.
ber of concepts in different documents varies a lot,
                                                                                                            3       Article Pair Matching through Graph
we utilize the betweenness centrality score based
                                                                                                                    Convolutions
algorithm (Sayyadi and Raschid, 2013) to detect
keyword communities in KeyGraph.                                                                            Given the merged CIG GAB of two documents DA
   Note that this step is optional, i.e., we can also                                                       and DB described in Sec. 2, we match a pair of ar-
ticles in a “divide-and-conquer” manner by match-       vectors, i.e.,
ing the sentence sets from DA and DB associated
with each concept and aggregating local matching          mAB (v) = (|cA (v) − cB (v)|, cA (v) ◦ cB (v)),
results into a final result through multiple graph                                                        (1)
convolutional layers. Our approach overcomes the        where ◦ denotes Hadamard product.
limitation of previous text matching algorithms,           Term-based Similarities: we also generate an-
by extending text representation from a sequential      other matching vector for each v by directly cal-
(or grid) point of view to a graphical view, and can    culating term-based similarities between SA (v)
therefore better capture the rich semantic interac-     and SB (v), based on 5 metrics: the TF-IDF co-
tions in longer text.                                   sine similarity, TF cosine similarity, BM25 co-
   Fig. 2 illustrates the overall architecture of our   sine similarity, Jaccard similarity of 1-gram, and
proposed method, which consists of four steps:          Ochiai similarity measure. These similarity scores
a) representing a pair of documents by a sin-           are concatenated into another matching vector
gle merged CIG, b) learning multi-viewed match-         m0AB (v) for v, as shown in Fig. 2 (b).
ing features for each concept vertex, c) struc-            Matching Aggregation via GCN The local
turally transforming local matching features by         matching vectors must be aggregated into a final
graph convolutional layers, and d) aggregating lo-      matching score for the pair of articles. We propose
cal matching features to get the final result. Steps    to utilize the ability of the Graph Convolutional
(b)-(d) can be trained end-to-end.                      Network (GCN) filters (Kipf and Welling, 2016)
                                                        to capture the patterns exhibited in the CIG GAB
   Encoding Local Matching Vectors. Given the
                                                        at multiple scales. In general, the input to the GCN
merged CIG GAB , our first step is to learn an
                                                        is a graph G = (V, E) with N vertices vi ∈ V, and
appropriate matching vector of a fixed length for
                                                        edges eij = (vi , vj ) ∈ E with weights wij . The
each individual concept v ∈ GAB to express the
                                                        input also contains a vertex feature matrix denoted
semantic similarity between SA (v) and SB (v), the
                                                        by X = {xi }N   i=1 , where xi is the feature vector
sentence sets of concept v from documents DA
                                                        of vertex vi . For a pair of documents DA and DB ,
and DB , respectively. This way, the matching of
                                                        we input their CIG GAB (with N vertices) with
two documents is converted to match the pair of
                                                        a (concatenated) matching vector on each vertex
sentence sets on each vertex of GAB . Specifically,
                                                        into the GCN, such that the feature vector of ver-
we generate local matching vectors based on both
                                                        tex vi in GCN is given by
neural networks and term-based techniques.
   Siamese Encoder: we apply a Siamese neural                       xi = (mAB (vi ), m0AB (vi )).
network encoder (Neculoiu et al., 2016) onto each
vertex v ∈ GAB to convert the word embeddings              Now let us briefly describe the GCN lay-
(Mikolov et al., 2013) of {SA (v), SB (v)} into a       ers (Kipf and Welling, 2016) used in Fig. 2 (c).
fixed-sized hidden feature vector mAB (v), which        Denote the weighted adjacency matrix of the
we call the match vector.                               graph as A ∈ RN ×N where Aij = wij (in CIG,
   We use a Siamese structure to take SA (v) and        it is the TF-IDF similarity between vertex i and
SB (v)} (which are two sequences of word embed-         j).
                                                        P Let D be a diagonal matrix such that         Dii =
                                                                                                    (0)
dings) as inputs, and encode them into two con-             j Aij . The input layer to the GCN is H     = X,
text vectors through the context layers that share      which contains the original vertex features. Let
the same weights, as shown in Fig. 2 (b). The           H (l) ∈ RN ×Ml denote the matrix of hidden rep-
context layer usually contains one or multiple bi-      resentations of the vertices in the lth layer. Then
directional LSTM (BiLSTM) or CNN layers with            each GCN layer applies the following graph con-
max pooling layers, aiming to capture the contex-       volutional filter onto the previous hidden represen-
tual information in SA (v) and SB (v)}.                 tations:
                                                                                  1       1
   Let cA (v) and cB (v) denote the context vectors             H (l+1) = σ(D̃− 2 ÃD̃− 2 H (l) W (l) ),   (2)
obtained for SA (v) and SB (v), respectively. Then,
the matching vector mAB (v) for vertex v is given       where à = A + IN , IN is the identity matrix,
                                                                                                 P and
by the subsequent aggregation layer, which con-         D̃ is a diagonal matrix such that D̃ii = j A˜ij .
catenates the element-wise absolute difference and      They are the adjacency matrix and the degree ma-
the element-wise multiplication of the two context      trix of graph G, respectively.
2016-07-19
                                                     2016-10-07
                                                E)'?,13(/1$F/'(              2016-10-08
                                                                                                   2016-11-02       app. In fact, the proposed article pair matching
                         2016-07-26                                                             7,-)%8$
Table 1: Description of evaluation datasets.          datasets. For both datasets, we use 60% of all
                                                           the samples as the training set, 20% as the devel-
Dataset Pos Samples Neg Samples Train     Dev       Test
                                                           opment (validation) set, and the remaining 20%
CNSE    12865        16198         17438 5813 5812         as the test set. We carefully ensure that different
CNSS    16887        16616         20102 6701 6700
                                                           splits do not contain any overlaps to avoid data
   As a typical example, Fig. 3 shows the events           leakage. The metrics used for performance eval-
contained in the story 2016 U.S. presidential elec-        uation are the accuracy and F1 scores of binary
tion, where each tag shows a breaking news event           classification results. For each evaluated method,
possibly reported by multiple articles with dif-           we perform training for 10 epochs and then choose
ferent narratives (articles not shown here). We            the epoch with the best validation performance to
group highly coherent events together. For exam-           be evaluated on the test set.
ple, there are multiple events about Election tele-           Baselines. We test the following baselines:
vision debates. One of our objectives is to identify
                                                             • Matching by representation-focused or
whether two news articles report the same event,
                                                               interaction-focused deep neural network
e.g., a yes when they are both reporting Trump and
                                                               models: DSSM (Huang et al., 2013), C-
Hilary’s first television debate, though with differ-
                                                               DSSM (Shen et al., 2014), DUET (Mitra
ent wording, or a no, when one article is report-
                                                               et al., 2017), MatchPyramid (Pang et al.,
ing Trump and Hilary’s second television debate
                                                               2016), ARC-I (Hu et al., 2014), ARC-II (Hu
while the other is talking about Donald Trump is
                                                               et al., 2014). We use the implementations
elected president.
                                                               from MatchZoo (Fan et al., 2017) for the
   Datasets. To the best of our knowledge, there
                                                               evaluation of these models.
is no publicly available dataset for long document
matching tasks. We created two datasets: the Chi-            • Matching by term-based similarities: BM25
nese News Same Event dataset (CNSE) and Chi-                   (Robertson et al., 2009), LDA (Blei et al.,
nese News Same Story dataset (CNSS), which are                 2003) and SimNet (which is extracting the
labeled by professional editors. They contain long             five text-pair similarities mentioned in Sec. 3
Chinese news articles collected from major Inter-              and classifying by a multi-layer feedforward
net news providers in China, covering diverse top-             neural network).
ics in the open domain. The CNSE dataset con-
tains 29, 063 pairs of news articles with labels rep-        • Matching by a large-scale pre-training lan-
resenting whether a pair of news articles are re-              guage model: BERT (Devlin et al., 2018).
porting about the same breaking news event. Sim-              Note that we focus on the capability of long text
ilarly, the CNSS dataset contains 33, 503 pairs of         matching. Therefore, we do not use any short text
articles with labels representing whether two doc-         information, such as titles, in our approach or in
uments fall into the same news story. The average          any baselines. In fact, the “relationship” between
number of words for all documents in the datasets          two documents is not limited to ”whether the same
is 734 and the maximum value is 21791.                     event or not”. Our algorithm is able to identify
   In our datasets, we only labeled the major event        a general relationship between documents, e.g.,
(or story) that a news article is reporting, since         whether two episodes are from the same season
in the real world, each breaking news article on           of a TV series. The definition of the relationship
the Internet must be intended to report some spe-          (e.g., same event/story, same chapter of a book) is
cific breaking news that has just happened to at-          solely defined and supervised by the labeled train-
tract clicks and views. Our objective is to deter-         ing data. For these tasks, the availability of other
mine whether two news articles intend to report            information such as titles can not be assumed.
the same breaking news.                                       As shown in Table 2, we evaluate different vari-
   Note that the negative samples in the two               ants of our own model to show the effect of differ-
datasets are not randomly generated: we select             ent sub-modules. In model names, “CIG” means
document pairs that contain similar keywords, and          that in CIG, we directly use keywords as concepts
exclude samples with TF-IDF similarity below a             without community detection, whereas “CIGcd ”
certain threshold. The datasets have been made             means that each concept vertex in the CIG con-
publicly available for research purpose.                   tains a set of keywords grouped via community de-
   Table 1 shows a detailed breakdown of the two           tection. To generate the matching vector on each
Table 2: Accuracy and F1-score results of different algorithms on CNSE and CNSS datasets.

                         CNSE            CNSS                                                   CNSE           CNSS
 Baselines                                             Our models
                       Acc  F1         Acc   F1                                               Acc   F1       Acc   F1
 I. ARC-I              53.84   48.68   50.10   66.58   XI. CIG-Siam                          74.47   73.03   75.32   78.58
 II. ARC-II            54.37   36.77   52.00   53.83   XII. CIG-Siam-GCN                     74.58   73.69   78.91   80.72
 III. DUET             55.63   51.94   52.33   60.67   XIII. CIGcd -Siam-GCN                 73.25   73.10   76.23   76.94
 IV. DSSM              58.08   64.68   61.09   70.58   XIV. CIG-Sim                          72.58   71.91   75.16   77.27
 V. C-DSSM             60.17   48.57   52.96   56.75   XV. CIG-Sim-GCN                       83.35   80.96   87.12   87.57
 VI. MatchPyramid      66.36   54.01   62.52   64.56   XVI. CIGcd -Sim-GCN                   81.33   78.88   86.67   87.00
 VII. BM25             69.63   66.60   67.77   70.40   XVII. CIG-Sim&Siam-GCN                84.64   82.75   89.77   90.07
 VIII. LDA             63.81   62.44   62.98   69.11   XVIII. CIG-Sim&Siam-GCN-Simg          84.21   82.46   90.03   90.29
 IX. SimNet            71.05   69.26   70.78   74.50   XIX. CIG-Sim&Siam-GCN-BERTg           84.68   82.60   89.56   89.97
 X. BERT fine-tuning   81.30   79.20   86.64   87.08   XX. CIG-Sim&Siam-GCN-Simg &BERTg      84.61   82.59   89.47   89.71

vertex, “Siam” indicates the use of Siamese en-                     used for the baseline method SimNet.
coder, while “Sim” indicates the use of term-based                     As we mentioned in Sec. 1, our code and
similarity encoder, as shown in Fig. 2. “GCN”                       datasets have been open sourced. We implement
means that we convolve the local matching vec-                      our model using PyTorch 1.0 (Paszke et al., 2017).
tors on vertices through GCN layers. Finally,                       The experiments without BERT are carried out on
“BERTg ” or “Simg ” indicates the use of additional                 an MacBook Pro with a 2 GHz Intel Core i7 pro-
global features given by BERT or the five term-                     cessor and 8 GB memory. We use L2 weight de-
based similarity metrics mentioned in Sec. 3, ap-                   cay on all the trainable variables, with parame-
pended to the graphically merged matching vector                    ter λ = 3 × 10−7 . The dropout rate between
mAB , for final classification.                                     every two layers is 0.1. We apply gradient clip-
   Implementation Details. We use Stanford                          ping with maximum gradient norm 5.0. We use
CoreNLP (Manning et al., 2014) for word segmen-                     the ADAM optimizer (Kingma and Ba, 2014) with
tation (on Chinese text) and named entity recogni-                  β1 = 0.8, β2 = 0.999,  = 108 . We use a learn-
tion. For Concept Interaction Graph construction                    ing rate warm-up scheme with an inverse exponen-
with community detection, we set the minimum                        tial increase from 0.0 to 0.001 in the first 1000
community size (number of keywords contained                        steps, and then maintain a constant learning rate
in a concept vertex) to be 2, and the maximum size                  for the remainder of training. For all the exper-
to be 6.                                                            iments, we set the maximum number of training
   Our neural network model consists of word                        epochs to be 10.
embedding layer, Siamese encoding layer, Graph
                                                                    4.1   Results and Analysis
transformation layers, and classification layer. For
embedding, we load the pre-trained word vectors                     Table 2 summarizes the performance of all
and fix it during training. The embeddings of out                   the compared methods on both datasets. Our
of vocabulary words are set to be zero vectors. For                 model achieves the best performance on both two
the Siamese encoding network, we use 1-D con-                       datasets and significantly outperforms all other
volution with number of filters 32, followed by                     methods. This can be attributed to two reasons.
an ReLU layer and Max Pooling layer. For graph                      First, as the input of article pairs are re-organized
transformation, we utilize 2 layers of GCN (Kipf                    into Concept Interaction Graphs, the two doc-
and Welling, 2016) for experiments on the CNSS                      uments are aligned along the corresponding se-
dataset, and 3 layers of GCN for experiments on                     mantic units for easier concept-wise comparison.
the CNSE dataset. When the vertex encoder is the                    Second, our model encodes local comparisons
five-dimensional features, we set the output size                   around different semantic units into local match-
of GCN layers to be 16. When the vertex encoder                     ing vectors, and aggregate them via graph convo-
is the Siamese network encoder, we set the output                   lutions, taking semantic topologies into considera-
size of GCN layers to be 128 except the last layer.                 tion. Therefore, it solves the problem of matching
For the last GCN layer, the output size is always                   documents via divide-and-conquer, which is suit-
set to be 16. For the classification module, it con-                able for handling long text.
sists of a linear layer with output size 16, an ReLU                   Impact of Graphical Decomposition. Com-
layer, a second linear layer, and finally a Sigmoid                 paring method XI with methods I-VI in Table 2,
layer. Note that this classification module is also                 they all use the same word vectors and use neu-
ral networks for text encoding. The key differ-        viewed matching vectors.
ence is that our method XI compares a pair of ar-         Impact of Added Global Features. Compar-
ticles over a CIG in per-vertex decomposed fash-       ing methods XVIII, XIX, XX with method XVII,
ion. We can see that the performance of method         we can see that adding more global features, such
XI is significantly better than methods I-VI. Sim-     as global similarities (Simg ) and/or global BERT
ilarly, comparing our method XIV with methods          encodings (BERTg ) of the article pair, can hardly
VII-IX, they all use the same term-based similar-      improve performance any further. This shows that
ities. However, our method achieves significantly      graphical decomposition and convolutions are the
better performance by using graphical decompo-         main factors that contribute to the performance
sition. Therefore, we conclude that graphical de-      improvement. Since they already learn to aggre-
composition can greatly improve long text match-       gate local comparisons into a global semantic re-
ing performance.                                       lationship, additionally engineered global features
   Note that the deep text matching models I-          cannot help.
VI lead to bad performance, because they were             Model Size and Parameter Sensitivity: Our
invented mainly for sequence matching and can          biggest model without BERT is XVIII, which
hardly capture meaningful semantic interactions        contains only ∼34K parameters. In comparison,
in article pairs. When the text is long, it is hard    BERT contains 110M-340M parameters. How-
to get an appropriate context vector representa-       ever, our model significantly outperforms BERT.
tion for matching. For interaction-focused neural         We tested the sensitivity of different parame-
network models, most of the interactions between       ters in our model. We found that 2 to 3 layers
words in two long articles will be meaningless.        of GCN layers gives the best performance. Fur-
   Impact of Graph Convolutions. Compare               ther introducing more GCN layers does not im-
methods XII and XI, and compare methods XV             prove the performance, while the performance is
and XIV. We can see that incorporating GCN lay-        much worse with zero or only one GCN layer. Fur-
ers has significantly improved the performance on      thermore, in GCN hidden representations of a size
both datasets. Each GCN layer updates the hid-         between 16 and 128 yield good performance. Fur-
den vector of each vertex by integrating the vec-      ther increasing this size does not show obvious im-
tors from its neighboring vertices. Thus, the GCN      provement.
layers learn to graphically aggregate local match-        For the optional community detection step in
ing features into a final result.                      CIG construction, we need to choose the minimum
   Impact of Community Detection. By compar-           size and the maximum size of communities. We
ing methods XIII and XII, and comparing methods        found that the final performance remains similar
XVI and XV, we observe that using community            if we vary the minimum size from 2∼3 and the
detection, such that each concept is a set of corre-   maximum size from 6∼10. This indicates that our
lated keywords instead of a single keyword, leads      model is robust and insensitive to these parame-
to slightly worse performance. This is reasonable,     ters.
as using each keyword directly as a concept vertex        Time complexity. For keywords of news arti-
provides more anchor points for article compari-       cles, in real-world industry applications, they are
son . However, community detection can group           usually extracted in advance by highly efficient
highly coherent keywords together and reduces the      off-the-shelf tools and pre-defined vocabulary. For
average size of CIGs from 30 to 13 vertices. This      CIG construction, let Ns be the number of sen-
helps to reduce the total training and testing time    tences in two documents, Nw be the number of
of our models by as much as 55%. Therefore, one        unique words in documents, and Nk represents
may choose whether to apply community detec-           the number of unique keywords in a document.
tion to trade accuracy off for speedups.               Building keyword graph requires O(Ns Nk + Nw2 )
   Impact of Multi-viewed Matching. Compar-            complexity (Sayyadi and Raschid, 2013), and
ing methods XVII and XV, we can see that the           betweenness-based community detection requires
concatenation of different graphical matching vec-     O(Nk3 ). The complexity of sentence assignment
tors (both term-based and Siamese encoded fea-         and weight calculation is O(Ns Nk + Nk2 ). For
tures) can further improve performance. This           graph classification, our model size is not big and
demonstrates the advantage of combining multi-         can process document pairs efficiently.
5   Related Work                                       et al., 2017; Paul et al., 2016) for long text match-
                                                       ing. In contrast, our method represents documents
Graphical Document Representation. A major-            by a novel graph representation and combines the
ity of existing works can be generalized into four     representation with GCN.
categories: word graph, text graph, concept graph,        Finally, pre-training models such as BERT (De-
and hybrid graph. Word graphs use words in a           vlin et al., 2018) can also be utilized for text
document as vertices, and construct edges based        matching. However, the model is of high complex-
on syntactic analysis (Leskovec et al., 2004), co-     ity and is hard to satisfy the speed requirement in
occurrences (Zhang et al., 2018; Rousseau and          real-world applications.
Vazirgiannis, 2013; Nikolentzos et al., 2017) or          Graph Convolutional Networks. We also con-
preceding relation (Schenker et al., 2003). Text       tributed to the use of GCNs to identify the rela-
graphs use sentences, paragraphs or documents          tionship between a pair of graphs, whereas pre-
as vertices, and establish edges by word co-           viously, different GCN architectures have mainly
occurrence, location (Mihalcea and Tarau, 2004),       been used for completing missing attributes/links
text similarities (Putra and Tokunaga, 2017), or       (Kipf and Welling, 2016; Defferrard et al., 2016)
hyperlinks between documents (Page et al., 1999).      or for node clustering or classification (Hamilton
Concept graphs link terms in a document to real        et al., 2017), but all within the context of a sin-
world concepts based on knowledge bases such as        gle graph, e.g., a knowledge graph, citation net-
DBpedia (Auer et al., 2007), and construct edges       work or social network. In this work, the pro-
based on syntactic/semantic rules. Hybrid graphs       posed Concept Interaction Graph takes a simple
(Rink et al., 2010; Baker and Ellsworth, 2017)         approach to represent a document by a weighted
consist of different types of vertices and edges.      undirected graph, which essentially helps to de-
   Text Matching. Traditional methods repre-           compose a document into subsets of sentences,
sent a text document as vectors of bag of words        each subset focusing on a different sub-topic or
(BOW), term frequency inverse document fre-            concept.
quency (TF-IDF), LDA (Blei et al., 2003) and
so forth, and calculate the distance between vec-      6   Conclusion
tors. However, they cannot capture the semantic
                                                       We propose the Concept Interaction Graph to or-
distance and usually cannot achieve good perfor-
                                                       ganize documents into a graph of concepts, and in-
mance.
                                                       troduce a divide-and-conquer approach to match-
   In recent years, different neural network archi-
                                                       ing a pair of articles based on graphical decom-
tectures have been proposed for text pair match-
                                                       position and convolutional aggregation. We cre-
ing tasks. For representation-focused models, they
                                                       ated two new datasets for long document matching
usually transform text pairs into context represen-
                                                       with the help of professional editors, consisting
tation vectors through a Siamese neural network,
                                                       of about 60K pairs of news articles, on which we
followed by a fully connected network or score
                                                       have performed extensive evaluations. In the ex-
function which gives the matching result based on
                                                       periments, our proposed approaches significantly
the context vectors (Qiu and Huang, 2015; Wan
                                                       outperformed an extensive range of state-of-the-
et al., 2016; Liu et al., 2018; Mueller and Thya-
                                                       art schemes, including both term-based and deep-
garajan, 2016; Severyn and Moschitti, 2015). For
                                                       model-based text matching algorithms. Results
interaction-focused models, they extract the fea-
                                                       suggest that the proposed graphical decomposi-
tures of all pair-wise interactions between words
                                                       tion and the structural transformation by GCN lay-
in text pairs, and aggregate the interaction fea-
                                                       ers are critical to the performance improvement in
tures by deep networks to give a matching result
                                                       matching article pairs.
(Hu et al., 2014; Pang et al., 2016). However, the
intrinsic structural properties of long text docu-
ments are not fully utilized by these neural models.
Therefore, they cannot achieve good performance
for long text pair matching.
   There are also research works which utilize
knowledge (Wu et al., 2018), hierarchical property
(Jiang et al., 2019) or graph structure (Nikolentzos
References                                                Jyun-Yu Jiang, Mingyang Zhang, Cheng Li, Mike Ben-
                                                            dersky, Nadav Golbandi, and Marc Najork. 2019.
Sören Auer, Christian Bizer, Georgi Kobilarov, Jens        Semantic text matching for long-form documents.
   Lehmann, Richard Cyganiak, and Zachary Ives.
   2007. Dbpedia: A nucleus for a web of open data.       Diederik P Kingma and Jimmy Ba. 2014. Adam: A
   The semantic web, pages 722–735.                         method for stochastic optimization. arXiv preprint
                                                            arXiv:1412.6980.
Collin Baker and Michael Ellsworth. 2017. Graph
  methods for multilingual framenets. In Proceedings      Thomas N Kipf and Max Welling. 2016. Semi-
  of TextGraphs-11: the Workshop on Graph-based             supervised classification with graph convolutional
  Methods for Natural Language Processing, pages            networks. arXiv preprint arXiv:1609.02907.
  45–50.
                                                          Heeyoung Lee, Angel Chang, Yves Peirsman,
Cosmin Adrian Bejan and Sanda Harabagiu. 2010. Un-          Nathanael Chambers, Mihai Surdeanu, and Dan
  supervised event coreference resolution with rich         Jurafsky. 2013. Deterministic coreference resolu-
  linguistic features. In Proceedings of the 48th An-       tion based on entity-centric, precision-ranked rules.
  nual Meeting of the Association for Computational         Computational Linguistics, 39(4):885–916.
  Linguistics, pages 1412–1422. Association for Com-
                                                          Heeyoung Lee, Marta Recasens, Angel Chang, Mihai
  putational Linguistics.
                                                            Surdeanu, and Dan Jurafsky. 2012. Joint entity and
                                                            event coreference resolution across documents. In
David M Blei, Andrew Y Ng, and Michael I Jordan.
                                                            Proceedings of the 2012 Joint Conference on Empir-
  2003. Latent dirichlet allocation. Journal of ma-
                                                            ical Methods in Natural Language Processing and
  chine Learning research, 3(Jan):993–1022.
                                                            Computational Natural Language Learning, pages
                                                            489–500. Association for Computational Linguis-
Daniel Bruggermann, Yannik Hermey, Carsten Orth,            tics.
  Darius Schneider, Stefan Selzer, and Gerasimos
  Spanakis. 2016. Storyline detection and tracking us-    Jure Leskovec, Marko Grobelnik, and Natasa Milic-
  ing dynamic latent dirichlet allocation. In Proceed-       Frayling. 2004. Learning sub-structures of docu-
  ings of the 2nd Workshop on Computing News Sto-            ment semantic graphs for document summarization.
  rylines (CNS 2016), pages 9–19.
                                                          Bang Liu, Di Niu, Kunfeng Lai, Linglong Kong, and
Michaël Defferrard, Xavier Bresson, and Pierre Van-        Yu Xu. 2017. Growing story forest online from
  dergheynst. 2016. Convolutional neural networks on        massive breaking news. In Proceedings of the 2017
  graphs with fast localized spectral filtering. In Ad-     ACM on Conference on Information and Knowledge
  vances in Neural Information Processing Systems,          Management, pages 777–785. ACM.
  pages 3844–3852.
                                                          Bang Liu, Ting Zhang, Fred X Han, Di Niu, Kunfeng
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and               Lai, and Yu Xu. 2018. Matching natural language
   Kristina Toutanova. 2018. Bert: Pre-training of deep     sentences with hierarchical sentence factorization.
   bidirectional transformers for language understand-      In Proceedings of the 2018 World Wide Web Con-
   ing. arXiv preprint arXiv:1810.04805.                    ference on World Wide Web, pages 1237–1246. In-
                                                            ternational World Wide Web Conferences Steering
Yixing Fan, Liang Pang, JianPeng Hou, Jiafeng Guo,          Committee.
  Yanyan Lan, and Xueqi Cheng. 2017. Matchzoo:
  A toolkit for deep text matching. arXiv preprint        Christopher Manning, Mihai Surdeanu, John Bauer,
  arXiv:1707.07270.                                         Jenny Finkel, Steven Bethard, and David McClosky.
                                                            2014. The stanford corenlp natural language pro-
William L Hamilton, Rex Ying, and Jure Leskovec.            cessing toolkit. In Proceedings of 52nd annual
  2017. Representation learning on graphs: Methods          meeting of the association for computational lin-
  and applications. arXiv preprint arXiv:1709.05584.        guistics: system demonstrations, pages 55–60.
                                                          Rada Mihalcea and Paul Tarau. 2004. Textrank: Bring-
Baotian Hu, Zhengdong Lu, Hang Li, and Qingcai              ing order into text. In Proceedings of the 2004 con-
  Chen. 2014. Convolutional neural network architec-        ference on empirical methods in natural language
  tures for matching natural language sentences. In         processing.
  Advances in neural information processing systems,
  pages 2042–2050.                                        Tomas Mikolov, Kai Chen, Greg Corrado, and Jef-
                                                            frey Dean. 2013. Efficient estimation of word
Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng,           representations in vector space. arXiv preprint
  Alex Acero, and Larry Heck. 2013. Learning deep           arXiv:1301.3781.
  structured semantic models for web search using
  clickthrough data. In Proceedings of the 22nd ACM       Bhaskar Mitra, Fernando Diaz, and Nick Craswell.
  international conference on Conference on informa-        2017. Learning to match using local and distributed
  tion & knowledge management, pages 2333–2338.             representations of text for web search. In Proceed-
  ACM.                                                      ings of the 26th International Conference on World
Wide Web, pages 1291–1299. International World           François Rousseau and Michalis Vazirgiannis. 2013.
  Wide Web Conferences Steering Committee.                   Graph-of-word and tw-idf: new approach to ad hoc
                                                             ir. In Proceedings of the 22nd ACM international
Jonas Mueller and Aditya Thyagarajan. 2016. Siamese          conference on Information & Knowledge Manage-
  recurrent architectures for learning sentence similar-     ment, pages 59–68. ACM.
  ity. In Thirtieth AAAI Conference on Artificial Intel-
  ligence.                                                 Hassan Sayyadi and Louiqa Raschid. 2013. A graph
                                                             analytical approach for topic detection. ACM Trans-
Paul Neculoiu, Maarten Versteegh, Mihai Rotaru, and          actions on Internet Technology (TOIT), 13(2):4.
  Textkernel BV Amsterdam. 2016. Learning text
  similarity with siamese recurrent networks. ACL          Adam Schenker, Mark Last, Horst Bunke, and Abra-
  2016, page 148.                                            ham Kandel. 2003.    Clustering of web docu-
                                                             ments using a graph model. SERIES IN MA-
Giannis Nikolentzos, Polykarpos Meladianos, François        CHINE PERCEPTION AND ARTIFICIAL INTEL-
  Rousseau, Yannis Stavrakas, and Michalis Vazir-            LIGENCE, 55:3–18.
  giannis. 2017. Shortest-path graph kernels for doc-
  ument similarity. In Proceedings of the 2017 Con-        Aliaksei Severyn and Alessandro Moschitti. 2015.
  ference on Empirical Methods in Natural Language           Learning to rank short text pairs with convolutional
  Processing, pages 1890–1900.                               deep neural networks. In Proceedings of the 38th in-
                                                             ternational ACM SIGIR conference on research and
Lawrence Page, Sergey Brin, Rajeev Motwani, and              development in information retrieval, pages 373–
  Terry Winograd. 1999. The pagerank citation rank-          382. ACM.
  ing: Bringing order to the web. Technical report,
  Stanford InfoLab.                                        Dafna Shahaf, Jaewon Yang, Caroline Suen, Jeff Ja-
                                                             cobs, Heidi Wang, and Jure Leskovec. 2013. Infor-
Liang Pang, Yanyan Lan, Jiafeng Guo, Jun Xu,                 mation cartography: creating zoomable, large-scale
  Shengxian Wan, and Xueqi Cheng. 2016. Text                 maps of information. In Proceedings of the 19th
  matching as image recognition. In AAAI, pages              ACM SIGKDD international conference on Knowl-
  2793–2799.                                                 edge discovery and data mining, pages 1097–1105.
                                                             ACM.
Adam Paszke, Sam Gross, Soumith Chintala, and Gre-
  gory Chanan. 2017. Pytorch: Tensors and dynamic          Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng,
  neural networks in python with strong gpu acceler-         and Grégoire Mesnil. 2014. Learning semantic rep-
  ation. PyTorch: Tensors and dynamic neural net-            resentations using convolutional neural networks for
  works in Python with strong GPU acceleration.              web search. In Proceedings of the 23rd Interna-
                                                             tional Conference on World Wide Web, pages 373–
Christian Paul, Achim Rettinger, Aditya Mogadala,            374. ACM.
  Craig A Knoblock, and Pedro Szekely. 2016. Effi-
  cient graph-based document similarity. In European       Piek Vossen, Tommaso Caselli, and Yiota Kont-
  Semantic Web Conference, pages 334–349. Springer.           zopoulou. 2015. Storylines for structuring massive
                                                              streams of news. In Proceedings of the First Work-
Marten Postma, Filip Ilievski, and Piek Vossen. 2018.         shop on Computing News Storylines, pages 40–49.
 Semeval-2018 task 5: Counting events and par-
 ticipants in the long tail. In Proceedings of The         Shengxian Wan, Yanyan Lan, Jiafeng Guo, Jun Xu,
 12th International Workshop on Semantic Evalua-             Liang Pang, and Xueqi Cheng. 2016. A deep ar-
 tion, pages 70–80.                                          chitecture for semantic matching with multiple po-
                                                             sitional sentence representations. In AAAI, vol-
Jan Wira Gotama Putra and Takenobu Tokunaga. 2017.           ume 16, pages 2835–2841.
   Evaluating text coherence based on semantic sim-
   ilarity graph. In Proceedings of TextGraphs-11:         Yu Wu, Wei Wu, Can Xu, and Zhoujun Li. 2018.
   the Workshop on Graph-based Methods for Natural           Knowledge enhanced hybrid neural network for text
   Language Processing, pages 76–85.                         matching. In Thirty-Second AAAI Conference on
                                                             Artificial Intelligence.
Xipeng Qiu and Xuanjing Huang. 2015. Convolutional
  neural tensor network architecture for community-        Ting Zhang, Bang Liu, Di Niu, Kunfeng Lai, and
  based question answering. In IJCAI, pages 1305–            Yu Xu. 2018. Multiresolution graph attention net-
  1311.                                                      works for relevance matching. In Proceedings of
                                                             the 27th ACM International Conference on Informa-
Bryan Rink, Cosmin Adrian Bejan, and Sanda M                 tion and Knowledge Management, pages 933–942.
  Harabagiu. 2010. Learning textual graph patterns           ACM.
  to detect causal event relations. In FLAIRS Confer-
  ence.                                                    Deyu Zhou, Haiyang Xu, and Yulan He. 2015. An un-
                                                             supervised bayesian modelling approach for story-
Stephen Robertson, Hugo Zaragoza, et al. 2009. The           line detection on news articles. In Proceedings of
   probabilistic relevance framework: Bm25 and be-           the 2015 Conference on Empirical Methods in Nat-
   yond. Foundations and Trends R in Information Re-         ural Language Processing, pages 1943–1948.
   trieval, 3(4):333–389.
You can also read