S&I Reader: Multi-granularity Gated Multi-hop Skimming and Intensive Reading Model for Machine Reading Comprehension - ResearchGate

Page created by Seth Lloyd
 
CONTINUE READING
S&I Reader: Multi-granularity Gated Multi-hop Skimming and Intensive Reading Model for Machine Reading Comprehension - ResearchGate
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
 10.1109/ACCESS.2021.3079165, IEEE Access

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.Doi Number

S&I Reader: Multi-granularity Gated Multi-hop
Skimming and Intensive Reading Model for
Machine Reading Comprehension
Yong Wang1, Chong Lei2, and Duoqian Miao3
1
 School of Artificial Intelligence, Liangjiang, Chongqing University of Technology, Chongqing 401135, China
2
 College of Computer Science and Engineering, Chongqing University of Technology, Chongqing 400054, China
3
 Department of Computer Science and Technology, Tongji University, Shanghai 201804, China

Corresponding author: Yong Wang (e-mail: ywang@cqut.edu.cn)
The research is supported by the National Natural Science Foundation of China under Grant 61976158 and Grant 61673301.

ABSTRACT Machine reading comprehension is a very challenging task, which aims to determine the
answer span based on the given context and question. The newly developed pre-training language model has
achieved a series of successes in various natural language understanding tasks with its powerful contextual
representation ability. However, these pre-training language models generally lack the downstream
processing structure for specific tasks, which limits further performance improvement. In order to solve this
problem and deepen the model's understanding of the question and context, this paper proposes S&I Reader.
On the basis of the pre-training model, skimming, intensive reading, and gated mechanism modules are added
to simulate the behavior of humans reading text and filtering information. Based on the idea of granular
computing, a multi-granularity module for computing context granularity and sequence granularity is added
to the model to simulate the behavior of human beings to understand the text from words to sentences, from
parts to the whole. Compared with the previous machine reading comprehension model, our model structure
is novel. The skimming module and multi-granularity module proposed in this paper are used to solve the
problem that the previous model ignores the key information of the text and cannot understand the text with
multi granularity. Experiments show that the model proposed in this paper is effective for both Chinese and
English datasets. It can better understand the question and context and give a more accurate answer. The
performance has made new progress on the basis of the baseline model.

INDEX TERMS Gated mechanism, granular computing, intensive reading, machine reading
comprehension, pre-training model, skimming.

I. INTRODUCTION models for these machine reading comprehension tasks have
 Important books must be read over and over again, and achieved good results.
every time you read it, you will find it beneficial to open the With the development of the popular pre-training
book. language models with ultra-large-scale parameters in recent
 Jules Renard(1864-1910) years, which use large-scale corpus for training [2]. The
 latest results (SOTA) have been achieved in various natural
Machine reading comprehension (MRC) is a basic and language processing tasks including machine reading
challenging task in natural language processing. It requires comprehension. These pre-training language models such as
the machine to give the answer while understanding the the bidirectional Transformer [3] structure of Bert [4] and
given question and context. Common machine reading Albert [5], the generalized autoregressive pre-training model
comprehension tasks are divided into cloze test, multiple XLNet [6] are used as the model encoder to extract
choice, span extraction, and free answering according to the contextual language features of related texts and fine-tune in
answer form [1]. With the development of natural language conjunction with downstream processing structures for
processing and deep learning theories, more and more specific tasks.

VOLUME XX, 2017 1

 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
 10.1109/ACCESS.2021.3079165, IEEE Access
 Author Name: Preparation of Papers for IEEE Access (February 2017)

 With the great success of the development of pre-training realization of human-centered operations in the presence of
language models, people have focused more attention on the multi-faceted data. It contains a large number of techniques
encoder side of the model. People can directly benefit from to minimize uncertainty [10]. A recognized feature of
multiple powerful encoders with similar structures, leading artificial intelligence is that people can observe and analyze
to a bottleneck in the development of downstream processing the same problem from extremely different granularities. Not
structures tailored to specific tasks. However, it is time- only can people solve problems in different granular worlds,
consuming and resource-consuming to encode the general but they can also quickly jump from one granular world to
knowledge contained in large-scale corpus into language another. This ability to deal with worlds of different
models with ultra-large-scale parameters. Moreover, due to granularities is a powerful manifestation of human problem
the slow development of language representation encoding solving [11]. The granular computing model divides the
technology, the performance of the pre-training language research object into several layers with different granularities,
model is limited. These all highlight the importance of and each layer is related to each other to form a unified whole
developing downstream processing structures for specific [12]. Different granularities indicate different angles and
tasks. Therefore, this paper focuses on the downstream ranges of information. The idea of granular computing helps
processing structure of the pre-training model. the model to solve the problem from multiple levels and
 Many studies have shown that many models pay attention angles, and helps the model understand the relationship
to the unimportant parts of the text and ignore the important between the part of the text and the whole.
parts [7]. At the same time, this paper also found that the When humans perform reading comprehension tasks, they
previous model still has the phenomenon of over-stability. usually have comprehensive reading comprehension
That is, it is susceptible to interference sentences in the behaviors such as skimming and intensive reading. First,
context where there are many same words with the question, grasp the key information in the question and context by
which indicates that the model is stably matched only skimming, and then further grasp the main idea of the text
literally, but not semantically matched. This paper selects an and filter the important information that matches the
example in the Chinese dataset DuReader 2.0 for an question by intensive reading. Read repeatedly, from the
explanation, as shown in Table 1. global theme to the partial information of the text, and finally
 determine the answer to the question. Therefore, this paper
Table 1. An over-stability MRC example. proposes S&I Reader, which aims to help the model
 Context: determine the valid information in the question and context,
 第 32 届夏季奥林匹克运动会原定 2020 年 7 月在日本 and understand the text in terms of word granularity, context
 东京举行,但 2020 年受到全球新冠疫情的影响,经与 granularity, and sequence granularity. Our model solves the
 国际奥委会协商,东京奥运会推迟大约一年后举行, problem of incomplete semantics of prediction answers
 暂定 2021 年 7 月 23 日正式开幕,8 月 8 日闭幕。 caused by insufficient learning and the phenomenon of over-
 Question: stability caused by only literal matching to some extent. This
 第 32 届夏季奥林匹克运动会开幕时间 paper uses RoBERTa [13] as the encoder of the model and
 GroundTruth: 2021 年 7 月 23 日 the baseline model for comparison, to give full play to the
 advantages of its encoding context language features.
 Over-stability Answer:2020 年 7 月
 The model contains the following four parts:
 1. Skimming Reading Module, which is used to determine
Table 1. An over-stability MRC example.
 the keywords in question and context and their
 Context:
 corresponding related parts, helping the model pay attention
 The 32nd Summer Olympic Games was originally
 scheduled to be held in Tokyo, Japan in July 2020. to the important content of the text.
 However, in 2020, affected by the global COVID-19 2. Intensive Reading Module, which is used to compare
 epidemic, after consultation with the International the similarity and relevance of each word in the question and
 Olympic Committee, the Tokyo Olympic Games was context to generate a question-aware context representation.
 postponed about one year later, and it was tentatively The model cycles repeatedly the Skimming Reading Module
 scheduled to officially open on July 23, 2021 and close on and Intensive Reading Module by the Multi-hop mechanism
 August 8, 2021. to simulate the process of human skimming and intensive
 Question: reading multiple times.
 Opening time of the 32nd Summer Olympic Games 3. Gated Mechanism, which is used to determine the part
 GroundTruth: July 23, 2021 that needs to be memorized or forgotten in the context and
 Over-stability Answer: July 2020 update. This module is used to simulate the behavior of
 human beings, and to filter and memorize important
 Granular computing (GrC) has been applied in many information.
fields since Zadeh [8],[9] proposed. Granular computing is 4. Multi-granularity Module, which is used to help the
an effective solution to structured problems. It simulates the model understand the relationship between the whole and the

2 VOLUME XX, 2017

 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
 10.1109/ACCESS.2021.3079165, IEEE Access
 Author Name: Preparation of Papers for IEEE Access (February 2017)

part of the context at multiple levels in terms of word cancels the NSP task and adds the Sentence Order Prediction
granularity, context granularity, and sequence granularity. task (SOP) to predict the order between sentences. ALBERT
 The S&I Reader proposed in this paper is evaluated on the can still maintain good performance while greatly reducing the
DuReader 2.0 [14] and SQuAD v1.1 [15]. Experiments number of parameters.
prove that this model is effective in both Chinese and English The novelty of the model proposed in this paper is that we
datasets, and the performance of this model has been further propose a bidirectional attention over attention mechanism,
improved on the basis of the pre-training model. At the same which overcomes the inability of the previous model to focus
time, ablation experiments and experimental analysis also on key information when establishing the relationship
prove that this model is more effective for grasping the key between context and question. The multi-granularity module
information of the text and solving the over-stability problem. proposed in this paper helps the model understand the text
 from the perspective of the whole and the part, and overcomes
II. RELATED WORK the previous model to understand the text only from the
In recent years, various machine reading comprehension perspective of word granularity. At the same time, our model
datasets have been released one after another, such as SQuAD, is based on the pre-training model to establish a downstream
MS MARCO [16], RACE [17], CNN & Daily Mail [18], and processing structure for span extraction reading
DuReader. It has aroused people's interest in the research of comprehension tasks, and further improves the accuracy on
machine reading comprehension and made it possible to build the basis of the pre-training model.
deep network models according to specific tasks [19]. And it The main contributions in this paper include the following
provides a test platform to evaluate MRC model extensively. three aspects.
 In the early stage, the focus of the classic reading 1. This paper introduces the development process of
comprehension model was to study the multiple ways of machine reading comprehension and analyzes the current
interaction between the question and the context, and the necessity of developing downstream data processing
corresponding attention mechanism increases the model's structures based on pre-training language models.
understanding of the relationship between the question words 2. This paper proposes the S&I Reader model, Bidirectional
and the context words and then output the predicted answer. Attention over Attention, and Multi-granularity Module. This
These include GA Reader [20] using a one-way attention model can effectively focus on and filter effective information,
mechanism. Match-LSTM [21] using an LSTM structure with and understand the text at multiple levels and multiple angles.
an attention mechanism for information alignment. Bi-DAF It solves the problems of over-focusing on unimportant parts
[22] using a bidirectional attention flow mechanism. R-Net and ignoring important parts, over-stability problem, and
[23] using a self-matching attention mechanism to obtain more insufficient learning in previous models.
comprehensive contextual information. J-Net [24] uses an 3. The experiments and analysis in this paper show that the
attention pooling mechanism to filter key information. QANet model can further improve the performance on the basis of the
[25] using depth separability convolution [26] and multi-head pre-training model.
self-attention mechanism [3] simulates local and global
information interaction and so on. These end-to-end neural III. S&I Reader
network models have achieved remarkable success. The model is used for span extraction reading comprehension
 In recent years, with the accumulation of large-scale corpus tasks. The task is defined as a given question =
and the development of pre-training language models based { 1 , 2 , … , } containing words and a context =
on the multi-head self-attention mechanism, it has { 1 , 2 , … , } containing words. According to question
demonstrated powerful performance in a number of natural and context , the model extracts a continuous word
language processing fields. Such as BERT, RoBERTa, and subsequence = { +1 , +2 , … , + } from as the
ALBERT, etc. Among them, BERT uses two unsupervised predicted answer and outputs it.
prediction tasks for pre-training, that is, using Masked The model architecture is shown in Figure 1. The model is
Language Model to predict the words in the covered part and composed of Encoder and downstream structure. The
Next Sentence Prediction (NSP) task to understand the downstream structure is mainly composed of the following
relationship between two sentences. RoBERTa changed the four parts: Skimming Reading Module, Intensive Reading
static Mask mechanism on the basis of BERT, and instead Module, Gated Mechanism, and Multi-granularity Module.
adopted the dynamic Mask mechanism, which eliminated the Among them, Skimming Reading Module and Intensive
NSP task. At the same time, it used a larger corpus and a larger Reading Module are included in Multi-hop Mechanism.
batch of data for pre-training to achieve better results.
ALBERT removes the restriction that the word vector 1) ENCODER
dimension and hidden layer dimension must be the same on The encoder includes Embedding and Interaction, which are
the basis of BERT, and it shares parameters across layers on used to represent the embedding of the text sequence and
the Encoder side, avoiding the large increase of parameters represent the contextual language association feature.
with the increase of model depth. At the same time, ALBERT

2 VOLUME XX, 2017

 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
 10.1109/ACCESS.2021.3079165, IEEE Access
 Author Name: Preparation of Papers for IEEE Access (February 2017)

FIGURE 1. S&I Reader architecture.

• Embedding ̅ +1 = 2 +1 ∙ ( 1 +1 ℎ +1 + 1 +1 ) + 2 +1 (4)
Firstly, the question and context are sent to the Encoder. +1 = (ℎ +1 + ( ̅ +1 )) (5)
The words in the text are converted into sub-token by
wordpiece [27]. For each sub-token, its input embedding is the Where is the index of multi-head attention, +1
 ,
sum of token embedding, segment embedding, and position +1
 , +1 +1
 and are trainable matrix for the -th
embedding. attention head. 1 +1 , 2 +1 , 1 +1 and 2 +1 are the
 Then the embeddings of all sub-tokens in the question trainable matrix and bias for the +1-th layer.
and context are spliced together to form a sequence. At the We use = {ℎ1 , ℎ2 , ℎ3 , … , ℎ } to represent the output of
same time, set the length of question and context to the last layer of Encoder. And is sent to the Skimming
and respectively. If it is too long, it will be truncated, if it Reading Module.
is not enough, it will be filled with [PAD]. The starting
position of each sequence is the special classification indicator 2) SKIMMING READING MODULE
[CLS], and the indicator [SEP] is added to the end of the After the output vector of the last layer of the Encoder is
question and context respectively. Let the output embedding obtained, it is intercepted according to the position of
of the sequence be expressed as = { 1 , 2 , 3 , … , }, and question and context in the sequence to obtain =
send it to the interaction layer to establish contextual language {ℎ2 , ℎ3 , ℎ4 , … , ℎ +1 }, = {ℎ +3 , ℎ +4 , … , ℎ + +2 }.
association features. Inspired by attention over attention [28], this paper proposes
 bidirectional attention over attention. As shown in Figure 2, it
• Interaction aims to help the model determine the keywords in the question
The output embedding is sent to a multi-layer bidirectional and their associated key parts in the context, and the keywords
Transformer Encoder. Each layer includes a multi-head self- in the context and their associated key parts in the question, so
attention mechanism and a fully connected layer, and these that the model perceives the key content in the text. The
two parts include dropout, residual connection, and Layer- algorithm description of Bidirectional attention over attention
Normalization. As shown in formulas (1)-(5), = is shown in Algorithm 1.
{ 1 , 2 , 3 , … , } is used to represent the -th layer features, As described in Algorithm 1. Firstly, calculate the similarity
 +1-th layer features +1 is calculated as follows. of each pair of words between the question and the context by
 the trilinear function, as shown in formula (6), to obtain the
 +1 
 +1 
 ( ) ( ) similarity matrix , ∈ × . Among them, represents
 
 , = ( ) (1)
 the similarity between the -th word in the context and the -
 √ 
 th word in the question. Softmax is applied to each row of 
 
 to get 1 , as shown in formula (7), to determine which word
 ℎ̅ +1 = ∑ +1 (∑ +1 
 , ∙ ( )) (2) in the question is the closest to each word in the context.
 =1 =1 Softmax is applied to each column of to get 2 , as shown
 in formula (8), to determine which word in the context is most
 ℎ +1 = ( + (ℎ̅ +1 )) (3)
 relevant to each word in the question.

2 VOLUME XX, 2017

 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
 10.1109/ACCESS.2021.3079165, IEEE Access
 Author Name: Preparation of Papers for IEEE Access (February 2017)

FIGURE 2. Skimming Reading Module (Bidirectional Attention over Attention)

Algorithm 1. Calculate keywords and associated key parts in = ( , , ∙ ) (6)
context and question. 1 = → ( ) (7)
 Input: Question and paragraph representation 2 = ↓ ( ) (8)
 encoded by encoder
 Output: Key information-aware context representation Where is a trainable matrix.
 and key information-aware question representation is obtained by averaging 1 in the context direction,
 as shown in formula (9), to highlight the keywords in the
 Step 1: Calculate the word-level similarity matrix question. And calculate the context key part attention
 between the question and the context by the trilinear associated with the question keyword to get , as shown in
 function formula (10). As shown in formula (11), highlight the key part
 Step 2: Softmax is applied to each row and column of the of the context and add it to the context vector representation to
 similarity matrix to obtain respectively 1 and 2 get the key information- aware context representation.
 Step 3: 1 is averaged in the context direction to get the
 question keyword weight . 2 is averaged in the = ↓ ( 1 ) (9)
 question direction to get the context keyword weight = 2 ⋅ (10)
 . = + ⨀ (11)
 Step 4: Calculate the weights of the key parts associated
 with the context and question keywords to get and Where ⨀ represents the element-wise multiplication.
 respectively 2 is averaged in the direction of the question word to get
 Step 5: Highlight the key part of the context and question , as shown in formula (12), to highlight context
 to obtain key-information aware context representation keywords. And calculate the attention of the key part of the
 and question representation question word associated with the context keyword to get ,
 as shown in formula (13). As shown in formula (14), highlight

2 VOLUME XX, 2017

 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
 10.1109/ACCESS.2021.3079165, IEEE Access
 Author Name: Preparation of Papers for IEEE Access (February 2017)

the key part of the question word, and add it to the question The model generates the update vector by formula (18).
word representation to get the key information-aware question At the same time, and are combined into a linear layer
representation. with sigmoid by formula (19). That is, when the part of is
 = → ( 2 ) (12) more related to the content of the question, the memory weight
 = 1 ∙ (13) approaches 1, and more relevant information will be
 = + ⨀ (14) retained. The model uses the update vector to update it to
 get the output vector by formula (20), ∈ × .
3) INTENSIVE READING MODULE
In this module, the and obtained above are = ℎ( ∙ + ) (18)
sent to the bidirectional attention flow layer [22] to establish a = ( [ ; ] + ) (19)
complete relationship between the question and context, and = ∘ + (1 − ) ∘ (20)
obtain the query-aware context representation.
 Firstly, ̅ and ̅ are obtained respectively by formulas Where , , and are trainable matrices and
(6)-(8). Context-to-query attention is obtained by formula biases.
(15) and query-to-context attention is obtained by formula
(16). After that, , and are spliced together by 6) MULTI-GRANULARITY MODULE
formula (17) and sent to a linear layer to obtain the query- If a whole is divided into three parts and interpreted as three
aware context representation . granules, we immediately obtain a three-way granular
 computing model [32]. This paper proposes a multi-
 = ̅ ∙ 
 
 (15) granularity module. The model can understand text in terms of
 ̅
 = ̅ ∙ ∙ 
 (16) word granularity, context granularity, and sequence
 = 3 ([ ; ⊙ ; ⊙ ]) + 3 (17) granularity.
 The aforementioned module uses the similarity between the
 Where 3 and 3 are trainable matrices and biases. context and question to find key information and establish the
 association relationship. Therefore, the output vector 
4) MULTI-HOP MECHANISM obtained before can be regarded as the word granularity,
In the model, the multi-hop mechanism is used to simulate the which reflects the process of the model processing the local
behavior of human beings to deepen their understanding of the information of the text. At the same time, in order to simulate
text by reading many times. When the model passes through the human behavior of understanding the main idea of the text
multiple skimming reading modules and intensive reading from the whole, the module is used to calculate the context
modules, it will constantly adjust the key information of granularity representing the global meaning of the context and
judgment and obtain a more comprehensive context the sequence granularity of the global meaning of the sequence
representation. composed of context and questions. It makes the model
 The above-mentioned query-aware context representation understand the text from the word granularity, context
 and key information-aware question representation granularity and sequence granularity, and the relationship
are re-sent into the Skimming Reading Module and Intensive between the whole text and the local text.
Reading Module multiple times. Multi-Granularity Module and the aforementioned modules
 Experiments show that the multi-hop mechanism can are parallel processing structures in the model architecture.
improve the performance of the model. At the same time, this This module removes the [PAD] filling part of the context
paper will further test the effect of multi-hop times on the representation and takes the average value to obtain the
model performance in the subsequent ablation experiments. context granularity vector . And the [CLS] identifier in the
 sequence has the ability to characterize the global sequence,
5) GATED MECHANISM so the corresponding vector in is taken out as the sequence
This paper is inspired by LSTM [29], GRU [30], and granularity vector . The above three granularities are
Ruminating Reader [31]. Gated Mechanism is added to the added and sent to the linear layer to obtain the output vector
model. It is used to simulate the behavior of human beings to of the parallel structure by formulas (21)-(22). It should
filter and memorize important content after repeated reading be pointed out that the question granularity was also
while ignoring unimportant content. That is, the above introduced in the previous work of this paper. In subsequent
mentioned query-aware context representation and key experiments, it was found that removing the question
information-aware question representation are sent to granularity helps to improve the performance of the model.
the gated mechanism to allow the model to determine the part Therefore, only the word granularity, context granularity, and
that needs to be memorized or forgotten, and use to sequence granularity are retained in the model.
generate an update vector to update the results of the model
memory. = + + (21)
 = 4 ∙ + 4 (22)

VOLUME XX, 2017 9

 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
 10.1109/ACCESS.2021.3079165, IEEE Access
 Author Name: Preparation of Papers for IEEE Access (February 2017)

 Where 4 and 4 are trainable matrices and biases.
7) PREDICTION LAYER 2) DATASET
The requirement of the span extraction reading • DuReader 2.0
comprehension task is to extract a continuous subsequence in DuReader 2.0 [14] is a Chinese machine reading
the context as the predicted answer. Therefore, the output comprehension dataset. The questions and contexts of the
vector obtained above is sent to the linear layer and dataset are from Baidu Search and Baidu Zhidao, and the
softmax is used to obtain the probability of each word as the answers are manually annotated. Previous machine reading
starting position and ending position of the predicted answer. comprehension datasets, such as SQuAD [15], pay more
And the continuous subsequence with the highest probability attention to the description of facts and true-or-false questions.
is extracted as the model predicted answer, as shown in On this basis, DuReader 2.0 adds opinion questions to better
formulas (23)-(25). fit people's questioning habits. DuReader 2.0 training set
 contains 15K samples and the dev set contains 1.4k samples.
 = ( ∙ + ) (23) Since the dataset is not open to the test set, this paper evaluates
 = ( ⋅ + ) (24) the performance of the model and other related models on the
 ̅ = + (25) dev set.
 Where , , and are trainable
matrices and biases. • SQUAD v1.1
 The model selects the index pairs from ̅ with the SQuAD v1.1 is an English question and answer dataset. The
maximum value of start ≤ end as the starting and ending sample is composed of triples.
positions of the prediction answers. The context is derived from 536 Wikipedia articles.
 The loss function of the model in training is shown in Annotators asked questions and provided answers in the
formula (26). context. There were more than 100,000 question and answer
 pairs. The difference from CNN & Daily Mail is that the
 answer is no longer a word or entity, but a continuous word
 1 sequence in the context, which increases the difficulty of the
 = − ∑ [log ( 
 ) + log ( 
 )] (26)
 
 task. SQuAD v1.1 includes 87.5K training set samples, 10.1K
 =0
 dev set samples and 10.1K test set samples. This paper
 Where and represent the starting position and evaluates model performance on the dev set.
ending position of the groundtruth of the -th sample. is
the total number of samples. 3) METRICS
 The model uses F1 and EM as evaluation indicators. EM
IV. EXPERIMENTS AND ANALYSIS measures whether the predicted answer matches the
1) SETUP groundtruth exactly. F1 measures the word-level matching
This paper uses the Chinese pre-training model RoBERTa as between the predicted answer and groundtruth, which is
the Encoder of the model and uses it as the baseline model. At calculated by Precision and Recall. The Precision represents
the same time, the pre-training models BERT and ALBERT the proportion of the predicted correct continuous span in the
with the same hyper-parameters and ALBERT-Large with predicted answer. The Recall represents the proportion of the
larger hyper-parameters were used for comparative predicted correct continuous span in the groundtruth. F1 is
experiments. The model is implemented in DuReader 2.0 by defined as follows.
Tensorflow-1.12.0 [33] and in SQuAD v1.1 with Pytorch
1.0.1 [34], and experiments are performed by NVIDIA 2 × × 
 1 =
GeForce GTX 1080Ti. The hyper-parameters used in the + 
model are shown in Table 2.
 4) PERFORMANCE
Table 2. Our Model Hyper Parameters In Table 3, this paper compares the evaluation results of
 Hyper Parameters Values multiple models in the DuReader 2.0 and SQuAD v1.1 dev set.
 batch size 4 Based on the baseline model, our model has further improved
 epoch 3 F1 (+0.94%; +0.526%) and EM (+0.918%; 0.464%). Table 4
 max query length (DuReader) 16 shows the comparison of model parameters. The improvement
 max query length (SQuAD) 24 of EM is obvious, indicating that the model can deepen the
 max sequence length 512 understanding of the text and help the model predict more
 learning rate 3 × 10−5 accurate answers.
 doc stride 384
 warmup rate 0.1
 multi-hop 3

VOLUME XX, 2017 9

 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
 10.1109/ACCESS.2021.3079165, IEEE Access
 Author Name: Preparation of Papers for IEEE Access (February 2017)

Table 3. The Results from models for DeReader2.0 and The model can further understand the text semantics. As
SQuAD v1.1 shown in Table 5, we give a sample in the DuReader 2.0 dev
 DuReader(dev) SQuAD(dev) set. In this sample, the baseline model has a misunderstanding
 Model
 F1 EM F1 EM
 of the question and context text, and cannot accurately locate
 ALBERT 82.615 70.430 86.719 78.742
 ALBERTlarge 83.664 71.983 88.179 80.643 and predict the answer. Based on the baseline model, the
 BERT 83.856 72.912 87.882 80.047 model adds the multi-hop mechanism, which makes the model
 RoBERTa 83.911 72.618 88.565 80.766 deepen the understanding of the text, so as to correctly predict
 Our Model 84.851 73.536 89.091 81.230
 the answer span.
Table 4. Parameters Comparison on Model Table 5. A comparative MRC example
 Params Context:
 Model
 (M) 对于 ATM 机每日取款限额中的每日的意思就是 00:00
 BERT 110 到当天的 23:59,这个时间就是当日,过了 0 点就是次
 RoBERTa 110 日了。 银行的 ATM 最高取款只能取 2 万元,ATM 转账
 S&I Reader 119
 只能转款 5 万元,取现超过 2 万要到银行柜台办理,超
 过五万需要提前一天预约,转账超过 5 万的话需要到
 In the experiment of DuReader 2.0, the model was trained
 银行柜台办理。
for 10,890 steps, the checkpoint of the model was saved every
 Question:
2000 steps and the performance of the model was recorded.
 atm 机取款限额
The changes of F1 and EM of the model and the baseline
model with the number of training steps are shown in Figure 3 GroundTruth: 2 万元
and Figure 4. In the early stage of training, due to the increase RoBERTa: 00:00 到当天的 23:59
of model parameters, the performance of the model is slightly Our Model: 2 万元
lower than the baseline model. After full training, the
performance of the model is basically better than the baseline
 Table 5. A comparative MRC example
model.
 Context:
 For the ATM machine’s daily withdrawal limit, “daily”
 means from 00:00 to 23:59 of the day. This time is the
 same day, and after 0:00 it is the next day. The bank’s
 ATM withdrawals can only withdraw 20,000 yuan, and
 ATM transfers can only transfer 50,000 yuan. Cash
 withdrawals of more than 20,000 yuan must be processed
 at the bank counter. If the cash exceeds 50,000 yuan, an
 appointment must be made one day in advance. If the
 transfer exceeds 50,000 yuan, it must be processed at the
 bank counter.
 Question:
 ATM withdrawal limit
 GroundTruth: 20,000 yuan
FIGURE 3. F1 comparison line chart RoBERTa: 00:00 to 23:59 of the day
 Our Model: 20,000 yuan

 This paper finds that the model can solve the over-stability
 problem of previous models to a certain extent. As shown in
 Table 6, a sample in the DuReader 2.0 dev set is selected as an
 example. This paper finds that the baseline model is only
 matched literally, that is, the bold part marked in Table 6. It
 can be seen that the baseline model only literally matches the
 part similar to the question text in the context, and gets the
 wrong answer. Our model can match and find the correct
 answer according to the semantics of the question and context.

FIGURE 4. EM comparison line chart

VOLUME XX, 2017 9

 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
 10.1109/ACCESS.2021.3079165, IEEE Access
 Author Name: Preparation of Papers for IEEE Access (February 2017)

Table 6. An over-stability example in dev set should be in March next year|UK final result of the
 Context: referendum count of 382 voting districts showed that 51.9%
 24 日,英国脱欧公投,截止到北京时间下午 1:01 分, 据 of the people chose to support leaving the European Union.
 英国广播公司 BBC 报道,英国脱欧公投结果揭晓,英国 Britain's "Brexit" has hit the British financial market
 将正式脱欧。脱欧票数 1683 万 5512 票,留欧 1569 万 severely. The pound fell sharply today, hitting a 30-year
 low. The prospects for European and American stock
 2093 票。 根据目前结果显示,脱欧派在公投中占据上风,
 markets were bleak, and Asian stock markets also fell. In
 这意味着英国将会退出欧盟。投票结果将在 14:00 左右 addition, some media speculated that British Prime Minister
 正式公布。 英国广播公司(BBC)当地时间 6 月 24 日曾 Cameron, who holds a "stay in Europe" position, might
 作出官方预测,英国脱欧公投的最终结果基本已经确 resign as a result.
 定:英国将离开欧盟。这使得其他包括北爱尔兰和苏格 Question:
 兰在内的地区纷纷考虑独立可能。另据报道,爱尔兰新 Brexit time
 芬党表示,他们将会就北爱尔兰独立、爱尔兰统一举行 GroundTruth: March next year
 新的投票。苏格兰首席大臣接受采访时则表示:“我们 RoBERTa: 1:01 pm Beijing time
 已经看到苏格兰加入欧盟的那一天了。”这预示苏格兰 Our Model: March next year
 可能脱离英国独立。|只是宣布,公投结果是脱欧,但还
 4) ABLATION
 没正式脱欧,英国高等法院对于脱欧的裁定让议会进行
 In order to analyze the influence of Skimming Reading
 无休止的讨论,可能上诉最高院,没意外,脱欧应该要明
 Module, Intensive Reading Module, Gated Mechanism,
 年 3 月 | 英 国 公 投 382 个 投 票 区 计 票 最 终 结 果 显 Multi-granularity Module, and the number of multi-hop on the
 示,51.9%的民众选择支持脱离欧盟。英国“脱欧”重创 performance of the model, ablation experiments in the
 英国金融市场,今天英镑暴跌超,创 30 年来新低。另外, DuReader 2.0 were carried out in this paper. Table 7 shows
 欧美股市前景暗淡,亚洲股市也应声下跌。此外,有媒体 the performance of this model in different ablation
 猜测持“留欧”立场的英国首相卡梅伦可能因此辞职。 experiments.
 Question:
 英国脱欧时间 Table 7. Ablation result
 GroundTruth: 明年 3 月 DuReader(dev)
 Ablation
 RoBERTa: 北京时间下午 1:01 分 F1 EM
 Our Model: 明年 3 月 1.w/o Skimming Module 84.368 72.406
 2.w/o Intensive Module 84.418 72.430
Table 6. An over-stability example in dev set 3.w/o Gated Mechanism 83.735 72.054
 On the 24th, the Brexit referendum ended at 1:01 pm 4.w/o Multi-granularity 84.542 73.465
 Beijing time. According to the BBC report, the results of 5.multihop-1 84.329 72.618
 the Brexit referendum were announced and the UK will 6.multihop-2 84.198 72.689
 officially leave the European Union. The number of votes 7.multihop-3(final model) 84.851 73.536
 for Brexit was 16.83512 and 15.692093 million for staying 8.multihop-4 84.097 73.042
 in Europe. According to the current results, the Brexitists 9.multihop-5 84.151 72.406
 have the upper hand in the referendum, which means that
 Britain will withdraw from the European Union. The voting According to the multiple groups of ablation experiments in
 results will be officially announced around 14:00. The BBC Table 7, compared with Experiment 7, Experiment 1 shows
 made an official prediction on June 24, local time, that the that bidirectional attention over attention helps the model pay
 final result of the Brexit referendum has basically been attention to key content and can improve model performance
 determined, Britain will leave the European Union. This to a certain extent. Experiment 2 shows that further
 makes other regions including Northern Ireland and establishing a more complete relationship between the
 Scotland consider the possibility of independence.
 question and the context will help improve performance.
 According to another report, the Irish Sinn Fein Party stated
 Experiment 3 shows that Gated Mechanism helps the model
 that they will hold new votes on the independence of
 filter unimportant information and can significantly improve
 Northern Ireland and the unification of Ireland. The Chief
 the performance of the model. Experiment 4 shows that the
 Minister of Scotland said in an interview: "We have already
 seen the day when Scotland joins the European Union." model processes text at multiple granularities, which helps the
 This indicates that Scotland may become independent from model understand text information at multiple levels and
 Britain. It was only announced that the result of the further improves the performance of the model.
 referendum was Brexit, but it has not formally left the At the same time, Experiments 5-9 show that appropriately
 European Union. The British High Court's ruling on Brexit increasing the number of multi-hop can help the model
 has allowed the parliament to have endless discussions and understand the text semantics more deeply, solve the problem
 may appeal to the Supreme Court. No surprise, Brexit of insufficient learning and over-stability, and improve the

VOLUME XX, 2017 9

 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
 10.1109/ACCESS.2021.3079165, IEEE Access
 Author Name: Preparation of Papers for IEEE Access (February 2017)

accuracy of model predict answer. However, the increase of
the number of multi-hop will also lead to the increase of model
parameters and calculation, which will affect the performance
and efficiency of the model. The experimental results show
that the model achieves the best performance in the multi-hop-
3.

5) SKIMMING READING MODULE VERIFICATION
In order to further verify and explain the effectiveness of the
Skimming Reading Module in the model, this paper selects a
sample from the DuReader 2.0 dev set, as shown in Table 8.
When the sample data enters the Skimming Reading Module
of the model, the model judges the sample's question
keywords and their associated context key parts, and context
keywords and their associated question key parts, and draws
the heat maps, as shown in the Figure 5 and Figure 6.

Table 8. A dev set example
 Context:
 据悉,每年通过注册安全工程师过的人还是比较多,
 加上需求此证书的企业减少,许多单位的注册安全工
 程师基本属于挂名,市场情况不是很好,注册安全工
 程师挂靠价格不是很高,去年为 1 万-2 万之间,其收入
 要视地区和公司盈利状况。
 Question: FIGURE 5. Question keywords and associated
 安全工程师挂靠价格 contextual key parts.
 GroundTruth:
 1 万-2 万

Table 8. A dev set example
 Context:
 It is reported that there are still a lot of registered safety
 engineers every year. In addition, there are fewer
 companies that require this certificate. The registered
 safety engineers of many units are basically nominal, the
 market situation is not very good, and the price of
 registered safety engineers is not very high. Between
 10,000 and 20,000 yuan, the income depends on the
 region and the company's profitability.
 Question:
 The price of a registered safety engineer.
 GroundTruth:
 Between 10,000 and 20,000 yuan

 FIGURE 6. Context keywords and associated
 question key parts.

VOLUME XX, 2017 9

 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
 10.1109/ACCESS.2021.3079165, IEEE Access
 Author Name: Preparation of Papers for IEEE Access (February 2017)

 The horizontal and vertical axes of Figure 5 and Figure 6 [3] A. Vaswani et al., "Attention is all you need," in Advances in
 neural information processing systems, 2017, pp. 5998-6008.
represent the question and context text, respectively. From [4] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "Bert: Pre-
Figures 5 and Figure 6, we can see that the model accurately training of deep bidirectional transformers for language
judged the keyword of the question, "price". It can be seen understanding," arXiv preprint arXiv:1810.04805, 2018.
 [5] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R.
from Figure 5 that the Skimming Reading Module of the
 Soricut, "Albert: A lite bert for self-supervised learning of
model identifies the key parts highly related to the question language representations," arXiv preprint arXiv:1909.11942,
keyword "price" in the context semantics, such as "income", 2019.
"market", "region" and "company's profitability " and so on. [6] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov,
 and Q. V. Le, "Xlnet: Generalized autoregressive pretraining for
Therefore, it can be further verified and explained by the language understanding," in Advances in neural information
above sample and heat maps that the Skimming Reading processing systems, 2019, pp. 5753-5763.
Module in the model has the ability to semantically identify [7] R. Jia and P. Liang, "Adversarial examples for evaluating
 reading comprehension systems," arXiv preprint
keywords and corresponding key parts. arXiv:1707.07328, 2017.
 [8] L. A. Zadeh, "Toward a theory of fuzzy information granulation
 and its centrality in human reasoning and fuzzy logic," Fuzzy sets
V. CONCLUSION and systems, vol. 90, no. 2, pp. 111-127, 1997.
 [9] L. A. Zadeh, "Some reflections on soft computing, granular
This paper proposes a reading comprehension model S&I computing and their roles in the conception, design and
Reader that combines multiple skimming and intensive utilization of information/intelligent systems," Soft computing,
reading modules, gated mechanism, and multi-granularity vol. 2, no. 1, pp. 23-25, 1998.
 [10] Y. Zhang, D. Miao, W. Pedrycz, T. Zhao, J. Xu, and Y. Yu,
module. The model focuses on developing the downstream "Granular structure-based incremental updating for multi-label
processing structure of the pre-training model to solve the classification," Knowledge-Based Systems, vol. 189, p. 105066,
performance bottleneck caused by the development of the 2020.
 [11] B. Zhang and L. Zhang, "Problem solving theory and
current pre-training model. The downstream structure of the
 application," ed: Tsinghua University Press, Beijing, 1990.
model simulates the behaviors of human beings in solving [12] C. Yunxian, L. Renjie, Z. Shuliang, and G. Fenghua, "Measuring
reading comprehension tasks, such as skimming, intensive multi-spatiotemporal scale tourist destination popularity based
reading, filtering effective information, and understanding text on text granular computing," PloS one, vol. 15, no. 4, p.
 e0228175, 2020.
from multi-granularity. Experiments have proved that the [13] Y. Liu et al., "Roberta: A robustly optimized bert pretraining
model is effective in both Chinese and English datasets. And approach," arXiv preprint arXiv:1907.11692, 2019.
the performance of the model can be further improved on the [14] W. He et al., "Dureader: a chinese machine reading
 comprehension dataset from real-world applications," arXiv
basis of the pre-training model, which can solve the errors preprint arXiv:1711.05073, 2017.
caused by insufficient learning and over-stability to a certain [15] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, "Squad:
extent, and deepen the model's understanding of the text. At 100,000+ questions for machine comprehension of text," arXiv
 preprint arXiv:1606.05250, 2016.
the same time, experiments and analyses have also verified the [16] T. Nguyen et al., "Ms marco: A human-generated machine
effectiveness of the Skimming Reading Module proposed in reading comprehension dataset," 2016.
this paper to select key information of the text. We pay [17] G. Lai, Q. Xie, H. Liu, Y. Yang, and E. Hovy, "Race: Large-
attention to the fact that S&I Reader continues to add a large scale reading comprehension dataset from examinations," arXiv
 preprint arXiv:1704.04683, 2017.
number of parameters and multi-layer structure based on the [18] K. M. Hermann et al., "Teaching machines to read and
pre-training model with a large number of parameters. comprehend," in Advances in neural information processing
Therefore, it is our next important work to simplify the systems, 2015, pp. 1693-1701.
 [19] Z. Wu and H. Xu, "Improving the robustness of machine reading
structure and parameters of the model while improving the comprehension model with hierarchical knowledge and auxiliary
performance of the model, so that the model can better meet unanswerability prediction," Knowledge-Based Systems, p.
the requirements of resources and time in industrial 106075, 2020.
 [20] B. Dhingra, H. Liu, Z. Yang, W. W. Cohen, and R.
applications. At the same time, in the process of testing the Salakhutdinov, "Gated-attention readers for text
model, we find that for a small number of samples, because comprehension," arXiv preprint arXiv:1606.01549, 2016.
the model can not get the internal meaning of the entities in [21] S. Wang and J. Jiang, "Machine comprehension using match-
 lstm and answer pointer," arXiv preprint arXiv:1608.07905,
the text and the relationship between entities, the model gets
 2016.
irrelevant answers. Therefore, how to introduce knowledge [22] M. Seo, A. Kembhavi, A. Farhadi, and H. Hajishirzi,
graph into the model to obtain prior knowledge and how to "Bidirectional attention flow for machine comprehension,"
better understand the context and question to avoid absurd arXiv preprint arXiv:1611.01603, 2016.
 [23] W. Wang, N. Yang, F. Wei, B. Chang, and M. Zhou, "Gated
answers are also our next research direction. self-matching networks for reading comprehension and question
 answering," in Proceedings of the 55th Annual Meeting of the
REFERENCES Association for Computational Linguistics (Volume 1: Long
[1] S. Liu, X. Zhang, S. Zhang, H. Wang, and W. Zhang, "Neural Papers), 2017, pp. 189-198.
 machine reading comprehension: Methods and trends," Applied [24] J. Zhang, X. Zhu, Q. Chen, L. Dai, S. Wei, and H. Jiang,
 Sciences, vol. 9, no. 18, p. 3698, 2019. "Exploring question understanding and adaptation in neural-
[2] Z. Zhang et al., "Semantics-aware bert for language network-based question answering," arXiv preprint
 understanding," arXiv preprint arXiv:1909.02209, 2019. arXiv:1703.04617, 2017.

VOLUME XX, 2017 9

 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
 10.1109/ACCESS.2021.3079165, IEEE Access
 Author Name: Preparation of Papers for IEEE Access (February 2017)

[25] A. W. Yu et al., "Qanet: Combining local convolution with
 global self-attention for reading comprehension," arXiv preprint
 Yong Wang received Ph.D. degree from the
 arXiv:1804.09541, 2018.
 East China Normal University in 2007. He is
[26] L. Kaiser, A. N. Gomez, and F. Chollet, "Depthwise separable
 currently an Associate Professor with School
 convolutions for neural machine translation," arXiv preprint
 of Artificial Intelligence, Liangjiang,
 arXiv:1706.03059, 2017.
 Chongqing University of Technology. His
[27] W. Yonghui et al., "Bridging the gap between human and
 research interests include deep learning,
 machine translation," arXiv preprint arXiv:1609.08144, 2016.
 natural language processing, multimedia and
[28] Y. Cui, Z. Chen, S. Wei, S. Wang, T. Liu, and G. Hu, "Attention-
 big data technology.
 over-attention neural networks for reading comprehension,"
 arXiv preprint arXiv:1607.04423, 2016.
[29] S. Hochreiter and J. Schmidhuber, "Long short-term memory,"
 Neural computation, vol. 9, no. 8, pp. 1735-1780, 1997.
[30] K. Cho et al., "Learning phrase representations using RNN
 encoder-decoder for statistical machine translation," arXiv
 preprint arXiv:1406.1078, 2014.
[31] Y. Gong and S. R. Bowman, "Ruminating reader: Reasoning
 Chong Lei received the bachelor’s degree
 with gated multi-hop attention," arXiv preprint
 from the Taiyuan University of Technology in
 arXiv:1704.07415, 2017.
 2017, where he is currently pursuing the
[32] Y. Yao, "Three-way granular computing, rough sets, and formal
 master’s degree with the College of Computer
 concept analysis," International Journal of Approximate
 Science and Engineering, Chongqing
 Reasoning, vol. 116, pp. 106-125, 2020.
 University of Technology. His research
[33] M. Abadi et al., "Tensorflow: Large-scale machine learning on
 interests include deep learning, natural
 heterogeneous distributed systems," arXiv preprint
 language processing, and machine reading
 arXiv:1603.04467, 2016.
 comprehension.
[34] A. Paszke et al., "Pytorch: An imperative style, high-
 performance deep learning library," in Advances in neural
 information processing systems, 2019, pp. 8026-8037.

 Duoqian Miao received the Ph.D. degree from
 the Institute of Automation, Chinese Academy
 of Sciences in 1997. He is currently a Professor
 with Department of Computer Science and
 Technology, Tongji University. His research
 interests include artificial intelligence,
 machine learning, big data analysis and
 granular computing.

VOLUME XX, 2017 9

 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
You can also read