Probing Prior Knowledge Needed in Challenging Chinese Machine Reading Comprehension

Page created by Margaret Harvey

Education

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Probing Prior Knowledge Needed in
                                                       Challenging Chinese Machine Reading Comprehension

                                                              Kai Sun1∗ Dian Yu2 Dong Yu2 Claire Cardie1
                                                                         1
                                                                           Cornell University, Ithaca, NY
                                                                         2
                                                                           Tencent AI Lab, Bellevue, WA
                                                      ks985@cornell.edu, {yudian, dyu}@tencent.com, cardie@cs.cornell.edu

                                                               Abstract                             number of questions that require prior knowledge
                                                                                                    in addition to the given context (Richardson et al.,
                                              With an ultimate goal of narrowing the gap
                                                                                                    2013; Mostafazadeh et al., 2016; Lai et al., 2017;
                                              between human and machine readers in text
arXiv:1904.09679v2 [cs.CL] 30 Apr 2019

                                              comprehension, we present the first collec-           Ostermann et al., 2018; Khashabi et al., 2018; Tal-
                                              tion of Challenging Chinese machine read-             mor et al., 2018; Sun et al., 2019a). Therefore,
                                              ing Comprehension datasets (C3 ) collected            they can serve as good test-beds for evaluating
                                              from language and professional certification          progress towards goals of teaching machine read-
                                              exams, which contains 13,924 documents                ers to use different kinds of prior knowledge for
                                              and their associated 23,990 multiple-choice           better text comprehension and narrowing the per-
                                              questions. Most of the questions in C3
                                                                                                    formance gap between human and machine read-
                                              cannot be answered merely by surface-form
                                              matching against the given text.
                                                                                                    ers in real-world settings such as language or sub-
                                                                                                    ject examinations. However, progress on these
                                              As a pilot study, we closely analyze the
                                                                                                    kind of tasks is mostly limited to English (Storks
                                              prior knowledge (i.e., linguistic, domain-
                                              specific, and general world knowledge)                et al., 2019) because of the unavailability of large-
                                              needed in these real-world reading com-               scale datasets in other languages.
                                              prehension tasks. We further explore how                 To study the prior knowledge needed to
                                              to leverage linguistic knowledge including
                                                                                                    better comprehend written and oral texts in
                                              a lexicon of idioms and proverbs, graphs
                                              of general world knowledge (e.g., Concept-            Chinese, we propose the first collection of
                                              Net), and domain-specific knowledge such              Challenging Chinese multiple-choice machine
                                              as textbooks to aid machine readers, through          reading Comprehension datasets (C3 ) that contain
                                              fine-tuning a pre-trained language model.             both general and domain-specific tasks. For the
                                              Experimental results demonstrate that lin-            general-domain task: Given a reference document
                                              guistic and general world knowledge may               that can be either written or oral (i.e., a dialogue),
                                              help improve the performance of the base-
                                                                                                    select the correct answer option from all options
                                              line reader in both general and domain-
                                              specific tasks. C3 will be available at
                                                                                                    associated with a question. Besides, we present a
                                              http://dataset.org/c3/.                               challenging task that has not been explored in the
                                                                                                    literature: Given a counseling oral text (a third-
                                                                                                    person narrative or a dialogue) mostly about life
                                         1 Introduction
                                                                                                    concerns and an additional domain-specific refer-
                                         Machine reading comprehension (MRC) tasks,                 ence corpus, select the correct answer option(s)
                                         which aim to teach machine readers to read and             from associated options of a question. Compared
                                         understand a reference material (e.g., a document),        to relevant datasets (Ostermann et al., 2018; Sun
                                         and evaluate the comprehension ability of ma-              et al., 2019a), besides the oral text, we also need
                                         chines by letting them answer questions relevant           additional domain-specific knowledge for answer-
                                         to the given content (Poon et al., 2010; Richardson        ing questions. However, it is relatively difficult to
                                         et al., 2013), have attracted substantial attention of     link the content in less formal oral language to the
                                         both academia and industry.                                corresponding well-written facts, explanations, or
                                            An increasing number of studies focus on de-            definitions in the domain-specific reference cor-
                                         veloping MRC datasets that contain a significant           pus. For all the mentioned tasks, we collect ques-
                                              ∗
                                                Part of this work was conducted when K. S. was an   tions from language and professional certification
                                         intern at the Tencent AI Lab, Bellevue, WA.                exams designed by experts (Section 3.1).

We observe three kinds of prior knowledge are      et al., 2017; Rajpurkar et al., 2018) or conversa-
required for in-depth understanding of the writ-      tional (Reddy et al., 2018; Choi et al., 2018) ques-
ten and oral texts to answer most of the questions    tions that might require reasoning (Zhang et al.,
in both general and domain-specific reading com-      2018a), designing free-form answers (Kočiskỳ
prehension tasks: linguistic knowledge, domain        et al., 2018) or (question, answer) pairs that
knowledge, and general world knowledge that is        cover the content of multiple sentences or docu-
further broken down into eight types such as arith-   ments (Welbl et al., 2018; Yang et al., 2018). Still,
metic, connotative, and cause-effect (Section 3.2).   questions usually provide sufficient information to
Around 86.8% of general-domain questions and          find answers in the given context.
ALL the domain-specific questions require knowl-         There are also a variety of non-extractive ma-
edge beyond the given context. We further inves-      chine reading comprehension (Richardson et al.,
tigate the utilization of linguistic knowledge in-    2013; Mostafazadeh et al., 2016; Lai et al., 2017;
cluding a lexicon of common Chinese idioms and        Khashabi et al., 2018; Talmor et al., 2018; Sun
proverbs (Section 4.2), general world knowledge       et al., 2019a) and question answering tasks (Clark
in the form of graphs (Speer et al., 2017) (Sec-      et al., 2016, 2018; Mihaylov et al., 2018), mostly
tion 4.3), and domain-specific knowledge such         in multiple-choice forms. For question answer-
as textbooks to improve the comprehension abil-       ing tasks, there is no reference document provided
ity of machine readers, via fine-tuning a pre-        for each question. Instead, a reference corpus is
trained language model (Devlin et al., 2019) (Sec-    provided, which contains a collection of domain-
tion 4.1). Experimental results show that general-    specific textbooks or/and related encyclopedia ar-
domain lexicons and general world knowledge           ticles. For these tasks, besides the given reference
graph generally improve the baseline performance      documents or corpora, knowledge from other re-
on both general and domain-specific tasks. Exper-     sources may be necessary to solve a significant
iments also demonstrate that for domain-specific      percentage of questions.
questions, typical methods such as enriching the         Besides these standard tasks, we are aware that
given text with retrieved sentences from addi-        there is a trend of formalizing tasks such as re-
tional domain-specific corpora actually hurt the      lation extraction (Levy et al., 2017), word pre-
performance of the baseline, which indicates the      diction (Chu et al., 2017), and judgment pre-
great challenge in finding external knowledge rel-    diction (Long et al., 2018) as extractive or non-
evant to informal oral texts (Section 5.2). We        extractive machine reading comprehension prob-
hope our observations and proposed challenging        lems, which are beyond the scope of this paper.
datasets may inspire further research on knowl-          We compare our proposed tasks with similar
edge acquisition and utilization for Chinese or       datasets in Table 1. C3 -2A and C3 -2B can be re-
cross-language reading comprehension.                 garded as new challenging tasks that have never
                                                      been studied before.
2 Related Work
2.1   Machine Reading Comprehension and               2.2 Chinese Machine Reading
      Question Answering                                  Comprehension and Non-Extractive
                                                          Question Answering
We discuss tasks in which texts are written in
English. Much of the early work focuses on            We have seen the construction of span-style
constructing large-scale extractive MRC datasets:     Chinese machine reading comprehension
Answers are spans from the reference docu-            datasets (Cui et al., 2016, 2018b,a; Shao et al.,
ment (Hermann et al., 2015; Hill et al., 2016; Baj-   2018), using Chinese news reports, books, and
gar et al., 2016; Rajpurkar et al., 2016; Trischler   Wikipedia articles as source documents, similar to
et al., 2017; Joshi et al., 2017). As a question      their English counterparts CNN/Daily Mail (Her-
and its answer are usually in the same sentence,      mann et al., 2015), CBT/BT (Hill et al., 2016;
deep neural models (Devlin et al., 2019; Radford      Bajgar et al., 2016), and SQuAD (Rajpurkar et al.,
et al., 2019) have outperformed human perfor-         2016), in which all answers are extractive spans
mance on many such tasks. To increase the dif-        from the provided reference documents.
ficulty of MRC tasks, researchers have explored          Previous work also focus on non-extractive
ways including adding unanswerable (Trischler         question answering tasks (Cheng et al., 2016; Guo

Language
Task Reference Document Reference Corpus Domain
Chinese English
C3 -1A mixed-genre × general N/A RACE (Lai et al., 2017)
C3 -1B dialogue × general N/A DREAM (Sun et al., 2019a)
√
C3 -2A narrative psychological N/A N/A
√
C3 -2B dialogue psychological N/A N/A

Table 1: Comparison of C3 and existing representative large-scale multiple-choice reading comprehension tasks.

et al., 2017a,b; Zhang and Zhao, 2018; Zhang reference document. The rest of the problems be-
et al., 2018b; Hao et al., 2019), in which ques- long to the C3 -1A. We show a sample problem for
tions are usually collected from examinations. An- each type in Table 2 and Table 3, respectively.
other kind of non-extractive question answering Each domain-specific problem comprises a ref-
tasks (He et al., 2017) is based on search engines erence document mostly about life concerns (e.g.,
(similar to English MS MARCO (Nguyen et al., social, work, family, school, and emotional or
2016)): Researchers collect questions from query physical health), which contains one or multiple
logs and ask crowdsourcers to generate answers. sub-documents. Every sub-document is followed
Compared to the tasks mentioned above, we fo- by a series of questions designed mainly for this
cus on Chinese machine reading comprehension sub-document. Each question is associated with
tasks that require prior knowledge to facilitate the several answer options, AT LEAST ONE of which
understanding of the given text. is correct. The goal is to select all correct an-
swer options. An answerer is allowed to read the
3 Data complete reference document and utilize the rele-
vant knowledge in an additional reference corpus
In this section, we describe how we construct such as a psychological counseling textbook (Sec-
C3 (Section 3.1), and we analyze the data (Sec- tion 5.2) to reach the correct answer options. Sim-
tion 3.2) and knowledge needed (Section 3.3). ilarly, domain-specific problems are divided into
sub-tasks C3 -2A and C3 -2B. For each problem in
3.1 Collection Methodology and Task
C3 -2B, its reference document is made up of one
Definitions
or multiple dialogues (in chronological order). C3 -
We collect general-domain problems from Hanyu 2A contains the rest problems in which reference
Shuiping Kaoshi (HSK) and Minzu Hanyu Kaoshi documents are third-person narratives. See a sam-
(MHK), which are designed to evaluate the Chi- ple problem for each type in Appendix A (Table 12
nese listening and reading comprehension ability and Table 13) due to limited space.
of second-language learners such as international We remove duplicate problems and randomly
students, overseas Chinese, and ethnic minorities. split the data (13,924 documents and 23,990 ques-
We collect domain-specific problems from Psy- tions in total) at the problem level, with 60% train-
chological Counseling Examinations (a national ing, 20% development, and 20% test.
qualification test that certifies level two and level
three psychological counselors in China), which 3.2 Data Analysis
focus on accessing the acquisition and retention We summarize the overall statistics of C3 in Ta-
of subject knowledge. We include problems from ble 4. We observe some differences, which
both real and practice exams, and all of them are may be relevant to the difficulty level of ques-
freely accessible online for public usage. tions, exist between general-domain (i.e., C3 -1A
Each general-domain problem consists of a ref- and C3 -1B) and domain-specific tasks (i.e., C3 -
erence document and a series of questions. Each 2A and C3 -2B). For example, the percentage of
question is associated with several answer options, non-extractive correct answer options in domain-
EXACTLY ONE of which is correct. The goal is specific tasks (C3 -2A: 91.4%; C3 -2B: 95.2%) is
to select the correct option. According to the much higher than that in general-domain Chinese
reference document type, we divide the collected (C3 -1A: 81.9%; C3 -1B: 78.9%) and English lan-
general-domain problems into two sub-tasks: C3 - guage exams (RACE (Lai et al., 2017): 87.0%;
1A and C3 -1B. In C3 -1B, a dialogue serves as the DREAM (Sun et al., 2019a): 83.7%). Besides,

1928年，经徐志摩介绍，时任中国公学校 In 1928, recommended by Hsu Chih-Mo (1897-1931), Hu Shih (1891-
长的胡适聘用了沈从文做讲师，主讲大学 1962), who was the president of the previous National University of
一年级的现代文学选修课。 China, employed Shen Ts’ung-wen (1902-1988) as a lecturer of the
university who was in charge of teaching the optional course of modern
literature.
当时，沈从文已经在文坛上崭露头角，在 At that time, Shen already made himself conspicuous in the literary
社会上也小有名气，因此还未到上课时 world and was a little famous in society. For this sake, even before the
间，教室里就坐满了学生。上课时间到 beginning of class, the classroom was crowded with students. Upon the
了，沈从文走进教室，看见下面黑压压 arrival of class, Shen went into the classroom. Seeing a dense crowd
一片，心里陡然一惊，脑子里变得一片空 of students sitting beneath the platform, Shen was suddenly startled and
白，连准备了无数遍的第一句话都堵在嗓 his mind went blank. He was even unable to utter the first sentence he
子里说不出来了。 had rehearsed repeatedly.
他呆呆地站在那里，面色尴尬至极，双手 He stood there motionlessly, extremely embarrassed. He wrung his
拧来拧去无处可放。上课前他自以为成 成竹 hands without knowing where to put them. Before class, he believed
在 胸 ，所以就没带教案和教材。整整10 分 that he had had a ready plan to meet the situation so he did not bring
钟，教室里鸦雀无声，所有的学生都好奇 his teaching plan and textbook. For up to 10 minutes, the classroom
地等着这位新来的老师开口。沈从文深吸 was in perfect silence. All the students were curiously waiting for the
了一口气，慢慢平静了下来，原先准备好 new teacher to open his mouth. Breathing deeply, he gradually calmed
的东西也重新在脑子里聚拢，然后他开始 down. Thereupon, the materials he had previously prepared gathered in
讲课了。不过由于他依然很紧张，原本预 his mind for the second time. Then he began his lecture. Nevertheless,
计一小时的授课内容，竟然用了不到15 分 since he was still nervous, it took him less than 15 minutes to finish the
钟就讲完了。 teaching contents he had planned to complete in an hour.
接下来怎么办？他再次陷入了窘境。无奈 What should he do next? He was again caught in embarrassment. He
之下，他只好拿起粉笔在黑板上写道：我 had no choice but to pick up a piece of chalk before writing several
第一次上课，见你们人多，怕了。 words on the blackboard: This is the first time I have given a lecture. In
the presence of a crowd of people, I feel terrified.
顿时，教室里爆发出了一阵善意的笑声， Immediately, a peal of friendly laughter filled the classroom. Presently,
随即一阵鼓励的掌声响起。得知这件事之 a round of encouraging applause was given to him. Hearing this
后，胡适对沈从文大加赞赏，认为他非常 episode, Hu heaped praise upon Shen, thinking that he was very suc-
成功。 cessful.
有了这次经历，在以后的课堂上，沈从文 Because of this experience, Shen always reminded himself of not being
都会告诫自己不要紧张，渐渐地，他开始 nervous in his class for years afterwards. Gradually, he began to give
在课堂上变得从容起来。 his lecture at leisure in class.
Q1 第2段中，“黑压压一片”指的是： Q1 In paragraph 2, “a dense crowd” refers to
A. 教室很暗 A. the light in the classroom was dim.
B. 听课的人多⋆ B. the number of students attending his lecture was large. ⋆
C. 房间里很吵 C. the room was noisy.
D. 学生们发言很积极 D. the students were active in voicing their opinions.
Q2 沈从文没拿教材，是因为他觉得： Q2 Shen did not bring the textbook because he felt that
A. 讲课内容不多 A. the teaching contents were not many.
B. 自己准备得很充分⋆ B. his preparation was sufficient. ⋆
C. 这样可以减轻压力 C. his mental pressure could be reduced in this way.
D. 教材会限制自己的发挥 D. the textbook was likely to restrict his ability to give a lecture.
Q3 看见沈从文写的那句话，学生们： Q3 Seeing the sentence written by Shen, the students
A. 急忙安慰他 A. hurriedly consoled him.
B. 在心里埋怨他 B. blamed him in mind.
C. 受到了极大的鼓舞 C. were greatly encouraged.
D. 表示理解并鼓励了他⋆ D. expressed their understanding and encouraged him.⋆
Q4 上文主要谈的是： Q4 The passage above is mainly about
A. 中国教育制度的发展 A. the development of the Chinese educational system.
B. 紧张时应如何调整自己 B. how to make self-adjustment if one is nervous.
C. 沈从文第一次讲课时的情景⋆ C. the situation where Shen gave his lecture for the first time.⋆
D. 沈从文如何从作家转变为教师的 D. how Shen turned into a teacher from a writer.

Table 2: A sample C3 -1A problem (left) and its English translation (right) (⋆: the correct answer option).

the average document/question length of C3 -2A proficiency native readers who obtained at least a
and C3 -2B is much longer than that of C3 -1A and bachelor’s degree in psychology, education, or so-
C3 -1B. The differences are probably due to the ciology, while C3 -1A and C3 -1B are designed for
fact that domain-specific exam designers (experts) those less proficient second-language learners.
assume that most of the participants are high- Chinese idioms and proverbs, which are widely

F: How is it going? Have you bought your ticket? idioms and proverbs in Section 4.2.
M: There are so many people at the railway station. I
have waited in line all day long. However, when 3.3 Categories of Knowledge
my turn comes, they say that there is no ticket left
unless the Spring Festival is over. Because there is no prior work discussing the
F: It doesn’t matter. It is all the same for you to come required knowledge in Chinese machine reading
back after the Spring Festival is over.
comprehension, we carefully analyze a subset of
M: But according to our company’s regulation, I must
go to the office on the 6th day of the first lunar questions randomly sampled from the develop-
month. I’m afraid I have no time to go back after ment and test sets of C3 (Table 4) and arrive at
the Spring Festival, so could you and my dad come the following three kinds of prior knowledge.
to Shanghai for the coming Spring Festival?
L INGUISTIC: To answer a given question (e.g.,
F: I am too old to endure the travel.
M: It is not difficult at all. After I help you buy the Q2 in Table 2 and Q3 in Table 3), we require lex-
tickets, you can come here directly. ical/grammatical knowledge include but not lim-
Q1 What is the relationship between the speakers? ited to: idioms, proverbs, negation, antonymy,
A. father and daughter synonymy, and sentence structures.
B. mother and son ⋆ D OMAIN -S PECIFIC: This kind of world knowl-
C. classmates
edge consists of, but not limited to, facts
D. colleagues
Q2 What difficulty has the male met? about domain-specific concepts, their definitions
A. his company does not have a vacation. and properties, and relations among these con-
B. things are expensive during the Spring Festival. cepts (Grishman et al., 1983; Hansen, 1994).
C. he has not bought his ticket. ⋆
G ENERAL W ORLD: It refers to the general
D. he cannot find the railway station.
Q3 What suggestion does the male put forth? knowledge about how world works, sometimes
A. he invites the female to come to Shanghai. ⋆ called commonsense knowledge. We focus on
B. he is going to wait in line the next day. the sort of world knowledge that an encyclope-
C. he wants to go to the company as soon as possible. dia would assume readers know without being
D. he is going to go home after the Spring Festival is over.
told (Lenat et al., 1985; Schubert, 2002) instead
Table 3: English translation of a sample problem from of the factual knowledge such as properties of
C3 -1B (⋆: the correct answer option). We show the famous entities. We further break down gen-
original problem in Chinese in Appendix A (Table 8). eral world knowledge into eight types, some of
which (marked with †) are similar to the cate-
gories for recognizing textual entailment summa-
used in both written and oral language, play an
rized by LoBue and Yates (2011).
essential role in Chinese learning and understand-
ing because of their conciseness in forms and ex-
• Arithmetic† : This includes numerical compu-
pressiveness in meaning (Lewis et al., 1998; Yang
tation and analysis (e.g., comparisons).
and Xie, 2013). We notice that a significant per-
centage of reference documents in C3 (especially • Connotation: This includes knowledge about
C3 -2A in Table 4) contain at least one idiom or implicit and implied sentiments (Feng et al.,
proverb. As the meaning of such an expression 2013; Van Hee et al., 2018).
may not be predicted from the meanings of its
constituent parts, we require culture-specific back- • Cause-effect† : The occurrence of a event A
ground knowledge (Wong et al., 2010). For exam- causes the occurrence of event B. See Q2 in
ple, to answer Q2 in Table 2, we need to know Table 2 for an example.
that the bolded idiom “成竹在胸” means “has
a ready plan to meet the situation” instead of • Implication: This category indicates the
its literal meaning “chest-have-fully developed- implicit inference from the content explic-
bamboo” derived from a story about a painter who itly described in the text, which cannot be
has a complete image of the bamboo in mind be- reached by paraphrasing sentences using lin-
fore drawing it. Therefore, the frequent use of id- guistic knowledge. For example, Q1 and Q4
ioms and proverbs in C3 may impede comprehen- in Table 2 belong to this category.
sion of human readers as well as pose challenges
for machine readers. We will introduce details • Part-whole: We require knowledge that ob-
about how we attempt to teach machine readers ject A is a part of object B. Relations such as

Metric                                                 C3 -1A            C3 -1B            C3 -2A          C3 -2B
  Min./Avg./Max. # of options per question               2 / 3.7 / 4       3 / 3.8 / 4       4/4/4           4/4/4
  Min./Avg./Max. # of correct options per question       1/1/1             1/1/1             1 / 1.9 / 4     1 / 1.8 / 4
  Min./Avg./Max. # of questions per reference document   1 / 1.9 / 6       1 / 1.2 / 6       2 / 10.0 / 20   1 / 6.4 / 22
  Avg./Max. option length (in characters)                6.5 / 45          4.4 / 31          5.6 / 39        6.5 / 36
  Avg./Max. question length (in characters)              13.5 / 57         10.9 / 34         22.5 / 97       26.0 / 91
  Avg./Max. reference document length (in characters)    180.2 / 1,274     76.3 / 1,540      395.9 / 995     440.1 / 1,651
  character vocabulary size                              4,120             2,922             2,093           2,075
  non-extractive correct option (%)                      81.9              78.9              91.4            95.2
  (sub-)documents that contain proverbs or idioms (%)    25.4              7.8               70.9            33.6
  # of (sub-)documents / # of questions
   Training                                              3,138 / 6,013     4,885 / 5,856     143 / 1,414     225 / 1,216
   Development                                           1,046 / 1,991     1,628 / 1,825     45 / 469        41 / 406
   Testing                                               1,045 / 2,002     1,627 / 1,890     49 / 492        52 / 416
   All                                                   5,229 / 10,006    8,140 / 9,571     237 / 2,375     318 / 2,038

Table 4: The overall statistics of C3 . For C3 -2A and C3 -2B, the reference documents refer to the sub-documents,
with the exception that we regard an option that does not appear in the full reference document as non-extractive.

                    Metric                     C3 -1A    C3 -1B    C3 -1   C3 -2A   C3 -2B      C3 -2
                    Matching                   12.0      14.3      13.2    0.0      0.0         0.0
                    Prior knowledge            88.0      85.7      86.8    100.0    100.0       100.0
                      ⋄ Linguistic             49.0      30.7      39.8    33.3     6.7         20.0
                      ⋄ Domain-specific        0.7       1.0       0.8     91.7     98.3        95.0
                      ⋄ General world          50.7      64.0      57.3    71.7     80.0        75.8
                           Arithmetic          3.0       4.7       3.8     0.0      0.0         0.0
                           Connotation         1.3       5.3       3.3     0.0      8.3         4.2
                           Cause-effect        14.0      6.7       10.3    6.7      1.7         4.2
                           Implication         17.7      20.3      19.0    6.7      10.0        8.3
                           Part-whole          5.0       5.0       5.0     0.0      0.0         0.0
                           Precondition        2.7       4.3       3.5     0.0      0.0         0.0
                           Scenario            9.6       24.3      17.0    61.7     68.3        65.0
                           Other               3.3       0.3       1.8     0.0      0.0         0.0
                    Single sentence            50.7      22.7      36.7    1.7      3.3         2.5
                    Multiple sentences         47.0      77.0      62.0    83.3     81.7        82.5
                    Independent                2.3       0.3       1.3     15.0     15.0        15.0

                    # of annotated questions   300       300       600     60       60          120

 Table 5: Distribution (%) of types of required knowledge based on a subset of test and development sets of C3 .

     member-of, stuff-of, and component-of be-                    • Precondition† : If had event A not happened,
     tween two objects also fall into this cate-                    event B would not have happened (Ikuta
     gory (Winston et al., 1987; Miller, 1998).                     et al., 2014; O’Gorman et al., 2016).

                                                                  • Other: Knowledge that belongs to none of the
  • Scenario: It includes knowledge about hu-                       above categories.
    man behaviors or activities, which may in-
    volve corresponding time and location infor-                  As shown in Table 5, compared to narrative-
    mation. We also consider knowledge about                    based (C3 -2A) or dialogue-based (C3 -1B and C3 -
    the profession, education, personality, and                 2B) tasks, we tend to require more linguistic
    mental or physical health of the involved par-              knowledge and less general world knowledge to
    ticipant as well as the relations among the                 answer questions designed for well-written texts
    participants, indicated by the behaviors or ac-             in C3 -1A. In C3 -2, not surprisingly, 95.0% of
    tivities described in texts. For example, we                questions require domain-specific knowledge, and
    put Q3 in Table 2 in this category as “friendly             we notice that a higher percentage (75.8%) of
    laughter” may express “understanding”.                      questions require general world knowledge es-

pecially the scenario-based knowledge (65.0%) ing model on C3 . Specifically, we generate
compared to that in C3 -1. Besides, we require two types of problems: (1) Given an explana-
multiple sentences to answer most of the domain- tion of a proverb/idiom, choose the corresponding
specific questions, which also reflects the diffi- proverb/idiom; (2) Given a proverb/idiom, select
culty of these kind of tasks. the corresponding explanation. To generate dis-
tractors (wrong answer options), we first sort all
4 Approaches entries (i.e., proverbs and idioms) in alphabetical
order and assume two entries are more likely to be
4.1 Reading Comprehension Model
closer in meaning if they share more characters.
We follow the framework of discriminatively For type (1) problems, we treat the entry close to
fine-tuning pre-trained language models on ma- the correct entry as distractors. For type (2) prob-
chine reading comprehension tasks (Radford et al., lems, distractors are explanations of entries close
2018). We use the Chinese BERT-Base model (de- to the given entry. See examples of generated
noted as BERTCN ) released by Devlin et al. (2019) problems in Table 9 in Appendix A. When fine-
as the pre-trained language model. tuning on the generated problems, we regard the
Given a reference document d, a question q, and given explanation (for type (1) problems) or given
an answer option oi , we construct the input se- entry (for type (2) problems) as the reference doc-
quence by concatenating a [CLS] token, tokens ument and leave the question context empty.
in d, a [SEP] token, tokens in q, a [SEP] token,
tokens in oi , and a [SEP] token, where [CLS] 4.3 General World Knowledge Graph
and [SEP] are the classifier token and sentence
separator token in BERT, respectively. We add A graph of general world knowledge such as Con-
an embedding A to every token before the first ceptNet (Speer et al., 2017) is useful to help us
[SEP] token (inclusive) and a B embedding to understand the meanings behind the words and
every other token, where A and B are pre-trained therefore may bridge knowledge gaps between hu-
segmentation embeddings in BERT. We denote the man and machine readers. For instance, rela-
final hidden state for the first token in the input se- tional triples under relation categories C AUSES
quence as Si ∈ R1×H . For C3 -1A and C3 -1B, we and PARTO F in ConceptNet may be helpful for
introduce a classification layer W1 ∈ R1×H and us to solve questions in C3 , which fall into cause-
obtain the unnormalized log probability Pi ∈ R effect and part-whole subcategories of the general
of oi being correct by Pi = Si W1T . For C3 - world knowledge defined in Section 3.3.
2A and C3 -2B, we introduce a classification layer We propose to introduce an additional fine-
W2 ∈ R2×H and obtain the probabilities Pi ∈ R2 tuning stage to incorporate general world knowl-
of answer option oi being correct and incorrect by edge. We first fine-tune BERTCN on multiple-
Pi = softmax(Si W2T ). We refer readers to Devlin choice problems, which are automatically gener-
et al. (2019) for more details. ated based on ConceptNet. We then fine-tune the
resulting model on C3 . Let (a, r, b) denote a re-
4.2 Proverb and Idiom Knowledge lational triple in ConceptNet: a and b are Chi-
As mentioned in Section 3.2, Chinese proverbs nese words or phrases; r represents the relation
and idioms are usually difficult to understand type (e.g., C AUSES) between a and b. For each
without enough background knowledge. Teach- relation type r, we introduce two special tokens
ers generally believe that these expressions can [r→] and [r←] to represent r and its reverse re-
be learned effectively via a proverb or idiom dic- lation type, respectively. We convert each (a, r, b)
tionary (Lewis et al., 1998). Thus, we consider into two problems: (1) Given a and [r→], choose
infusing the linguistic knowledge in a lexicon of b. (2) Given b and [r←], choose a. Distractors
proverbs and idioms into the baseline reader. are formed by randomly picked Chinese words or
We propose to introduce an additional fine- phrases in ConceptNet. See examples of generated
tuning stage: Instead of directly fine-tuning problems in Table 10 in Appendix A. During the
BERTCN on C3 , we first fine-tune BERTCN on fine-tuning stage on the generated problems, we
multiple-choice proverb and idiom problems that regard the given word or phrase as the reference
are automatically generated based on proverb and document and the given relation type token as the
idiom dictionaries and then fine-tune the result- question.

5 Experiment However, none of the above methods outper-
forms the BERTCN baseline. It remains a chal-
5.1 Experimental Settings lenge to leverage expert knowledge to improve the
We set the learning rate to 2 × 10−5 , the batch size performance on C3 -2 for future investigation.
to 24, and the maximal sequence length to 512. Gap between machine and human: We show hu-
We truncate the longest sequence among d, q, and man performance on the same subset of C3 -1 used
oi (Section 4.1) when the input sequence length for analysis of the required knowledge in Sec-
exceeds 512. The embeddings of relation type to- tion 3.3. We do not report the human performance
kens are initialized randomly (Section 4.3). For on C3 -2 due to the wide variance in the human ex-
C3 -2A and C3 -2B, we regard each sub-document pertise with psychological counseling. We see a
as d. When we introduce one additional fine- significant gap between the automated approach
tuning stage before fine-tuning on the target C3 and human performance on C3 -1, especially on
task, we first fine-tune on the additional task for questions that require prior knowledge or multi-
one epoch. For all experiments, we fine-tune on ple sentences than questions that can be answered
the target C3 task(s) for eight epochs. We run ev- by surface matching or only involve content from
ery experiment five times with different random a single sentence (Section 3.3).
seeds and report the best development set perfor-
mance and its corresponding test set performance. 5.3 Impact of Linguistic Knowledge and
General World Knowledge
5.2 Baseline Results and Discussion Linguistic Knowledge: We generate 86, 720
We report the baseline performance in Table 6 and problems based on 43, 360 proverbs/idioms and
discuss the following aspects of our observations. their explanations. By introducing an additional
Domain-specific knowledge: For domain- fine-tuning stage on the generated proverb and id-
3
specific questions in C -2, we explore three ways iom problems, we see consistent gain in accuracy
to introduce domain-specific knowledge. over all C3 tasks compared to the BERTCN base-
First, we follow previous work (Sun et al., line, with an absolute improvement of 2% in aver-
2019b) for English question answering tasks. age accuracy (Table 6).
Given a reference document d, a question q, and We also compare with an alternative approach
an answer option oi , we use Lucene (McCandless of imparting proverb and idiom knowledge. For
et al., 2010) to retrieve top 50 sentences from two each reference document d, we use the same lex-
psychological counselling textbooks by using the icon as used in proverb and idiom problem gen-
concatenation of q and oi as a query. In compari- eration to look up proverbs and idioms in d for
son, we append the retrieved sentences to the end- their explanations. Let a1...n and b1...n denote
ing of d to form the new input sequence. proverbs/idioms in d and their corresponding ex-
In another attempt, we collect 4,544 multiple- planations, respectively. We replace d with the
choice question answering problems1 on core concatenation of d and a1 : b1 a2 : b2 . . . an : bn
knowledge of psychological counseling from Psy- when constructing the input sequence. However,
chological Counseling Examinations (the same this approach does not yield promising results.
source as problems in C3 -2A and C3 -2B). Each General World Knowledge: We generate
problem is composed of a question and four an- 737,534 problems based on ConceptNet. By in-
swer options, at least one of which is correct. See troducing an additional fine-tuning stage on the
an example in Table 11 in Appendix A. We first generated problems, we see gain in accuracy over
fine-tune BERTCN on the core knowledge prob- most C3 tasks compared to the BERTCN baseline
lems and then fine-tune the resulting model on C3 - (Table 6). We notice that C3 -1 benefits more than
2A/C3 -2B. In the first stage, we leave the reference C3 -2 from this general world knowledge graph in-
document context empty as no context is provided. troduced by the proposed approach.
In the third method, we run pre-training steps on 5.4 Other Attempt: Cross Genre Training
two psychological counselling textbooks starting
We also fine-tune BERTCN on sub-tasks from sim-
from the BERTCN checkpoint and use the resulting
ilar domains but in different genres simultane-
model as the new pre-trained model.
ously, instead of fine-tuning it on each of the four
1
We will release them along with C3 . C3 sub-tasks separately. We observe that BERTCN

C3 -1A          C3 -1B        C3 -2A        C3 -2B       Average
   Method
                                              Dev Test        Dev Test      Dev Test      Dev Test      Dev Test
   Random                                     27.8     27.8   26.4   26.6   6.7     6.7   6.7     6.7   16.9   17.0
   BERTCN                                     63.0     62.6   62.3   62.1   36.7   26.2   34.7   31.3   49.2   45.6
   Domain-specific Knowledge:
   BERTCN + IR from Textbooks                   –        –     –      –     35.2   26.0   32.5   30.0    –      –
   BERTCN + Core Knowledge Problems             –        –     –      –     34.8   27.0   35.5   29.6    –      –
   BERTCN + Textbook Pre-Training               –        –     –      –     36.2   29.7   33.7   28.1    –      –
   Linguistic Knowledge:
   BERTCN + Proverb and Idiom Look-Up         62.6     63.2   62.8   62.2   37.1   27.6   32.3   30.0   48.7   45.8
   BERTCN + Proverb and Idiom Problems        63.6     63.5   63.9   64.0   38.8   30.9   38.4   32.2   51.2   47.7
   General World Knowledge:
   BERTCN + Graph-Structured Knowledge        63.8     65.0   63.9   64.5   37.7   28.9   34.7   32.0   50.0   47.6
   Cross Genre:
   BERTCN                                     64.8     65.1   65.4   64.3   37.1   28.0   36.7   33.7   51.0   47.8
                       ⋆
   Human Performance                          96.0     93.3   98.0   98.7    –      –      –      –      –      –

Table 6: Performance of baseline and models absorbing linguistic knowledge and general world knowledge in
accuracy (%) on the C3 dataset (⋆: performance based on a subset of test and development sets of C3 ).

                       BERTCN            Human                 References
                       C3 -1A | C3 -1B   C3 -1A | C3 -1B
                                                               Ondrej Bajgar, Rudolf Kadlec, and Jan Kleindi-
  Matching             100.0 | 74.1      100.0 | 100.0
  Prior knowledge      57.6 | 60.2       95.7 | 97.6             enst. 2016. Embracing data abundance: Book-
                                                                 test dataset for reading comprehension. CoRR,
  Single sentence      65.8 | 79.4       97.0 | 97.0
  Multiple sentences   55.6 | 57.8       94.0 | 98.0             cs.CL/1610.00956v1.

Table 7: Performance comparison in accuracy (%)                Gong Cheng, Weixi Zhu, Ziwei Wang, Jianghui
based on the coarse-grained question categories.                 Chen, and Yuzhong Qu. 2016. Taking up the
                                                                 gaokao challenge: An information retrieval ap-
                                                                 proach. In Proceedings of the IJCAI, pages
trained on the combination of C3 -1A or C3 -1B                   2479–2485, New York City, NY.
consistently outperforms the same model trained
solely on C3 -1A or C3 -1B. We have a similar ob-              Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar,
servation on C3 -2.                                              Wen-tau Yih, Yejin Choi, Percy Liang, and
                                                                 Luke Zettlemoyer. 2018. QuAC: Question an-
6 Conclusion                                                     swering in context. In Proceedings of the
                                                                 EMNLP, pages 2174–2184, Brussels, Belgium.
We present the first collection of Challenging Chi-
                                                               Zewei Chu, Hai Wang, Kevin Gimpel, and David
nese multiple-choice machine reading Compre-
                                                                 McAllester. 2017. Broad context language
hension datasets (C3 ) collected from real-world
                                                                 modeling as reading comprehension. In Pro-
exams, requiring linguistic, general or domain-
                                                                 ceedings of the EACL, pages 52–57, Valencia,
specific knowledge to answer questions based on
                                                                 Spain.
the given oral or written text. We study the
prior knowledge needed in these challenging read-              Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar
ing comprehension tasks and further explore how                  Khot, Ashish Sabharwal, Carissa Schoenick,
to utilize linguistic, general world, and domain-                and Oyvind Tafjord. 2018. Think you have
specific knowledge to improve the comprehen-                     solved question answering? try arc, the ai2 rea-
sion ability of machine readers through fine-tuning              soning challenge. CoRR, cs.CL/1803.05457v1.
BERT. Experimental results show that linguis-
tic and general world knowledge may help the                   Peter Clark, Oren Etzioni, Tushar Khot, Ashish
reader baseline perform better in both general and               Sabharwal, Oyvind Tafjord, Peter D Turney,
domain-specific reading comprehension tasks.                     and Daniel Khashabi. 2016. Combining re-
                                                                 trieval, statistics, and inference to answer ele-

mentary science questions. In Proceedings of        Yu Hao, Xien Liu, Ji Wu, and Ping Lv. 2019. Ex-
  the AAAI, pages 2580–2586, Phoenix, AZ.               ploiting sentence embedding for medical ques-
                                                        tion answering. In Proceedings of the AAAI,
Yiming Cui, Ting Liu, Zhipeng Chen, Wentao Ma,          Honolulu, HI.
  Shijin Wang, and Guoping Hu. 2018a. Dataset
  for the first evaluation on chinese machine read-   Wei He, Kai Liu, Jing Liu, Yajuan Lyu, Shiqi
  ing comprehension. In Proceedings of the             Zhao, Xinyan Xiao, Yuan Liu, Yizhong Wang,
  LREC, pages 2721–2725, Miyazaki, Japan.              Hua Wu, Qiaoqiao She, et al. 2017. Dureader:
                                                       a chinese machine reading comprehension
Yiming Cui, Ting Liu, Zhipeng Chen, Shijin             dataset from real-world applications. In Pro-
  Wang, and Guoping Hu. 2016. Consensus                ceedings of the MRQA, pages 37–46, Mel-
  attention-based neural networks for chinese          bourne, Australia.
  reading comprehension. In Proceedings of the
  COLING, pages 1777–1786, Osaka, Japan.              Karl Moritz Hermann, Tomas Kocisky, Edward
                                                        Grefenstette, Lasse Espeholt, Will Kay, Mustafa
Yiming Cui, Ting Liu, Li Xiao, Zhipeng Chen,            Suleyman, and Phil Blunsom. 2015. Teaching
  Wentao Ma, Wanxiang Che, Shijin Wang, and             machines to read and comprehend. In Proceed-
  Guoping Hu. 2018b. A span-extraction dataset          ings of the NIPS, pages 1693–1701, Montreal,
  for chinese machine reading comprehension.            Canada.
  CoRR, cs.CL/1810.07366v1.
                                                      Felix Hill, Antoine Bordes, Sumit Chopra, and Ja-
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
                                                        son Weston. 2016. The goldilocks principle:
  Kristina Toutanova. 2019. BERT: Pre-training
                                                        Reading children’s books with explicit memory
  of deep bidirectional transformers for language
                                                        representations. In Proceedings of the ICLR,
  understanding. In Proeedings of the NAACL-
                                                        Caribe Hilton, Puerto Rico.
  HLT, Minneapolis, MN.
                                                      Rei Ikuta, Will Styler, Mariah Hamang, Tim
Song Feng, Jun Seok Kang, Polina Kuznetsova,
                                                        O’Gorman, and Martha Palmer. 2014. Chal-
  and Yejin Choi. 2013. Connotation lexicon: A
                                                        lenges of adding causation to richer event de-
  dash of sentiment beneath the surface meaning.
                                                        scriptions. In Proceedings of the Second Work-
  In Proceedings of the ACL, pages 1774–1784,
                                                        shop on EVENTS: Definition, Detection, Coref-
  Sofia, Bulgaria.
                                                        erence, and Representation, pages 12–20, Bal-
Ralph Grishman, Lynette Hirschman, and Carol            timore, MD.
  Friedman. 1983. Isolating domain dependen-
  cies in natural language interfaces. In Proceed-    Mandar Joshi, Eunsol Choi, Daniel S. Weld,
  ings of the ANLP, pages 46–53.                       and Luke Zettlemoyer. 2017.        TriviaQA:
                                                       A large scale distantly supervised challenge
Shangmin Guo, Kang Liu, Shizhu He, Cao Liu,            dataset for reading comprehension. CoRR,
  Jun Zhao, and Zhuoyu Wei. 2017a. Ijcnlp-             cs.CL/1705.03551v2.
  2017 task 5: Multi-choice question answering
  in examinations. In Proceedings of the IJCNLP       Daniel Khashabi, Snigdha Chaturvedi, Michael
  2017, Shared Tasks, pages 34–40, Taipei, Tai-         Roth, Shyam Upadhyay, and Dan Roth. 2018.
  wan.                                                  Looking beyond the surface: A challenge set
                                                        for reading comprehension over multiple sen-
Shangmin Guo, Xiangrong Zeng, Shizhu He,                tences. In Proceedings of the NAACL-HLT,
  Kang Liu, and Jun Zhao. 2017b. Which is the           pages 252–262, New Orleans, LA.
  effective way for gaokao: Information retrieval
  or neural networks?     In Proceedings of the       Tomáš Kočiskỳ, Jonathan Schwarz, Phil Blun-
  EACL, pages 111–120, Valencia, Spain.                 som, Chris Dyer, Karl Moritz Hermann, Gáa-
                                                        bor Melis, and Edward Grefenstette. 2018. The
Steffen Leo Hansen. 1994. Reasoning with a do-          narrativeqa reading comprehension challenge.
  main model. In Proceedings of the NODALIDA,           Transactions of the Association of Computa-
  pages 111–121.                                        tional Linguistics, 6:317–328.

Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming           Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng
  Yang, and Eduard Hovy. 2017. RACE: Large-            Gao, Saurabh Tiwary, Rangan Majumder, and
  scale reading comprehension dataset from ex-         Li Deng. 2016. MS MARCO: A human gen-
  aminations. In Proceedings of the EMNLP,             erated machine reading comprehension dataset.
  pages 785–794, Copenhagen, Denmark.                  CoRR, cs.CL/1611.09268v2.

Douglas B Lenat, Mayank Prakash, and Mary            Tim O’Gorman, Kristin Wright-Bettner, and
  Shepherd. 1985. Cyc: Using common sense              Martha Palmer. 2016. Richer event descrip-
  knowledge to overcome brittleness and knowl-         tion: Integrating event coreference with tempo-
  edge acquisition bottlenecks. AI magazine,           ral, causal and bridging annotation. In Proceed-
  6(4):65–65.                                          ings of the CNS, pages 47–56, Austin, TX.

Omer Levy, Minjoon Seo, Eunsol Choi, and Luke        Simon Ostermann, Michael Roth, Ashutosh Modi,
 Zettlemoyer. 2017. Zero-shot relation extrac-         Stefan Thater, and Manfred Pinkal. 2018.
 tion via reading comprehension. In Proceed-           SemEval-2018 Task 11: Machine comprehen-
 ings of the CoNLL, pages 333–342, Vancouver,          sion using commonsense knowledge. In Pro-
 Canada.                                               ceedings of the SemEval, pages 747–757, New
                                                       Orleans, LA.
R Lewis, RWP Luk, and ABY Ng. 1998.
 Computer-assisted learning of chinese id-           Hoifung Poon, Janara Christensen, Pedro Domin-
 ioms. Journal of Computer Assisted Learning,          gos, Oren Etzioni, Raphael Hoffmann, Chloe
 14(1):2–18.                                           Kiddon, Thomas Lin, Xiao Ling, Alan Ritter,
                                                       Stefan Schoenmackers, et al. 2010. Machine
Peter LoBue and Alexander Yates. 2011. Types           reading at the university of washington. In Pro-
  of common-sense knowledge needed for recog-          ceedings of the NAACL-HLT FAM-LbR, pages
  nizing textual entailment. In Proceedings of the     87–95, Los Angeles, CA.
  ACL, pages 329–334, Portland, OR.
                                                     Alec Radford, Karthik Narasimhan, Tim Sali-
Shangbang Long, Cunchao Tu, Zhiyuan Liu,               mans, and Ilya Sutskever. 2018. Improving lan-
  and Maosong Sun. 2018. Automatic judg-               guage understanding by generative pre-training.
  ment prediction via legal reading comprehen-         In Preprint.
  sion. CoRR, cs.AI/1809.06537v1.
                                                     Alec Radford, Jeffrey Wu, Rewon Child, David
Michael McCandless, Erik Hatcher, and Otis             Luan, Dario Amodei, and Ilya Sutskever. 2019.
 Gospodnetic. 2010. Lucene in Action, Second           Language models are unsupervised multitask
 Edition: Covers Apache Lucene 3.0. Manning            learners. In Preprint.
 Publications Co., Greenwich, CT.                    Pranav Rajpurkar, Robin Jia, and Percy Liang.
                                                       2018. Know what you don’t know: Unanswer-
Todor Mihaylov, Peter Clark, Tushar Khot, and
                                                       able questions for squad. In Proceedings of th
  Ashish Sabharwal. 2018. Can a suit of armor
                                                       ACL, pages 784–789, Melbourne, Australia.
  conduct electricity? a new dataset for open
  book question answering. In Proceedings of the     Pranav Rajpurkar, Jian Zhang, Konstantin Lopy-
  EMNLP, pages 2381–2391, Brussels, Belgium.           rev, and Percy Liang. 2016. SQuAD: 100,000+
                                                       questions for machine comprehension of text.
George Miller. 1998. WordNet: An electronic lex-
                                                       In Proceedings of the EMNLP, pages 2383–
  ical database. MIT press.
                                                       2392, Austin, TX.
Nasrin Mostafazadeh, Nathanael Chambers, Xi-         Siva Reddy, Danqi Chen, and Christopher D
  aodong He, Devi Parikh, Dhruv Batra, Lucy            Manning. 2018.    Coqa: A conversational
  Vanderwende, Pushmeet Kohli, and James               question answering challenge.     CoRR,
  Allen. 2016. A corpus and evaluation frame-          cs.CL/1808.07042v1.
  work for deeper understanding of commonsense
  stories. In Proceedings of the NAACL-HLT,          Matthew Richardson, Christopher JC Burges, and
  pages 839–849, San Diego, CA.                       Erin Renshaw. 2013. MCTest: A challenge

dataset for the open-domain machine compre-       Johannes Welbl, Pontus Stenetorp, and Sebastian
  hension of text. In Proceedings of the EMNLP,       Riedel. 2018. Constructing datasets for multi-
  pages 193–203, Seattle, WA.                         hop reading comprehension across documents.
                                                      Transactions of the Association of Computa-
Lenhart Schubert. 2002. Can we derive general         tional Linguistics, 6:287–302.
  world knowledge from texts? In Proceedings
  of the HLT, pages 94–97, San Diego, CA.           Morton E Winston, Roger Chaffin, and Douglas
                                                     Herrmann. 1987. A taxonomy of part-whole re-
Chih Chieh Shao, Trois Liu, Yuting Lai, Yiying       lations. Cognitive science, 11(4):417–444.
  Tseng, and Sam Tsai. 2018. DRCD: a chi-
  nese machine reading comprehension dataset.       Lung-Hsiang Wong, Chee-Kuen Chin, Chee-Lay
  CoRR, cs.CL/1806.00920v2.                           Tan, and May Liu. 2010. Students’ personal
                                                      and social meaning making in a chinese idiom
Robyn Speer, Joshua Chin, and Catherine Havasi.       mobile learning environment. Journal of Edu-
  2017. ConceptNet 5.5: An Open Multilingual          cational Technology & Society, 13(4):15–26.
  Graph of General Knowledge. In Proceedings
  of the AAAI, pages 4444–4451, San Francisco,      Chunsheng Yang and Ying Xie. 2013. Learn-
  CA.                                                 ing chinese idioms through ipads. Language
                                                      Learning & Technology, 17(2):12–23.
Shane Storks, Qiaozi Gao, and Joyce Y Chai.
  2019. Commonsense reasoning for natural           Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua
  language understanding: A survey of bench-          Bengio, William Cohen, Ruslan Salakhutdi-
  marks, resources, and approaches.   CoRR,           nov, and Christopher D Manning. 2018. Hot-
  cs.CL/1904.01172v1.                                 potqa: A dataset for diverse, explainable multi-
                                                      hop question answering. In Proceedings of the
Kai Sun, Dian Yu, Jianshu Chen, Dong Yu, Yejin        EMNLP, pages 2369–2380, Brussels, Belgium.
  Choi, and Claire Cardie. 2019a. DREAM:
  A challenge dataset and models for dialogue-      Sheng Zhang, Xiaodong Liu, Jingjing Liu,
  based reading comprehension. Transactions of        Jianfeng Gao, Kevin Duh, and Benjamin
  the Association for Computational Linguistics.      Van Durme. 2018a.     Record: Bridging
                                                      the gap between human and machine com-
Kai Sun, Dian Yu, Dong Yu, and Claire Cardie.         monsense reading comprehension.  CoRR,
  2019b. Improving machine reading compre-            cs.CL/1810.12885v1.
  hension with general reading strategies. In
  Proceedings of the NAACL-HLT, Minneapolis,        Xiao Zhang, Ji Wu, Zhiyang He, Xien Liu, and
  MN.                                                 Ying Su. 2018b. Medical exam question an-
                                                      swering with large-scale reading comprehen-
Alon Talmor, Jonathan Herzig, Nicholas Lourie,
                                                      sion. In Proceedings of the AAAI, pages 5706–
  and Jonathan Berant. 2018. CommonsenseQA:
                                                      5713, New Orleans, LA.
  A question answering challenge targeting com-
  monsense knowledge. In Proceedings of the         Zhuosheng Zhang and Hai Zhao. 2018. One-shot
  NAACL-HLT, Minneapolis, MN.                         learning for question-answering in gaokao his-
                                                      tory challenge. In Proceedings of the COLING,
Adam Trischler, Tong Wang, Xingdi Yuan, Justin
                                                      pages 449–461, Santa Fe, NM.
  Harris, Alessandro Sordoni, Philip Bachman,
  and Kaheer Suleman. 2017. NewsQA: A ma-
  chine comprehension dataset. In Proceedings
  of the RepL4NLP, pages 191–200, Vancouver,
  Canada.

Cynthia Van Hee, Els Lefever, and Véronique
  Hoste. 2018. We usually don’t like going to the
  dentist: Using common sense to detect irony on
  twitter. Computational Linguistics, 44(4):793–
  832.

A     Appendices

 女 : 怎么样?买到票了吗?
 男 : 火车站好多人啊，我排了整整一天的队，等排
     到我了，他们说没票了，要等过了年才有。
 女 : 没关系，过了年回来也是一样的。
 男 : 公司初六就上班了，我怕过了年来不及。要不
     今年您和我爸来上海过年吧?                                         Sample Problem 1:
 女 : 我这老胳膊老腿的不想折腾了。
                                                           努力赚钱
 男 : 一点儿不折腾，等我帮你们买好票，你们直接
     过来就行。                                                 Q    [MotivatedByGoal→]
                                                           A.   增广见识
 Q1      说话人是什么关系?
                                                           B.   打开电视机
 A.      父女
                                                           C.   刚搬家
 B.      母子⋆
                                                           D.   美好的生活⋆
 C.      同学
 D.      同事                                                Sample Problem 2:
 Q2      男的遇到了什么困难?
                                                           美好的生活
 A.      公司不放假
 B.      过年东西贵                                             Q    [MotivatedByGoal←]
 C.      没买到车票⋆                                            A.   想偷懒的时候
 D.      找不到车站                                             B.   努力赚钱⋆
 Q3      男的提出了什么建议?                                        C.   抢妹妹的食物
 A.      让女的来上海⋆                                           D.   龟起来
 B.      明天再去排队
 C.      早点儿去公司                                         Table 10: Examples of generated problems based on
 D.      过了年再回家                                         general relational knowledge (⋆: the correct answer op-
                                                        tion).
Table 8: A sample problem from C3 -1B (⋆: the correct
answer option).

    Sample Problem 1:
    古谚语，意思是为将者善战，其士卒亦必勇敢无
    前。亦比喻凡事为首者倡导于前，则其众必起而
    效之。
    A.   一人向隅，满坐不乐
    B.   一人善射，百夫决拾⋆
    C.   一人得道，鸡犬升天
    D.   一人传虚，万人传实
    Sample Problem 2:
    龙跳虎伏
    A.   犹言龙腾虎卧。比喻笔势。⋆                                     对求助者形成初步印象的工作程序包括( )。
    B.   比喻高举远逝。
                                                           A.   对求助者心理健康水平进行衡量⋆
    C.   比喻文笔、书法纵逸雄劲。
                                                           B.   对求助者心理问题的原因作解释
    D.   比喻超逸雄奇。
                                                           C.   对求助者的问题进行量化的评估⋆
                                                           D.   对某些含混的临床表现作出鉴别⋆
Table 9: Examples of generated Chinese proverb/idiom
problems (⋆: the correct answer option).                Table 11: An example of problems on core knowledge
                                                        of psychological counseling (⋆: the correct answer op-
                                                        tion).

General information: a female, at the age of 24, unmarried, a cashier.
The help seeker’s self-narration: In the past two years, I have always felt everything is dirty, especially money. For this
sake, I wash my hands so frequently that the skin of them has peeled off. Even so, I do not feel at ease. I know that this
is not good for me but I cannot help doing so.
Case Introduction: Ten years ago, the help seeker went to hospital to pay a visit to her classmate. After she went home,
she ate an apple without washing her hands and was scolded by her parents after being noticed by them. They warned
her that she would fall ill if she eats things without washing her hands. For this sake, she was worried and suffered from
insomnia for two days. Since then, whenever she has come home from school, she has remembered to wash her hands
earnestly. Little by little, this episode has gone by. Two years ago, she became a cashier dealing with money every day.
She always believes that money is dirty for it is covered with numerous bacteria. So she washes her hands repeatedly
after work. In spite of being clearly aware that her hands are quite clean, she is still unable to control her mentality
and always afraid that “in case they might not washed clean”? Much time has been spent in her perplexity. She is so
vexed that she has been unable to go to work recently. A month ago, she began to worry that she was likely to suffer
from a mental disorder, which has made her dispirited and sleepless. As a result, she often suffers from a headache and
frequently sees doctors. However, she is unwilling to take the medicine prescribed by doctors because of many side
effects written in the instructions. Her parents think that she does not have mental illness and accompany her to seek
for psychological counseling.
According to the parents of the help seeker, she is the only child in her family and her parents are strict with her. During
her childhood, she was not permitted to go out to play by herself. Since her parents were busy, she was sent to her
grandmother’s home to be taken care of. Her grandmother tightly protected her and did not allow her to play with
other little friends. Every day, she must have meals and go to bed on time. Later, she was at school. She was very
meticulous in her study and her academic results were excellent. Yet her classmate relation was ordinary. She listened
to her parents and was careful about what she did. She was a little coward. After failing to pass the postgraduate
entrance examination after her graduation from college, she once suffered from depression and insomnia. Fortunately,
everything has gone well with her since she was employed. As a cashier, she has to deal with money every day and
always thinks that money is dirty. Consequently, she has washed her hands more and more frequently. Recently, she
has been so depressed that she is unable to work.
Q1 The main symptoms of the help seeker are ( )
A. fear B. depression⋆ C. obsession ⋆ D. compulsive behavior ⋆
Q2 Which of the following symptoms does not happen to the help seeker ( )
A. palpitation B. insomnia C. headache D. nausea⋆
Q3 The main behavioral symptoms of the help seeker are ( )
A. repeated behavior⋆ B. repeated counseling C. repeated examination⋆ D. repeated weeping
Q4 Which of the following behavioral symptoms does not happen to the help seeker ( )
A. worry B. depression C. frequent headaches D. nervousness and fear⋆
Q5 The course of disease on the part of the help seeker is ( )
A. one month B. two years⋆ C. three months D. ten years
Q6 The psychological characteristics of the help seeker include ( )
A. strict family education B. cowardliness and overcaution ⋆
C. failure in passing the postgraduate entrance examination D. headaches and insomnia
Q7 The reasons for judging whether the help seeker has a normal mentality are ( )
A. whether she has self-consciousness ⋆ B. see doctors of her own accord⋆
C. severity of symptom D. impairment of social function
Q8 The causes of the help seeker’s psychological problem exclude ( )
A. personality factors B. cognitive factors C. examination stress⋆ D. stress imposed by her parents⋆
Q9 The causes of the formation of the help seeker’s personality may be ( )
A. parental control⋆ B. tight protection given by her grandmother⋆
C. work environment D. failure in passing the postgraduate entrance examination
Q10 The help seeker’s characteristics of personality include ( )
A. excessive demands on herself⋆ B. excessive pursuit of perfection⋆
C. excessive sentimentality D. excessively self-righteous
Q11 The crux of the psychological problem on the part of the help seeker lies in ( )
A. self-inferiority and parental criticism B. fear and being afraid of falling ill⋆
C. anxiety and excessive demands on herself D. depression and the loss of work
Q12 The life events affecting the help seeker include ( )
A. separation from her parents during her childhood⋆
B. being unable to play by herself when she was a child
C. insomnia caused by her failure in passing the postgraduate entrance examination
D. being scolded for eating things without washing her hands⋆

Table 12: English translation of a sample problem (Part 1) from C3 -2A (⋆: the correct answer option). We show
the original problem in Chinese in Table 14.

Q13 The grounds for the help seeker’s mental disease exclude ( )
   A. free of depressive symptom B. free of hallucination ⋆
   C. free of delusional disorder⋆ D. free of thinking disorder
   Q14 The definite diagnose on the part of the help seeker needs to have a further understanding of the following
   materials ( )
   A. her body conditions      B. characteristics of her inner world⋆
   C. her economic conditions D. her inter-personal communication⋆
   Q15 While offering the counseling service to the help seeker, what a counselor should pay attention to are ( )
   A. job change B. behavior change⋆ C. mood change⋆ D. cognitive change ⋆

Table 12: English translation of a sample problem (Part 2) from C3 -2A (⋆: the correct answer option). We show
the original problem in Chinese in Table 14.

General information: a help seeker, male, at the age of 24, graduation from a university, waiting for employment.
Case introduction: Two years have passed since the help seeker graduated from a university. Coerced by his parents,
he once went to a job fair. However, shortly after he arrived at the job fair, he went away without saying one sentence.
He confesses that he is not good at expressing himself. Seeing that other people are skilled at promoting themselves,
he feels that he is inferior to others in this regard. Because of the lack of work experience, he is afraid that he cannot be
competent for a job. Therefore, he always has no self-confidence, only to stay at home.
The information of the help seeker observed and understood by a psychological counselor includes introverted person-
ality, poor independence, no interest in making friends and low mood. The following is a section of talk between the
psychological counselor and the help seeker:
Psychological Counselor: What aspect would you like to receive my help?
Help Seeker: I am afraid to go to a job fair.
Psychological Counselor: Have you been there?
Help Seeker: Yes, I have. But I only wandered around before going away.
Psychological Counselor: You only wandered about without saying anything. Can you be called a candidate?
Help Seeker: I’m afraid to tell my intention in case I might be declined. What’s more, I’m worried that I am not
competent at a job.
Psychological Counselor: Just now, you’ve said that you are eager to land a job but now you tell me that you have
participated in almost no job fair. There seems to be a contradiction. Can you explain it?
Help Seeker: Almost two years have passed since I graduated from university. Yet I have not landed a job. I am really
anxious. But I really have no idea how I can prepare for a job fair.
Psychological Counselor: As a university graduate, you’d better consider how to attend a job fair.
Help Seeker: (keeps silent for a moment) I have to go to a job fair and tell employers that I need a job and I must
discuss with them about work nature, conditions, wage and so on.
Psychological Counselor: For what reasons have you not done these?
Help Seeker: I’m afraid that they may not accept me if I go there.
Psychological Counselor: You mean that as long as you go there, employers will be scrambling for employing you. In
other words, if you go to the State Council to apply for a job, you will make it; if you go to the municipal government
to seek a job, you will still realize your intention and if you go to an enterprise to apply for a post, you will succeed all
the same.
Help Seeker: Ah. . . it seems that I don’t think so (shaking his head).
Psychological Counselor: Since time is limited, so much for this counseling. Please go home to think it over. Next
time, we can continue to discuss this issue.
Q1 The major causes that trigger the help seeker’s psychological problem include ( )
A. personality factors⋆ B. inter-personal stress C. cognitive factors⋆ D. economic pressures
Q2 At the beginning of the counseling, the approaches to asking questions adopted by the psychological counselor are
()
A. to question intensely B. to ask open-ended questions⋆
C. to ask indirectly about something⋆ D. to ask closed questions
Q3 The psychological counselor says, “You only wandered about without saying anything. Can you be called a
candidate? ” He indicates his ( ) attitude to the help seeker.
A. reprobation B. query⋆ C. enlightenment D. encouragement
Q4 By saying “There seems to be a contradiction. Can you explain it? ” the psychological counselor adopts the
following tactic ( )
A. guidance B. confrontation⋆ C. encouragement D. explanation
Q5 Silence phenomenon has several major types except for ( )
A. suspicion type B. intellectual type⋆ C. vacant type D. resistant type
Q6 “I’m afraid that they may not accept me if I go there.” This sentence reflects the help seeker’s ( )
A. pessimistic attitude B. overgeneralization C. absolute requirement D. extreme awfulness⋆
Q7 The psychological counselor says, “You mean that as long as you go there, ...” What tactic is employed in this
paragraph?
A. Aristotle’s sophistry B. “mid-wife” argumentation ⋆
C. technique of rational-emotion imagination D. rational analysis report
The help seeker is the same person in the passage above.
Here is the section of the second talk between the psychological counselor and the help seeker:
Psychological Counselor: From the perspective of psychology, what triggers your emotional reaction is not some
events that have happened externally but your opinions of these events. It is necessary to change your emotion instead
of external events. That is to say, it is necessary to change your opinions and comment on these events. In fact, people
have their opinions of things. Some of their opinions are reasonable while others are not reasonable. Different opinions
may lead to different emotional results. If you have realized that your present emotional state has been caused by some
unreasonable ideas in your mind, perhaps you are likely to control your emotion.

Table 13: Translation of a sample problem (Part 1) from C3 -2B (⋆: the correct answer option). We show the
original problem in Chinese in Table 15.

You can also read