TEXSMART: A TEXT UNDERSTANDING SYSTEM FOR FINE-GRAINED NER AND ENHANCED SEMANTIC ANALYSIS
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
T EX S MART: A T EXT U NDERSTANDING S YSTEM FOR
F INE -G RAINED NER AND E NHANCED S EMANTIC A NALYSIS
T ECHNICAL R EPORT
Haisong Zhang, Lemao Liu, Haiyun Jiang, Yangming Li, Enbo Zhao,
Kun Xu, Linfeng Song, Suncong Zheng, Botong Zhou, Jianchen Zhu, Xiao Feng,
arXiv:2012.15639v1 [cs.CL] 31 Dec 2020
Tao Chen, Tao Yang, Dong Yu, Feng Zhang, Zhanhui Kang, Shuming Shi∗
Tencent AI
{texsmart, hansonzhang, redmondliu, shumingshi}@tencent.com
January 1, 2021
A BSTRACT
This technique report introduces TexSmart, a text understanding system that supports fine-grained
named entity recognition (NER) and enhanced semantic analysis functionalities. Compared to most
previous publicly available text understanding systems and tools, TexSmart holds some unique
features. First, the NER function of TexSmart supports over 1,000 entity types, while most other
public tools typically support several to (at most) dozens of entity types. Second, TexSmart introduces
new semantic analysis functions like semantic expansion and deep semantic representation, that are
absent in most previous systems. Third, a spectrum of algorithms (from very fast algorithms to those
that are relatively slow but more accurate) are implemented for one function in TexSmart, to fulfill
the requirements of different academic and industrial applications. The adoption of unsupervised or
weakly-supervised algorithms is especially emphasized, with the goal of easily updating our models
to include fresh data with less human annotation efforts.
The main contents of this report include major functions of TexSmart, algorithms for achieving
these functions, how to use the TexSmart toolkit and Web APIs, and evaluation results of some key
algorithms.
Keywords NLP toolkit · NLP API · Text understanding · fine-grained NER · Semantic analysis
1 Introduction
The long-term goal of natural language processing (NLP) is to help computers understand natural language as well
as we do, which is one of the most fundamental and representative challenges for artificial intelligence. Natural
language understanding includes a broad variety of tasks covering lexical analysis, syntactic analysis and semantic
analysis. In this report we introduce TexSmart2 , a new text understanding system that provides enhanced named entity
recognition (NER) and semantic analysis functionalities besides standard NLP modules. Compared to most previous
publicly-available text understanding systems [1, 2, 3, 4, 5, 6], TexSmart holds the following key characteristics:
• Fine-grained named entity recognition (NER)
• Enhanced semantic analysis
• A spectrum of algorithms implemented for one function, to fulfill the requirements of different academic and
industrial applications
∗
Project lead and chief architect
2
The English home-page of TexSmart is https://texsmart.qq.com/enT ECHNICAL R EPORT - JANUARY 1, 2021
(a) (b) Semantic Expansion
No. Entity Type ID No. Entity Type ID Semantics
{"related":["Batman","Superman","Wonder Woman”,
1 Marvel person 1 Captain Marvel work.movie "Green Lantern”,"the flash”,"Aquaman","Spider-Man",
"Green Arrow”,"Supergirl","Captain America"]}
{“related":["Toronto","Montreal","Vancouver","Ottawa",
2 Los Angeles location 2 Los Angeles loc.city
"Calgary","London","Paris","Chicago","Edmonton","Boston"]}
3 22 months ago time 3 22 months ago time.generic {“value":[2019,2]}
Fine-grained NER Deep Semantic Expression
Figure 1: Comparison between the NER results of a traditional text understanding system (in (a)) and the fine-grained
NER and semantic analysis results provided by TexSmart (in (b)). The input sentence is “Captain Marvel was premiered
in Los Angeles 22 months ago.” TexSmart successfully assigns fine-grained type “work.movie” to “Captain Marvel”,
while most traditional toolkits fail to do that. In addition, TexSmart gives more rich semantic analysis results for the
input sentence.
First, the fine-grained NER function of TexSmart supports over 1,000 entity types while most previous text understanding
systems typically support several to (at most) dozens of coarse entity types (among which the most popular types are
people, locations, and organizations). Large-scale fine-grained entity types are expected to provide richer semantic
information for downstream NLP applications. Figure 1 shows a comparison between the NER results of a previous
system and the fine-grained NER results of TexSmart. It is shown that TexSmart recognizes more entity types (e.g.,
work.movie) and finer-grained ones (e.g., loc.city vs. the general location type). Examples of entity types (and their
important sub-types) which TexSmart is able to recognize include people, locations, organizations, products, brands,
creative work, time, numerical values, living creatures, food, drugs, diseases, academic disciplines, languages, celestial
bodies, organs, events, activities, colors, etc.
Second, TexSmart provides two advanced semantic analysis functionalities: semantic expansion, and deep semantic
representation for a few entity types. These two functions are not available in most previous public text understanding
systems. Semantic expansion suggests a list of related entities for an entity in the input sentence (as shown in Figure 1).
It provides more information about the semantic meaning of an entity. Semantic expansion could also benefit upper-layer
applications like web search (e.g., for query suggestion) and recommendation systems. For time and quantity entities,
in addition to recognizing them from a sentence, TexSmart also tries to parse them into deep representations (as shown
in Figure 1). This kind of deep representations is essential for some NLP applications. For example, when a chatbot is
processing query “please book an air ticket to London at 4 pm the day after tomorrow”, it needs to know the exact time
represented by “4 pm the day after tomorrow”.
Third, a spectrum of algorithms is implemented for one task (e.g., part-of-speech tagging and NER) in TexSmart, to
fulfill the requirements of different academic and industrial applications. On one side of the spectrum are the algorithms
that are very fast but not necessarily the best in accuracy. On the opposite side are those that are relatively slow
yet delivering state-of-the-art performance in terms of accuracy. Different application scenarios may have different
requirements for efficiency and accuracy. Unfortunately, it is often very difficult or even impossible for a single
algorithm to achieve the best in both speed and accuracy at the same time. With multiple algorithms implemented for
one task, we have more chances to better fulfill the requirements of more applications.
One design principle of TexSmart is to put a lot of efforts into designing and implementing unsupervised or weakly-
supervised algorithms for a task, based on large-scale structured, semi-structured, or unstructured data. The goal is to
update our models easier to include fresh data with less human annotation efforts.
The rest part of this report is organized as follows: Major tasks and algorithms supported by TexSmart are described in
Section 2. Then we demonstrate in Section 3 how to use the TexSmart toolkit and Web APIs. Finally, evaluation results
of some algorithms on several key tasks are reported.
2 System Modules
Compared to most other public text understanding systems, TexSmart supports three unique modules, i.e., fine-grained
NER, semantic expansion and deep semantic representation. Besides, traditional tasks supported by both TexSmart
and many other systems include word segmentation, part-of-speech (POS) tagging, coarse-grained NER, constituency
parsing, semantic role labeling, text classification and text matching. Below we first introduce the unique modules,
followed by the traditional tasks.
2T ECHNICAL R EPORT - JANUARY 1, 2021
2.1 Key Modules
Since the implementation of fine-grained NER depends on semantic expansion, we first present semantic expansion,
then fine-grained NER, and finally deep semantic representation.
2.1.1 Semantic Expansion
Given an entity within a sentence, the semantic expansion module suggests a list of entities related to the given entity.
For example in Figure 1, the suggestion results for “Captain Marvel” include “Spider-Man”, “Captain America”,
and other related movies. Semantic expansion attaches additional information to an entity mention, which could be
leveraged by upper-layer applications for better understanding the entity and the source sentence. Possible applications
of the expansion results include web search (e.g., for query suggestion) and recommendation systems.
Offline Training Online Testing
(apple, fruit) C1:({apple, banana, “Apple juice”
fruits such as
(banana, fruit) peach, …}, fruit)
apple and banana Extraction Clustering Retrieval
(apple, company) C2:({apple, google C1, C2
Apple, Google, Microsoft
and other companies (google, company) microsoft,…}, Disambiguation
…… (microsoft, company) company)
…… …… C1
{apple, banana, peach, …}
Figure 2: Key steps for semantic expansion: extraction, clustering, retrieval and disambiguation. The first two steps are
conducted offline and the last two are performed online.
Semantic expansion task was firstly introduced in [7], and it was addressed by a neural method. However, this method
is not as efficient as one expected for some industrial applications. Therefore, we propose a light-weight alternative
approach in TexSmart for this task.
This approach includes two offline steps and two online ones, as illustrated in Figure 2. During the offline procedure,
Hearst patterns are first applied to a large-scale text corpus to obtain a is-a map (or called a hyponym-to-hypernym map)
[8, 9]. Then a clustering algorithm is employed to build a collection of term clusters from all the hyponyms, allowing
a hyponym to belong to multiple clusters. Each term cluster is labeled by one or more hypernyms (or called type
names). Term similarity scores used in the clustering algorithm are calculated by a combination of word embedding,
distributional similarity, and pattern-based methods [10, 11, 12].
During the online testing time, clusters containing the target entity mention are first retrieved by referring to the cluster
collection. Generally, there may be multiple (ambiguous) clusters containing the target entity mention and thus it is
necessary to pick the best cluster through disambiguation. Once the best cluster is chosen, its members (or instances)
can be returned as the expansion results.
Now the core challenge is how to calculate the score of a cluster given an entity mention. We choose to compute the
score as the average similarity score between a term in the cluster and a term in the context of the entity mention.
Formally, suppose e is a mention in a sentence, context C = {c1 , c2 , · · · , cm } is a window of e within the sentence, and
L = {e1 , e2 , · · · , en } is a term cluster containing the entity mention (i.e., e ∈ L). The cluster score is then calculated
below:
1 X
sim(C, L; e) = cos(vx , wy ) (1)
(m − 1) × (n − 1)
x∈C\{e},y∈L\{e}
where C \ {e} means excluding a subset {e} from a set C, vx denotes the input word embedding of x, wy denotes the
output word embedding of y from a well-trained word embedding model, and cos is the cosine similarity function.
It is worth mentioning that although it takes considerable time for clustering hyponyms during offline time because
there are millions of hyponyms, this procedure can be conducted offline for one single time. The online steps can be
performed very efficiently.
2.1.2 Fine-Grained NER
Generally, it is challenging to build a fine-grained NER system. Ref [13] creates a fine-grained NER dataset for Chinese,
but the number of its types is less than 20. A knowledge base (such as Freebase [14]) is utilized in [15] as distant
supervision to obtain a training dataset for fine-grained NER. However, this dataset only includes about one hundred
types whereas TexSmart supports up to one thousand types. Moreover, the fine-grained NER module in TexSmart does
3T ECHNICAL R EPORT - JANUARY 1, 2021
loc.generic
loc.geo
loc.geo.district loc.natural_geo
loc.geo.populated_place land_form.mountain land_form.peak land_form.plateau
loc.admin_division loc.country_region
loc.county_level loc.city loc.state_in_country loc.town
Figure 3: A sub-tree of the TexSmart ontology, with “loc.generic” as the root
not rely on any knowledge bases and thus can be readily extended to other languages for which there is no knowledge
base available.
Ontology To establish fine-grained NER in TexSmart, we need to define an ontology of entity types. The TexSmart
ontology was built in a semi-automatic way, based on the term clusters in Figure 2. Please note that each term cluster
is labeled by one or more hypernyms as type names of the cluster. We first conduct a simple statistics over the term
clusters to get a list of popular type names (i.e., those having a lot of corresponding term clusters). Then we manually
create one or more formal types from one popular type name and add the type name to the name list of the formal types.
For example, formal type “work.movie” is manually built from type name “movie”, and the word “movie” is added to
the name list of “work.movie”. As another example, formal types “language.human_lang” and “language.programming”
are manually built from type name “language”, and the word “language” is added to the name lists of both the two
formal types. Each formal type is also assigned with a sample instance list in addition to a name list. Instances can
be chosen manually from the clusters corresponding to the names of the formal type. To reduce manual efforts, the
sample instance list for every type is often quite short. The supertype/subtype relation between the formal types are
also specified manually. As a result, we obtain a type hierarchy containing about 1,000 formal types, each assigned
with a standard id (e.g., work.movie), a list of names (e.g., “movie” and “film”), and a short list of example instances
(e.g., “Star Wars”). The TexSmart ontology is available on the download page3 . Figure 3 shows a sub-tree (with type id
“loc.generic” as the root) sampled from the entire ontology.
Unsupervised method The unsupervised fine-grained NER method works in two steps. First, run the semantic
expansion algorithm (referring to the previous subsection) to get the best cluster for the entity mention. Second, derive
an entity type from the cluster.
For the best cluster obtained in the first step, it contains a list of terms as instances and is also labeled with a list of
hypernyms (or type names). The final entity type id for the cluster is determined by a type scoring algorithm. The
candidate types are those in the TexSmart ontology whose name lists contain at least one hypernym of the cluster.
Please note that each entity type in the TexSmart ontology has been assigned with a name list and a sample instance
list. Therefore the score of a candidate entity type can be calculated according to the information of the entity type and
cluster.
This unsupervised method has a major drawback: It cannot recognize unknown entity mentions (i.e., entity mentions
that are not in any of our term clusters).
Hybrid method In order to address the above issue, we propose a hybrid method for fine-grained NER. Its key idea is
to combine the results of the unsupervised method and those of a coarse-grained NER model. We train a coarse-grained
NER model in a supervised manner using an off-the-shelf training dataset (for example, Ontonotes dataset [16]). Given
the supervised and unsupervised results, the combination policy is as follows: If the fine-grained type is compatible
with the coarse type, the fine-grained type is returned; otherwise the coarse type is chosen. Two types are compatible if
one is the subtype of the other.
For example, assume that the entity mention “apple” in the sentence “...apple juice...” is determined as “food.fruit”
by the unsupervised method and “food.generic” by the supervised model. The hybrid approach returns “food.fruit”
according to the above policy. However, if the unsupervised method returns “org.company”, the hybrid approach will
return “food.generic” because the two types returned by the supervised method and the unsupervised method are not
compatible.
3
https://ai.tencent.com/ailab/nlp/texsmart/en/download.html
4T ECHNICAL R EPORT - JANUARY 1, 2021
Parse
Algorithm: DNN
!
NNP NNP VBD VBN IN NNP NNP CD
NNS RB .
NNP VBD VBN IN NNP CD NNS RB
.
Figure 4: An example of English word segmentation and part-of-speech tagging.
Algorithm: Fine-grained
!
DNN
!
NN NN NN VC NN NN NN CC JJ NN NN
No Entity Type [ID, Name, Flag] Semantics
LC DEG CD M JJ NN PU
{"related":["Batman","Superman","Wonder
Captain Woman","Green Lantern","the
1 work.movie, film, 1
Marvel flash","Aquaman","Spider-Man","Green
NN VC NN NN CC NN NN LC DEG
Arrow","Supergirl","Captain America"]}
CD M JJ NN PU
{"related":
["Toronto","Montreal","Vancouver","Ottawa","
2 Los Angeles loc.city, city, 1
Figure 5: An example of Chinese word segmentation and part-of-speech tagging. Calgary","London","Paris","Chicago","Edmont
on","Boston"]}
14 months
!
3 time.generic, time, 0 {"value":[2019,10]}
ago
2.1.3 Deep Semantic Representation
For a time or quantity entity within a sentence, TexSmart can analyze its potential structured representation, so as
[ID, , Flag]
to further derive its precise semantic meaning. For example in Figure 1, the deep semantic representation given by
TexSmart for “22 months ago” is a structured string with a precise date
{"related":[" "," in JSON
"," format:
"," {"value": [2019, 2]}. Deep
semantic representation1 is importantkafield.technology,
2
for applications ,
like task-oriented
"," "," chatbots,
"," where
"," the precise meanings of some
entities are required. So far, most public text understanding "," tools do","
not"]}provide such a feature. As a result, applications
using these tools have to implement deep semantic representation {"related":["
by","themselves.
"," ","
kafield.discipline, ,
2 "," "," "," "," ","
Some NLP toolkits make use of regular 1 expressions or supervised sequence tagging methods to recognize time and
"," "]}
quantity entities. However, it is difficult for those methods to derive structured or deep semantic information of entities.
quantity entities, are{"related":["
To overcome this problem, time and kafield.technology, parsed in TexSmart "," "," by Context
"," Free Grammar (CFG), which
3 "," "," "," ","5g","
is more expressive than regular expressions.
2 Its key idea is similar","ai"]}
to that in [17] and can be described as follows: First,
CFG grammar rules are manually written according to possible natural language expressions of a specific entity type.
4
Second, the Earley algorithm quantity.generic,to parse
[18] is employed ,0 a{"value":[1]}
piece of text to obtain semantic trees of entities. Finally, deep
semantic representations of entities are derived from the semantic trees.
2.2 Other Modules
2.2.1 Word Segmentation
Automatic word segmentation is a task for segmenting a piece of text into a word sequence. The results of word
segmentation are widely used in many downstream NLP tasks. In order to support different application scenarios,
TexSmart provides word segmentation results of two granularity levels: word level (or basic level), and phrase level.
Figure 4 shows an example of English word segmentation results. As shown in the figure, some phrases (especially
noun phrases) are contained in the phrase-level segmentation results. An example of Chinese word segmentation results
is shown in Figure 5.
An unsupervised algorithm is implemented in TexSmart for both English and Chinese word segmentation. We choose
an unsupervised method over supervised ones due to two reasons. First, it is at least 10 times faster. Second, its accuracy
is good enough for most applications.
5T ECHNICAL R EPORT - JANUARY 1, 2021
TexSmart 2020/12/24 下午2:31
Figure 6: An example of the constituency parsing task.
2.2.2 Part-of-Speech Tagging
Part-of-Speech (POS) denotes the syntactic role of each word in a sentence, also known as word classes or syntactic
categories. Popular part-of-speech tags include noun, verb, adjective, etc. The task of POS Tagging is about assigning a
part-of-speech tag to each word in the input text. POS tagging information is important in many information extraction
ARGM-TMP
tasks. For some text understanding tasks like 30
NER, the performance of some algorithms can be improved if POS tagging
ARG0
information is included.1 Part-of-speech tags also
ARGM-LOC
help us to construct good sentences.
ARGM-DIS
In Figure 4 and 5, in addition to word segmentation
ARG1
results, we also present the POS tag of each word in the sentence
under different word segmentation granularities. The Chinese and English part-of-speech tag sets supported by TexSmart
are CTB4 [19] and PTB5 [20], respectively.ARGM-TMP 30
ARG0
2
We implement three models ARGM-LOC
for part-of-speech tagging: Log-linear based, conditional random field (CRF) based and
deep neural network (DNN) based modelsARGM-ADV
[21,
ARG1
22, 23, 24]. We denote them as: log_linear, crf and dnn, respectively. In
the web version of the demo, users can select different algorithms in the drop-down box, i.e., “词性标注算法” in the
Chinese page (Figure 5) or “Algorithm” in the English page (Figure 4). When the HTTP API is called, one of the three
algorithms can be chosen freely.
https://texsmart.qq.com/zh ⻚码:2/3
2.2.3 Coarse-grained NER
Coarse-grained NER aims to identify entity mentions from a sentence and assigns them with coarse-grained entity
types, such as Person, Location, Organization, etc, rather than the fine-grained types as described in Section 2.1.1 for
fine-grained NER. For example, given the following sentence:
“Captain Marvel was premiered in Los Angeles 22 months ago.”,
the mention “Los Angeles” is identified by the coarse-grained NER algorithm with the coarse type “loc.generic”.
The difference between fine-grained and coarse-grained NERs is that the former involves more entity types with a finer
granularity. Since coarse-grained NER includes much fewer entity types,6 it has its own advantages: (1) with a faster
speed, (2) having a higher precision, which makes it suitable for some special applications.
We implement coarse-grained NER using supervised learning methods, including conditional random field (CRF) [22]
based and deep neural network (DNN) based models [23, 24, 25].
4
https://ai.tencent.com/ailab/nlp/texsmart/table_html/table2.html
5
https://ai.tencent.com/ailab/nlp/texsmart/table_html/table3.html
6
The type set considered in TexSmart and its descriptions can be found in https://ai.tencent.com/ailab/nlp/texsmart/
table_html/table4.html.
6T ECHNICAL R EPORT - JANUARY 1, 2021
ARGM-TMP 30
ARG0
1 ARGM-LOC
ARGM-DIS
ARG1
ARGM-TMP 30
ARG0
2 ARGM-LOC
ARGM-ADV
ARG1
Figure 7: An example to illustrate the task of semantic role labeling.
2.2.4 Constituency Parsing
Constituency parsing is used to parse an input word sequence into the corresponding phrase structure tree, which is
based on the phrase structure grammar proposed by Chomsky. The phrase structure includes noun phrases (NP), verb
phrases (VP), etc. For example, in Figure 6, “南昌 (Nanchang)” or “王先生 (Mr Wang)” constitutes a noun phrase, and
“吃 (eat)” or “煲仔饭 (claypot rice)” constitutes a verb phrase.
We present an example to illustrate constituency parsing. The leaf node in the phrase structure tree, such as “流浪
地球 (The Wandering Earth)” in Figure 6, denotes a word of the input sentence. The parent node (e.g., “NN”) of a
leaf node is the part-of-speech tag of the word corresponding to the leaf node. Further, the parent node (e.g., “VP”)
of the part-of-speech tag (e.g, “NN”) node refers to the part-of-speech tag of the phrase structure that constitutes the
leaf nodes, e.g., “看 (watch)” and “流浪地球 (The Wandering Earth)” in Figure 6. The phrase structure labels used in
TexSmart can be found in the home-page of TexSmart.7
We implement the constituency parsing model based on the work [26]. Ref [26] builds the parser by combining a
sentence encoder with a chart decoder based on the self-attention mechanism. Different from work [26] , we use
pre-trained BERT model as the text encoder to extract features to support the subsequent decoder-based parsing. Our
model achieves excellent performance and has low search complexity.
2.2.5 Semantic Role Labeling
In natural language processing, semantic role labeling (also called shallow semantic parsing) tries to assign role labels
to words or phrases in a sentence. The labels indicate the semantic role of words in the sentence, including the basic
elements of a event: trigger word, subject, object and other adjuncts, such as time, location, method, etc. of the event.
The task of semantic role labeling is closely related to event semantics, where the elements that make up an event are
called semantic roles. In semantic role labeling, we need to identify all the events and their basic elements in a sentence.
We present an example to illustrate semantic role labeling. In Figure 7, two events with trigger words of “看 (watch)”
and “吃 (eat)” are detected by TexSmart. Further, all the elements of the two events are also detected. For example,
given the trigger word “看 (watch)”, the subject (ARG0), object (ARG1), adjunct time (ARGM-TMP) and adjunct
location (ARGM-LOC) are “南昌的王先生 (Mr. Wang in Nanchang)”, “流浪地球 (The Wandering Earth)”, “上个
月30号(On the 30th of last month)” and “在自己家里 (in his own home)”, respectively.
TexSmart takes a sequence labeling model with BERT as the text encoder for semantic role labeling similar to [27]. The
results of semantic role labeling support many downstream tasks, such as deeper semantic analysis (AMR Parsing, CCG
Parsing, etc.), intention recognition in a task-oriented dialogue system, entity scoring in knowledge-based question
7
https://ai.tencent.com/ailab/nlp/texsmart/table_html/table6.html (CTB, Chinese) and https:
//ai.tencent.com/ailab/nlp/texsmart/table_html/table7.html (PTB, English), respectively.
7T ECHNICAL R EPORT - JANUARY 1, 2021
answering, etc. TexSmart supports semantic role labeling on both Chinese and English texts. The Chinese label set
used can be found in the home-page of TexSmart.8
2.2.6 Text Classification
Text Classification aims to assign a semantic label for an input text among a predefined label set. Text Classification is a
classical task in NLP and it has been widely used in many applications, such as spam filtering, sentiment analysis and
question classification. For example, given the input text:
“TensorFlow (released as open source on November 2015) is a library developed by Google to accelerate deep learning
research.”,
TexSmart classifies it into a category named “tc.technology” with a confidence score of 0.198. The predefined label set
in TexSmart is available on the web page.9
2.2.7 Text Matching
Text matching is the task of calculating the semantic similarity between two pieces of text. It has wide applications in
NLP, information retrieval, Web search, and recommendation systems. For example, it is a core task in information
retrieval to estimate the similarity between an input query and a document. As another example, question answering is
usually modeled as the matching between a question and its answer candidate.
We implement two text matching algorithms in TexSmart: Linkage and ESIM [28]. Linkage is an unsupervised
algorithm designed by ourselves that incorporates synonymy information and word embedding knowledge to compute
semantic similarity. ESIM is the abbreviation of “Enhanced LSTM for Natural Language Inference”. Different from the
previous models with complicated network architectures, ESIM carefully designs the sequential model with both local
and global inference based on chain LSTMs and outperforms the counterparts.
For training the ESIM model for Chinese, we automatically created a large-scale labeled dataset of Chinese sentences,
to ensure a better generalization capacity than the models trained from small datasets. Since the English training data is
not ready yet, the English version of ESIM is not available so far. We would like to leave it as future work.
3 System Usage
Two ways are available to use TexSmart: Calling the HTTP API directly, or downloading one version of the offline SDK.
Note that for the same input text, the results from the HTTP API and the SDK may be slightly different, because the
HTTP API employs a larger knowledge base and supports more text understanding tasks and algorithms. The detailed
comparison between the SDK and the HTTP API is available in
https://ai.tencent.com/ailab/nlp/texsmart/en/instructions.html.
The latest and history versions of the SDK can be downloaded from
https://ai.tencent.com/ailab/nlp/texsmart/en/download.html,
and detailed information about how to call the TexSmart APIs is introduced in
https://ai.tencent.com/ailab/nlp/texsmart/en/api.html.
3.1 Offline Toolkit (SDK)
So far the SDK supports Linux, Windows, and Windows Subsystem for Linux (WSL). Mac OS support will be added in
v0.3.0. Programming languages supported include C, C++, Python (both version 2 and version 3) and Java (version ≥
1.6.0).
Example codes for using the SDK with different programming languages are in the ./examples sub-folder. For example,
the Python codes in ./examples/python/en_nlu_example1.py show how to use the TexSmart SDK to process an English
sentence. The C++ codes in ./examples/c_cpp/src/nlu_cpp_example1.cc show how to use the SDK to analyze both an
English sentence and a Chinese sentence.
8
https://ai.tencent.com/ailab/nlp/texsmart/table_html/table8.html, and the English are presented in https:
//ai.tencent.com/ailab/nlp/texsmart/table_html/table9.html.
9
https://ai.tencent.com/ailab/nlp/texsmart/table_html/tc_label_set.html.
8The input post-data needs to be in JSON format, and the output is also in JSON format. It is recommended to use Postman for testing.
Here is a simple example request (as HTTP-POST body):
{"str":"He stayed in San Francisco."}
T ECHNICAL R EPORT - JANUARY 1, 2021
Example codes for calling the API: Python Code | Java Code | C++ Code | C# Code
The results returned after calling the API is also in JSON format as follows:
{
"header":{"time_cost_ms":1.18,"time_cost":0.00118,
"core_time_cost_ms":1.139,"ret_code":"succ"},
"norm_str":"He stayed in San Francisco.",
"word_list":[
{"str":"He","hit":[0,2,0,1],"tag":"PRP"},
{"str":"stayed","hit":[3,6,1,1],"tag":"VBD"},
{"str":"in","hit":[10,2,2,1],"tag":"IN"},
{"str":"San","hit":[13,3,3,1],"tag":"NNP"},
{"str":"Francisco","hit":[17,9,4,1],"tag":"NNP"},
{"str":".","hit":[26,1,5,1],"tag":"NFP"}
],
"phrase_list":[
{"str":"He","hit":[0,2,0,1],"tag":"PRP"},
{"str":"stayed","hit":[3,6,1,1],"tag":"VBD"},
{"str":"in","hit":[10,2,2,1],"tag":"IN"},
{"str":"San Francisco","hit":[13,13,3,2],"tag":"NNP"},
{"str":".","hit":[26,1,5,1],"tag":"NFP"}
],
"entity_list":[
{"str":"San Francisco","hit":[13,13,3,2],"tag":"loc.city","tag_i18n":"city",
"meaning":{"related":["Los Angeles", "San Diego", "San Jose", "Santa Clara", "Palo Alto",
"Santa Cruz", "Sacramento", "San Mateo", "Santa Barbara", "Oakland"]}}
],
"syntactic_parsing_str":"",
"srl_str":""
}
Note that, the field “header” gives some auxiliary information (time cost, error codes, etc.) which explains this API call; the field “norm_str” gives
theFigure
result of8: The
text JSON results
normalization; the fieldof the text
“word_list” understanding
contains the results of API for inputword
basic-granularity sentence “He stayed
segmentation in San Francisco.”
and part-of-speech tagging; the
field “phrase_list” is the word segmentation and part-of-speech tagging of. compound granularity; the field “entity_list” gives all the recognized
entities and their types; and the fields “syntactic_parsing_str” and “srl_str” represent constituent syntax tree and semantic role labeling results
respectively. In this example, the field “syntactic_parsing_str” and “srl_str” have both empty strings, because “syntactic_parsing” and “srl” are not
activated by default.
3.2 TexSmart HTTP API
Instructions on Input and Output Format
The An
HTTP API of
introduction ofeach
TexSmart
field of thecontains two
input JSON parts:
object the text understanding API and the text matching API.
is as follows:
3.2.1Field
Name
Data
Field Introduction
Text Understanding
Type API
str string The input text to be analyzed
The text understanding API can be accessed via HTTP-POST and the URL is available on the web page10 .
option information, mainly used to assign functions to be called
the inputJSON
Bothoptions post-data
Object and the
and algorithms response
to be are in JSON
used for functions. can beIt is recommended to use Postman11 to test the API.
format.
More details
found in the section of "More Ways to Call".
Below is a sample input:
The JSON object can be defined by users and the TexSmart
{"str": "He stayed
echo_data
JSON inservice
San Francisco."}
will return the same object in the way of echo.
Object users can use this field to record the identity information of the
The results are shown current request. 8. Note that, in this example, both fields “syntactic_parsing_str” and “srl_str” have
in Figure
empty strings, because syntactic parsing and semantic role labeling are not enabled by default. We summarize the fields
The introduction of each field of output JSON object is as follows:
used and their descriptions in Table 1. More details about the output JSON contents is available on the web page.12
Data
Field Name Field Introduction
Type
Table 1: Major fields of the output JSON object returned by the text understanding API
Auxiliary information returned after calling and
Field execution
Description
time_cost_ms field: the total time for processing
header Containing
request calculated some auxiliary(ms).
with millisecond information about this API call, including time cost, error
time_cost
codes, etc field: the total time for processing request
calculated with second (s).
norm_str Normalized
ret_code textcode. "succ" denotes success;
field: return
word_list others are false
Results of codes which include
word-level word the following
segmentation and POS tagging
JSON cases:
header
phrase_list Results of phrase-level word segmentation and POS tagging
entity_list Results of NER, semantic expansion, and deep semantic representation
syntactic_parsing_str Results of constituency syntax parsing
srl_str Results of semantic role labeling
cat_list Results of text classification
Advanced options Some options can be added to enable/disable some modules, and to specify the algorithms used in
some modules. Some key options are summarized in Table 2. Figure 9 shows an example input JSON with options.
10
https://texsmart.qq.com/api
11
https://www.postman.com/downloads/
12
https://ai.tencent.com/ailab/nlp/texsmart/en/api.html.
9T ECHNICAL R EPORT - JANUARY 1, 2021
Table 2: Major fields of the options JSON object in the text understanding API. Field x.y means field y in the value of
field x.
Field Values Description
input_spec.lang {auto, chs, en} Specify the language of the input text. The default
value is “auto”, which means the language is detected
automatically.
word_seg.enable {true, false} Enable or disable the word segmentation module.
pos_tagging.enable {true, false} Enable or disable the POS tagging module.
pos_tagging.alg {log_linear, crf, dnn} The algorithm used in the POS tagging module.
ner.enable {true, false} Enable or disable the NER module.
ner.alg {coarse.crf, coarse.dnn, The algorithm used in the NER module.
coarse.lua, fine.std,
fine.high_acc}
syntactic_parsing.enable {true, false} Enable or disable the syntactic parsing module.
srl.enable {true, false} Enable or disable the semantic role labeling module. 2020/12/31 下午3:52
{
"str":"he stayed in San Francisco.",
"options":
{
"input_spec":{"lang":"auto"},
"word_seg":{"enable":true},
"pos_tagging":{"enable":true,"alg":"log_linear"},
"ner":{"enable":true,"alg":"fine.std"},
"syntactic_parsing":{"enable":false},
"srl":{"enable":false},
"text_cat":{"enable":false}
},
"echo_data":{"request_id":12345}
}
Batch Call
Figure
TexSmart also supports API for batch calling: 9: An
through example input
a JSON-format input, itJSON containing
can analyze options
multiple (Chinese/English) sentences. Here is an input
example with its JSON format:
En | En |
{
"str":[
TexSmart
" 30 HTTP API forTexSmart Text Matching HTTP API ", for Text Matching
"2020 2 7 ",
"John Smith stayed in SanTexSmart Francisco API last month."
includes two types: i.e. Text Understanding API and Text Matching API, and this page introduces text matching API.
TexSmart API includes two types: i.e. Text Understanding API and Text Matching API, and this page introduces text matching API.
] The API
The API supports access via HTTP-POST andsupports
its url isaccess via HTTP-POST and its url is https://texsmart.qq.com/api/match_text.
https://texsmart.qq.com/api/match_text.
} The input post-data
The input post-data needs to be in JSON format, and theneeds
outputtoisbe in in
also JSONJSONformat, and
format. the
It is output is alsoto
recommended in use
JSON format.for
Postman It is recommended to use Postman for
testing.
Note thattesting.
the output format of batch call is slightly different from that of the ordinary call, and the result of all sentences is a JSON array which is
Figure
used as the value 10: Anfield.
of "res_list" example to illustrate
Here is a simple example request (asHere
batch
is a simple
HTTP-POST
call with multiple languages in text understanding API.
example request (as HTTP-POST body):
body):
{ {
Code Example
"text_pair_list": [ "text_pair_list": [
{"str1":"hello","str2":"how{"str1":"hello","str2":"how
are you?"}, are you?"},
{"str1":"you and me","str2":"me {"str1":"you
and you"}, and me","str2":"me and you"},
API Call with Python {"str1":"you
{"str1":"you and i","str2":"he and i"}, and i","str2":"he and i"},
], ],
Code Example 1 wotjhttp.client
"options":{"alg":"esim"}, "options":{"alg":"esim"},
"echo_data":{"id":123} "echo_data":{"id":123}
# -*- coding:
} utf8 -*- }
import json
import Example Example
codes for calling the API: Python
http.client codes
Code for Code
| Java calling| C++
the API:
CodePython Code | Java Code | C++ Code | C# Code
| C# Code
Figure 11: An example input of text matching API.
The results returned after calling the The
API isresults returned
also in after calling
JSON format the API is also in JSON format as follows:
as follows:
obj = {"str": "he stayed in San Francisco."}
req_str{ = json.dumps(obj) {
"header":{"time_cost_ms":1.185,"time_cost":0.001185,"core_time_cost_ms":1.111,"ret_code":"succ
"header":{"time_cost_ms":1.185,"time_cost":0.001185,"core_time_cost_ms":1.111,"ret_code":"succ"},
"res_list":[ "res_list":[
conn = http.client.HTTPSConnection("texsmart.qq.com")
{"score":0.723794},
conn.request("POST", "/api", req_str) {"score":0.723794},
response {"score":0.838658},
= conn.getresponse() {"score":0.838658},
{"score":0.853867}
print(response.status, response.reason) {"score":0.853867}
res_str =] response.read().decode('utf-8') ]
}
print(res_str) }
#print(json.loads(res_str))
Note
Note that, the field “header” gives some that, theinformation
auxiliary field “header” gives
(time cost,some
errorauxiliary information
codes, etc.) (time cost,
which explains thiserror codes,
API call; theetc.)
field which explains this API call; the field
“res_list”
Code Example 2 returns Figure 12: The
the matching
with requests module scores“res_list”
output returns
of
for all input
installed the the matching
text
pairs. matching scoresAPIfor all
forinput
thepairs.
input text in Figure 11.
Instructions
# -*- coding: utf8 on-*-Input and Output Instructions
Format on Input and Output Format
import json
import An An introduction
introduction of each field of the input
requests JSON objectofiseach field of the input JSON object is as follows:
as follows:
10
obj = {"str":
Field "heData Field
stayed in San Francisco."}Data
Field Introduction Field Introduction
req_str Name
= json.dumps(obj).encode()
Type Name Type
str string The input text to str
url = "https://texsmart.qq.com/api" be analyzedstring The input text to be analyzedT ECHNICAL R EPORT - JANUARY 1, 2021
Batch mode Multiple sentences (of possibly different languages) can be processed in one API call. An example is
shown in Figure 10, where one English sentence and two Chinese sentences are processed in one API call. Note that the
output format of the batch mode is slightly different from that of the ordinary mode. In the batch mode, the results for
the input sentences form a JSON array, as the value of the “res_list” field.
3.2.2 Text Matching API
The text matching API is used to calculate the similarity between a pair of sentences. Similar to the text understanding
API, the text matching API also supports access via HTTP-POST and the URL is available on the web page.13
Figure 11 shows an example input JSON. The corresponding response returned by the API is shown in Figure 12.
4 System Evaluation
4.1 Settings
Semantic Expansion The performance of semantic expansion are evaluated based on human annotation. We first
select at random 5,000 pairs (called SE pairs) from our test set of NER (to make sure
that the entities selected are correct). Then our semantic expansion algorithm is applied to the SE pairs to generate a
related-entity list for each pair. Top nine expansion results of each SE pair are then judged by human annotators in
terms of quality and relatedness, with each result annotated by two annotators. For each result, a label of 2, 1, or 0 is
assigned by each annotator. The three labels mean “highly related”, “slightly related”, and “not related” respectively. In
calculating evaluation scores, the three labels are normalized to scores 100, 50, and 0 respectively. The score of each
result is the average of the scores given by the two annotators.
Fine-grained NER Ref [15] provides a test set for fine-grained NER evaluation. However, this dataset only contains
about 400 sentences. In addition, it misses some important entities during human annotation, which is a common issue
in building a dataset for evaluating fine-grained NER [29]. Therefore, we create a larger fine-grained NER dataset,
based on the Ontonotes 5.0 dataset. We ask human annotators to label fine-grained types for each coarse-labeled entity.
Since human annotators do not need to identify mentions from scratch, it would mitigate the missing entities issue to
some extent. Furthermore, because it is too costly for human annotators to annotate types from the entire ontology, we
instead take a sub-ontology for human annotation which combines all types from [15] and [30], including 140 types in
total.
To set the hybrid method for fine-grained NER, we select LUA [25] as the coarse-grained NER model, which is trained
on Ontonotes 5.0 training dataset [16]. To compare fine-grained NER against coarse-grained NER, we report a variant
of F1 measure for evaluation which only differs from standard F1 in matching count accumulation: if an output type is
a fine-grained type and it exactly matches a gold fine-grained type, the matching count accumulates 1; if an output is a
coarse grained type and it is compatible with a gold fine-grained type, the matching count accumulates 0.5.
POS Tagging We evaluate three POS tagging algorithms: log-linear, CRF, and DNN. They are all trained on PTB for
English and CTB 9.0 for Chinese. We use their corresponding test sets to evaluate all the models.
Coarse-grained NER To ensure better generalization to industrial applications, we combine several public training
sets together for English NER. They are CoNLL2003 [31], BTC [32], GMB [33], SEC_FILING [34], WikiGold [35, 36],
and WNUT17 [37]. Since the label set for all these datasets are slightly different, we only maintain three common labels
(Person, Location and Organization) for training and testing. For Chinese, we create a NER dataset including about
80 thousand sentences labeled with 12 entity types, by following a similar guideline to that of the Ontonotes dataset.
We randomly split it into a training set and a test set with ratio of 3:1. We evaluate two algorithms for coarse-grained
NER: CRF and DNN. For DNN, we implement the RoBERTa-CRF and Flair models. As we found RoBERTa-CRF
performs better on the Chinese dataset while Flair is better on the English dataset, we report results of RoBERTa-CRF
for Chinese and Flair for English in our experiments.
Constituency Parsing We conduct parsing experiments on both English and Chinese datasets. For English task, we
use WSJ sections in Penn Treebank (PTB) [38], and we follow the standard splits: the training data ranges from section
2 to section 21; the development data is section 24; and the test data is section 23. For Chinese task, we use the Penn
Chinese Treebank (CTB) of the version 5.1 [19]. The training data includes the articles 001-270 and articles 440-1151;
the development data is the articles 301- 325; and the test data is the articles 271-300.
13
https://texsmart.qq.com/api/match_text.
11T ECHNICAL R EPORT - JANUARY 1, 2021
Table 3: Semantic expansion and fine-grained NER evaluation results. Semantic expansion is evaluated by the
averaged score from human annotators. Fine-grained NER is evaluated by a variant of F1 score which can measure both
coarse-grained and fine-grained models against a fine-grained ground-truth dataset.
Semantic Expansion (human score) Fine-grained NER (F1)
ZH EN LUA Hybrid
Quality (human score or F1) 79.5 80.5 45.9 53.8
Table 4: Evaluation results for some POS Tagging and coarse-grained NER algorithms in TexSmart on both English
(EN) and Chinese (ZH) datasets. The English and Chinese NER datasets are labeled with 3 and 12 entity types
respectively.
POS Tagging Coarse-grained NER
Log-linear CRF DNN CRF DNN
EN ZH EN ZH EN ZH EN ZH EN ZH
F1 96.76 93.94 96.50 93.73 97.04 98.08 73.24 67.26 83.12 75.23
Sents/sec 3.9K 1.3K 149 1.1K 107
SRL Semantic role labeling experiments are conducted on both English and Chinese datasets. We use the CoNLL
2012 datasets [39] and follow the standard splits for the training, development and test sets. The network parameters of
our model are initialized using RoBERTa. The batch size is set to 32 and the learning rate is 5×10−5 .
Text Matching Two text matching algorithms are evaluated: ESIM and Linkage. The datasets used in evaluating
English text matching are MRPC14 and QUORA15 . For Chinese text matching, four datasets are involved: LCQMC [40],
AFQMC [13], BQ_CORPUS [41], and PAWS-zh [42]. We evaluate the quality and speed for both ESIM and Linkage
algorithms in terms of F1 score and sentences per second, respectively. Since we have not trained the English version of
ESIM yet, the corresponding evaluation results are not reported.
Table 5: Evaluation results for constituency parsing and SRL
Constituency Parsing SRL
EN ZH EN ZH
F1 95.42 92.25 86.7 82.1
Sents/sec 16.60 16.00 10.2 11.5
4.2 Evaluation Results
Table 3 shows the evaluation results of semantic expansion and fine-grained NER. For semantic expansion, it is shown
that TexSmart achieves an accuracy of about 80.0 on both English and Chinese datasets. It is a pretty good performance.
For fine-grained NER, it is observed that the hybrid approach performs much better than the supervised model (LUA).
The evaluation results for POS Tagging and coarse-grained NER are listed in Table 4. The speed values in this table are
measured in sentences per second. Please note that the speed results for Log-linear and CRF are obtained using one
single thread, while the speed results for DNN are on 6 threads.
It is clear from the POS tagging results that the three algorithms form a spectrum. On one side of the spectrum is the
log-linear algorithm, which is very fast but less accurate than the DNN algorithm. On the opposite side is the DNN
algorithm, which achieves the best accuracy but are much slower than the other two algorithms. The CRF algorithm is
in the middle of the spectrum.
Also from Table 4, we can see that the two coarse-grained NER algorithms form another spectrum. The CRF algorithm
is on the high-speed side, while the DNN algorithm is on the high-accuracy side. Note that for DNN methods in this
table, we employ a data augmentation method to improve their generalization abilities and a knowledge distillation
method to speed up its inference [43].
Evaluation results for constituency parsing and semantic role labeling are summarized in Table 5. For constituency
parsing, the F1 scores on the English and Chinese test sets are 95.42 and 92.25, respectively. The decoding speed
14
https://www.microsoft.com/en-us/download/details.aspx?id=52398.
15
https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs.
12T ECHNICAL R EPORT - JANUARY 1, 2021
Table 6: Text matching evaluation results. ESIM is a supervised algorithm and it is trained on an in-house labeled
dataset only for Chinese. Linkage is an unsupervised algorithm and it is trained for both English and Chinese.
English Chinese
Algorithms Sents/Sec
MRPC QUORA LCQMC AFQMC BQ_CORPUS PAWS-zh
ESIM 861 - - 82.63 51.30 71.05 61.55
Linkage 1973 82.18 74.94 79.26 48.66 71.23 62.30
depends on the input sentence length. It can process 16.6 and 16.0 sentences per second on our test sets. For SRL, the
F1 scores on the English and Chinese test sets are 86.7 and 82.1 respectively and it processes about 10 sentences per
second. The speed may be too slow for some applications. As future work, we plan to design more efficient syntactic
parsing and SRL algorithms.
Table 6 shows the performance of two algorithms for text matching. We can see from this table that, in terms of speed,
both algorithms are fairly efficient. Please note that the speed is measured in sentences per second using one single CPU.
In terms of accuracy, their performance comparison depends on the dataset being used. ESIM performs apparently
better on the first two datasets, while slightly worse on the last one. Applications may need to test on their datasets
before making decision between the two algorithms.
5 Conclusion
In this technical report we have presented TexSmart, a text understanding system that supports fine-grained NER,
enhanced semantic analysis, as well as some common text understanding functionalities. We have introduced the main
functions of TexSmart and key algorithms for implementing the functions. Some instructions about how to use the
TexSmart offline SDK and online APIs have been described. We have also reported some evaluation results on major
modules of TexSmart.
References
[1] Edward Loper and Steven Bird. Nltk: the natural language toolkit. In Proceedings of the ACL-02 Workshop on
Effective tools and methodologies for teaching natural language processing and computational linguistics, 2002.
[2] https://opennlp.apache.org.
[3] Christopher D Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and David McClosky.
The stanford corenlp natural language processing toolkit. In Proceedings of 52nd annual meeting of the association
for computational linguistics: system demonstrations, pages 55–60, 2014.
[4] Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew Peters,
Michael Schmitz, and Luke Zettlemoyer. AllenNLP: A deep semantic natural language processing platform. In
Proceedings of Workshop for NLP Open Source Software (NLP-OSS), pages 1–6, Melbourne, Australia, July 2018.
Association for Computational Linguistics.
[5] Wanxiang Che, Zhenghua Li, and Ting Liu. Ltp: A chinese language technology platform. In Coling 2010:
Demonstrations, pages 13–16, 2010.
[6] Xipeng Qiu, Qi Zhang, and Xuan-Jing Huang. Fudannlp: A toolkit for chinese natural language processing. In
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations,
pages 49–54, 2013.
[7] Jialong Han, Aixin Sun, Haisong Zhang, Chenliang Li, and Shuming Shi. Case: Context-aware semantic expansion.
In AAAI, pages 7871–7878, 2020.
[8] Marti A Hearst. Automatic acquisition of hyponyms from large text corpora. In Coling 1992 volume 2: The 15th
international conference on computational linguistics, 1992.
[9] Fan Zhang, Shuming Shi, Jing Liu, Shuqi Sun, and Chin-Yew Lin. Nonlinear evidence fusion and propagation
for hyponymy relation mining. In Proceedings of the 49th Annual Meeting of the Association for Computational
Linguistics: Human Language Technologies, pages 1159–1168, 2011.
[10] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words
and phrases and their compositionality. Advances in neural information processing systems, 26:3111–3119, 2013.
13You can also read