ASAP: A Chinese Review Dataset Towards Aspect Category Sentiment Analysis and Rating Prediction

Page created by Benjamin Higgins
 
CONTINUE READING
ASAP: A Chinese Review Dataset Towards Aspect Category Sentiment Analysis and Rating Prediction
ASAP: A Chinese Review Dataset Towards Aspect Category Sentiment
                                                                    Analysis and Rating Prediction

                                                Jiahao Bu, Lei Ren, Shuang Zheng, Yang Yang, Jingang Wang, Fuzheng Zhang, Wei Wu
                                                                                   NLP Center, Meituan
                                         {bujiahao, renlei04, zhengshuang04, yangyang113, wangjingang02, zhangfuzheng, wuwei30}@meituan.com

                                                                     Abstract                         widely used in industries. Specifically, given a re-
                                                                                                      view ”Although the fish is delicious, the waiter is
                                                    Sentiment analysis has attracted increasing at-   horrible!”, the ACSA task aims to infer the senti-
                                                    tention in e-commerce. The sentiment po-          ment polarity over aspect category food is positive
arXiv:2103.06605v1 [cs.CL] 11 Mar 2021

                                                    larities underlying user reviews are of great
                                                                                                      while the opinion over the aspect category service
                                                    value for business intelligence. Aspect cate-
                                                    gory sentiment analysis (ACSA) and review         is negative.
                                                    rating prediction (RP) are two essential tasks       The user interfaces of e-commerce platforms are
                                                    to detect the fine-to-coarse sentiment polar-     more intelligent than ever before with the help of
                                                    ities. ACSA and RP are highly correlated          ACSA techniques. For example, Figure 1 presents
                                                    and usually employed jointly in real-world        the detail page of a coffee shop on a popular e-
                                                    e-commerce scenarios. While most public           commerce platform in China. The upper aspect-
                                                    datasets are constructed for ACSA and RP sep-
                                                                                                      based sentiment text-boxes display the aspect cate-
                                                    arately, which may limit the further exploita-
                                                    tion of both tasks. To address the problem        gories (e.g, food, sanitation) mentioned frequently
                                                    and advance related researches, we present a      in user reviews and the aggregated sentiment po-
                                                    large-scale Chinese restaurant review dataset     larities on these aspect categories (the orange ones
                                                    ASAP including 46, 730 genuine reviews from       represent positive and the blue ones represent neg-
                                                    a leading online-to-offline (O2O) e-commerce      ative). Customers can filter corresponding reviews
                                                    platform in China. Besides a 5-star scale rat-    effectively by clicking the aspect-based sentiment
                                                    ing, each review is manually annotated accord-    text-boxes they care about (i.e., the orange filled
                                                    ing to its sentiment polarities towards 18 pre-
                                                                                                      text-box “卫生条件好” (good sanitation)). Our
                                                    defined aspect categories. We hope the release
                                                    of the dataset could shed some light on the       user survey based on 7, 824 valid questionnaires
                                                    fields of sentiment analysis. Moreover, we pro-   demonstrates that 80.08% customers agree that the
                                                    pose an intuitive yet effective joint model for   aspect-based sentiment text-boxes are helpful to
                                                    ACSA and RP. Experimental results demon-          their decision-making on restaurant choices. Be-
                                                    strate that the joint model outperforms state-    sides, the merchants can keep track of their cuisines
                                                    of-the-art baselines on both tasks.               and service qualities with the help of the aspect-
                                                                                                      based sentiment text-boxes. Most e-commerce plat-
                                               1    Introduction                                      forms such as Taobao1 , Dianping2 , and Koubei3
                                               With the rapid development of e-commerce, mas-         deploy the similar user interfaces to improve user
                                               sive user reviews available on e-commerce plat-        experience.
                                               forms are becoming valuable resources for both            Users also publish their overall 5-star scale rat-
                                               customers and merchants. Aspect-based sentiment        ings together with reviews. Figure 1 displays a
                                               analysis(ABSA) on user reviews is a fundamental        sample of 5-star rating to the coffee shop. In com-
                                               and challenging task which attracts interests from     parison to fine-grained aspect sentiment, the overall
                                               both academia and industries (Hu and Liu, 2004;        review rating is usually a coarse-grained synthesis
                                               Ganu et al., 2009; Jo and Oh, 2011; Kiritchenko        of the opinions on multiple aspects. Rating predic-
                                               et al., 2014). According to whether the aspect terms   tion(RP) which aims to predict the “seeing stars” of
                                               are explicitly mentioned in texts, ABSA can be         reviews (Jin et al., 2016; Li et al., 2018; Wu et al.,
                                               further classified into aspect term sentiment anal-       1
                                                                                                           https://www.taobao.com/
                                               ysis (ATSA) and aspect category sentiment analy-          2
                                                                                                           https://www.dianping.com/
                                                                                                         3
                                               sis (ACSA), we focus on the latter which is more            https://www.koubei.com/
ASAP: A Chinese Review Dataset Towards Aspect Category Sentiment Analysis and Rating Prediction
2019a) also has wide applications. For example,          baselines on both tasks.
to promise the aspect-based sentiment text-boxes
accurate, unreliable reviews should be removed                                                               .       5   5

before ACSA algorithms are performed. Given a
piece of user review, we can predict a rating for
it based on the overall sentiment polarity under-                                               -
                                                                                           ,     --,,       ,      , 5 5
lying the text. We assume the predicted rating of                                              5    ,      5, 5 ,
                                                                                               ,5  ,,       .       5     -
the review should be consistent with its ground-                                            5
                                                                                              ,       , 5.
                                                                                                        ,   5 ,
                                                                                                                 5    - ,

truth rating as long as the review is reliable. If the                                    5                   ,     ,   -
                                                                                         ',   ,,              - --,, 5
                                                                                            .    ,              , -- ,
predicted rating and the user rating of a review dis-
agree with each other explicitly, the reliability of
                                                         Figure 1: The user interface of a coffee shop on a pop-
the review is doubtful. Figure 2 demonstrates an
                                                         ular e-commerce App. The top aspect-based sentiment
example review of low-reliability. In summary, RP        text-boxes display aspect categories and sentiment po-
can help merchants to detect unreliable reviews.         larities. The orange text-boxes are positive, while the
   Therefore, both ACSA and RP are of great im-          blue ones are negative. The reviews mentioning the
portance for business intelligence in e-commerce,        clicked aspect category (e.g, good sanitation) with rat-
                                                         ings are shown below. The text spans mentioning the
and they are highly correlated and complementary.
                                                         aspect categories are also highlighted.
ACSA focuses on predicting its underlying senti-
ment polarities on different aspect categories, while
RP focuses on predicting the user’s overall feelings
from the review content. We reckon these two tasks
are highly correlated and better performance could
be achieved by considering them jointly.
   As far as we know, current public datasets are                 .         '      ,5 5      .          '        '
constructed for ACSA and RP separately, which                     .                ,   '   '
                                                                  ' ,'          . , .    , ' '           .           '
limits further joint explorations of ACSA and RP.                                   .
To address the problem and advance the related
research, this paper presents a large-scale Chi-         Figure 2: A content-rating disagreement case. The re-
nese restaurant review dataset for Aspect category       view holds a 2-star rating while all the mentioned as-
                                                         pects are super positive.
Sentiment Analysis and rating Prediction, denotes
as ASAP for short. All the reviews in ASAP are
collected from the aforementioned e-commerce                Our main contributions can be summarized as
platform. There are 46, 730 restaurant reviews at-       follows. (1) We present a large-scale Chinese re-
tached with 5-star scale ratings. Each review is         view dataset towards aspect category sentiment
manually annotated according to its sentiment po-        analysis and rating prediction, named as ASAP,
larities towards 18 fine-grained aspect categories.      including as many as 46, 730 real-world restau-
To the best of our knowledge, ASAP is the largest        rant reviews annotated from 18 pre-defined as-
Chinese large-scale review dataset towards both          pect categories. Our dataset has been uploaded
ACSA and RP tasks.                                       in softconf system and will be released at https:
                                                         //github.com/XXXX4 . (2) We explore the perfor-
   We implement several state-of-the-art (SOTA)          mance of widely used models for ACSA and RP on
baselines for ACSA and RP and evaluate their per-        ASAP. (3) We propose a joint learning model for
formance on ASAP. To make a fair comparison,             ACSA and RP tasks. Our model achieves the best
we also perform ACSA experiments on a widely             results both on ASAP and SemEval R ESTAURANT
used SemEval-2014 restaurant review dataset (Pon-        datasets.
tiki et al., 2014). Since BERT (Devlin et al.,
2018) has achieved great success in several nat-         2       Related Work and Datasets
ural language understanding tasks including sen-
timent analysis (Xu et al., 2019; Sun et al., 2019;      Aspect   Category       Sentiment      Analysis.
Jiang et al., 2019), we propose a joint model that       ACSA (Zhou et al., 2015; Movahedi et al.,
employs the fine-to-coarse semantic capability of        2019; Ruder et al., 2016; Hu et al., 2018) aims
                                                             4
BERT. Our joint model outperforms the competing                  Anonymization for double-blind review
ASAP: A Chinese Review Dataset Towards Aspect Category Sentiment Analysis and Rating Prediction
to predict sentiment polarities on all aspect                        opinions on multiple aspects. (Ganu et al., 2009;
categories mentioned in the text. The series of                      Li et al., 2011; Chen et al., 2018) form this task
SemEval datasets consisting of user reviews from                     as a text classification or regression problem. Con-
e-commerce websites have been widely used and                        sidering the importance of opinions on multiple
pushed forward related research (Wang et al.,                        aspects in reviews, recent years have seen numer-
2016; Ma et al., 2017; Xu et al., 2019; Sun et al.,                  ous work (Jin et al., 2016; Cheng et al., 2018; Li
2019; Jiang et al., 2019). The SemEval-2014                          et al., 2018; Wu et al., 2019a) utilizing the informa-
task-4 dataset (SE-ABSA14) (Pontiki et al.,                          tion of the aspects to improve the rating prediction
2014) is composed of laptop and restaurant                           performance. This trending also inspires the moti-
reviews. The restaurant subset includes 5 aspect                     vation of ASAP.
categories (i.e., Food, Service, Price, Ambience                        Most RP datasets are crawled from real-world re-
and Anecdotes/Miscellaneous) and 4 polarity                          view websites and created for RP specifically. Ama-
labels (i.e., Positive, Negative, Conflict and                       zon Product Review English dataset (McAuley and
Neutral). The laptop subset is not suitable                          Leskovec, 2013) containing product reviews and
for ACSA. The SemEval-2015 task-12 dataset                           metadata from Amazon has been widely used for
(SE-ABSA15) (Pontiki et al., 2015) builds upon                       RP (Cheng et al., 2018; McAuley and Leskovec,
SE-ABSA14 and defines its aspect category as a                       2013). Another popular English dataset comes
combination of an entity type E and an attribute                     from Yelp Dataset Challenge 20176 , which in-
type A (e.g., Food#Style Options). The SemEval-                      cludes reviews of local businesses in 12 metropoli-
2016 task-5 dataset (SE-ABSA16) (Pontiki et al.,                     tan areas across 4 countries. Openrice7 is a Chinese
2016) extends SE-ABSA15 to new domains and                           RP dataset composed of 168, 142 reviews. Both
new languages other than English. MAMS (Jiang                        the English and Chinese datasets don’t annotate
et al., 2019) tailors SE-ABSA14 to make it more                      fine-grained aspect category sentiment polarities.
challenging, in which each sentence contains                            We hope the release of ASAP can bridge the gap
at least two aspects with different sentiment                        between ACSA and RP.
polarities.
   Compared with the prosperity of English re-                       3       Dataset Collection and Analysis
sources, high-quality Chinese datasets are not                       3.1       Data Construction & Curation
rich enough. “ChnSentiCorp” (Tan and Zhang,
                                                                     We collect reviews from one of the most popular
2008), “IT168TEST” (Zagibalov and Carroll,
                                                                     O2O e-commerce platforms in China, which al-
2008), “Weibo”5 , “CTB” (Li et al., 2014) are 4 pop-
                                                                     lows users to publish coarse-grained star ratings
ular Chinese datasets for general sentiment analy-
                                                                     and writing fine-grained reviews to restaurants (or
sis. However, aspect category information is not
                                                                     places of interest) they have visited. In the reviews,
annotated in these datasets. (Zhao et al., 2014)
                                                                     users comment on multiple aspects either explic-
presents two Chinese ABSA datasets for consumer
                                                                     itly or implicitly, including ambience,price, food,
electronics (mobile phones and cameras). Never-
                                                                     service, and so on.
theless, the two datasets only contain 400 docu-
                                                                        First, we retrieve a large volume of user reviews
ments (∼ 4000 sentences), in which each sentence
                                                                     from popular restaurants holding more than 50 user
only mentions one aspect category at most. (Peng
                                                                     reviews randomly. Then, 4 pre-processing steps
et al., 2017) summarizes available Chinese ABSA
                                                                     are performed to promise the ethics, quality, and
datasets. While most of them are constructed
                                                                     reliability of the reviews. (1) User information
through rule-based or machine learning-based ap-
                                                                     (e.g., user-ids, usernames, avatars, and post-times)
proaches, which inevitably introduce additional
                                                                     are removed due to privacy considerations. (2)
noise into the datasets. Our ASAP excels above
                                                                     Short reviews with less than 50 Chinese characters,
Chinese datasets both on quantity and quality.
                                                                     as well as lengthy reviews with more than 1000
Rating Prediction. Rating prediction (RP) aims
                                                                     Chinese characters are filtered out. (3) If the ratio
to predict the “seeing stars” of reviews, which rep-
                                                                     of non-Chinese characters within a review is over
resent the overall ratings of reviews. In comparison
                                                                     70%, the review is discarded. (4) To detect the low-
to fine-grained aspect sentiment, the overall review
                                                                     quality reviews (e.g., advertising texts), we build
rating is usually a coarse-grained synthesis of the
                                                                         6
                                                                             http://www.yelp.com/dataset challenge/
   5                                                                     7
       http://tcci.ccf.org.cn/conference/2014/pages/page04 dg.html           https://www.openrice.com
a BERT-based classifier with an accuracy of 97%        assessors to annotate independently. Second, each
in a leave-out test-set. The reviews detected as       group is split into 2 subsets according to the an-
low-quality by the classifier are discarded too.       notation results, denoted as Sub-Agree and Sub-
                                                       Disagree. Sub-Agree comprises the data exam-
3.2   Aspect Categories                                ples with agreement annotation, and Sub-Disagree
Since the reviews already hold users’ star ratings,    comprises the data examples with disagreement an-
this section mainly introduces our annotation de-      notation. Sub-Agree will be reviewed by assessors
tails for ACSA. In SE-ABSA14 restaurant dataset        from other groups. The controversial examples
(denoted as R ESTAURANT for simplicity), there         during the review are considered as difficult cases.
are 5 coarse-grained aspect categories, including      Sub-Disagree will be reviewed by the 2 project
food, service, price, ambience and miscellaneous.      managers independently and then discuss to reach
After an in-depth analysis of the collected reviews,   an agreement annotation. The examples that could
we find the aspect categories mentioned by users       not be addressed after discussions are also con-
are rather diverse and fine-grained. Take the text     sidered as difficult cases. Third, for each group,
“...The restaurant holds a high-end decoration but     the difficult examples from two subsets are deliv-
is quite noisy since a wedding ceremony was be-        ered to the expert reviewer to make a final deci-
ing held in the main hall... (...环境看起来很高               sion. More details of difficult cases and annotation
大上的样子,但是因为主厅在举办婚礼非常                                    guidelines during annotation are demonstrated in
混乱,感觉特别吵...)” in Table 2 for example,                  the Appendix A.2.
the reviewer actually expresses opposite sentiment        Finally, ASAP corpus consists of 46, 730 pieces
polarities on two fine-grained aspect categories re-   of real-world user reviews, and we split it into a
lated to ambience. The restaurant’s decoration is      training set (36, 850), a validation set (4, 940) and
very high-end (Positive), while it’s very noisy due    a test set (4, 940) randomly. Table 2 presents an
to an ongoing ceremony (Negative). Therefore,          example review of ASAP and corresponding anno-
we summarize the frequently mentioned aspects          tations on the 18 aspect categories.
and refine the 5 coarse-grained categories into 18
fine-grained categories. We replace miscellaneous      3.4   Dataset Analysis
with location since we find users usually review
the restaurants’ location (e.g., whether the restau-   Figure 3 presents the distribution of 18 aspect cat-
rant is easy to reach by public transportation.). We   egories in ASAP. Because ASAP concentrates
denote the aspect category as the form of “Coarse-     on the domain of restaurant, 94.7% reviews men-
grained Category#Fine-grained Categoty”, such          tion Food#Taste as expected. Users also pay
as “Food#Taste” and “Ambience#Decoration”. The         great attention to fine-grained aspect categories
full list of aspect categories and definitions are     such as Service#Hospitality, Price#Level and Am-
listed in Table 1.                                     bience#Decoration. The distribution proves the
                                                       advantages of ASAP, as users’ fine-grained prefer-
3.3   Annotation Guidelines & Process                  ences could reflect the pros and cons of restaurants
Bearing in mind the pre-defined 18 aspects, as-        more precisely.
sessors are asked to annotate sentiment polarities        The statistics of ASAP are presented in Table 3.
towards the mentioned aspect categories of each        We also include a tailored SE-ABSA14 R ESTAU -
review. Given a review, when an aspect category is     RANT dataset for reference. Please note that we
mentioned within the review either explicitly and      remove the reviews holding aspect categories with
implicitly, the sentiment polarity over the aspect     sentiment polarity of “conflict” from the original
category is labeled as 1 (Positive), 0 (Neutral) or    R ESTAURANT dataset.
−1 (Negative) as shown in Table 2.                        Compared with R ESTAURANT, ASAP excels
   We hire 20 vendor assessors, 2 project managers,    in the quantities of training instances, which sup-
and 1 expert reviewer to perform annotations. Each     ports the exploration of recent data-intensive deep
assessor needs to attend a training to ensure their    neural models. ASAP is a review-level dataset,
intact understanding of the annotation guidelines.     while R ESTAURANT is a sentence-level dataset.
Three rounds of annotation are conducted sequen-       The average length of reviews in ASAP is much
tially. First, we randomly split the whole dataset     longer, thus the reviews tend to contain richer as-
into 10 groups, and every group is assigned to 2       pect information. In ASAP, the reviews contain
Table 1: The full list of 18 aspect categories and definitions.

 Aspect category                    Definition                    Aspect category                Definition
 Food#Taste                                                       Location#Easy to find
                                    Food taste                                                   Whether the restaurant is
 (口味)                                                             (是否容易寻找)
                                                                                                 easy to find
 Food#Appearance                                                  Service#Queue
                                    Food appearance                                              Whether the queue time
 (外观)                                                             (排队时间)
                                                                                                 is acceptable
 Food#Portion                                                     Service#Hospitality
                                    Food portion                                                 Waiters/waitresses’ atti-
 (分量)                                                             (服务人员态度)
                                                                                                 tude/hospitality
 Food#Recommend                                                   Service#Parking
                                    Whether the food is worth                                    Parking convenience
 (推荐程度)                                                           (停车方便)
                                    being recommended
 Price#Level                                                      Service#Timely
                                    Price level                                                  Order/Serving time
 (价格水平)                                                           (点菜/上菜速度)
 Price#Cost effective                                             Ambience#Decoration
                                    Whether the restaurant is                                    Decoration level
 (性价比)                                                            (装修)
                                    cost-effective
 Price#Discount                                                   Ambience#Noise
                                    Discount strength                                            Whether the restaurant is
 (折扣力度)                                                           (嘈杂情况)
                                                                                                 noisy
 Location#Downtown                                                Ambience#Space
                                    Whether the restaurant is                                    Dining Space and Seat
 (位于商圈附近)                                                         (就餐空间)
                                    located near downtown                                        Size
 Location#Transportation                                          Ambience#Sanitary
                                    Convenient public trans-                                     Sanitary condition
 (交通方便)                                                           (卫生情况)
                                    portation to the restaurant

                        Figure 3: The distribution of 18 fine-grained aspect categories in ASAP.

5.8 aspect categories in average, which is 4.7 times              view R. N is the number of pre-defined aspect
of R ESTAURANT. Both review-level ACSA and                        categories (i.e., 18 in this paper). Suppose there
RP are more challenging than their sentence-level                 are K mentioned aspect categories in R. We de-
counterparts. Take the review in Table 2 for exam-                fine a mask vector [p1 , p2 , ..., pN ] to indicate the
ple, the review contains several sentiment polarities             occurrence of aspect categories. When the aspect
towards multiple aspect categories. In addition to                category ai is mentioned
                                                                                        P in R, pi = 1, otherwise
aspect category sentiment annotations, ASAP also                  pi = 0. So we have N     i=1 pi = K. In terms of RP,
includes overall user ratings for reviews. With the               it aims to predict the 5-star rating score of g, which
help of ASAP, ACSA and RP can be further opti-                    represents the overall rating of the given review R.
mized either separately or jointly.
                                                                  4.2    Joint Model
4     Methodology
                                                                  Given a user review, ACSA focuses on predicting
4.1    Problem Formulation                                        its underlying sentiment polarities on different as-
We use D to denote the collection of user review                  pect categories, while RP focuses on predicting the
corpus in the training data. Given a review R which               user’s overall feelings from the review content. We
consists of a series of words: {w1 , w2 , ..., wZ },              reckon these two tasks are highly correlated and
ACSA aims to predict the sentiment polarity                       better performance could be achieved by consider-
yi ∈ {P ositive, N eutral, N egative} of review                   ing them jointly.
R with respect to the mentioned aspect category                      The advent of BERT has established the success
ai , i ∈ {1, 2, ..., N }. Z denotes the length of re-             of the “pre-training and then fine-tuning” paradigm
Table 2: A review example in ASAP, with overall star rating and aspect category sentiment polarity annotations.

                        Review                            Rating      Aspect Category        Label    Aspect Category      Label

                                                                     Location#Transportation         Price#Discount
                                                                                            1                                -
                                                                     (交通方便)                          (折扣力度)
 With convenient traffic, the restaurant holds a
 high-end decoration, but quite noisy because a                      Location#Downtown               Ambience#Decoration
 wedding ceremony was being held in the main                                                   -                            1
                                                                     (位于商圈附近)                        (装修)
 hall. Impressed by its delicate decoration and grand
 appearance though, we had to wait for a while at
 the weekend time. However, considering its high                     Location#Easy to find           Ambience#Noise
                                                                                               -                            −1
 price level, the taste is unexpected. We ordered the                (是否容易寻找 )                       (嘈杂情况)
 Kung Pao Prawn, the taste was acceptable and the
 serving size is enough, but the shrimp is not fresh.
 In terms of service, you could not expect too much                  Service#Queue                   Ambience#Space
                                                                                               -                            1
 due to the massive customers there. By the way, the                 (排队时间)                          (就餐空间)
 free-served fruit cup was nice. Generally speaking, it
 was a typical wedding banquet restaurant rather than     3-Star     Service#Hospitality             Ambience#Sanitary
 a comfortable place to date with friends.                                                     -                             -
                                                                     (服务人员态度)                        (卫生情况)
 交通还挺方便的,环境看起来很高大上的
 样子,但是因为主厅在举办婚礼非常混乱,特别                                               Service#Parking                 Food#Portion
 吵感觉,但是装修的还不错,感觉很精致的装                                                                          -                            1
                                                                     (停车方便)                          (分量)
 修,门面很气派,周末去的时候还需要等位。味
 道的话我觉得还可以但是跟价格比起来就很一般
 了,性价比挺低的,为了去吃宫保虾球的,但是                                               Service#Timely                  Food#Taste
 我觉得也就那样吧虾不是特别新鲜,不过虾球很                                                                        −1                            1
                                                                     (点菜/上菜速度)                       (口味)
 大,味道还行。服务的话由于人很多所以也顾不
 过来上菜的速度不快,但是有送水果杯还挺好吃
 的。总之就是典型的婚宴餐厅不是适合普通朋友                                               Price#Level
                                                                                              0
                                                                                                     Food#Appearance
                                                                                                                             -
 吃饭的地方了。                                                             (价格水平)                          (外观)

                                                                     Price#Cost effective            Food#Recommend
                                                                                              −1                             -
                                                                     (性价比)                           (推荐程度)

for NLP tasks. BERT-based models have achieved                     gregate the related token embeddings dynamically
impressive results in ACSA (Xu et al., 2019; Sun                   for every aspect category. The attention-pooling
et al., 2019; Jiang et al., 2019). Review rating                   layer helps the model focus on the tokens most
prediction can be deemed as a single-sentence clas-                related to the target aspect categories.
sification (regression) task, which could also be ad-
dressed with BERT. Therefore, we propose a joint                                   Mia = tanh(Wia ∗ H)                      (1)
learning model to address ACSA and RP in a multi-
task learning manner. Our joint model employs
the fine-to-coarse semantic representation capabil-                            αi & = sof tmax(ωiT ∗ Mia )                  (2)
ity of the BERT encoder. Figure 4 illustrates the
framework of our joint model.                                                  ri & = tanh(Wip ∗ H ∗ αiT )                  (3)
ACSA As shown in Figure 4, the token embed-                        Where Wia ∈ Rd∗d , Mia ∈ Rd∗Z , ωi ∈ Rd ,
dings of the input review are generated through a                  αi ∈ RZ , Wip ∈ Rd∗d , and ri ∈ Rd . αi is a vec-
shared BERT encoder. Briefly, let H ∈ Rd∗Z be                      tor consisting of attention weights of all tokens
the matrix consisting of token embedding vectors                   which can selectively attend the regions of the as-
{h1 , ..., hZ } that BERT produces, where d is the                 pect category related tokens, and ri is the attentive
size of hidden layers and Z is the length of the                   representation of review with respect to the ith as-
given review. Since different aspect category infor-               pect category ai , i ∈ {1, 2, ..., N }. Then we have
mation is dispersed across the content of R, we add
an attention-pooling layer (Wang et al., 2016) to ag-                         ŷi = sof tmax(Wiq ∗ ri + bqi )               (4)
Table 3: The statistics and label/rating distribution of ASAP and R ESTAURANT. The review length are counted
by Chinese characters and English words respectively. The sentences are segmented with periods in ASAP, while
R ESTAURANT is a sentence-level dataset.
                                              Average                     Average              Average Positive Negative
     Dataset       Split      Reviews        sentences                    aspects                                          Neutral   1-star   2-star   3-star   4-star    5-star
                                                                                               length
                                             per review                  per review
                   Train         36, 850         8.6                         5.8               319.7   133, 721 27, 425    52, 225   1, 219   1, 258   5, 241   13, 362   15, 770
     ASAP          Dev            4, 940         8.7                         5.9               319.9   18, 176 3, 733       7, 192    151      166      784      1, 734    2, 105
                   Test           4, 940         8.3                         5.7               317.1   17, 523 3, 813       7, 026    165      173      717      1, 867    2, 018
                   Train          2, 855          1                          1.2                15.2   2150      822         498        -        -        -         -         -
   R ESTAURANT
                   Test            749            1                          1.3                15.6    645      215          94        -        -        -         -         -

                             g                       y   1                        y
                                                                                  N
                                                                                                           Where W r ∈ Rd∗d ,br ∈ Rd , β ∈ Rd are trainable
      Output
                           dense             Softmax1              …     SoftmaxN                          parameters.
                                                                                                              The final loss of our joint model becomes as
                                             r                     …      r
                                                 1                            N
                                                                                                           follows.
                                                                                                                           loss = lossACSA + lossRP                                 (8)
                                             α   1                       α    N

     Attention                                                     …
                                                                                                           5     Experiments
    Contextual
                                   h                         h      …                     h
                                                                                                           We perform an extensive set of experiments to eval-
  Representation                   [cls]                      1                            Z

                                                                                                           uate the performance of our joint model on ASAP
                                                                                                           and R ESTAURANT. Ablation studies are also con-
   BERT Encoder                                                   BERT                                     ducted to probe the interactive influence between
                                                                                                           ACSA and RP.
      Input                [CLS]                     w
                                                     1
                                                                    …             w   Z

                                                                                                           5.1     ACSA
Figure 4: The framework of the proposed joint learning
model. The right part of the dotted vertical line is used                                                  Baseline Models We implement several ACSA
to predict multiple aspect category sentiment polarities,                                                  baselines for comparison. According to the differ-
while the left part is used to predict the review rating.                                                  ent structures of their encoders, these models are
                                                                                                           classified into Non-BERT based models or BERT-
Where Wiq ∈ RC∗d and bqi ∈ RC are trainable                                                                based models. Non-BERT based models include
parameters of the softmax layer. C is the number                                                           TextCNN (Kim, 2014), BiLSTM+Attn (Zhou et al.,
of labels (i.e, 3 in our task). Hence, the ACSA loss                                                       2016), ATAE-LSTM (Wang et al., 2016) and Cap-
for a given review R is defined as follows,                                                                sNet (Sabour et al., 2017). BERT-based models
                                                                                                           include vanilla BERT (Devlin et al., 2018), QA-
                                     N
                                   1 X X                                                                   BERT (Sun et al., 2019) and CapsNet-BERT (Jiang
        lossACSA =                     pi yi ∗ log ŷi                                           (5)
                                   K                                                                       et al., 2019). Due to the page limit, the imple-
                                           i=1               C
                                                                                                           mentation details of all experimental models are
If the aspect category ai is not mentioned in S,                                                           presented in the Appendix A.3.
yi is set as a random value. The pi serves as a                                                            Evaluation Metrics Following the settings of
gate function, which filters out the random yi and                                                         R ESTAURANT (Pontiki et al., 2014), we adopt
ensures only the mentioned aspect categories can                                                           Macro-F1 and Accuracy (Acc) as evaluation met-
participate in the calculation of the loss function.                                                       rics.
Rating Prediction Since the objective of RP is                                                             Experimental Results & Analysis We report the
to predict the review rating based on the review                                                           performance of aforementioned models on ASAP
content, we adopt the [CLS] embedding h[cls] ∈ Rd                                                          and R ESTAURANT in Table 4. Generally, BERT-
BERT produces as the representation of the input                                                           based models outperform Non-BERT based mod-
review, where d is the size of hidden layers in the                                                        els on both datasets. The two variants of our
BERT encoder.                                                                                              joint model perform better than vanilla-BERT, QA-
           ĝ = β T ∗ tanh(W r ∗ h[cls] + br )                                                   (6)       BERT, and CapsNet-BERT, which proves the ad-
                                                                                                           vantages of our joint learning model. Given a user
Hence the RP loss for a given review R is defined
                                                                                                           review, vanilla-BERT, QA-BERT, and CapsNet-
as follows,
                                                                                                           BERT treat the pre-defined aspect categories inde-
                           lossRP = |g − ĝ|                                                     (7)       pendently, while our joint model combines them
Table 4: The experimental results of ACSA models on ASAP and R ESTAURANT. Best scores are boldfaced.

                                                                           ASAP           R ESTAURANT
      Category                 Model
                                                                     Macro-F1   Acc.    Macro-F1   Acc.
                               TextCNN (Kim, 2014)                   60.41%   71.10%    70.56%    82.29%
                               BiLSTM+Attn (Zhou et al., 2016)       70.53%   77.78%    70.85%    81.97%
      Non-BERT-based models
                               ATAE LSTM (Wang et al., 2016)         76.60%   81.94%    70.15%    82.12%
                               CapsNet (Sabour et al., 2017)         75.54%   81.66%    71.84%    82.63%
                               Vanilla-BERT (Devlin et al., 2018)    79.18%   84.09%    79.22%    87.63%
      BERT-based models        QA-BERT (Sun et al., 2019)            79.44%   83.92%    80.89%    88.89%
                               CapsNet-BERT (Jiang et al., 2019)     78.92%   83.74%    80.94%    89.00%
                               Joint Model (w/o RP)                  80.75%   85.15%    82.01%    89.62%
                               Joint Model                           80.78%   85.19%       -         -

together with a multi-task learning framework.               outperforms other models considerably. On one
On one hand, the encoder-sharing setting enables             hand, the performance improvement is expected
knowledge transferring among different aspect cat-           since our joint model is built upon BERT. On
egories. On the other hand, our joint model is               the other hand, the ablation of ACSA (i.e., joint
more efficient than other competitors, especially            model(w/o ACSA)) brings performance degrada-
when the number of aspect categories is large. The           tion of RP on both metrics. We can conclude that
ablation of RP (i.e., joint model(w/o RP)) still out-        the fine-grained aspect category sentiment predic-
performs all other baselines. The introduction of            tion of the review indeed helps the model predict
RP to ACSA brings marginal improvement. This is              its overall rating more accurately.
reasonable considering that the essential objective             This section conducts preliminary experiments
of RP is to estimate the overall sentiment polarity          to evaluate classical ACSA and RP models on our
instead of fine-grained sentiment polarities. We             proposed ASAP dataset. We believe there still
also visualize the attention weights produced by             exists much room for improvements to both tasks,
our joint model on the example of Table 2 in Ap-             and we will leave them for future work.
pendix A.4 for reference.
                                                             6      Conclusion
5.2   Rating Prediction
                                                            This paper presents ASAP, a large-scale Chinese
We compare several RP models on ASAP, includ-               restaurant review dataset towards aspect category
ing TextCNN (Kim, 2014), BiLSTM+Attn (Zhou                  sentiment analysis (ACSA) and rating prediction
et al., 2016) and ARP (Wu et al., 2019b). ARP is            (RP) tasks. ASAP consists of 46, 730 restaurant
the SOTA aspect-aware rating prediction model in            user reviews with star ratings from a leading online-
the literature. The data pre-processing and imple-          to-offline (O2O) e-commerce platform in China.
mentation details are identical with ACSA experi-           Each review is manually annotated according to
ments.                                                      its sentiment polarities on 18 fine-grained aspect
Evaluation Metrics. We adopt Mean Absolute                  categories, such as Food#Taste, Ambience#Noise
Error (MAE) and Accuracy (by mapping the pre-               and Service#Hospitality. Both ACSA and RP are of
dicted rating score to the nearest category) as eval-       great value for business intelligence in e-commerce,
uation metrics.                                             while high-quality Chinese datasets for ACSA and
Experimental Results & Analysis The experi-                 RP are in short supply unexpectedly. we hope the
mental results of comparative RP models are il-             release of ASAP could push forward related re-
lustrated in Table 5.                                       searches and applications. Although ASAP could
Table 5: Experimental results of RP models on ASAP.         be utilized as evaluation benchmarks for ACSA
Best scores are boldfaced.                                  and RP respectively, a unique advantage of ASAP
                                                            is that each review holds fine-grained aspect-based
  Model                              MAE      Acc.
  TextCNN (Kim, 2014)                .5814   52.99%         sentiment polarity annotation and an overall user
  BiLSTM+Attn (Zhou et al., 2016)    .5737   54.38%         rating. Besides evaluations of ACSA and RP
  ARP (Wu et al., 2019b)             .5620   54.76%         models on ASAP separately, we also propose a
  Joint Model (w/o ACSA)             .4421   60.08%
  Joint Model                       .4266    61.26%         joint model to address ACSA and RP synthetically,
                                                            which outperforms other state-of-the-art baselines
  Our joint model which combines ACSA and RP                considerably.
References                                                Changliang Li, Bo Xu, Gaowei Wu, Saike He, Guan-
                                                            hua Tian, and Hongwei Hao. 2014. Recursive
Chong Chen, Min Zhang, Yiqun Liu, and Shaoping              deep learning for sentiment analysis over social data.
  Ma. 2018. Neural attentional rating regression with       In 2014 IEEE/WIC/ACM International Joint Con-
  review-level explanations. In Proceedings of the          ferences on Web Intelligence (WI) and Intelligent
  2018 World Wide Web Conference, pages 1583–               Agent Technologies (IAT), volume 2, pages 180–185.
  1592.                                                     IEEE.
Zhiyong Cheng, Ying Ding, Lei Zhu, and Mohan
                                                          Fangtao Li, Nathan Nan Liu, Hongwei Jin, Kai Zhao,
  Kankanhalli. 2018.     Aspect-aware latent factor
                                                            Qiang Yang, and Xiaoyan Zhu. 2011. Incorporat-
  model: Rating prediction with ratings and reviews.
                                                            ing reviewer and product information for review rat-
  In Proceedings of the 2018 world wide web confer-
                                                            ing prediction. In Twenty-second international joint
  ence, pages 639–648.
                                                            conference on artificial intelligence.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
                                                          Junjie Li, Haitong Yang, and Chengqing Zong. 2018.
   Kristina Toutanova. 2018. Bert: Pre-training of deep
                                                            Document-level multi-aspect sentiment classifica-
   bidirectional transformers for language understand-
                                                            tion by jointly modeling users, aspects, and overall
   ing. CoRR, abs/1810.04805.
                                                            ratings. In Proceedings of the 27th International
Gayatree Ganu, Noemie Elhadad, and Amélie Marian.          Conference on Computational Linguistics, pages
  2009. Beyond the stars: improving rating predic-          925–936, Santa Fe, New Mexico, USA. Association
  tions using review text content. In WebDB, vol-           for Computational Linguistics.
  ume 9, pages 1–6. Citeseer.
                                                          Dehong Ma, Sujian Li, Xiaodong Zhang, and Houfeng
Mengting Hu, Shiwan Zhao, Li Zhang, Keke                    Wang. 2017. Interactive attention networks for
 Cai, Zhong Su, Renhong Cheng, and Xiaowei                  aspect-level sentiment classification. In Interna-
 Shen. 2018.     CAN: constrained attention net-            tional Joint Conference on Artificial Intelligence,
 works for multi-aspect sentiment analysis. CoRR,           pages 4068–4074.
 abs/1812.10735.
                                                          Julian McAuley and Jure Leskovec. 2013. Hidden fac-
Minqing Hu and Bing Liu. 2004. Mining and summa-             tors and hidden topics: understanding rating dimen-
  rizing customer reviews. In Proceedings of the tenth       sions with review text. In Proceedings of the 7th
  ACM SIGKDD international conference on Knowl-              ACM conference on Recommender systems, pages
  edge discovery and data mining, pages 168–177.            165–172.
Qingnan Jiang, Lei Chen, Ruifeng Xu, Xiang Ao, and        Sajad Movahedi, Erfan Ghadery, Heshaam Faili, and
  Min Yang. 2019. A challenge dataset and effective         Azadeh Shakery. 2019. Aspect category detection
  models for aspect-based sentiment analysis. In Pro-       via topic-attention network.
  ceedings of the 2019 Conference on Empirical Meth-
  ods in Natural Language Processing and the 9th In-      Haiyun Peng, Erik Cambria, and Amir Hussain. 2017.
  ternational Joint Conference on Natural Language          A review of sentiment analysis research in chinese
  Processing (EMNLP-IJCNLP), pages 6281–6286.               language. Cognitive Computation, 9(4):423–435.

Zhipeng Jin, Qiudan Li, Daniel D Zeng, YongCheng          Maria Pontiki, Dimitrios Galanis, Haris Papageorgiou,
  Zhan, Ruoran Liu, Lei Wang, and Hongyuan Ma.             Suresh Manandhar, and Ion Androutsopoulos. 2015.
  2016. Jointly modeling review content and aspect         Semeval-2015 task 12: Aspect based sentiment
  ratings for review rating prediction. In Proceedings     analys. In International Workshop on Semantic
  of the 39th International ACM SIGIR conference on        Evaluation (SemEval 2015).
  Research and Development in Information Retrieval,
  pages 893–896.                                          Maria Pontiki, Dimitrios Galanis, John Pavlopou-
                                                           los, Haris Papageorgiou, Ion Androutsopoulos, and
Yohan Jo and Alice H Oh. 2011. Aspect and senti-           Suresh Manandhar. 2014. Semeval-2014 task 4:
  ment unification model for online review analysis.       Aspect based sentiment analysis. In International
  In Proceedings of the fourth ACM international con-      Workshop on Semantic Evaluation (SemEval 2015).
  ference on Web search and data mining, pages 815–
  824. ACM.                                               Maria Pontiki, Dimitris Galanis, Haris Papageor-
                                                           giou, Ion Androutsopoulos, Suresh Manandhar, AL-
Yoon Kim. 2014.         Convolutional neural net-          Smadi Mohammad, Mahmoud Al-Ayyoub, Yanyan
  works for sentence classification. arXiv preprint        Zhao, Bing Qin, Orphée De Clercq, et al. 2016.
  arXiv:1408.5882.                                         Semeval-2016 task 5: Aspect based sentiment anal-
                                                           ysis. In Proceedings of the 10th international work-
Svetlana Kiritchenko, Xiaodan Zhu, Colin Cherry, and       shop on semantic evaluation (SemEval-2016), pages
  Saif Mohammad. 2014. Nrc-canada-2014: Detect-            19–30.
  ing aspects and sentiment in customer reviews. In
  Proceedings of the 8th International Workshop on        Sebastian Ruder, Parsa Ghaffari, and John G. Breslin.
  Semantic Evaluation (SemEval 2014), pages 437–            2016. A hierarchical model of reviews for aspect-
  442.                                                      based sentiment analysis.
Sara Sabour, Nicholas Frosst, and Geoffrey E Hin-
  ton. 2017. Dynamic routing between capsules. In
  Advances in neural information processing systems,
  pages 3856–3866.
Chi Sun, Luyao Huang, and Xipeng Qiu. 2019. Uti-
  lizing bert for aspect-based sentiment analysis via
  constructing auxiliary sentence. arXiv preprint
  arXiv:1903.09588.
Songbo Tan and Jin Zhang. 2008. An empirical study
  of sentiment analysis for chinese documents. Expert
  Systems with applications, 34(4):2622–2629.
Yequan Wang, Minlie Huang, and Li Zhao. 2016.
  Attention-based lstm for aspect-level sentiment clas-
  sification. In Empirical Methods in Natural Lan-
  guage Processing, pages 606–615.
Chuhan Wu, Fangzhao Wu, Junxin Liu, Yongfeng
  Huang, and Xing Xie. 2019a. Arp: Aspect-aware
  neural review rating prediction. In Proceedings of
  the 28th ACM International Conference on Infor-
  mation and Knowledge Management, pages 2169–
  2172.
Chuhan Wu, Fangzhao Wu, Junxin Liu, Yongfeng
  Huang, and Xing Xie. 2019b. Arp: Aspect-aware
  neural review rating prediction. In Proceedings of
  the 28th ACM International Conference on Infor-
  mation and Knowledge Management, pages 2169–
  2172.
Hu Xu, Bing Liu, Lei Shu, and Philip S Yu. 2019.
  Bert post-training for review reading comprehension
  and aspect-based sentiment analysis. arXiv preprint
  arXiv:1904.02232.
Taras Zagibalov and John A Carroll. 2008. Unsuper-
  vised classification of sentiment and objectivity in
  chinese text. In Proceedings of the Third Interna-
  tional Joint Conference on Natural Language Pro-
  cessing: Volume-I.
Yanyan Zhao, Bing Qin, and Ting Liu. 2014. Creating
  a fine-grained corpus for chinese sentiment analysis.
  IEEE Intelligent Systems, 30(1):36–43.
Peng Zhou, Wei Shi, Jun Tian, Zhenyu Qi, Bingchen Li,
  Hongwei Hao, and Bo Xu. 2016. Attention-based
  bidirectional long short-term memory networks for
  relation classification. In Proceedings of the 54th
  Annual Meeting of the Association for Computa-
  tional Linguistics (Volume 2: Short Papers), pages
  207–212.

Xinjie Zhou, Xiaojun Wan, and Jianguo Xiao. 2015.
  Representation learning for aspect category detec-
  tion in online reviews. In Twenty-ninth Aaai Con-
  ference on Artificial Intelligence.
You can also read