A Multimodal Sentiment Dataset for Video Recommendation

A Multimodal Sentiment Dataset for Video Recommendation
A Multimodal Sentiment Dataset for Video Recommendation

                                                           Hongxuan Tang, Hao Liu, Xinyan Xiao, Hua Wu
                                                                       Baidu Inc., Beijing, China
                                                   {tanghongxuan, liuhao24, xiaoxinyan, wu hua}@Baidu.com

                                                                Abstract                               achieved promising results. Recently, with the
                                                                                                       development of short video applications, multi-
                                             Recently, multimodal sentiment analysis has
                                             seen remarkable advance and a lot of datasets             modal sentiment analysis has obtained more at-
arXiv:2109.08333v1 [cs.CL] 17 Sep 2021

                                             are proposed for its development. In general,             tention (Tsai et al., 2019; Li et al., 2020; Yu et al.,
                                             current multimodal sentiment analysis datasets            2021) and a lots of datasets (Li et al., 2017; Po-
                                             usually follow the traditional system of senti-           ria et al., 2018; Yu et al., 2020) are proposed to
                                             ment/emotion, such as positive, negative and              advance its developments. However, current mul-
                                             so on. However, when applied in the sce-                  timodal sentiment analysis datasets usually follow
                                             nario of video recommendation, the traditional            the traditional sentiment system (positive, neutral
                                             sentiment/emotion system is hard to be lever-
                                                                                                       and negative ) or emotion system (happy, sad, sur-
                                             aged to represent different contents of videos
                                             in the perspective of visual senses and lan-              prise and so on), which is far from satisfactory
                                             guage understanding. Based on this, we pro-               especially for video recommendation scenario of
                                             pose a multimodal sentiment analysis dataset,             application.
                                             named baiDu Video Sentiment dataset (Du-                     In the scenery of industrial video recommenda-
                                             VideoSenti), and introduce a new sentiment                tion, content-based recommendation method (Paz-
                                             system which is designed to describe the sen-             zani and Billsus, 2007) is widely used because of
                                             timental style of a video on recommendation
                                             scenery. Specifically, DuVideoSenti consists
                                                                                                       its advantages in improving the explainability of
                                             of 5,630 videos which displayed on Baidu 1 ,              recommendation and conducting effective online
                                             each video is manually annotated with a sen-              interventions. Specifically, the videos are first rep-
                                             timental style label which describes the user’s           resented by certain tags, which is automatically pre-
                                             real feeling of a video. Furthermore, we pro-             dicted by some neural network models (Wu et al.,
                                             pose UNIMO (Li et al., 2020) as our baseline              2015, 2016; Rehman and Belhaouari, 2021). Then,
                                             for DuVideoSenti. Experimental results show               a user’s profile is constructed by gathering the tags
                                             that DuVideoSenti brings new challenges to
                                                                                                       from his/her watched videos. Final, it recommends
                                             multimodal sentiment analysis, and could be
                                             used as a new benchmark for evaluating appo-              candidate videos based on the relevance between
                                             raches designed for video understanding and               current candidate video’s tags and the user’s profile.
                                             multimodal fusion. We also expect our pro-                In general, the types of tags are further divided into
                                             posed DuVideoSenti could further improve the              topic-level and entity-level. As show in Figure 1,
                                             development of multimodal sentiment analysis              for the videos talking about one’s traveling in The
                                             and its application to video recommendations.             Palace Museum, they may relate to the topic-level
                                         1   Introduction                                              tags such as ”Tourism” and entity-level tags such
                                                                                                       as ”The Palace Museum”. While in real applica-
                                         Sentiment analysis is an important research area              tion, the topic-level and entity-level are not afford
                                         in Natural Language Processing (NLP), which has               to summary the content of a certain video com-
                                         wide applications, such as opinion mining, dia-               prehensively. The reason relies in that the videos,
                                         logue generation and recommendation. Previous                 which share the same topics or contains similarity
                                         work (Liu and Zhang, 2012; Tian et al., 2020)                 entities, usually gives different visual and senti-
                                         mainly focused on text sentiment analysis and                 mental perceptions to a certain user. For example,
                                               One of the most popular applications in China, which    the first video brings us a feeling of great momen-
                                         features both information retrieval and news recommendation   tum, while the second video brings us a feeling of
A Multimodal Sentiment Dataset for Video Recommendation
Figure 1: Two videos displayed in Baidu, which record the authors’ travelling to The Palace Museum. Although,
the two videos share the same topic-level and entity-level tags such as ”Tourism” and ”The Palace Museum”, they
give different visual and sentimental perceptions to the audience. Specifically, the first video gives us a feeling of
great momentum, while the another one brings us a feeling of generic.

generic, although they share the same topics.                dations. In this paper, our DuVideoSenti dataset
   Based on this, we propose a multimodal senti-             defines a new sentimental-style system which is
ment analysis dataset, named baiDu Video Senti-              customized for video recommendation.
ment dataset (DuVideoSenti), and construct a new
sentimental -style system, which consists of eight           2.2    Multimodal Sentiment Analysis
sentimental-style tags. These sentimental-style tags        In multimodal sentiment analysis, intra-modal rep-
are designed to describe the users’ real feeling after      resentation and inter-modal fusion are two essential
he/she has watched the video.                               and challenging subtasks. For intra-modal repre-
   In detail, we collect 5,630 videos which dis-            sentation, previous work pay attention to the tem-
played on Baidu, and then each video is manually            poral and spatial characteristics among different
annotated with a sentimental-style label selected           modalities. The Convolutional Neural Network
from the above sentimental-style system. In ad-             (CNN), Long Short-term Memory (LSTM) and
dition, we further propose a baseline to our Du-            Deep Neural Network (DNN) are representative
VideoSenti Corpus which is based on UNIMO.                  approaches to extract multimodal features (Cam-
And we expect our proposed DuVideoSenti cor-                bria et al., 2017; Zadeh et al., 2017, 2018a). For
pus could further improve the development of mul-           inter-modal fusion, numerous methods have been
timodal sentiment analysis and its application to           proposed recently. For example, concatenation
video recommendations.                                      (Cambria et al., 2017), Tensor Fusion Network
                                                            (TFN) (Zadeh et al., 2017), Low-rank Multimodal
2     Related Work                                          Fusion (LMF) (Liu et al., 2018), Memory Fusion
In this section, we briefly introduce previous works        Network (MFN) (Zadeh et al., 2018a), Dynamic
in multimodal datasets, multimodal sentiment anal-          Fusion Graph (DFG) (Zadeh et al., 2018b). Re-
ysis.                                                       cently, we follow the trend of pretaining, and select
                                                            UNIMO (Li et al., 2020) to build our baselines.
2.1    Multimodal Datasets
                                                             3     Data Construction
To promote the development of multimodel senti-
ment analysis and emotion detection, a variety of           In this section, we introduce the constrcution de-
multimodal datasets are proposed, including IEMO-           tails of our DuVideoSenti dataset. Specifically,
CAP (Busso et al., 2008), YouTube (Morency et al.,          DuVideoSenti contains 5,630 video examples, all
2011), MOUD (Pérez-Rosas et al., 2013), ICT-               examples are selected from Baidu. Besides, in or-
MMMO (Wöllmer et al., 2013), MOSI (Zadeh                   der to ensure the data quality, all sentiment labels
et al., 2016), CMU-MOSEI (Zadeh et al., 2018b),             are annotated by experts. Finally, we release Du-
CHEAVD (Li et al., 2017), Meld (Poria et al.,               VideoSenti according to following data structure
2018), CH-SIMS (Yu et al., 2020) and so on.                 as shown in Table 1. Table 1 shows a real exam-
As mentioned above, current multimodal datasets             ple in our DuVideoSenti datasets, which contains
follow the traditional sentiment/emotion system,            multi-regions such as url, title, video feaures and
which is not fit to the scenario of video recommen-         its corresponding sentiment label. For the video
A Multimodal Sentiment Dataset for Video Recommendation
Region     Example
            url        http://quanmin.baidu.com/sv?source=share-h5&pd=qm share search&
            title      那小孩不让我吃口香糖. . .
            label      呆萌可爱


            ...        ...


Table 1: Data structure of released dataset. (It should be noted that we have not provide the origin frame images
in our dataset for the protection of copyright. Instead, we provided the visual features extracted from each frame
image and its corresponding url for watching the video online.)

features extraction, we first sample four frame im-       baseline. The maximum sequence length of text
ages from all the frame images of the given video         tokens and image-region features are set as 512
at the same time intervals. Second, we use Faster         and 100, respectively. The learning rate is set to
R-CNN (Ren et al., 2015) to detect the salient im-        1e-5 and the batch size is 8. We set the number of
age regions and extract the visual features (pooled       epochs to 20. All experiments are conducted on 1
ROI features) for each region, which is the same as       Tesla V100 GPUs. We select Accuracy to evaluate
(Chen et al., 2020; Li et al., 2020). Specifically, the   the performance of the baseline systems.
image-region features are set as 100.
   In addition, we split DuVideoSenti into train-          4.2   Experimental Results
ing set and test set randomly, the size of which           Table 3 shows the baseline performance in our
are 4,500 and 1,130 respectively. The sentimental-         dataset, specifically, UNIMO-large model achieves
style labels are listed as follows: “文 艺 清 新               an Accuracy of 54.56% on the test set. It suggests
(Hipsterism)”, “时尚炫酷 (Fashion)”, “舒适温                      that the by simply using a strong and widely used
馨 (Warm and Sweet)”, “客观理性 (Objective and                  models failed to obtain promising results, which
Rationality)”, “家长里短 (Daily)”, “土里土气                       proposes the needs for further development of mul-
(Old-fashion)”, “呆萌可爱 (Cute)”, “奇葩低俗                       timodal sentiemnt analysis, especially for videos.
(Vulgar)”, “正能量 (Positive Energy)”, “负能量                      We are also interested in the affects brought by
(Negative Energy)”, “其他 (Other)” , the class dis-          different number of image-region frames. We fur-
tribution and examples of each sentiment label in          ther test the performance of our baseline system
our DuVideoSenti are shown in Table 2.                     when {1, 4, 20} frames are selected. The exper-
                                                           iment results are listed in Table 3, which shows
4     Experiments                                          that the performance of decreases with less image-
                                                           region frames. It suggests that by improving visual
4.1    Experiment Setting                                  representation, can further promote the classifica-
In our experiment, we select UNIMO-large , which           tion performance.
consists of 24 layers of Transformer block,as as our          Finally, we propose to evaluate the improve-
A Multimodal Sentiment Dataset for Video Recommendation
Label                             Example                                                              #
  文艺清新                              令箭荷花的开放过程
  Hipsterism                        Translated: The opening process of Nopalxochia
  时尚炫酷                              古驰中国新年特别系列
  Fashion                           Translated: Gucci Chinese New Year limited edition
  舒适温馨                              美好的一天,从有爱的简笔画开始
  Warm and Sweet                    Translated: A beautiful day, start with simple strokes of love
  客观理性                              吹气球太累?是你没找对方法!
  Objective and Rationality         Translated: Tired of blowing balloons? Try this method.
  家长里短                              今天太阳炎热夏天以到来
  Daily                             Translated: It’s very hot today, summer is coming
  土里土气                              广场舞,梨花飞情人泪32步
  Old-fashion                       Translated: The 32 steps of public square dancing
  呆萌可爱                              #创意简笔画#可爱小猫咪怎么画?
  Cute                              Translated: #Creative stick figure#How to draw a cute kitten?
  奇葩低俗                              撒网是我的本事,入网是你的荣幸
  Vulgar                            Translated: It is your honour to love me
  正能量                               山东齐鲁医院130名医护人员集体出征
  Positive Energy                   Translated: Shandong Qilu Hospital 130 medical staff set out
  负能量                               黑社会被打屁股
  Negative Energy                   Translated: The underworld spanked
  其他                                速记英语,真的很快记住
  Other                             Translated: English shorthand, really quick to remember
  Total                                                                                                5,630

               Table 2: Distribution and examples for different sentiment classes in DuVideoSenti.

                 frame 1     frame 4    frame 20                             Visual       Textual    Multi
   文艺清新           50.91       52.88        45.19             文艺清新             52.88         32.69    52.88
   时尚炫酷           2.56        28.20        15.38             时尚炫酷             17.94         17.94    28.20
   舒适温馨           27.85       27.83        37.11             舒适温馨             18.55         27.83    27.83
   客观理性           62.65       67.91        65.14             客观理性             66.39         56.84    67.91
   家长里短           64.41       67.69        74.90             家长里短             61.79         65.54    67.69
   土里土气           6.66         0.00         6.66             土里土气              6.66          0.00     0.00
   呆萌可爱           60.95       39.04        68.57             呆萌可爱             49.52         49.52    39.04
   奇葩低俗           29.82       33.33        10.52             奇葩低俗             35.08         28.07    33.33
   正能量            5.88        41.17        47.05             正能量              11.76         47.05    41.17
   负能量            0.00         0.00         0.00             负能量               0.00          0.00     0.00
   其他             57.75       59.89        56.14             其他               56.68         41.71    59.89
   All            52.65       54.56        56.46             All              51.85         47.25    54.56

Table 3: UNIMO baseline performance based on differ-      Table 4: Baseline performance based on image and
ent number of key frames.                                 text.

ments brought by multi-modal fusion. We com-              which use visual, textual, and multi-modal infor-
pare the accuracy performance among the systems           mation respectively. The results are show in Ta-
ble 4, which shows that by fusing the visual and          Zhun Liu, Ying Shen, Varun Bharadhwaj Lakshmi-
textual information of the videos, it obtained the          narasimhan, Paul Pu Liang, Amir Zadeh, and Louis-
                                                            Philippe Morency. 2018. Efficient low-rank multi-
best performance. Furthermore, visual-only mod-
                                                            modal fusion with modality-specific factors. arXiv
els performs better than the textual-only one, we           preprint arXiv:1806.00064.
suspect that is our defined sentimental-style system
is much more related to user’ visual feelings after       Louis-Philippe Morency, Rada Mihalcea, and Payal
                                                            Doshi. 2011. Towards multimodal sentiment anal-
he/she watched the video.                                   ysis: Harvesting opinions from the web. In Proceed-
                                                            ings of the 13th international conference on multi-
5   Conclusion                                              modal interfaces, pages 169–176.
                                                          Michael J Pazzani and Daniel Billsus. 2007. Content-
In this paper, we propose a new multimodel senti-
                                                            based recommendation systems. In The adaptive
ment analysis dataset named DuVideoSenti, which             web, pages 325–341. Springer.
is designed for the scenario of video recommen-
dation. Furthermore, we propose UNIMO as our              Verónica Pérez-Rosas, Rada Mihalcea, and Louis-
                                                            Philippe Morency. 2013. Utterance-level multi-
baseline, and test the accuray performance of the           modal sentiment analysis. In Proceedings of the
baseline on a variety of settings. We expect our            51st Annual Meeting of the Association for Compu-
DuVideoSenti dataset could further improve the              tational Linguistics (Volume 1: Long Papers), pages
development for the area of multimodal sentiment            973–982.
analysis, and promote the applications to video           Soujanya Poria, Devamanyu Hazarika, Navonil Ma-
recommendation.                                             jumder, Gautam Naik, Erik Cambria, and Rada Mi-
                                                            halcea. 2018. Meld: A multimodal multi-party
                                                            dataset for emotion recognition in conversations.
                                                            arXiv preprint arXiv:1810.02508.
                                                          Atiq Rehman and Samir Brahim Belhaouari. 2021.
Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe              Deep learning for video classification: A review.
  Kazemzadeh, Emily Mower, Samuel Kim, Jean-
  nette N Chang, Sungbok Lee, and Shrikanth S             Shaoqing Ren, Kaiming He, Ross Girshick, and Jian
  Narayanan. 2008. Iemocap: Interactive emotional           Sun. 2015. Faster r-cnn: Towards real-time object
  dyadic motion capture database. Language re-              detection with region proposal networks. Advances
  sources and evaluation, 42(4):335–359.                    in neural information processing systems, 28:91–99.

Erik Cambria, Devamanyu Hazarika, Soujanya Po-            Hao Tian, Can Gao, Xinyan Xiao, Hao Liu, Bolei He,
   ria, Amir Hussain, and RBV Subramanyam. 2017.            Hua Wu, Haifeng Wang, and Feng Wu. 2020. Skep:
   Benchmarking multimodal sentiment analysis. In           Sentiment knowledge enhanced pre-training for sen-
  International Conference on Computational Linguis-        timent analysis. arXiv preprint arXiv:2005.05635.
   tics and Intelligent Text Processing, pages 166–179.
                                                          Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang,
                                                            J Zico Kolter, Louis-Philippe Morency, and Rus-
                                                            lan Salakhutdinov. 2019. Multimodal transformer
Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed                 for unaligned multimodal language sequences. In
  El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and            Proceedings of the conference. Association for Com-
  Jingjing Liu. 2020. Uniter: Universal image-text          putational Linguistics. Meeting, volume 2019, page
  representation learning. In European conference on        6558. NIH Public Access.
  computer vision, pages 104–120. Springer.
                                                          Martin Wöllmer, Felix Weninger, Tobias Knaup, Björn
Wei Li, Can Gao, Guocheng Niu, Xinyan Xiao, Hao            Schuller, Congkai Sun, Kenji Sagae, and Louis-
  Liu, Jiachen Liu, Hua Wu, and Haifeng Wang.              Philippe Morency. 2013. Youtube movie reviews:
  2020. Unimo: Towards unified-modal understand-           Sentiment analysis in an audio-visual context. IEEE
  ing and generation via cross-modal contrastive learn-    Intelligent Systems, 28(3):46–53.
  ing. arXiv preprint arXiv:2012.15409.
                                                          Zuxuan Wu, Yu-Gang Jiang, Xi Wang, Hao Ye, and
                                                            Xiangyang Xue. 2016. Multi-stream multi-class fu-
Ya Li, Jianhua Tao, Linlin Chao, Wei Bao, and Yazhu         sion of deep networks for video classification. In
  Liu. 2017. Cheavd: a chinese natural emotional            Proceedings of the 24th ACM international confer-
  audio–visual database. Journal of Ambient Intelli-        ence on Multimedia, pages 791–800.
  gence and Humanized Computing, 8(6):913–924.
                                                          Zuxuan Wu, Yu-Gang Jiang, Xi Wang, Hao Ye, Xi-
Bing Liu and Lei Zhang. 2012. A survey of opinion           angyang Xue, and Jun Wang. 2015. Fusing multi-
  mining and sentiment analysis. In Mining text data,       stream deep networks for video classification. arXiv
  pages 415–463. Springer.                                  preprint arXiv:1509.06086.
Wenmeng Yu, Hua Xu, Fanyang Meng, Yilin Zhu,
 Yixiao Ma, Jiele Wu, Jiyun Zou, and Kaicheng
 Yang. 2020. Ch-sims: A chinese multimodal senti-
  ment analysis dataset with fine-grained annotation of
  modality. In Proceedings of the 58th Annual Meet-
  ing of the Association for Computational Linguistics,
  pages 3718–3727.
Wenmeng Yu, Hua Xu, Ziqi Yuan, and Jiele Wu. 2021.
  Learning modality-specific representations with self-
  supervised multi-task learning for multimodal senti-
  ment analysis. arXiv preprint arXiv:2102.04830.
Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cam-
 bria, and Louis-Philippe Morency. 2017. Tensor
 fusion network for multimodal sentiment analysis.
 arXiv preprint arXiv:1707.07250.
Amir Zadeh, Paul Pu Liang, Navonil Mazumder,
 Soujanya Poria, Erik Cambria, and Louis-Philippe
 Morency. 2018a.     Memory fusion network for
 multi-view sequential learning. In Proceedings of
 the AAAI Conference on Artificial Intelligence, vol-
 ume 32.
Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-
 Philippe Morency. 2016. Mosi: multimodal cor-
 pus of sentiment intensity and subjectivity anal-
 ysis in online opinion videos.    arXiv preprint

AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Po-
 ria, Erik Cambria, and Louis-Philippe Morency.
 2018b. Multimodal language analysis in the wild:
 Cmu-mosei dataset and interpretable dynamic fu-
 sion graph. In Proceedings of the 56th Annual Meet-
 ing of the Association for Computational Linguistics
 (Volume 1: Long Papers), pages 2236–2246.
You can also read