Crowdsourcing-based Copyright Infringement Detection in Live Video Streams - University of Notre Dame

Page created by Darlene Thompson
 
CONTINUE READING
Crowdsourcing-based Copyright Infringement Detection in Live Video Streams - University of Notre Dame
Crowdsourcing-based Copyright Infringement
              Detection in Live Video Streams
                    Daniel (Yue) Zhang, Qi Li, Herman Tong, Jose Badilla, Yang Zhang, Dong Wang
                                        Department of Computer Science and Engineering
                                                     University of Notre Dame
                                                       Notre Dame, IN, USA
                                   {yzhang40, qli8, ktong1, jbadilla, yzhang42, dwang5}@nd.edu

   Abstract—With the increasing popularity of online video shar-         Pay-per-view programs) without the permission of content
ing platforms (such as YouTube and Twitch), the detection of             owners [3]. For example, a major anti-piracy agency claims 77
content that infringes copyright has emerged as a new critical           million people watched “Game Of Thrones” season 7 episode
problem in online social media. In contrast to the traditional
copyright detection problem that studies the static content (e.g.,       1 via unauthorized live videos, causing an estimated total of 45
music, films, digital documents), this paper focuses on a much           million US dollars of revenue loss to HBO, the legal copyright
more challenging problem: one in which the content of interest is        owner of the series [4]. One of the main reasons for such
from live videos. We found that the state-of-the-art commercial          serious copyright infringement is the grass-root nature of the
copyright infringement detection systems, such as the ContentID          video sharing platforms: anyone can start a live video stream
from YouTube, did not solve this problem well: large amounts of
copyright-infringing videos bypass the detector while many legal         on the platform without going through a rigorous copyright
videos are taken down by mistake. In addressing the copyright            screening process. This leaves room for “rogue accounts” to
infringement detection problem for live videos, we identify several      host illegal live streams.
critical challenges: i) live streams are generated in real-time             Due to the increasing demand of blocking unauthorized
and the original copyright content from the owner may not be             video streams from copyright owners, the video sharing plat-
accessible; ii) streamers are getting more and more sophisticated
in bypassing the copyright detection system (e.g., by modifying          forms have spent a significant amount of efforts addressing the
the title, tweaking the presentation of the video); iii) similar video   copyright infringement problem. One of the most representa-
descriptions and visual contents make it difficult to distinguish        tive copyright detection tools for live videos is ContentID, a
between legal streams and copyright-infringing ones. In this             proprietary system developed by YouTube to detect copyright-
paper, we develop a crowdsourcing-based copyright infringement           infringing video uploads. In ContentID, each uploaded video
detection (CCID) scheme to address the above challenges by
exploring a rich set of valuable clues from live chat messages.          stream is compared against a database of files provided by
We evaluate CCID on two real world live video datasets collected         content owners to check for copyright issues. ContentID also
from YouTube. The results show our scheme is significantly more          uses the self-reports from content owners when they identify
effective and efficient than ContentID in detecting copyright-           pirated videos [5]. Unfortunately, ContentID has received
infringing live videos on YouTube.                                       heated criticisms from both video streamers and copyright
                                                                         owners due to its high false positives (i.e., falsely taking
                        I. I NTRODUCTION
                                                                         down legal streams) and false negatives (i.e., constantly miss
   It has been a recent phenomenon that online social media              copyright-infringing videos) 1 . In fact, our empirical study
(e.g., YouTube and Twitch) allow users to broadcast live                 showed that the ContentID failed to catch 26% of copyrighted
videos to audience worldwide. These video sharing platforms              videos after they have been broadcast for 30 minutes and shut
are fundamentally different from “static” content distribution           down 22% video streams that are not copyright-infringing.
platforms (e.g., Netflix and Hulu) because streams are gener-               Several alternative copyright protection techniques (e.g.,
ated and consumed in real time. The live videos (e.g., real-             digital fingerprinting [6] and forensic watermarking [7]) can
time game play, live music, and live TV shows) create great              help track down pirated content effectively. However, such
revenues for both live stream uploaders (often referred to as            solutions require the original copy of the copyrighted content
“streamers”) and the video sharing platforms. For example,               in advance to extract unique video features or embedded
according to a recent survey, the live video market is estimated         identifiers (e.g., digital watermarks or serial numbers) for
to grow from 30.29 billion US dollars in 2016 to more than 70            tracking. Therefore, they are often applied on static content
billion US dollars by 2021 [1]. With such incentives, YouTube            (e.g., eBooks, music and films) and are not suitable for live
attracts over 76,000 active streamers in March 2017 alone                videos that are generated in real-time [8]. Several tools have
and has a projected growth of 330% new active streamers per              been developed to detect the copyright-infringing content by
month [2].                                                               examining the video content (referred to as “video copy de-
   This prevalence of live stream platforms also opens the               tectors”) [9]. However, they cannot be applied to our problem
door for severe copyright infringement issues where users can
stream copyrighted live events (e.g., TV shows, sport matches,              1 https://mashable.com/2017/08/16/youtube-live-streaming-copyrights/
Crowdsourcing-based Copyright Infringement Detection in Live Video Streams - University of Notre Dame
because many streamers are sophisticated enough to change                                     II. R ELATED W ORK
the video presentation and make it look very different from              A. Copyright Protection
the original one (See Figure 1). Therefore, a system that can
effectively address the copyright detection problem of live                 Due to the increasing popularity of online data sharing
video streams has yet to be developed.                                   platform, protecting copyrighted content has become a critical
                                                                         problem in recent years [7]. Various techniques have been
                                                                         proposed to protect copyrighted music, text documents and
                                                                         videos. For example, Podilchuk et al. developed a robust
                                                                         watermarking technique that can covertly embed owner in-
                                                                         formation into a digital image without affecting the perceived
                                                                         visual quality of the original content [10]. Low et al. proposed
                                                                         a data hiding technique to protect copyrighted text documents
                                                                         by slightly shifting certain text lines and words from their
    (a) Video with Split Screen      (b) Video with Camouflage           original positions to create unique identifiers for the original
                                                                         content [11]. Waldfogel et al. developed a music copyright
Figure 1: Videos modified by sophisticated streamers that                protection scheme based on the observation that the origi-
successfully bypassed the ContentID system from YouTube                  nal music often has superior quality than the unauthorized
                                                                         copies [12]. However, these techniques focus on the static
                                                                         contents and cannot be applied to live video streams where
   In this paper, we develop a novel Crowdsourcing-based                 contents are generated in real-time.
Copyright Infringement Detection (CCID) scheme to capture
copyright-infringing live videos. Our solution is motivated              B. Video Copy Detection
by the observation that the live chat messages from the                     Video copy detection is one of the most commonly used
online audience of a video could reveal important information            techniques for detecting copyright infringement in video con-
of copyright infringement. For example, consider detecting               tent. For example, Esmaeili et al. proposed a video copy
copyrighted streams of a NBA match. If the audience of the               detection system that compares the fingerprints (unique fea-
live video are chatting about the current game status (e.g.,             tures extracted from the copyrighted content) of different
“nice 3-pointer from Player A”), this video stream is likely to          videos to detect the copyright issues [8]. Nie et al. developed
be copyright-infringing because the audience will only know              a near-duplicate video detection framework by combining
these details of the game if the broadcast is real. Another              comprehensive image features using a tensor model [9]. Chou
interesting example is the “colluding behavior” of the audience          et al. proposed a spatial-temporal pattern based framework
and streamers. If a video stream is copyright-infringing, the            for efficient and effective detection of duplicate videos [13].
audience sometimes colludes with the streamers by reminding              However, these methods all require access to the original copy
them to change the title of the stream to bypass the platform’s          of the video in advance which is not practical in live video
detection system. However, such colluding behavior actually              streams. More importantly, these content-based methods often
serves as a “signal” that the stream has copyright issues. In this       fail when streamers are sophisticated enough to tweak the
paper, the CCID designs a novel detection system that explores           video presentations to bypass the detection system. In contrast,
the “clues” extracted from both live chats of the audience               our scheme develops a crowdsourcing-based scheme that relies
and the meta-data of the videos (e.g., view counts, number               on the chat messages from the audience and video meta-data,
of likes/dislikes). It develops a supervised learning scheme to          which is independent of the video content.
effectively track down copyright infringement in live video
streams.                                                                 C. Crowdsourcing in Online Social Media
   To the best of our knowledge, the CCID scheme is the                     Crowdsourcing-based techniques have been widely used in
first crowdsourcing-based solution to address the copyright              the analysis of online social media data. For example, Wang et
infringement issues for live videos in online social media. It           al. developed a principled estimation framework that identifies
is robust against sophisticated streamers who can intentionally          credible information during disaster events by taking Twitter
modify the description and presentation of the video, because            users as crowd sensors [14]. Schumaker et al. proposed a
CCID does not rely on the analysis of the actual content of              crowdsourcing-based model to predict the outcome of soccer
the videos. Additionally, CCID performs the detection task on-           games by analyzing the sentiments of the tweets related to the
the-fly without accessing the original copyrighted content. We           game [15]. Steiner et al. developed a generic crowdsourcing
evaluate the performance of CCID on two live stream video                video annotation framework that invites the users on YouTube
datasets collected from YouTube. The results show that our               to annotate the type of events and named entities of the
scheme is more accurate (achieving 17% higher in F1-Score)               videos they viewed [16]. Our work is different from the
and efficient (detecting 20% more copyright-infringing videos            above schemes in the sense that it is the first crowdsoucing
within 5 minutes after the videos start) than the ContentID              based approach to address the copyright infringement detection
tool from YouTube.                                                       problem of live videos on online social media.

                                                                     2
Crowdsourcing-based Copyright Infringement Detection in Live Video Streams - University of Notre Dame
III. P ROBLEM S TATEMENT                                      the meta-data of the videos. Formally, for every piece of
   In this section, we present the copyright infringement                        copyrighted content y, 1 ≤ y ≤ Y , find:
detection problem of live video streams. In particular, we                         arg max P r(z˜iy = ziy |M etayi , Chatyi ), ∀1 ≤ i ≤ N (y) (1)
assume that a video hosting service has a set of live videos                          z˜iy
related to a piece of copyrighted content y, 1 ≤ y ≤ Y :
V (y) = V1y , V2y ...VNy (y) where N (y) denotes the total number                where z˜iy denotes the estimated category label for Viy .
of live videos related to y. A video Viy is associated with a                                              IV. A PPROACH
tuple, i.e., Viy = (tstart
                      i     , tend
                               i   , M etayi , Chatyi , ziy ) where tstart
                                                                     i
      end
and ti refer to the timestamp when the video starts and ends.                       In this section, we present the CCID framework to address
M etayi is the meta-data of the video (e.g., description, view                   the copyright infringement problem for live videos. It consists
count, likes/dislikes). Chatyi is the live chat messages for the                 of four major components: i) a data collection component
video. ziy is the ground truth label defined below:                              to obtain the live videos from online social media (e.g.,
                                                                                 YouTube); ii) a live chat feature extraction component to
   • Copyright-Infringing (labeled as “True”): live videos that
                                                                                 obtain the features from the live chats to which the audience
      contain actual copyrighted content (e.g., a live broadcast-
                                                                                 of the video contributes; iii) a metadata feature extraction
      ing of a football game; a live stream of latest episode of
                                                                                 component to extract features from the descriptive information
      “Game of Thrones”.).
                                                                                 of each video; iv) a supervised classification component to
   • Non-Copyright-Infringing (labeled as “False”): videos
                                                                                 decide if the video stream is copyright-infringing or not. We
      that do not contain actual live copyrighted content.
                                                                                 discuss each component in details below.
   An example of the above definitions is shown in Figure 2.
We observe that all four pictures are related to a copyrighted                   A. Obtaining Live Video Datasets
NBA game and claimed to broadcast the live events for free.                          The data collection is challenging because: i) no existing
However, only the last one (bottom-right) should actually be                     live streaming video dataset is publicly available with chat
labeled as “True” (i.e., copyright-infringing) and the others                    content and ground truth labels; ii) many live streams about
should be labeled as “False”. For example, the top-left video                    the same event are simultaneously broadcast (e.g., sport games,
is a game-play video of an NBA 2K game. The top-right one                        TV shows), which requires a scalable design for data col-
is just a static image and the bottom-left one is broadcasting                   lection; iii) the crawling system must understand when the
an old recorded match.                                                           copyrighted content will be broadcast so it can start the data
                                                                                 collection on time. In light of these challenges, we developed
                                                                                 a distributed live stream crawling system using Selenium 2
                                                                                 and Docker 3 . The system is deployed on 4 virtual machines
                                                                                 hosted on Amazon Web Service. The crawling system collects
                                                                                 the following items of a live video stream:
                                                                                     Video Metadata: The metadata of the video includes video
                                                                                 title, video description, streamer id, view count, and the
                                                                                 number of likes and dislikes.
                                                                                     Live Screen Shots: The real-time screenshots of the live
                                                                                 video are captured every 30 seconds.
                                                                                     Live Chat Messages: The real-time chat messages from the
               Figure 2: Live Videos on YouTube                                  audience about the video are also collected.
                                                                                     Terminology Dictionary: We also crawl a dictionary re-
    We make the following assumptions in our model.                              lated to a piece of copyrighted content y. Examples of the
    Real-time Content: the content of the copyrighted material                   terms in the dictionary include the names of the main charac-
is assumed to be generated in real-time and the content of the                   ters in a TV show, the names of players and terminologies
video cannot be acquired in advance.                                             used in a sport event. Such dictionary is used to analyze
    Sophisticated Streamers: we assume streamers are sophis-                     the relevance of the chat messages to the broadcast event
ticated and they can manipulate the video descriptions and                       (discussed next).
content to bypass the copyright infringement detection system.                   B. Live Chat Feature Extraction with Truth Analysis
    Colluding Audience: we assume some of the audience can
                                                                                   The goal of the live chat feature extraction component is to
collude with streamers by reminding them of ways to cheat
                                                                                 identify the key features from the audience’s chat messages
the copyright infringement detection system (e.g., change the
                                                                                 that are relevant to the copyright infringement of a live video.
title).
                                                                                 We observe that the chat messages often reveal important
    Given the above definitions and assumptions, the goal of
                                                                                 “clues” that help make inferences with respect to the copyright
copyright right infringement detection is to classify each live
video stream into one of the two categories (i.e., copyright-                       2 https://www.seleniumhq.org/

infringing or not) by leveraging the live chat messages and                         3 https://www.docker.com/

                                                                             3
Crowdsourcing-based Copyright Infringement Detection in Live Video Streams - University of Notre Dame
Table I: Examples of Crowd Votes
            Types of Crowd Votes                                              Example Chat Messages
                                       edgar dlc: change title please so nba wont copyright
                Colluding Vote
                                       Johnson’s Baby Oil: Change the name when it starts
                                       Joshua Colinet: As a lakers fan, I’m hoping the new look cavs flop so we can get a higher draft pick
            Content Relevance Vote
                                       Moustache Man11: Can someone tell me who scored
                                       KING BAMA: Looks good but can we please get some sound buddy?!!
              Video Quality Vote
                                       Malik Horton: would be alot more fun too watch if it wasn’t laggy
                                       PHXmove: FAKE DO NOT BOTHER
                Negativity Vote
                                       DaVaughn Sneed: They keep putting bullshit up I’m just trying to watch this game

infringement of a video. For example, many viewers often                      copyright-infringing given the type of the crowd vote about
complain about the quality of the video (e.g., resolution,                    the video. Formally, it is defined as:
sound quality) for a live stream that is copyright-infringing.
Alternatively, the viewers are disappointed (e.g., by posting
                                                                                                     φi,k = P r(Vi = T |SVk )                         (2)
cursing and negative comments in the chat messages) if the
content of the video is actually fake. In the CCID scheme, we                 where Vi = T denotes that the video Vi is copyright-infringing
define these messages relevant to copyright infringement as a                 and SVk denotes the crowd vote is of type k 5 . For the ease of
set of crowd votes.                                                           notation, we omit the superscript of copyrighted content (i.e.,
                                                                              y) in all equations in this section.
DEFINITION 1. Crowd Votes: a crowd vote is a chat
                                                                                 In the CCID scheme, we develop a principled model to
message that suggests whether a video is copyright-infringing
                                                                              compute the weight of each crowd vote using a Maximum
or not. The vote reflects the viewer’s observation about the
                                                                              Likelihood Estimation (MLE) approach. We first define a few
“truthfulness” (copyright infringement) of the video. More
                                                                              important notations. Let ai,k = P r(SVk |Vi = T ) denote
specifically, we define four types of crowd votes:
                                                                              the probability that a crowd vote of type SVk appears in a
  •   Colluding Vote: a live chat message from the audience                   copyright-infringing video. It can be derived from φi,k using
                                                                                                            φ ×P r(SV )
      to help the streamer bypass the copyright infringement                  Bayes’ theorem: ai,k = i,k πT k , where πT is the proba-
      detection system of the video platform. Examples of the                 bility of a randomly selected video being copyright-infringing
      colluding vote include chat messages that contain the                   (i.e., P r(Vi = T )). Similarly, we define bi,k = P r(SVk |Vi =
      keywords such as “change the title,” “change the name,”                         (1−φi,k )×P r(SVk )
                                                                              F) =          1−πT          . It represents the probability that SVk
      “change description.”                                                   appears in a video with no copyright infringement. We further
   • Content Relevance Vote: a live chat message that contains                define a helper function χ(c, k) which returns 1 if a chat
      keywords that are directly relevant to the event. For                   message c is of type SVk and 0 otherwise.
      example, the names of players in an NBA game, the                          Given the above definitions, we derive the likelihood func-
      names of main characters in a TV show. The relevance                    tion of observed data X (i.e., the videos {V1 , V2 , ...VN } and
      vote of a message is derived based on the overlap between               its corresponding comments {Chat1 , Chat2 , ..., ChatN }) as:
      the message and terms in the Terminology Dictionary                                       (
                                                                                            N                 K
      described above.                                                                     Y         Y Y           χ(c,k)
                                                                              L(Θ|X) =                           ai,k × (1 − ai,k )(1−χ(c,k))
   • Video Quality Vote: a live chat message that contains
                                                                                            i=1    c∈Chati k=1
      keywords about the quality of the video (e.g., “lag,”                                                       K
      “resolution,” “full screen,” “no sound”). Normally, the                                              Y      Y       χ(c,k)
                                                                                        × πT × zi +                      bi,k      × (1 − bi,k )(1−χ(c,k))
      more people care about the video quality, the more likely
                                                                                                         c∈Chati k=1
      the video contains the real copyrighted content.                                                              )
   • Negativity Vote: a live chat message that contains direct
                                                                                        × (1 − πT ) × (1 − zi )
      debunking of the content (e.g., “fake, dislike, down vote,
      quit, go to my stream instead”) and a set of swear words                                                                                        (3)
      that express anger towards the streamer 4 .
                                                                              where zi is a binary variable indicating whether a video stream
Table I shows a few examples of different types of crowd votes                Vi is copyright-infringing (zi = 1) or not (zi = 0).
from our collected video datasets.                                               In the above equation, the estimation parameters Θ are
   To better quantify the contribution of a crowd vote to the                 πT , φi,1 , ...φi,K }. They can be estimated by maximizing the
likelihood of a video being copyright-infringing, we further                  likelihood of the observed data.
define the weight of a crowd vote as follows.
                                                                                                         arg max         L(Θ|X)                       (4)
DEFINITION 2. Weight of Crowd Vote: the weight of                                                     {πT ,φT      T
                                                                                                            1 ,...φK }
a crowd vote is defined as the probability that a video is
                                                                                  5 Without loss of generality, we assume a set of K types of crowd votes
                                                                              in our model, i.e., SV = {SV1 , SV2 ..SVK }. In this paper, we focus on the
   4 http://www.bannedwordlist.com/lists/swearWords.txt                       four types as defined above (i.e., K = 4).

                                                                          4
Using the Bayesian estimation [17], we can derive the                receive more “likes” from the audience. However, if the
closed-form solution to the above estimation problem as                 audience finds out that the video stream does not contain
follows:                                                                copyrighted content, they are more likely to hit the “dislike”
            PN                P                                         button.
              i=1 zi            c∈Chati zi × χ(c, k)
       πT =          , φi,k = P                      (5)                   Note that we chose not to extract the semantic features
               N                  c∈Chati χ(c, k)                       directly related to the content of titles and descriptions of
where the value of zi will be learned from the training data.           the videos (e.g., using text mining to extract topic and Bag-
   Using the weights of the crowd votes from the above estima-          of-Words (BoW) features). This is because the sophisticated
tion, we can define the Overall Crowd Vote (Chatocv ) feature           streamers often manipulate the title and descriptions to bypass
for each video Vi . This feature represents the aggregated              the platform’s copyright detection system. For example, in one
observations from the audience on whether the video stream              of the copyright-infringing live streams of an NBA game, the
is copyright-infringing. Formally, Chatocv,i is derived as the          streamer modified the name as “1000 iphones!!!!”. On the
aggregated weights of all crowd votes about video Vi .                  other hand, many legal video streams (e.g., a live game play
                                                                        of an NBA 2K video game) actually have very suspicious titles
                             K
                           X X                                          such as “2018 NBA All-Star Game LIVE!!!” in order to attract
            Chatocv,i =                 φi,k × χ(c, k)       (6)        attention of audience. BoW or topic based feature extraction
                          c∈Chati k=1
                                                                        techniques are often not robust against sophisticated streamers
   In addition to the Chatocv feature, we also investigate other        and can easily lead to a large amount of false alarms [19].
live chat features that are potentially relevant to copyright
infringement detection. We summarize these features below:              D. Supervised Classification
   Chat Message Rate (ChatrateM ): the average number of                   After the live chat and metadata features are extracted
chat messages per minute.                                               from the collected data, CCID performs supervised binary
   Chat User Rate (ChatrateU ): the average number of                   classification using the extracted features to classify live
distinct chatting users per minute.                                     videos as copyright-infringing or not. Rather than re-invent
   Early Chat Polarity (Chatpolarity ): The average sentiment           the wheel, we use a set of the state-of-the-art supervised
polarity of the chat messages posted during the starting stage          classification models in the CCID scheme. Examples include
of the event (i.e., 0-3 minutes). The polarity refers to how pos-       neural networks, boosting models, tree based classifiers and a
itive/negative the chat messages are. Normally, the audience            support vector machine. These classifiers serve as plug-ins to
starts to curse and posts negative comments of a video after            our CCID scheme and the one with the best performance from
they find the live stream to be fake, which usually happens at          the evaluation on training data will be selected. We present the
the beginning of the stream.                                            detailed performance evaluation of CCID when it is coupled
                                                                        with these classifiers in Section V.
C. Metadata Feature Extraction
    We found the metadata of a video also provides valuable                      V. E VALUATION O N R EAL W ORLD DATA
clues for copyright infringement detection. In our CCID                    In this section, we evaluate the CCID scheme using
scheme, we focus on the following metadata features.                    two real-world datasets collected from YouTube. The results
    View Counts (M etaview ): The number of viewers that are            demonstrate that CCID significantly outperforms Content ID
currently watching the live video stream. Intuitively, the more         from YouTube, the only available copyright infringement
viewers, the more likely the video is broadcasting copyright-           detection tool for live videos at the time of writing [3].
infringing content.
    Title Subjectivity (M etasubT ): The subjectivity of the            A. Datasets
video’s title. We derive a subjectivity score (a floating point            We summarize the two real-world datasets used for evalu-
within the range [-1.0, 1.0]) of each video using the subjec-           ation in Table II. The NBA dataset contains 130 live video
tivity analysis of TextBlob [18]. Intuitively, a title with high        streams related to NBA games from Dec. 2017 to Mar. 2018.
subjectivity (e.g., “Super Bowl live stream for free!”, “Best           28.57% of the collected videos are found to be copyright-
Quality Ever!”) can potentially be a spam since a copyright-            infringing. The Soccer dataset contains 226 live videos related
infringing video normally keeps an objective and low-profile            to soccer matches in major soccer leagues worldwide from
title (e.g., “NFL Super Bowl LII ”) to minimize the chance              Sept. 2017 to Mar. 2018. 17.53% of these videos are copyright-
of being caught by the platform’s copyright infringement                infringing. We use the data crawler system (described in
detection system.                                                       Section IV) to collect these live videos. The search terms we
    Description Subjectivity (M etasubD ): The subjectivity of          used to collect these videos are team names of the match
the video’s description. It is chosen based on the same intuition       (e.g., “Houston Rockets Detroit Pistons”). We leverage the
as the title subjectivity.                                              advanced search filters provided by YouTube to ensure all
    Number of Likes/Dislikes (M etalike , M etadislike ): The           collected videos are live video streams. For each video, we
total number of viewers who hit the “like”/“dislike” button.            collect the stream for a duration of 30 minutes and we start
Intuitively, if the video contains copyrighted content, it may          the crawling process at the scheduled time of each game.

                                                                    5
Table II: Data Trace Statistics                                      owner and taken down (See Figure 3(b)). We observe that
  Data Trace                                  NBA                     Soccer               the latter case is rare (less than 5%) in the live streams
  Collection Period                  Dec. 2017 - March 2018   Sept. 2017 - Mar. 2018
  Number of Videos                             138                      226                we collected. This again demonstrates the importance and
  % of copyright-infringing Videos           28.57%                   17.53%               necessity of developing an automatic detection system like
  % of Videos with Chat Enabled              57.78%                   40.71%
  Number of Chat Users                        2,705                    4,834               CCID to keep track of copyright-infringing content in live
  Number of Chat Messages                    61,512                   94,357               videos from online social media.

   To obtain the ground truth labels, we manually looked at the
collected screenshots of a video stream to check if the video is
copyright-infringing. This labeling step is carried out by three
independent graders to eliminate possible bias. We sort the
video streams by their chronological order and use the first
70% of data as the training set and the last 30% (latest) as
the test set. In the training phase, we perform 10-fold cross                                            (a)                             (b)
validation to tune the parameters of the classifiers.
   To build the terminology database for extracting the crowd                                Figure 3: Copyright infringements identified by YouTube
vote feature from the live chat messages, we collected terms
related to the live video events. For the NBA dataset, we                                  C. Results: Detection Effectiveness
crawled the names of players and teams from the ESPN
website 6 . For the Soccer dataset, we use an existing database                               In the first set of experiments, we evaluate the detection
to extract the names of players and teams of major soccer                                  effectiveness of CCID when it is coupled with different
clubs 7 . We also built a set of terminologies and slogans related                         classifiers and identify the best performed classifier for CCID.
to these events (e.g., flop, 3-pointer, foul, dribble, header, hat-                        We then compare CCID with ContentID used by YouTube. The
trick).                                                                                    detection effectiveness is evaluated using the classical metrics
                                                                                           for binary classification: Accuracy, Precision, Recall and F1-
B. Classifier Selection and Baseline                                                       Score. The results are reported in Table III.
   We chose a few state-of-the-art supervised classifiers that                                We observe that AdaBoost achieves the best performance
can be coupled with the CCID scheme in our experiments.                                    among all candidate classifiers. We also observe adding the
We summarize them below.                                                                   features extracted from live chat messages can significantly
   • AdaBoost, XGBoost, Random Forest (RF): AdaBoost                                       improve the detection performance of CCID. More specifi-
      [20], XGBoost [21], and Random Forest [22] are                                       cally, CCID with AdaBoost achieved 6.8% and 17.2% increase
      ensemble-based classification algorithms. RF applies bag-                            in F1-Score in the NBA and Soccer datasets, respectively,
      ging while XGBoost and AdaBoost uses boosting tech-                                  compared to YouTube’s ContentID. In fact, we observe Con-
      niques to combine a set of classifiers (we use 50 decision                           tentID has poor precision in both datasets due to high false
      trees) to improve classification performance. XGBoost                                positive rates (which will be further discussed in the next
      trains a classifier by minimizing the negative gradient and                          subsection). The high false positive rate leads to the unfair
      AdaBoost trains a classifier by adjusting the weights of                             taking down of legal live videos and discourages streamers
      the training samples.                                                                from uploading live video contents. In contrast, the CCID
   • Linear Support Vector Machine (SVM): Given labeled                                    scheme exploits the chat messages from the actual audience
      training data, the SVM algorithm outputs an optimal                                  of the videos to identify potential evidence of copyright
      hyperplane to categorize new data samples [23].                                      infringement, making it more robust to false alarms.
   • Multi-layer Perceptron (MLP): An artificial neural
      network based classification scheme that can distinguish                             D. Results: Detection Time
      data that is not linearly separable [24].                                               We then evaluate the detection time of both CCID and
   We compare the CCID system with the current copyright                                   ContentID. The detection time is defined as the amount of time
detection system (i.e., ContentID) developed by YouTube [5].                               the system takes to detect the copyright infringement of a live
To evaluate YouTube’s ContentID without direct access to its                               video after it starts. We focus on two aspects of the detection
internal system (since it is a proprietary system), we estimate                            system when we study the detection time: i) True Positive
the effectiveness of ContentID as follows. In particular, we                               Rate: it characterizes the ability of the system to correctly
label a video as copyright-infringing (detected by ContentID)                              identify a copyright-infringing video. This metric is important
if it i) went offline abruptly during the broadcasting (See                                for copyright owners who would like to detect all illegal
Figure 3(a)), or ii) it was explicitly reported by the copyright                           video streams; ii) False Positive Rate: it characterizes the
                                                                                           ability of the system to suppress the misclassified copyright-
    6 http://www.espn.com/nba/players                                                      infringing videos. This is particularly important to “streamers”
    7 https://www.kaggle.com/artimous/complete-fifa-2017-player-dataset-                   who would like to keep their legal content from being falsely
global                                                                                     taken down.

                                                                                       6
Table III: Classification Accuracy for All Schemes
                                                                          NBA                                                Soccer
      Algorithms                   Features           Accuracy     Precision       Recall     F1-Score      Accuracy   Precision      Recall    F1-Score
                               w/ chat features        0.8621       0.8182         0.8182      0.8182        0.9103     0.8125        0.8667      0.8387
      Adaboost (CCID)          w/o chat features       0.8276       0.7500         0.8182      0.7826        0.8750     0.7500        0.8000      0.7742
                                w chat features        0.7971       0.7777         0.6364      0.7000        0.8571     0.7059        0.8000      0.7500
      XGBoost                  w/o chat features       0.7586       0.6667         0.7273      0.6957        0.8214     0.8571        0.4000      0.5455
                               w/ chat features        0.7586       0.7000         0.6364      0.6667        0.8750     0.7857        0.7333      0.7586
      RF                       w/o chat features       0.6897       0.5714         0.7273      0.6400        0.8036     0.7000        0.4667      0.5600
                               w/ chat features        0.6207       0.5000         0.4545      0.4762        0.8214     0.6667        0.6667      0.6667
      SVM                      w/o chat features       0.5862       0.4286         0.2728      0.3333        0.7679     0.6667        0.2667      0.3810
                               w/ chat features        0.6207       0.5000         0.4545      0.4762        0.7321     0.5000        0.5333      0.5161
      MLP                      w/o chat features       0.4137       0.3636         0.7273      0.4848        0.6786     0.2000        0.0667      0.1000
      YouTube (ContentID)                              0.7931       0.6923         0.8182      0.7500        0.8036     0.6111        0.7333      0.6667

   In the experiment, we tune the time window of the data
collection from 1 to 30 minutes and only use the chat
messages within the specified time window for CCID. We
also chose the best-performed classifier (i.e., AdaBoost) for
CCID. The results are shown in Figure 4 and Figure 5. We
observe that, for the true positive rate, our scheme quickly
outperforms YouTube at a very early stage of the event and
keeps a consistently high performance for the rest of the event.                            (a) True Positive Rate                 (b) False Positive Rate
Such results suggest our CCID scheme can catch copyright-                                                 Figure 5: Soccer Dataset
infringing videos not only more accurately but also much
faster than ContentID from YouTube. For the false positive
rate, we observe that the CCID scheme has a higher false                           (i.e., M etaview , Chatocv ) are intuitive: the more viewers
positive rate at the very beginning of the event (due to the lack                  a video stream attracts, the more likely it is broadcasting
of sufficient chat messages). However, our scheme quickly                          copyrighted content (otherwise the viewers will simply quit
catches up and starts to outperform YouTube (ContentID)                            and switch to another video stream). Similarly, crowd vote
when the time window is longer than 5 minutes. We also                             represents a strong signal from the audience to indicate if a
observe that YouTube starts to mistakenly take down more                           video is copyright-infringing (via the overall crowd votes).
and more legal videos (as copyright-infringing ones) as time                       For M etasubT , we attribute it to the fact that a title with high
elapses. Such increase can clearly discourage streamers with                       subjectivity can potentially be spam (a copyright-infringing
legal content from using the video sharing platform.                               video normally keeps a low profile to minimize the chance of
                                                                                   being caught.). We also observe that some intuitive features
                                                                                   such as the number of likes/dislikes do not actually play
                                                                                   a critical role in the classification process. This observation
                                                                                   might be attributed to the fact that users may not even bother
                                                                                   hitting the “like” or “dislike” button when they are watching
                                                                                   the videos about live events (e.g., sports).
                                                                                      Finally, we evaluate the parameter estimation of the live chat
                                                                                   feature extraction module. As shown in Table IV, we observe
      (a) True Positive Rate                  (b) False Positive Rate
                                                                                   the overall crowd vote (Chatocv ) feature plays an important
                                                                                   role (ranked 2nd) in detecting copyright infringement videos.
                     Figure 4: NBA Dataset
                                                                                   It is interesting to further look into the decomposition of the
E. Results: Feature Analysis                                                       crowd votes. The estimation of the weights of the crowd votes
   In addition to the holistic evaluation of CCID system, we                       are shown in Table V. We observe that colluding, negativity,
also investigate which features are the most critical ones in                      and quality votes are strong indicators of whether a video is
our selected classifier. Table IV shows the ranking of features                    copyright-infringing (i.e, the weights are either high or low).
we used in the CCID scheme (with Adaboost) based on the                            In contrast, the relevant vote seems to be a weak indicator.
information gain ratio, a commonly used metric in analyzing                        After a careful investigation, we find the main reason is that
the feature importance for decision-tree based classifiers [25].                   some live streams of the games are only in audio (thus non-
   We found M etaview , Chatocv and M etasubT are the three                        copyright-infringing) but the audience still posts chat messages
most important features for both datasets. The first two features                  with soccer or NBA-related terms in those videos.

                                                                               7
Table IV: Feature Importance for All Schemes                                [4] P. Tassi, “Game of thrones’ sets piracy world record, but does hbo
                                 NBA                         Soccer                     care?, ,” Forbes, vol. 4, p. 15, 2014.
                                                                                               ”
                                                                                    [5] D. King, “Latest content id tool for youtube,” Google Blog, 2007.
  Features            Ranking      Gain Ratio     Ranking       Gain Ratio          [6] A. Barg, G. R. Blakley, and G. A. Kabatiansky, “Digital fingerprinting
  Chatocv                   2        0.1568          2            0.2011                codes: Problem statements, constructions, identification of traitors,”
                                                                                        IEEE Transactions on Information Theory, vol. 49, no. 4, 2003.
  ChatrateM                 5        0.1004        6 (tie)        0.0726            [7] C.-Y. Lin, “Watermarking and digital signature techniques for mul-
  ChatrateU                 4        0.1129        6 (tie)        0.0726                timedia authentication and copyright protection,” Ph.D. dissertation,
                                                                                        Columbia University, 2001.
  Chatpolarity              8        0.0718          4            0.1006            [8] M. M. Esmaeili, M. Fatourechi, and R. K. Ward, “A robust and fast
                                                                                        video copy detection system using content-based fingerprinting,” IEEE
  M etaview                 1        0.2048          1            0.2179
                                                                                        Transactions on information forensics and security, vol. 6, no. 1, 2011.
  M etalike                 9        0.0588          8            0.0614            [9] X. Nie, Y. Yin, J. Sun, J. Liu, and C. Cui, “Comprehensive feature-based
                                                                                        robust video fingerprinting using tensor model,” IEEE Transactions on
  M etadislike              7        0.0723          9            0.0447                Multimedia, vol. 19, no. 4, pp. 785–796, 2017.
  M etasubD                 6        0.0972          5            0.0894           [10] C. I. Podilchuk and W. Zeng, “Image-adaptive watermarking using visual
                                                                                        models,” IEEE Journal on selected areas in communications, vol. 16,
  M etasubT                 3        0.1132          3            0.1388                no. 4, pp. 525–539, 1998.
                                                                                   [11] S. H. Low, N. F. Maxemchuk, and A. M. Lapone, “Document identifi-
  M etaenabled           10          0.0117          10           0.0009
                                                                                        cation for copyright protection using centroid detection,” IEEE Trans-
                                                                                        actions on Communications, vol. 46, no. 3, pp. 372–383, 1998.
                                                                                   [12] J. Waldfogel, “Copyright protection, technological change, and the
    Table V: Parameter Estimation (φi,k ) for Crowd Votes                               quality of new products: Evidence from recorded music since napster,”
     Datasets     Colluding       Negativity    Relevance       Quality                 The journal of law and economics, vol. 55, no. 4, pp. 715–740, 2012.
                                                                                   [13] C.-L. Chou, H.-T. Chen, and S.-Y. Lee, “Pattern-based near-duplicate
     NBA            0.968           0.382         0.589          0.920                  video retrieval and localization on web-scale videos,” IEEE Transactions
                                                                                        on Multimedia, vol. 17, no. 3, pp. 382–395, 2015.
     Soccer         0.890           0.257         0.497          0.872
                                                                                   [14] D. Wang, L. Kaplan, H. Le, and T. Abdelzaher, “On truth discovery in
                                                                                        social sensing: A maximum likelihood estimation approach,” in Proc.
                                                                                        ACM/IEEE 11th Int Information Processing in Sensor Networks (IPSN)
                                                                                        Conf, Apr. 2012, pp. 233–244.
              VI. C ONCLUSION AND F UTURE W ORK                                    [15] R. P. Schumaker, A. T. Jarmoszko, and C. S. Labedz Jr, “Predicting wins
   In this paper, we develop the first crowdsourcing-based                              and spread in the premier league using a sentiment analysis of twitter,”
                                                                                        Decision Support Systems, vol. 88, pp. 76–84, 2016.
solution (i.e., CCID) to address the copyright infringement                        [16] T. Steiner, R. Verborgh, R. Van de Walle, M. Hausenblas, and J. G.
detection problems for live video streams on online social                              Vallés, “Crowdsourcing event detection in youtube video,” in 10th
media. The proposed scheme is robust against sophisticated                              International Semantic Web Conference (ISWC 2011); 1st Workshop on
                                                                                        Detection, Representation, and Exploitation of Events in the Semantic
streamers by leveraging the valuable clues from the unstruc-                            Web, 2011, pp. 58–67.
tured and noisy live chat messages from the audience. Using                        [17] D. D. Lewis, “Naive (bayes) at forty: The independence assumption
two real-world live stream datasets, we have demonstrated that                          in information retrieval,” in European conference on machine learning.
                                                                                        Springer, 1998, pp. 4–15.
CCID can significantly outperform ContentID from YouTube                           [18] S. Loria, P. Keen, M. Honnibal, R. Yankovsky, D. Karesh, E. Dempsey
by detecting more copyright-infringing videos and reducing                              et al., “Textblob: simplified text processing,” Secondary TextBlob: Sim-
the number of legal streams of being mistakenly taken down.                             plified Text Processing, 2014.
                                                                                   [19] X. Wei and W. B. Croft, “Lda-based document models for ad-hoc
   We also identify several limitations of CCID that lead to                            retrieval,” in Proceedings of the 29th annual international ACM SI-
interesting directions in future work. First, CCID requires a                           GIR conference on Research and development in information retrieval.
terminology dictionary that depends on the prior knowledge                              ACM, 2006, pp. 178–185.
                                                                                   [20] Y. Freund and R. E. Schapire, “A decision-theoretic generalization of
on the relevant terms used for that event. In our future work,                          on-line learning and an application to boosting,” Journal of computer
we plan to explore entity extraction techniques [26] to directly                        and system sciences, vol. 55, no. 1, pp. 119–139, 1997.
learn the relevant entities (e.g., TV characters, team names,                      [21] T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,”
                                                                                        in Proceedings of the 22nd acm sigkdd international conference on
etc.) from the training data. Second, the current CCID scheme                           knowledge discovery and data mining. ACM, 2016, pp. 785–794.
adopts a supervised learning method to identify copyright                          [22] A. Liaw, M. Wiener et al., “Classification and regression by randomfor-
infringement of live videos. However, the training data may                             est,” R news, vol. 2, no. 3, pp. 18–22, 2002.
                                                                                   [23] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin,
not always be available for the new events (e.g., a brand                               “Liblinear: A library for large linear classification,” Journal of machine
new TV show). In our further work, we will explore both                                 learning research, vol. 9, no. Aug, pp. 1871–1874, 2008.
i) an unsupervised model that requires no training data and                        [24] D. W. Ruck, S. K. Rogers, M. Kabrisky, M. E. Oxley, and B. W. Suter,
                                                                                        “The multilayer perceptron as an approximation to a bayes optimal
ii) transfer learning techniques that can leverage the models                           discriminant function,” IEEE Transactions on Neural Networks, vol. 1,
trained with existing events for the live videos of new events.                         no. 4, pp. 296–298, 1990.
                                                                                   [25] B. Azhagusundari and A. S. Thanamani, “Feature selection based on
                                                                                        information gain,” International Journal of Innovative Technology and
                                R EFERENCES                                             Exploring Engineering (IJITEE) ISSN, pp. 2278–3075, 2013.
 [1] “Statistics of live stream market,” https://www.researchandmarkets.com/       [26] O. Etzioni, M. Cafarella, D. Downey, A.-M. Popescu, T. Shaked,
     research/8xpzlb/video streaming, accessed: 2018-04-07.                             S. Soderland, D. S. Weld, and A. Yates, “Unsupervised named-entity
 [2] “Statistics of twitch and youtube live,” https://www.statista.com/                 extraction from the web: An experimental study,” Artificial intelligence,
     statistics/761100/, accessed: 2018-04-07.                                          vol. 165, no. 1, pp. 91–134, 2005.
 [3] “Video game streaming brings new level of copyright issues,”
     https://www.law360.com/articles/920036/video-game-streaming-brings-
     new-level-of-copyright-issues , accessed: 2018-04-07.

                                                                               8
You can also read