VITAG: AUTOMATIC VIDEO TAGGING USING SEGMENTATION AND CONCEPTUAL INFERENCE - IIT HYDERABAD

Page created by Terrence Mcdaniel
 
CONTINUE READING
VITAG: AUTOMATIC VIDEO TAGGING USING SEGMENTATION AND CONCEPTUAL INFERENCE - IIT HYDERABAD
2019 IEEE Fifth International Conference on Multimedia Big Data (BigMM)

        ViTag: Automatic Video Tagging Using Segmentation and Conceptual Inference

                           Abhishek A. Patwardhan, Santanu Das,                                          Debi Prosad Dogra
                        Sakshi Varshney, Maunendra Sankar Desarkar                                   School of Electrical Sciences
                         Department of Computer Sc. & Engineering                                         IIT Bhubaneswar
                                       IIT Hyderabad                                                     Bhubaneswar, India
                                     Hyderabad, India                                                Email: dpdogra@iitbbs.ac.in
                          Email: {cs15mtech11015,cs15mtech11018,
                           cs16resch01002,maunendra}@iith.ac.in

      Abstract—Massive increase in multimedia data has created                       the ViTag framework is outlined in Figure 1.
   a need for effective organization strategy. The multimedia col-
   lection is organized based on attributes such as domain, index-
   terms, content description, owners, etc. Typically, index-term
   is a prominent attribute for effective video retrieval systems.
   In this paper, we present a new approach of automatic video
   tagging referred to as ViTag. Our analysis relies upon various
   image similarity metrics to automatically extract key-frames.
   For each key-frame, raw tags are generated by performing
   reverse image tagging. The final step analyzes raw tags in order
   to discover hidden semantic information. On a dataset of 103
   videos belonging to 13 domains derived from various YouTube
   categories, we are able to generate tags with 65.51% accuracy.
   We also rank the generated tags based upon the number of
   proper nouns present in it. The geometric mean of Reciprocal
   Rank estimated over the entire collection has been found to be
   0.873.
      Keywords-video content analysis, video tagging, video orga-
   nization, video information retrieval                                              Figure 1: Overview of the proposed ViTag architecture.

                             I. I NTRODUCTION
                                                                                     A. Related work
      Finding match for a user submitted query is challenging
   on large multimedia data. To reduce the empirical search,                            Automatic video tagging research is growing. Siersdor-
   video hosting websites often allow users to attach a descrip-                     fer et al. [1] have devised a technique based on content
   tion with the video. However, description or index terms can                      redundancy of videos. However, their approach requires
   be ambiguous, irrelevant, insufficient or even empty. This                         querying external video collection to generate tags for the
   creates a necessity for automatic video tagger. In this paper,                    video in question. Our approach exploits semantic similarity
   we present an automatic video tagging tool, referred to as                        information [2]. Moxley et al. [3] performs a search using
   ViTag. It involves video segmentation that extracts distinct,                     three attributes (frames, text, and concepts) to find matching
   representative frames from the input video by hierarchical                        videos out of a collection of videos. The approach needs
   combination of various image similarity metrics. In the next                      automatic speech recognition, and therefore it seems difficult
   step, raw tags obtained from the segmented video frames                           to work on generic videos involving challenging domains
   are investigated to estimate semantic similarity information.                     like animation, songs, games, etc. Toderici et al. [4] have
   Finally, we annotate the input video by combining raw tags                        trained a classifier that learns association of audio-visual
   with the inferred tags.                                                           features of a video to its tags. Machine learning based
      In accomplishing this, we make the following contribu-                         approaches are also promising. But they come at higher
   tions: (i) a hierarchical combination of three image similarity                   training and tuning overheads. Borth et al [5] have ex-
   metrics to design a video segmentation algorithm, (ii) a                          tracted key-frames for video summarization using k-means
   conceptual inference heuristic to automatically infer generic                     clustering to group similar frames into a cluster. Yao et
   tags from raw tags, and (iii) a fully automatic, end-to-end,                      al. [6] have tagged videos by mining user search behavior.
   and open-source tool to output the tags solely based on                           Their method requires dynamic information about user’s
   analyzing the input video. The approach implemented within                        behavior. Probabilistic model-based method proposed in [7]

 978-1-7281-5527-2/19/$31.00 ©2019 IEEE                                        271
 DOI 10.1109/BigMM.2019.00050

Authorized licensed use limited to: Indian Institute of Technology Hyderabad. Downloaded on February 12,2021 at 10:02:57 UTC from IEEE Xplore. Restrictions apply.
involves the two-step process, i.e, video analysis followed
   by querying classification framework to generate tags.
      Rest of the paper is organized as follows. Section II
   presents the overall methodology and implementation de-
   tails. In Section III, we present the results. Finally, in
   Section IV, we provide conclusions and future work.

                 II. P ROPOSED V I TAG F RAMEWORK
   A. Video Feature Extraction
     ViTag first extracts the key-frames and feed them as inputs
   to the reverse image tagger that generates raw tags. The
   process of reading dissimilar frames from an input video is
   outlined in Algorithm 1. The threshold value can be set
   empirically.

   Algorithm 1 Selection of dissimilar frames
   Require: Video V
    1: Output frame sequence K = ∅
    2: prev ← First frame in V
                                                                                          Figure 2: Complete key-frame extraction module.
    3: for all F rame : f ∈ V do
    4:     score ← compute mean square error(f , prev)
    5:     if score > threshold then                                                 Algorithm 2 Selecting a representative frame for a given
    6:         K =K ∪f                                                               window
    7:         prev ← f                                                              Require: Frames[1..N]: Window
    8:     end if                                                                    Require: Scores[1..N-1]: Similarity scores for adjacent
    9: end for                                                                           frames within the window
   10: Return K                                                                       1: maxVal, maxInd ← MAX(Scores[1..N-1])
                                                                                      2: minVal, minInd ← MIN(Scores[1..N-1])
                                                                                      3: if minVal > threshold then
      The algorithm consists of two stages. The input sequence
                                                                                      4:     if Scores[maxInd-1] < Scores [maxInd+1] then
   of frames is partitioned into fixed-sized non-overlapping
                                                                                      5:         select = maxInd+1
   windows. Within each window, we estimate the similarity of
                                                                                      6:     else
   two successive frames using features such as Mean Square
                                                                                      7:         select = maxInd
   Error (MSE), SIFT, and Structural Similarity Index (SSI).
                                                                                      8:     end if
   This results into a similarity vector (V ) for each window. A
                                                                                      9: else
   value in similarity vector (Vi ) depicts similarity score among
                                                                                     10:     if Scores[minInd-1] < Scores[minInd+1] then
   two adjacent frames (Fi , Fi+1 ) within the window. The input
                                                                                     11:         select = minInd+1
   to the video segmentation process as depicted in Figure 2 is
                                                                                     12:     else
   the set of frames selected by Algorithm 1. We analyze the
                                                                                     13:         select = minInd
   similarity vector for each window so as to select a single
                                                                                     14:     end if
   representative frame for that window.
                                                                                     15: end if
      Intuitively, we wish to select one frame that contains
                                                                                     16: return Frames[select]
   maximum information. Note that, a frame can be considered
   to contain maximum information in both cases when it
   matches to both of its neighbors with the highest matching
   score (considered as a representative frame for the window)                       B. Raw Tag Generation
   or its matching scores are low with the adjacent frames (it
                                                                                        The second phase of ViTag receives key-frames and
   contains unique information). To capture both cases, the
                                                                                     obtains raw tags by querying the reverse image tagger. The
   heuristic discussed in Algorithm 2 is used. The algorithm
                                                                                     reverse image search engine provides a list of web-pages and
   picks a frame contributing to maximum score. If the min-
                                                                                     the key-terms associated with the query image. Such a search
   imum score turns out to be less than a threshold value,
                                                                                     technique is discussed in [8]. Typically such algorithms em-
   the heuristic assumes the existence of a frame containing
                                                                                     ploy techniques including maximal stable external regions
   unique information. Hence we select the frame contributing
                                                                                     (MSER) detection [9], object detection [10], vocabulary
   to minimum score.

                                                                               272

Authorized licensed use limited to: Indian Institute of Technology Hyderabad. Downloaded on February 12,2021 at 10:02:57 UTC from IEEE Xplore. Restrictions apply.

   tree [11], etc. We initially encode the frame within a query                                              i =
                                                                                                             V                 Wui .                      (1)
   and it is fired to the search engine. The responses are then                                                      e(u,i)∈E
   parsed to extract tags. The steps are shown in Fig. 3.                                                             u∈T

                                                                                        Furthermore, it is expected that many of the tags in T will
                                                                                     be semantically similar to others. We model this situation
                                                                                     within bipartite graph G by inserting extra edges across two
                                                                                     nodes from the set T . Due to such construction, G is no
                                                                                     more a bipartite graph. We refer to these newly inserted
                                                                                     edges as semantic edges to distinguish them from the edges
                                                                                     originally present in G. We insert such semantic edges to
                                                                                     G in following two cases: (i) For each pair of tags (t1,
                                                                                     t2)∈ T , we find semantic similarity score. We add semantic
                                                                                     edge E(t1, t2) and E(t2, t1) and label that edge with
                                                                                     semantic similarity score obtained. (ii) For all multi-word
                                                                                     tags m ∈ T , we check for the presence of each individual
                                                                                     word (w) within the set T . If it exists, we add edge E(m,w).
                                                                                     We label edge with a score equal to reciprocal of the total
                                                                                     number of words present in m. This allows us to capture the
                                                                                     semantic similarity of a multi-word tag. After augmenting
                                                                                     the semantic edges to graph G, we revise the score vector
                                                                                      to reflect the changes made in G. To compute the revised
                                                                                     V
   Figure 3: Reverse image tagging methodology used in our                                      , we use (2).
                                                                                     value of V
   work.
                                                                                                                          
                                                                                                 V    i +
                                                                                                 i = V                            Sxu ∗ Wui .            (2)
   C. Conceptual Inference                                                                                    e(u,i)∈E e(x,u)∈E
      After performing a reverse image search for the key-                                                      u∈T      x∈T

   frames, we post-process the obtained tags in order to infer
                                                                                        We revise each entry in V with product of two weights,
   more generic tags. We achieve this by adding an extra mod-
   ule of conceptual inference that refers to external knowledge                     i.e., (i) weight Wui connecting node u ∈ T to node i ∈ C
   source built on the top of various concepts in the web. Such                      and (ii) node u connects to node x ∈ T with semantic edge
                                                                                                                  , we sort it in descending order
                                                                                     weight Sux . After revising V
   a representation is referred to as a concept graph. Formally,
   concept graph [12] is a knowledge representation structure                        and select top r entries. Semantic similarity metric used in
   storing a weighted association of natural language words                          frameworks such as NLTK [13] fails to capture semantic
   with (abstract) concept terms.                                                    similarity among commonly occurring words like iPhone,
                                                                                     gadget etc. We fix this issue by referring to concept graph
   D. Semantic Similarity using Bipartite Graph                                      engine.
      Let T be the set of (unique) raw tags obtained and C
   be the set of (unique) concept terms obtained by querying                         E. Implementation Details
   each raw tag from T to the concept graph engine. A directed
   bipartite graph G(T, C) with Edges E: T → C represents                               The video segmentation algorithm can be accelerated by
   a mapping of raw tags to various concept terms. We label                          parallelizing computations. We achieve this by applying
   each edge e(t, c) with a score w such that w: E → [0, 1].                         classic loop transformation approach known as it loop tiling.
   A labelled score w on each edge represents how likely a                           We query host architecture to get a total number of process-
   concept c is associated with a tag t. Each score w is obtained                    ing cores (denoted as p) available on a system. We have tiled
   by querying concept graph engine. We need to identify a                           the iterations of a parallel loop by a factor of p. We run all
   set K ⊆ C such that each c ∈ K is associated with a large                         the iterations within each tile in parallel. Python3 has been
   number of incoming edges from T . To find K, it is important                       used to implement ViTag. For computing SIFT and SSIM
   to obtain a relative importance of each c ∈ C. Thus, we need                      scores, we have used OpenCV library. Our implementation
                                 ) of length equal to cardinality
   to find the score vector (say V                                                    uses Google Reverse Image Search engine [14] to obtain raw
                            
   of C. Once we obtain V , it is easy to select top r entries                       tags of the key-frames. The conceptual inference heuristic
   for some r ∈ N by simply sorting the vector V    . In order to                   is based on Microsoft Concept Graph utility [12], [15]. The
                         
   compute the value Vi for i ∈ C, we sum up the weights of                          implementation and datasets are available at [16], [17].
   the incoming edges for node i. Formally,

                                                                               273

Authorized licensed use limited to: Indian Institute of Technology Hyderabad. Downloaded on February 12,2021 at 10:02:57 UTC from IEEE Xplore. Restrictions apply.
No Of
      Domain                                    Description                                                                                                   Examples
                                     videos
      Tourism                        8          Diverse tourist places, Seven wonders                                                                         Statue of Liberty, Gate way of India
      Products                       7          Product review, advertisements                                                                                iphone review, Nike-shoes ad
      Ceremony                       8          Popular events, ceremonies, Major disasters                                                                   Oscar Award functions, Japan Tsunami
      Famous persons                 10         Documentary on famous persons, Artists in concerts                                                            Indian leaders, Jennifer Lopez
      Entertainment                  9          Songs from popular movies, tv shows                                                                           Abraham Lincoln, Mr. Bean
      Speech                         7          Recent speeches by prominent personalities                                                                    Kofi Annan, Barack Obama
      Animations                     8          Popular animation movies, cartoon series                                                                      Tom n jerry, Kung fu Panda, Frozen
      Wildlife                       7          Videos/documentary on animal species                                                                          Peacock, Butterfly, Kangaroo, Desert
      Geography and Climate          8          Weather forecasting videos, Videos covering maps                                                              Continents of world, Weather forecast
      Vehicles                       8          Famous bikes, cars and automobiles                                                                            Sports Bike, Lamborghini Car
      Science and Education          8          Lecture series, Videos on general awareness                                                                   FPGA, Social media effects
      Video Games                    8          Popular Computer and mobile games                                                                             Counter Strike, Dangerous Dave
      Sports                         7          Videos of popular tournaments                                                                                 Tennis, Cricket, Football, Chess
      Total                          103
                                        Table I: Details of the video dataset used in evaluation of ViTag

                   III. R ESULTS AND E VALUATION
      We have used a video collection created using
   YouTube.com for experiments. While creating a collec-
                                                                                                     1.0
   tion, we have studied various video categories available on
   YouTube.com. It organizes videos into 31 different cate-
   gories out of which many categories are merged. Finally,                                          0.8                                                                                                                                                               0.76
   we have made 13 distinct domains. For each domain, we                                                   0.71                   0.7                                                               0.7
   have selected videos based upon inclusive opinion of each                                                                                               0.64                                                             0.64 0.63
                                                                                                                                             0.61 0.62 0.6                                                     0.61
                                                                                     Tag precision

   person in the group. For each popular content, we have                                            0.6              0.57                                                                                                                              0.57
   selected a random video obtained by searching the website.
   We have selected videos with length between 50 seconds
   to 4 minutes. A total of 103 videos have been collected.                                          0.4

   Each domain consists of approximately 8 videos. Table I
   describes the details. We have tagged each video using
   ViTag. Furthermore, by using natural language processing                                          0.2

   package (NLTK), we are able to reason whether a tag
   contains proper nouns or not. This information enables us to
                                                                                                     0.0
   rank the tags. We have also used reciprocal rank as another
                                                                                                                                                                Entertainment

                                                                                                                                                                                           Sports
                                                                                                                       Products

                                                                                                                                                                                Speech

                                                                                                                                                                                                                                        Science & edu

                                                                                                                                                                                                                                                         Video games

                                                                                                                                                                                                                                                                        Animations
                                                                                                                                  Ceremony

                                                                                                                                                                                                    Wildlife

                                                                                                                                                                                                                Geography
                                                                                                                                              Famous person

                                                                                                                                                                                                                             Vehicles
                                                                                                            Tourism

   metric for evaluation.
                                                                                                                                                                                         Domain
      Fig. 4 shows mean precision attained for each video do-
   main. Mean has been computed by taking the geometric
   mean of precision values attained for all videos belonging                            Figure 4: Tag precision recorded across various domains.
   to a particular domain. For Animations category, almost
   77% of the generated tags are precise. For videos belonging
   to product reviews and advertisements, ViTag obtains a                            descending order. The rectangle with dashed line shows ideal
   minimal precision of 57.36%. For videos belonging to do-                          accuracy having the precision of one for all the videos in the
   mains like Tourism, wildlife, animation, events/ceremonies,                       collection. Around 55% of the ideal accuracy is achieved by
   ViTag attains more than 70% precision. The summary of tag                         ViTag.
   precision is presented in Table III.                                                 We estimate the effectiveness of conceptual inference
      It is also important to investigate how many videos lie                        heuristic using the second metric of binary relevance for
   in particular precision interval. Fig. 5 shows number of                          inferred tags. For 4 out of 103 videos, conceptual inference
   videos attaining a particular precision interval. ViTag attains                   heuristic cannot infer extra tag. For 43 out of the remaining
   full (100%) precision for 11.5% videos. For 70 out of                             99 videos, the conceptual inference heuristic has inferred
   103 videos, it is able to generate more than 60% relevant                         meaningful tags. It has inferred vague tags for remaining 56
   tags. Fig. 6 summarizes the accuracy of ViTag. The plot                           videos. Table II lists videos for which conceptual inference
   is obtained by sorting the precision of all the videos in                         generates meaningful tags. The table also depicts cases

                                                                               274

Authorized licensed use limited to: Indian Institute of Technology Hyderabad. Downloaded on February 12,2021 at 10:02:57 UTC from IEEE Xplore. Restrictions apply.
Sample video
                                     Domain                                                                        Some raw tags                                                   Auto-inferred tags
                                                                            description
     Positive                        Famous Persons                         Indian freedom fighters                 Mahatma Gandhi, Bhagat Singh,freedom fighters                    leader person
     Conceptual                      Wildlife                               Dancing peacock                        peacock, peafowl                                                bird
     Inference                                                                                                                                                                     famous landmark,
                                     Tourism                                Eiffel Tower                           france eiffel tower, eiffel tower night view
                                                                                                                                                                                   sight
     Negative                        Ceremony                               Christian wedding                      woman, event, female, facial expression                         group, information
     Conceptual                      Geography                              Weather forecast                       map, planet, world, earth                                       object, material
     Inference                       Product                                iPhone review                          gadget, iPhone, Screenshot                                      item, factor
                                                                            Table II: Effectiveness of Conceptual inference heuristic

                     25

                                                                                                                                                 1.0
                                                                              24

                                                                     20
                     20

                                                                                                                                                 0.8
                                                                                              18

                                                                                      16

                                                                                                                                 Tag Precision
                     15
      No of videos

                                                                                                                                                 0.6
                                                                                                              12

                     10

                                                                                                                                                 0.4
                                                    5        5
                      5
                                            3

                                                                                                                                                       0   20   40            60   80      100
                             0      0                                                                  0
                      0                                                                                                                                              Video count

                            0−.1   .1−.2   .2−.3   .3−.4    .4−.5   .5−.6    .6−.7   .7−.8   .8−.9   .9−.99   1

                                                              Tag precision                                                 Figure 6: Tag precision for the entire video collection in
                                                                                                                            sorted order. The bounding rectangle depicts ideal accuracy.
                     Figure 5: No of videos Vs Tag-precision interval.

                                                           Tag Precision              Reciprocal Rank                       are very interesting to reflect enough potential about our ap-
                          Geometric mean                   0.6467                     0.873                                 proach. However, we think there are scopes for improvement
                          Arithmetic mean                  0.6491                     0.905
                          Median                           0.6389                     1.0
                                                                                                                            as discussed in future work. A snapshot of the user interface
                                                                                                                            of ViTag is presented in Fig. 8.
   Table III: Summary with Tag Precision and Reciprocal Rank
   as the evaluation metrics                                                                                                                      IV. C ONCLUSIONS AND F UTURE W ORK
                                                                                                                               We propose an analytical, end-to-end and fully automatic
                                                                                                                            approach to the problem of automatic video tagging. Our
   where conceptual inference fails to infer relevant tags. We                                                              approach exploits the combination of various image sim-
   have also evaluated ViTag based on reciprocal rank metric.                                                               ilarity metrics to select key-frames from the input video
   Fig. 7 shows the reciprocal rank for all the videos in the                                                               containing dissimilar information. Then we use reverse im-
   collection sorted in decreasing order. As seen from the                                                                  age tagging engine to generate raw tags for the input video.
   figure, ViTag covers 85% of the ideal case scenario. Table                                                                We infer generic tags using conceptual inference heuristic.
   III summarizes the statistical results with reciprocal rank as                                                           The heuristic leverages semantic similarity among tags for
   a metric of evaluation.                                                                                                  tag-inference. We have evaluated our implementation on
      In summary, our observations are as follows: For entire                                                               an open-collection comprising of 103 videos belonging to
   collection comprising of 103 videos, ViTag has generated                                                                 13 domains derived from various YouTube categories. Our
   696 tags out of which 456 tags are precise. It attains about                                                             implementation has obtained 65.51% precision and 87%
   65.51% of accuracy using precision as a metric. In 43.4%                                                                 accuracy using reciprocal rank as a metric. Our approach
   of the cases, conceptual inference heuristic has inferred                                                                is not video domain specific and it does not need any pre-
   valuable generic tags. It has obtained 87.3% accuracy as                                                                 tagged video dataset and training. This makes it practical
   per reciprocal rank metric. We believe the evaluation results                                                            and complementary to the state-of-art approaches.

                                                                                                                      275

Authorized licensed use limited to: Indian Institute of Technology Hyderabad. Downloaded on February 12,2021 at 10:02:57 UTC from IEEE Xplore. Restrictions apply.
[3] Emily Moxley, Tao Mei, Xian-Sheng Hua, Wei-Ying Ma,
                            1.0
                                                                                          and BS Manjunath, “Automatic video annotation through
                                                                                          search and mining,” in Proceedings of the IEEE International
                                                                                          Conference on Multimedia and Expo. IEEE, 2008, pp. 685–
                            0.8

                                                                                          688.
          Reciprocal Rank

                                                                                      [4] George Toderici, Hrishikesh Aradhye, Marius Pasca, Luciano
                                                                                          Sbaiz, and Jay Yagnik, “Finding meaning on youtube: Tag
                            0.6

                                                                                          recommendation and category discovery,” in Computer Vision
                                                                                          and Pattern Recognition, IEEE Conference on. IEEE, 2010,
                                                                                          pp. 3447–3454.
                            0.4

                                                                                      [5] Damian Borth, Adrian Ulges, Christian Schulze, and Thomas
                                                                                          Breuel, “Keyframe extraction for video tagging & summa-
                                                                                          rization.,” pp. 45–48, 01 2008.
                            0.2

                                  0   20     40            60   80   100              [6] Ting Yao, Tao Mei, Chong-Wah Ngo, and Shipeng Li, “Anno-
                                                  Video count
                                                                                          tation for free: Video tagging by mining user search behavior,”
                                                                                          in Proceedings of the 21st ACM international conference on
                                                                                          Multimedia. ACM, 2013, pp. 977–986.
   Figure 7: Reciprocal rank for all videos of the dataset shown
   in a sorted order. The bounding rectangle depicts the ideal                        [7] Jialie Shen, Meng Wang, and Tat-Seng Chua, “Accurate
   case.                                                                                  online video tagging via probabilistic hybrid modeling,” Mul-
                                                                                          timedia Systems, vol. 22, no. 1, pp. 99–113, 2016.
                                                                                      [8] Yushi Jing, David Liu, Dmitry Kislyuk, Andrew Zhai, Jiajing
                                                                                          Xu, Jeff Donahue, and Sarah Tavel, “Visual search at
                                                                                          pinterest,” 05 2015.
                                                                                      [9] J. Matas, O. Chum, M. Urban, and T. Pajdla, “Robust wide
                                                                                          baseline stereo from maximally stable extremal regions,” in
                                                                                          Proceedings of the British Machine Vision Conference. 2002,
                                                                                          pp. 36.1–36.10, BMVA Press, doi:10.5244/C.16.36.
                                                                                     [10] Josef Sivic, Bryan C Russell, Alexei A Efros, Andrew Zisser-
        Figure 8: A snapshot of the user interface of ViTag.                              man, and William T Freeman, “Discovering objects and their
                                                                                          location in images,” in Proceedings of the IEEE International
      We would like to come up with a deep neural network                                 Conference on Computer Vision. IEEE, 2005, vol. 1, pp. 370–
   driven reverse image tagger to improve the accuracy of tag                             377.
   generation. Also, we would like to explore various natural
                                                                                     [11] David Nister and Henrik Stewenius, “Scalable recognition
   language processing techniques to detect and eliminate the                             with a vocabulary tree,” in Proceedings of the IEEE computer
   non-relevant tags. For conceptual inference heuristic, we                              society conference on computer vision and pattern recogni-
   would like to introduce a scoring mechanism to reason                                  tion. Ieee, 2006, vol. 2, pp. 2161–2168.
   about the profitability of adding extra generic tags. We                           [12] Zhongyuan Wang, Haixun Wang, Ji-Rong Wen, and Yanghua
   would also like to explore parameter tuning. This may                                  Xiao, “An inference approach to basic level of catego-
   have positive impact on existing accuracy. In addition to                              rization,” in Proceedings of the 24th acm international
   this, we would like to make it run on real-time multimedia                             on conference on information and knowledge management.
   video collection (such as www.YouTube.com). We think the                               ACM, 2015, pp. 653–662.
   current implementation stands as a good starting point to                         [13] Edward Loper and Steven Bird, “Nltk: The natural language
   explore above aspects.                                                                 toolkit,” in Proceedings of the ACL-02 Workshop on Ef-
                                                                                          fective Tools and Methodologies for Teaching Natural Lan-
                                           R EFERENCES                                    guage Processing and Computational Linguistics - Volume
     [1] Stefan Siersdorfer, Jose San Pedro, and Mark Sanderson,                          1, Stroudsburg, PA, USA, 2002, pp. 63–70, Association for
         “Automatic video tagging using content redundancy,” in                           Computational Linguistics.
         Proceedings of the 32nd international ACM SIGIR conference                  [14] “Google inc, google reverse image search.,” .
         on Research and development in information retrieval. ACM,
         2009, pp. 395–402.                                                          [15] Wentao Wu, Hongsong Li, Haixun Wang, and Kenny Q Zhu,
                                                                                          “Probase: A probabilistic taxonomy for text understanding,”
     [2] Jose San Pedro, Stefan Siersdorfer, and Mark Sanderson,                          in Proceedings of the ACM SIGMOD International Confer-
         “Content redundancy in youtube and its application to video                      ence on Management of Data. ACM, 2012, pp. 481–492.
         tagging,” ACM Transactions on Information Systems, vol. 29,
         no. 3, pp. 13, 2011.                                                        [16] “Vitag automatic video tagger.,” .
                                                                                     [17] “Vitag evaluation: video collection.,” .

                                                                               276

Authorized licensed use limited to: Indian Institute of Technology Hyderabad. Downloaded on February 12,2021 at 10:02:57 UTC from IEEE Xplore. Restrictions apply.
You can also read