Assisted video sequences indexing: shot detection and motion analysis based on interest points

Page created by Gordon Mack
 
CONTINUE READING
Assisted video sequences indexing:
  shot detection and motion analysis based on interest points

                        Emmanuel Etiévent (corresponding author)
                                  Frank Lebourgeois
                                  Jean-Michel Jolion
                       Labo. Reconnaissance de Formes et Vision

                                 etievent@rfv.insa-lyon.fr
                                Insa Lyon, INSA - bat 403
                                20, Avenue Albert Einstein
                                69621 Villeurbanne cedex

♦ Keywords
    Video indexing ; motion analysis ; interest points ; shot cut detection.

♦ Abstract
    This work deals with content-based video indexing. It is part of a multidisciplinary
project about television archives. We focus on semi-automatic compressed video analysis
mainly as a means of assisting semantic indexing, i.e. we take into account interaction
between automatic analysis and the operator. First, we have developed such an assistant
for shot cut detection, using adaptive thresholding.
     Then, we have considered the possible applications of motion analysis and moving
object detection : assisting moving object indexing, summarising videos, and allowing
image and motion queries. We propose an approach based on interest points, specifically
with a multiresolution contrast-based detector, and we test different types of interest point
detectors. This approach does not require a full spatiotemporal segmentation.
INTRODUCTION
        Video indexing consists in describing the content of audiovisual sequences from a
video database to allow their retrieval: this concerns television archives, digital libraries,
video servers, and digital broadcasting. Like for text document indexing, the purpose is to
allow content-based retrieval instead of using only a bibliographic record. Video content
includes for instance characters, objects, dialogues, specific events occurring in a video...
Video content can be described at two complementary levels :
•     The semantic description which is the level the user understands : it allows concept-
based retrieval and usually requires human interpreting. However, some aspects can be
assisted by video (and sound track) automatic analysis, like finding the sequence
structure (shot and scenes), analysing camera and objects motion, detecting and
recognising characters, recognising speech.
•     The physical characterisation of images, objects, and also motion : extracting visual
features allows retrieval based on visual similarity by comparing the features. This
approach is only a complement of the previous one because there is no direct relation
with the user conception, based on the semantic level. However this method is automatic,
and it allows the use of an image or a sketch for queries by example : this is useful for
searching a specific object and for exploiting visual features, like shape or texture, which
are difficult to describe by semantic means.
        The work we present in this paper is part of the Sésame1 project (Audiovisual
Sequences and Multimedia Exploration System). Target users are information officers for
the indexing part and for instance journalists for the retrieval part. The aim is to use more
complete information than the bibliographical records used nowadays for operations like
retrieving, browsing, analysing or editing videos. Since we do not restrict the type of
videos (films, reports, news, TV programs), we have to use the interpretation capacity of
the operator which is very difficult to model. We work on Mpeg compressed video which is
compulsory for realistic applications.
        The project involves several complementary fields : knowledge modelling to
organise video annotation [Prié 98], database for storing and querying this complex
annotation structure and image features [Decleir 98], high performance parallel

1   This work is partially supported by France Télécom (through CNET/CCETT), research contract N° 96 ME
     17.

                                                                                                          1
architectures [Mostéfaoui 97], and semi-automatic image analysis [Lebourgeois 98]. An
integration prototype will allow an actual evaluation.
      We present here two different aspects : shot detection, and a prospective study
about motion analysis applied to indexing.

1. SHOT DETECTION
      A shot is a video segment defined by the continuity of camera shooting. Shots result
from video editing operations like cutting, assembling or introducing transition effects.
      Shot detection is based on shot transition detection. Indeed, cuts introduce a
discontinuity which can be detected by the discontinuity of an image feature or of the
measure of a similarity between two frames. We used a classical similarity measure
based on the histogram difference on each colour plane. We use an adaptive threshold2
[Faudemay 97] to take into account the local variability of the measure within a shot. The
threshold computation is based on the standard deviation in a window centered on each
point, excluding the considered point. Tests show that using 5+5 frame does not disturb
the thresholding, so short shots are not missed (a half-second shot is possible for
instance in a film announcement).

1.1. Semi-automatic operation
      As we said before, a specific point of our approach is to involve the operator to
validate results. Only uncertain cases need validation (that is known as rejection in
decision theory). The uncertain cases are determined according to a tolerance which is
tuned by the operator if the default one does not suit the analysed video :

                                                                              certain event

                           tunable tolerance
  computed threshold                                                            uncertain cases

                                                                              no event
                                               detection probability

                        Figure 1 : Thresholding with tunable rejection.

2Anadditional fixed and loose threshold would be useful anyway to avoid false alarms in shots where the
  variability is very low and an insignificant perturbation creates a small peak.

                                                                                                          2
We consider several constraints :
    • Computations and validation are performed asynchronously to avoid waiting times.
    • The default rejection tolerance is set quite high to avoid missing difficult shot cuts.
    • The operator should be able to tune the tolerance depending on the results at any time.
         As a consequence, instead of binary answers, the computation step provides a
    detection probability which allows the determination of uncertain cases with a tunable
    tolerance at validation time. Generally speaking, for other types of semi-automatic
    analysis, we can say that analysis of the raw results should be independent from the raw
    computations so that the operator can play a role in the analysis.

Figure 2 : Examples of shot cut detection, with start frame and end frame for each shot (see frame
        number). A shot cut at frame 499, within motion, needs validation. (INA archives)

    1.2. Comparison of detection methods
         Many methods have been developed. The following table gathers results from
    several works for which numeric results are published (this means they are evaluated on
    different sequences, and some are based on few data).

                                                                                                3
Gradual transitions         Shot cuts
Author     Method                   missed false nb          missed false nb Video type
[Yeo 95]   image difference with (14%) (57%) 7                7%      7% 41 1 Mpeg report
           a frame step
[Corridoni image        (dissolves) (20%) (20%) 4              3%      3%      181 films, adds
95]        ratio            (fades)    0        0     29
[Joly 94]  variation type of         (0%) (17%) 18            (1%)    (2%) 306 films
           individual pixels
[Zabih 95] edge matching              (0)    (27%) 11         2.5%     12% 118 short Mpeg videos
[Shen 97] edge matching and           8%      (4%) 98          4%      (4%) 187 clips, films, television
           motion compensation                                                   (with motion)
[Xiong 96]                          grey level colour                            short sequences with
           pairwise                     5/4       2/15       = number of         motion and
           likelihood                 10/48        XX        missed / false cuts perturbations.
           global histogram           14/78       9/46       (for 3864 frames ;
           local histogram             7/66        3/2       optimised
           net comparison               3/1        0/0       thresholds)
           histogram and                XX        1/2        for 2284 frames       Mpeg report and film
           adaptive threshold                                and 37 cuts           announcement
           If we use rejection, with a tolerance of +/-20% of the threshold, 20 cases need
    validation, including 7 due to motion, 3 due to fades.
           [Yeo 95] and [Shen 97] work directly in the Mpeg compressed domain and are very
    efficient. Gradual transitions are less studied than cuts (they are less common in videos)
    and need improvements, though the last method claims quite good results. One concern
    is motion, which modifies the images and hence causes variations of shot transition
    detection measures. We will come back to this point in the next section.
           For a less biased comparison, note that the LIP6 lab of Paris 6 university, France, is
    now comparing algorithms on a common video base containing one hundred hours.

    2. MOTION AND MOVING OBJECTS ANALYSIS FOR VIDEO INDEXING

    2.1. Assisting video description
           Semi-automatic motion analysis and moving objects detection can simplify several
    tasks of video description.

    ♦ Objects temporal presence
        Object tracking automates the detection of the interval where an object is present.
    This applies to objects selected manually, or to objects detected by their motion, and to
    special cases like face detection (which works only for front views, so tracking the
    detected faces recovers the moments when the characters turn the head). A further step
    consists in comparing all the detected objects to check their recurrence along the video.

                                                                                                 4
♦ Summarising videos
    Summarising shots gives condensed views of videos. Shots are defined by a fixed
background and objects in motion (plus a sound track summary). In case of camera
motion, the background is defined by several images or by a reconstructed view (images
are transformed back contrarily to the motion [Taniguchi 97]). Objects are characterised
by their motion, and optionally by several different enough views.

♦ Camera motion
    Camera motion possesses a meaning as regards the film structure. It is derived from
global motion parameters [Xiong 97], which are computed also for object motion.

♦ Shot transitions
    Another shot transition detection method relies on detecting motion discontinuity
[Gelgon 97] : transition detection becomes more robust to large motion and it avoids
preliminary computations with another transition detection algorithm.

2.2. Image queries
     When objects and background are separated, features extracted from them allow
similarity retrieval [Benayoun 98]. The operator can select the most significant elements
to index.

2.3. Motion queries
     The first step is to establish what can be useful for motion queries : using a track for
queries by example based on a video sample or a sketch? or how to describe motion
more simply and more semantically, with words ? That is :
• significant motion (as opposed to static shots. It is useful for navigating the video),
• motion features (horizontal or vertical motion, depth motion, speed, regularity),
• motion events like a start, a change of direction,
• interaction between objects [Delis 98] [Courtney 97].
     This means defining classes, with the problem of determining limits between them.

3. INTEREST POINTS, MOTION AND OBJECTS FOR VIDEO INDEXING

3.1. The tool : interest points
     Our lab worked on interest points [Bres 99], and here is a short glance about it.
Interest points are defined by two-dimensional signal variations in their neighbourhood,
for instance at corners, as opposed to 1D variation for basic edges. They describe an

                                                                                            5
image by a small amount of points, therefore they allow a fast image comparison and a
small storage. That is why they are used for image matching, in robotics, and also for
image indexing [Schmid 97] (see 3.2.1 Computing motion).
     We use three detection algorithms (see [Jolion 98] web site) : Plessey detector
[Harris 88], Susan [Smith 97], and multiresolution contrast detector [Bres 99]. The formers
are based on geometric models, which are well adapted for corner detection, while the
latter does not and is more appropriate for natural images. The Susan detector is much
faster than the others but is not very robust to Jpeg compression effects [Bres 99], which
raises doubts for Mpeg videos.
     For videos, matching interest points from one image to the next in a shot gives
motion vectors, which is the basis for motion analysis. This method should be fast
compared to pixel-based methods (optical flow or spatiotemporal segmentation) or more
complex matching (edges, curvature points).

3.2. Interest       points,   motion    and
     objects
     Figure     3    shows    the   temporal
superposition of interest points (the points
of the first frames appear darker), next to
one of the original image.

3.2.1. Computing motion

♦ Point cluster tracking
    In special cases with well-defined
objects, interest points are grouped into
clusters corresponding to objects or parts              Figure 3 : Rotating dancer.

of objects. A fast method consists in clustering the set of points (with morphologic
methods for instance) and following them. A consistency measure is then needed for
difficult cases to apply a more powerful method (for instance motion consistency over a
given duration).

♦ Point matching
    Many methods exist, for instance in robotics (edge or corner matching in artificial
images ; stereovision [Cédras 93] [Serra 96]). For robustness, tracking should take into
account several frames.

                                                                                         6
The comparison of local measures associated with interest points, robust to noise
and geometric transforms and masking, like differential invariants [Schmid 97], improves
the matching. For comparison with differential invariants, we are testing the invariance of
multiresolution contrast.

3.2.2. Interest points for video indexing
      Points of interest allow any of the elements we saw in chapter 2 "Motion and moving
objects analysis for video indexing". Let us focus on some parts.

♦ Moving objets tracking
   The purpose is to determine the time interval where an object is present. Object
motion is obtained by compensating the global motion. Rigid objects detection is based
on the similar motion of the object points. It is more difficult with non-rigid objects, and
due to the variety of the considered videos, detection cannot be perfect. Therefore an
operator has to validate and correct the results.
         For an indexing system, we consider different modes : batch processing, or more
interactive operating. In either case, to avoid waiting times, it is far preferable for the
interaction steps and computation steps to work independently on a whole video segment
rather than to work object by object. Notice that for an approximate object display,
showing a region containing the interest points is enough for human understanding.

        q On demand analysis
         In case the operator is interested only by a part of the objects (the most significant)
and does not want to run a full computation, we have the following steps :
• outlining manually all theses objects, in one frame each,
• extracting and tracking interest points included in theses regions,
• asking the operator to validate and correct the ambiguous cases (with the display of the
object at the beginning and at the end of its trajectory to see if it is the same).

        q Batch analysis of a sequence
         First, the computation step on the whole sequence includes extracting the interest
points, computing the motion, grouping points according to motion similarity to detect
objects, then the validating step is like mentioned above.

♦ Moving objects characterisation
   Interest points and the associated invariants are a way of characterising objects,
for :
• classifying similar objects from a video to assist the process of naming the objects,

                                                                                              7
• querying a video database by example. We need to store only several different views of
an object from the whole sequence (or even none if the object is already indexed and has
similar views already stored3).
      Characterising objects needs more accuracy than tracking. First, interest point
thresholding can be adapted to the object to get more points locally. Then, the operator
now must correct also the detected object shapes if they overlap other objects4.
      We emphasise the fact that the whole process does not need a full spatiotemporal
segmentation at the pixel level.

3.3. Comparing interest points detectors
      From one image to the next, interest points change because of Mpeg coding, object
distortions, and background variations when the object moves (which modifies the local
invariants associated to points on the edge of the object). Matching requires somewhat
steady points (number and location of points, invariants stability). At first, we compare the
temporal and spatial stability of interest point detectors with a simple matching algorithm
and global or small motion (by studying the variability of motion vectors in one frame). A
second step consists in comparing the results of a real tracking algorithm.
      The 1 mn report on Figure 4 shows small
                                                                                     400
                                                                  Number of images

variations in general (it uses multiresolution                                       350

contrast detector, and a fixed threshold ; the                                       300

                                                                                     250
frames associated to shot cut are removed).
                                                                                     200

We plan to test quite long video sequences                                           150

from television archives of the French Institute                                     100

                                                                                     50
of Audio-visual.
                                                                                      0
                                                                                           0   20   40   60    80    100

                                                                                                         Rate of change
CONCLUSION                                                            Figure 4: Histogram of the rate of
      We have developed a shot cut detection                        change (%) of the number of interest
                                                                         points between two frames.
assistant, using adaptive thresholding and
taking into account the interaction with the operator. Concerning motion analysis, we
have considered the possible applications for video indexing : assisting moving object

3 For that, a classification of the whole (and huge) database is not needed since we can reach the other
   possible instances of the object already indexed using the semantic annotations database.
4 But if some parts have no interest points, it does not matter to add them because they will not play any

   role in similarity queries.

                                                                                                                      8
indexing, summarising videos, and allowing image and motion queries. We have
proposed an approach based on interest points, specifically with a multiresolution
contrast-based detector, for analysing motion and detecting and characterising objects ;
this approach does not require a full spatiotemporal segmentation. Experimental results
will be presented at the conference and included in the final version of the paper.

REFERENCES

♦ Sésame
[Decleir 98] C. Decleir, M.S. Hacid, J. Kouloumdjian (1998) A Generic Model For Video
      Content Based Retrieval; Symposium on Applied Computing, ACM, 458-459.
[Mostéfaoui 97] A. Mostéfaoui , L. Brunie (1997) Exploiting data structures in a High
      Performance Video Server for TV Archives; Digital Media Information Base
      (DMIB’97), ACM-SIGMOD, Ed. World Scientist, 159-166.
[Prié 98] Y.Prié, A.Mille, J.M.Pinon (1998) AI-STRATA: A User-centered Model for
      Content-based description and Retrieval of Audiovisual Sequences; First Int.
      Advanced Multimedia Content Processing Conf., 143-152.
[Lebourgeois 98] F. Lebourgeois, J.M. Jolion, P. Awart (1998) Toward a Video
      Description for Indexation ; 14th IAPR Int. Conf. on Pattern Recognition, Brisbane,
      August 1998, vol. I, 912-915.
♦ Shot detection
[Corridoni 95] M. Corridoni, A. Del Bimbo (1995) Automatic Video Segmentation through
      Editing Analysis ; Technical Report Firenze University,
      http://www.nzdl.org/cgi-bin/gw?a=targetdoc&c=cstr&z=sw3E2P4hwhzy&d=6975.
[Faudemay 97] P. Faudemay, L. Chen, C. Montacié, M.J. Caraty, X. Tu (1997)
      Segmentation multi-canaux de vidéos en séquences; Coresa 97.
[Joly 94] P. Joly, P. Aigrain (1994) The Automatic Real-Time Analysis of Film Editing and
      Transition Effects and its Applications; Computers & Graphics, Vol. 18, No. 1, 1994,
      93-103.
[Shen 97] Bo Shen (1997) HDH Based Compressed Video Cut Detection; HPL-97-142
      971204 External, http://www.hpl.hp.com/techreports/97/HPL-97-142.html.
[Xiong 96] W. Xiong, J. Chung-Mong Lee, R.H. Ma, Automatic Video Data Structuring
      through Shot Partitioning and Key Frame Computing ;Technical report,
      http://www.nzdl.org/cgi-bin/gw?a=targetdoc&c=cstr&z=44Cx2P4hwhzy&d=22080.
[Yeo 95] B. L. Yeo and B. Liu (1995) Rapid scene analysis on compressed video ; IEEE
      Transactions on circuits and systems for video technology, vol. 5, 533-544.
[Zabih 95] R. Zabih, J. Miller, K. Mai (1995) A Feature-Based Algorithm for Detecting and
      Classifying Scene Breaks; ACM Multimedia 1995,
      http://simon.cs.cornell.edu/Info/People/rdz/dissolve.html.
♦ Interest points
[Bres 99] S.Bres, J.M. Jolion (1999) Detection of Interest Points for Image Indexation ;
      Visual’99 Amsterdam, june 2-4, http://rfv.insa-lyon.fr/~jolion/PS/visual99.ps.gz.
[Harris 98] C.Harris, M.Stephens (1988) A combined corner and edge detector; Proc. of
      4th Alvey Vision Conf., 147-151.
[Jolion 98] Interest points demo: http://rfv.insa-lyon.fr/~jolion/Cours/ptint.html.
[Schmid 96] C.Schmid (1996) Appariement d'images par invariants locaux de niveaux de
      gris; Thèse INP Grenoble.

                                                                                            9
[Schmid 97] C.Schmid, R.Mohr (1997) Local Grayvalue Invariants for Image Retrieval;
     IEEE Trans. on Pattern Analysis and Machine Intelligence, 19(5), 530-535.
[Smith 97] S.M.Smith, J.M.Brady (1997) SUSAN - A New Approach to Low Level Image
     Processing; Int. Journal of Computer Vision, 23(1), 45-78.
♦ Motion
[Benayoun 98] S. Benayoun, H. Bernard, P. Bertolino, P. Bouthemy, M. Gelgon, R. Mohr,
      C. Schmid, F. Spindler (1998) Structuration de vidéos pour des interfaces de
      consultation avancées; Coresa 98, 205.
[Cedras 93] C. Cédras, M. Shah (1993) Motion-Based Recognition: a Survey; technical
      report, http://www.nzdl.org/cgi-bin/Kniles?c=cstr&d=7153.
[Courtney 97] J.D. Courtney (1997) Automatic video indexing via object motion analysis;
      Pattern Recognition 1997.
[Delis 98] V. Delis, D. Papadias, N. Mamoulis (1998) Assessing Multimedia Similarity;
      ACM Multimédia 98, Session 7 C: Content-Based Retrieval Systems,
      http://www.acm.org/sigmm/MM98/electronic_proceedings/delis/index.html.
[Gelgon 97] M. Gelgon, P. Bouthemy, G. Fabrice (1997) A Unified Approach to Shot
      Change Detection and Camera Motion Characterization; Technical Report RR-3304
      INRIA Rennes, http://www.inria.fr/RRRT/RR-3304.html.
[Serra 96] B. Serra (1996) Reconnaissance et localisation d’objets cartographiques 3D en
      vision aérienne dynamique; Thèse université de Nice,150-185.
[Taniguchi 97] Y. Taniguchi, A. Akutsu, Y. Tonomura (1997) Panorama Excerpts:
      Extracting and Packing Panoramas for Video Browsing; ACM Multimedia 97,
      http://www1.acm.org:81/sigmm/MM97/papers/taniguchi/tani.html.
[Xiong 97] W. Xiong, J.C.M. Lee (1997) Efficient Scene Change Detection and Camera
      Motion Annotation for Video Classification; Technical Report HKUST-CS97-16,
      http://www.nzdl.org/cgi-bin/gw?a=targetdoc&c=cstr&z=2Ess2P4hwhzy&d=22748.

                                                                                      10
You can also read