Assisted video sequences indexing: shot detection and motion analysis based on interest points
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Assisted video sequences indexing:
shot detection and motion analysis based on interest points
Emmanuel Etiévent (corresponding author)
Frank Lebourgeois
Jean-Michel Jolion
Labo. Reconnaissance de Formes et Vision
etievent@rfv.insa-lyon.fr
Insa Lyon, INSA - bat 403
20, Avenue Albert Einstein
69621 Villeurbanne cedex
♦ Keywords
Video indexing ; motion analysis ; interest points ; shot cut detection.
♦ Abstract
This work deals with content-based video indexing. It is part of a multidisciplinary
project about television archives. We focus on semi-automatic compressed video analysis
mainly as a means of assisting semantic indexing, i.e. we take into account interaction
between automatic analysis and the operator. First, we have developed such an assistant
for shot cut detection, using adaptive thresholding.
Then, we have considered the possible applications of motion analysis and moving
object detection : assisting moving object indexing, summarising videos, and allowing
image and motion queries. We propose an approach based on interest points, specifically
with a multiresolution contrast-based detector, and we test different types of interest point
detectors. This approach does not require a full spatiotemporal segmentation.INTRODUCTION
Video indexing consists in describing the content of audiovisual sequences from a
video database to allow their retrieval: this concerns television archives, digital libraries,
video servers, and digital broadcasting. Like for text document indexing, the purpose is to
allow content-based retrieval instead of using only a bibliographic record. Video content
includes for instance characters, objects, dialogues, specific events occurring in a video...
Video content can be described at two complementary levels :
• The semantic description which is the level the user understands : it allows concept-
based retrieval and usually requires human interpreting. However, some aspects can be
assisted by video (and sound track) automatic analysis, like finding the sequence
structure (shot and scenes), analysing camera and objects motion, detecting and
recognising characters, recognising speech.
• The physical characterisation of images, objects, and also motion : extracting visual
features allows retrieval based on visual similarity by comparing the features. This
approach is only a complement of the previous one because there is no direct relation
with the user conception, based on the semantic level. However this method is automatic,
and it allows the use of an image or a sketch for queries by example : this is useful for
searching a specific object and for exploiting visual features, like shape or texture, which
are difficult to describe by semantic means.
The work we present in this paper is part of the Sésame1 project (Audiovisual
Sequences and Multimedia Exploration System). Target users are information officers for
the indexing part and for instance journalists for the retrieval part. The aim is to use more
complete information than the bibliographical records used nowadays for operations like
retrieving, browsing, analysing or editing videos. Since we do not restrict the type of
videos (films, reports, news, TV programs), we have to use the interpretation capacity of
the operator which is very difficult to model. We work on Mpeg compressed video which is
compulsory for realistic applications.
The project involves several complementary fields : knowledge modelling to
organise video annotation [Prié 98], database for storing and querying this complex
annotation structure and image features [Decleir 98], high performance parallel
1 This work is partially supported by France Télécom (through CNET/CCETT), research contract N° 96 ME
17.
1architectures [Mostéfaoui 97], and semi-automatic image analysis [Lebourgeois 98]. An
integration prototype will allow an actual evaluation.
We present here two different aspects : shot detection, and a prospective study
about motion analysis applied to indexing.
1. SHOT DETECTION
A shot is a video segment defined by the continuity of camera shooting. Shots result
from video editing operations like cutting, assembling or introducing transition effects.
Shot detection is based on shot transition detection. Indeed, cuts introduce a
discontinuity which can be detected by the discontinuity of an image feature or of the
measure of a similarity between two frames. We used a classical similarity measure
based on the histogram difference on each colour plane. We use an adaptive threshold2
[Faudemay 97] to take into account the local variability of the measure within a shot. The
threshold computation is based on the standard deviation in a window centered on each
point, excluding the considered point. Tests show that using 5+5 frame does not disturb
the thresholding, so short shots are not missed (a half-second shot is possible for
instance in a film announcement).
1.1. Semi-automatic operation
As we said before, a specific point of our approach is to involve the operator to
validate results. Only uncertain cases need validation (that is known as rejection in
decision theory). The uncertain cases are determined according to a tolerance which is
tuned by the operator if the default one does not suit the analysed video :
certain event
tunable tolerance
computed threshold uncertain cases
no event
detection probability
Figure 1 : Thresholding with tunable rejection.
2Anadditional fixed and loose threshold would be useful anyway to avoid false alarms in shots where the
variability is very low and an insignificant perturbation creates a small peak.
2We consider several constraints :
• Computations and validation are performed asynchronously to avoid waiting times.
• The default rejection tolerance is set quite high to avoid missing difficult shot cuts.
• The operator should be able to tune the tolerance depending on the results at any time.
As a consequence, instead of binary answers, the computation step provides a
detection probability which allows the determination of uncertain cases with a tunable
tolerance at validation time. Generally speaking, for other types of semi-automatic
analysis, we can say that analysis of the raw results should be independent from the raw
computations so that the operator can play a role in the analysis.
Figure 2 : Examples of shot cut detection, with start frame and end frame for each shot (see frame
number). A shot cut at frame 499, within motion, needs validation. (INA archives)
1.2. Comparison of detection methods
Many methods have been developed. The following table gathers results from
several works for which numeric results are published (this means they are evaluated on
different sequences, and some are based on few data).
3Gradual transitions Shot cuts
Author Method missed false nb missed false nb Video type
[Yeo 95] image difference with (14%) (57%) 7 7% 7% 41 1 Mpeg report
a frame step
[Corridoni image (dissolves) (20%) (20%) 4 3% 3% 181 films, adds
95] ratio (fades) 0 0 29
[Joly 94] variation type of (0%) (17%) 18 (1%) (2%) 306 films
individual pixels
[Zabih 95] edge matching (0) (27%) 11 2.5% 12% 118 short Mpeg videos
[Shen 97] edge matching and 8% (4%) 98 4% (4%) 187 clips, films, television
motion compensation (with motion)
[Xiong 96] grey level colour short sequences with
pairwise 5/4 2/15 = number of motion and
likelihood 10/48 XX missed / false cuts perturbations.
global histogram 14/78 9/46 (for 3864 frames ;
local histogram 7/66 3/2 optimised
net comparison 3/1 0/0 thresholds)
histogram and XX 1/2 for 2284 frames Mpeg report and film
adaptive threshold and 37 cuts announcement
If we use rejection, with a tolerance of +/-20% of the threshold, 20 cases need
validation, including 7 due to motion, 3 due to fades.
[Yeo 95] and [Shen 97] work directly in the Mpeg compressed domain and are very
efficient. Gradual transitions are less studied than cuts (they are less common in videos)
and need improvements, though the last method claims quite good results. One concern
is motion, which modifies the images and hence causes variations of shot transition
detection measures. We will come back to this point in the next section.
For a less biased comparison, note that the LIP6 lab of Paris 6 university, France, is
now comparing algorithms on a common video base containing one hundred hours.
2. MOTION AND MOVING OBJECTS ANALYSIS FOR VIDEO INDEXING
2.1. Assisting video description
Semi-automatic motion analysis and moving objects detection can simplify several
tasks of video description.
♦ Objects temporal presence
Object tracking automates the detection of the interval where an object is present.
This applies to objects selected manually, or to objects detected by their motion, and to
special cases like face detection (which works only for front views, so tracking the
detected faces recovers the moments when the characters turn the head). A further step
consists in comparing all the detected objects to check their recurrence along the video.
4♦ Summarising videos
Summarising shots gives condensed views of videos. Shots are defined by a fixed
background and objects in motion (plus a sound track summary). In case of camera
motion, the background is defined by several images or by a reconstructed view (images
are transformed back contrarily to the motion [Taniguchi 97]). Objects are characterised
by their motion, and optionally by several different enough views.
♦ Camera motion
Camera motion possesses a meaning as regards the film structure. It is derived from
global motion parameters [Xiong 97], which are computed also for object motion.
♦ Shot transitions
Another shot transition detection method relies on detecting motion discontinuity
[Gelgon 97] : transition detection becomes more robust to large motion and it avoids
preliminary computations with another transition detection algorithm.
2.2. Image queries
When objects and background are separated, features extracted from them allow
similarity retrieval [Benayoun 98]. The operator can select the most significant elements
to index.
2.3. Motion queries
The first step is to establish what can be useful for motion queries : using a track for
queries by example based on a video sample or a sketch? or how to describe motion
more simply and more semantically, with words ? That is :
• significant motion (as opposed to static shots. It is useful for navigating the video),
• motion features (horizontal or vertical motion, depth motion, speed, regularity),
• motion events like a start, a change of direction,
• interaction between objects [Delis 98] [Courtney 97].
This means defining classes, with the problem of determining limits between them.
3. INTEREST POINTS, MOTION AND OBJECTS FOR VIDEO INDEXING
3.1. The tool : interest points
Our lab worked on interest points [Bres 99], and here is a short glance about it.
Interest points are defined by two-dimensional signal variations in their neighbourhood,
for instance at corners, as opposed to 1D variation for basic edges. They describe an
5image by a small amount of points, therefore they allow a fast image comparison and a
small storage. That is why they are used for image matching, in robotics, and also for
image indexing [Schmid 97] (see 3.2.1 Computing motion).
We use three detection algorithms (see [Jolion 98] web site) : Plessey detector
[Harris 88], Susan [Smith 97], and multiresolution contrast detector [Bres 99]. The formers
are based on geometric models, which are well adapted for corner detection, while the
latter does not and is more appropriate for natural images. The Susan detector is much
faster than the others but is not very robust to Jpeg compression effects [Bres 99], which
raises doubts for Mpeg videos.
For videos, matching interest points from one image to the next in a shot gives
motion vectors, which is the basis for motion analysis. This method should be fast
compared to pixel-based methods (optical flow or spatiotemporal segmentation) or more
complex matching (edges, curvature points).
3.2. Interest points, motion and
objects
Figure 3 shows the temporal
superposition of interest points (the points
of the first frames appear darker), next to
one of the original image.
3.2.1. Computing motion
♦ Point cluster tracking
In special cases with well-defined
objects, interest points are grouped into
clusters corresponding to objects or parts Figure 3 : Rotating dancer.
of objects. A fast method consists in clustering the set of points (with morphologic
methods for instance) and following them. A consistency measure is then needed for
difficult cases to apply a more powerful method (for instance motion consistency over a
given duration).
♦ Point matching
Many methods exist, for instance in robotics (edge or corner matching in artificial
images ; stereovision [Cédras 93] [Serra 96]). For robustness, tracking should take into
account several frames.
6The comparison of local measures associated with interest points, robust to noise
and geometric transforms and masking, like differential invariants [Schmid 97], improves
the matching. For comparison with differential invariants, we are testing the invariance of
multiresolution contrast.
3.2.2. Interest points for video indexing
Points of interest allow any of the elements we saw in chapter 2 "Motion and moving
objects analysis for video indexing". Let us focus on some parts.
♦ Moving objets tracking
The purpose is to determine the time interval where an object is present. Object
motion is obtained by compensating the global motion. Rigid objects detection is based
on the similar motion of the object points. It is more difficult with non-rigid objects, and
due to the variety of the considered videos, detection cannot be perfect. Therefore an
operator has to validate and correct the results.
For an indexing system, we consider different modes : batch processing, or more
interactive operating. In either case, to avoid waiting times, it is far preferable for the
interaction steps and computation steps to work independently on a whole video segment
rather than to work object by object. Notice that for an approximate object display,
showing a region containing the interest points is enough for human understanding.
q On demand analysis
In case the operator is interested only by a part of the objects (the most significant)
and does not want to run a full computation, we have the following steps :
• outlining manually all theses objects, in one frame each,
• extracting and tracking interest points included in theses regions,
• asking the operator to validate and correct the ambiguous cases (with the display of the
object at the beginning and at the end of its trajectory to see if it is the same).
q Batch analysis of a sequence
First, the computation step on the whole sequence includes extracting the interest
points, computing the motion, grouping points according to motion similarity to detect
objects, then the validating step is like mentioned above.
♦ Moving objects characterisation
Interest points and the associated invariants are a way of characterising objects,
for :
• classifying similar objects from a video to assist the process of naming the objects,
7• querying a video database by example. We need to store only several different views of
an object from the whole sequence (or even none if the object is already indexed and has
similar views already stored3).
Characterising objects needs more accuracy than tracking. First, interest point
thresholding can be adapted to the object to get more points locally. Then, the operator
now must correct also the detected object shapes if they overlap other objects4.
We emphasise the fact that the whole process does not need a full spatiotemporal
segmentation at the pixel level.
3.3. Comparing interest points detectors
From one image to the next, interest points change because of Mpeg coding, object
distortions, and background variations when the object moves (which modifies the local
invariants associated to points on the edge of the object). Matching requires somewhat
steady points (number and location of points, invariants stability). At first, we compare the
temporal and spatial stability of interest point detectors with a simple matching algorithm
and global or small motion (by studying the variability of motion vectors in one frame). A
second step consists in comparing the results of a real tracking algorithm.
The 1 mn report on Figure 4 shows small
400
Number of images
variations in general (it uses multiresolution 350
contrast detector, and a fixed threshold ; the 300
250
frames associated to shot cut are removed).
200
We plan to test quite long video sequences 150
from television archives of the French Institute 100
50
of Audio-visual.
0
0 20 40 60 80 100
Rate of change
CONCLUSION Figure 4: Histogram of the rate of
We have developed a shot cut detection change (%) of the number of interest
points between two frames.
assistant, using adaptive thresholding and
taking into account the interaction with the operator. Concerning motion analysis, we
have considered the possible applications for video indexing : assisting moving object
3 For that, a classification of the whole (and huge) database is not needed since we can reach the other
possible instances of the object already indexed using the semantic annotations database.
4 But if some parts have no interest points, it does not matter to add them because they will not play any
role in similarity queries.
8indexing, summarising videos, and allowing image and motion queries. We have
proposed an approach based on interest points, specifically with a multiresolution
contrast-based detector, for analysing motion and detecting and characterising objects ;
this approach does not require a full spatiotemporal segmentation. Experimental results
will be presented at the conference and included in the final version of the paper.
REFERENCES
♦ Sésame
[Decleir 98] C. Decleir, M.S. Hacid, J. Kouloumdjian (1998) A Generic Model For Video
Content Based Retrieval; Symposium on Applied Computing, ACM, 458-459.
[Mostéfaoui 97] A. Mostéfaoui , L. Brunie (1997) Exploiting data structures in a High
Performance Video Server for TV Archives; Digital Media Information Base
(DMIB’97), ACM-SIGMOD, Ed. World Scientist, 159-166.
[Prié 98] Y.Prié, A.Mille, J.M.Pinon (1998) AI-STRATA: A User-centered Model for
Content-based description and Retrieval of Audiovisual Sequences; First Int.
Advanced Multimedia Content Processing Conf., 143-152.
[Lebourgeois 98] F. Lebourgeois, J.M. Jolion, P. Awart (1998) Toward a Video
Description for Indexation ; 14th IAPR Int. Conf. on Pattern Recognition, Brisbane,
August 1998, vol. I, 912-915.
♦ Shot detection
[Corridoni 95] M. Corridoni, A. Del Bimbo (1995) Automatic Video Segmentation through
Editing Analysis ; Technical Report Firenze University,
http://www.nzdl.org/cgi-bin/gw?a=targetdoc&c=cstr&z=sw3E2P4hwhzy&d=6975.
[Faudemay 97] P. Faudemay, L. Chen, C. Montacié, M.J. Caraty, X. Tu (1997)
Segmentation multi-canaux de vidéos en séquences; Coresa 97.
[Joly 94] P. Joly, P. Aigrain (1994) The Automatic Real-Time Analysis of Film Editing and
Transition Effects and its Applications; Computers & Graphics, Vol. 18, No. 1, 1994,
93-103.
[Shen 97] Bo Shen (1997) HDH Based Compressed Video Cut Detection; HPL-97-142
971204 External, http://www.hpl.hp.com/techreports/97/HPL-97-142.html.
[Xiong 96] W. Xiong, J. Chung-Mong Lee, R.H. Ma, Automatic Video Data Structuring
through Shot Partitioning and Key Frame Computing ;Technical report,
http://www.nzdl.org/cgi-bin/gw?a=targetdoc&c=cstr&z=44Cx2P4hwhzy&d=22080.
[Yeo 95] B. L. Yeo and B. Liu (1995) Rapid scene analysis on compressed video ; IEEE
Transactions on circuits and systems for video technology, vol. 5, 533-544.
[Zabih 95] R. Zabih, J. Miller, K. Mai (1995) A Feature-Based Algorithm for Detecting and
Classifying Scene Breaks; ACM Multimedia 1995,
http://simon.cs.cornell.edu/Info/People/rdz/dissolve.html.
♦ Interest points
[Bres 99] S.Bres, J.M. Jolion (1999) Detection of Interest Points for Image Indexation ;
Visual’99 Amsterdam, june 2-4, http://rfv.insa-lyon.fr/~jolion/PS/visual99.ps.gz.
[Harris 98] C.Harris, M.Stephens (1988) A combined corner and edge detector; Proc. of
4th Alvey Vision Conf., 147-151.
[Jolion 98] Interest points demo: http://rfv.insa-lyon.fr/~jolion/Cours/ptint.html.
[Schmid 96] C.Schmid (1996) Appariement d'images par invariants locaux de niveaux de
gris; Thèse INP Grenoble.
9[Schmid 97] C.Schmid, R.Mohr (1997) Local Grayvalue Invariants for Image Retrieval;
IEEE Trans. on Pattern Analysis and Machine Intelligence, 19(5), 530-535.
[Smith 97] S.M.Smith, J.M.Brady (1997) SUSAN - A New Approach to Low Level Image
Processing; Int. Journal of Computer Vision, 23(1), 45-78.
♦ Motion
[Benayoun 98] S. Benayoun, H. Bernard, P. Bertolino, P. Bouthemy, M. Gelgon, R. Mohr,
C. Schmid, F. Spindler (1998) Structuration de vidéos pour des interfaces de
consultation avancées; Coresa 98, 205.
[Cedras 93] C. Cédras, M. Shah (1993) Motion-Based Recognition: a Survey; technical
report, http://www.nzdl.org/cgi-bin/Kniles?c=cstr&d=7153.
[Courtney 97] J.D. Courtney (1997) Automatic video indexing via object motion analysis;
Pattern Recognition 1997.
[Delis 98] V. Delis, D. Papadias, N. Mamoulis (1998) Assessing Multimedia Similarity;
ACM Multimédia 98, Session 7 C: Content-Based Retrieval Systems,
http://www.acm.org/sigmm/MM98/electronic_proceedings/delis/index.html.
[Gelgon 97] M. Gelgon, P. Bouthemy, G. Fabrice (1997) A Unified Approach to Shot
Change Detection and Camera Motion Characterization; Technical Report RR-3304
INRIA Rennes, http://www.inria.fr/RRRT/RR-3304.html.
[Serra 96] B. Serra (1996) Reconnaissance et localisation d’objets cartographiques 3D en
vision aérienne dynamique; Thèse université de Nice,150-185.
[Taniguchi 97] Y. Taniguchi, A. Akutsu, Y. Tonomura (1997) Panorama Excerpts:
Extracting and Packing Panoramas for Video Browsing; ACM Multimedia 97,
http://www1.acm.org:81/sigmm/MM97/papers/taniguchi/tani.html.
[Xiong 97] W. Xiong, J.C.M. Lee (1997) Efficient Scene Change Detection and Camera
Motion Annotation for Video Classification; Technical Report HKUST-CS97-16,
http://www.nzdl.org/cgi-bin/gw?a=targetdoc&c=cstr&z=2Ess2P4hwhzy&d=22748.
10You can also read