Fuzzy-based Motion Estimation for Video Stabilization using SIFT interest points

Page created by Billy Banks
 
CONTINUE READING
Fuzzy-based Motion Estimation for Video Stabilization using SIFT
                            interest points
                          Battiato S.a and Gallo G.a and Puglisi G.a and Scellato S.b
                              a University of Catania, Viale A.Doria, Catania, Italy;
                          b Scuola   Superiore di Catania, Via San Nullo, Catania, Italy

                                                       ABSTRACT
In this paper we present a technique which infers interframe motion by tracking SIFT features through consecutive frames:
feature points are detected and their stability is evaluated through a combination of geometric error measures and fuzzy
logic modelling. Our algorithm does not depend on the point detector adopted prior to SIFT descriptor creation: therefore
performance have been evaluated against a wide set of point detection algorithms, in order to investigate how to increase
stabilization quality with an appropriate detector.

                                                  1. INTRODUCTION
In the past decade video stabilization techniques have been widely demanded to remove the uncomfortable video motion
vibrations, which are common in non-professional home videos taken with hand-held video cameras. In fact, despite
these devices allow everyone to produce personal footage, these videos are often shaky and affected by undesirable jitter.
Therefore video stabilization is often employed to increase video quality, for it permits to obtain stable video footages even
in non-optimal conditions. The best stabilization techniques make use of mechanical tools which physically avoid camera
shakes, or exploit optical or electronic devices to influence how the camera sensor receives the input light:1 on the other
hand, digital video stabilization techniques do not need any additional knowledge about camera physical motion. Therefore
these approaches are not expensive, for they may be implemented easily both in real-time and post-processing systems.
    A wide number of works have investigated several techniques, with different issues and weak points. A first group of
techniques is based on block matching: they use different filters to refine motion estimation from block local vectors.2–4
These algorithms generally provide good results but are more likely to be mislead by video containing large moving objects.
This happens because they do not associate any descriptor to a block and neither track blocks along consecutive frames.
    Feature-based algorithms extract features from video images and estimate interframe motion using their location. Some
authors present techniques5–7 combining features computation with other robust filters; these methods have gained larger
consensus for their good performances.
    A video stabilization system based on SIFT features8 has been recently presented by the authors in.9 It uses a custom
implementation of SIFT features to estimate interframe motion: then Adaptive Motion Vector Integration is adopted to
recognize and remove intentional movements.
    Here we present an improved technique which infers interframe motion by tracking SIFT features through consecutive
frames: feature points are detected and their stability is evaluated through a combination of geometric error measures and
fuzzy logic modelling. Our algorithm does not depend on the point detector adopted prior to SIFT descriptor creation:
therefore performance have been evaluated against a wide set of point detection algorithms, in order to investigate how to
increase stabilization quality with an appropriate detector.
    The paper is organized as follows: in Section 2 we present our motion estimation algorithm, which is independent from
the detector adopted, then follows a detailed discussion about point detection algorithms in Section 3. Experimental results
are shown in Section 4 and conclusions are summarized in Section 5.
2. CAMERA MOTION ESTIMATION
Our algorithm assumes that a suitable keypoint detector can be used to extract significant feature from each frame. To each
keypoint must be assigned a SIFT descriptor to further elaboration, that is usually a 128-dimensional feature vector which
offers robustness and invariance to several image transformations (for further details the reader is referred to the original
work8 ). The computation of SIFT keypoints and their relative descriptors can be divided into the two main tasks of point
detection and descriptor computation: it appears evident that the detection step could be performed with several different
techniques, since SIFT descriptors may be computed as soon as interest points have been detected in the processed image.
    Our approach tracks SIFT keypoints between frames and then uses a feature-based matching algorithm to estimate
interframe motion. Each couple of matched features results in a Local Motion Vector, but not all local motion vectors give
correct information about how the frame has moved relatively to the previous one. Wrong matchings that may mislead
the algorithm are discarded with Iterative Least-Square Estimation by using a fuzzy logic model to interpret geometric
measures.

2.1 Point matching
The first problem to address is keypoint matching. In8 it is performed using Euclidean distance between descriptors’
vectors and a distance ratio, namely ratio of closest neighbour distance to that of the second-closest one, that can be
checked against a threshold to discard false matchings. In fact correct matchings should have lower ratios while wrong
ones should have ratios closer to one. We have previously tested the correlation between the distance ratio and the distance
of the detected points in consecutive frames, since keypoints are likely to appear in the same location on both images.
     Correlation between these two variables is then easily investigated: when two keypoints are matched, the Euclidean
distance between the pixel position of the first keypoint and the pixel position of the second one is computed. We noticed
that using a value of 0.6 as threshold performs well in discarding wrong matchings: actually, only few keypoint couples
show such a low distance ratio, but they are more likely to be correct matchings than the many more that present higher
distance ratios (Fig. 1). It is important to notice that a medium-size image (640 × 480 pixels) may reveal many thousands
keypoints, even though a good estimation algorithm performs well even with less than one hundred points: therefore
filtering out a large portion of the point set is a good choice to increase the performance of the algorithm without affecting
its results.

Figure 1. Correlation between pixel distance (X axis) and distance ratio (Y axis) for a medium size image: on the left features with a
distance ratio below 0.6 and on the right remaining features.

    After this matching process a list of keypoints couples is obtained which represents the input of the successive feature-
based motion estimation algorithm. A Local Motion Vector is associated with each pair of matched features: since absolute
positions (xk , yk ) and (xˆk , yˆk ) of both first and second keypoint in both images are known, the local motion vector vk of
the feature k can be easily derived as

                                             vk = (xˆk − xk , yˆk − yk ) = (dxk , dyk )
and represents how the feature has supposedly moved from the previous frame to the current one.
2.2 Inter-frame motion estimation
The set of local motion vectors retrieved during features matching are used to estimate the motion occurred between the
current frame and the previous frame, namely a Global Motion Vector.
   Of course local motion vectors must be fit to a frame motion model: even if the motion in the scene is typically three-
dimensional, global motion between frames can be approximately estimated with a two-dimensional linear affine model,
which represents the best trade-off between effectiveness and computational complexity.
    This model describes interframe motion using four different parameters, namely two translational movements, one
rotation angle and a zoom factor, and it associates feature (xi , yi ) in frame In with feature (xf , yf ) in frame In+1 with this
transformation:
                                             
                                                 xf = xi λ cos θ − yi λ sin θ + Tx
                                                                                                                              (1)
                                                 yf = xi λ sin θ + yi λ cos θ + Ty

where λ is the zoom parameter, θ the rotation angle, Tx and Ty respectively X-axis and Y-axis shifts. Four transformation
parameters can be derived by four independent linear equations, so two couples of features are enough for the system to
have a solution. Unfortunately features are often heavily affected by noise so a more robust method should be applied. The
linear Least Squares Method on a set of redundant equations is a good choice to solve this problem. It results in a robust
parameter estimation and is less prone to bad conditioning in the numerical algorithm. The whole set of features local
motion vectors does not contain only useful information for motion compensation, because it probably includes wrong
matchings or correct matchings that indeed belong to moving objects.
    Least Squares Method does not perform well when there is a large portion of outliers in the total number of features,
as in this case. However, outliers can be identified and filtered out of the estimation process, resulting in better accuracy.
Iterative least squares refinement10 can be employed to avoid outliers and refine solution. This method determines at first
the least squares solution with the whole set of features, then it computes the error statistics for the data set removing
any keypoint that presents a significant error and performs again a better least square estimation, and so on until some
convergence criterion is met. This technique performs well, but it appears clear that big effort must be devoted to design
an adaptive technique able to compute the error statistics to remove outliers from the point set.
    In a preliminary phase all the points which present a large Euclidean norm of the local motion vector are discarded, for
they are likely to be correct, so they are immediately discarded using a fixed threshold. Then, the remaining local vectors
are used to get a first motion estimation with Least Squares Method.
    After this first estimation step, each input keypoint is validated against the computed parameters and so its error can
be evaluated. Since each feature is related to a keypoint in the first image and another one in the second, the first point is
transformed using parameters obtained from this first step, computing an expected second point that may be compared to
the real detected second point. Accordingly, two different local motion vectors can be computed: the first one from the
matched point and the second one from the expected point.
   Two different error measures have been adopted to evaluate the quality of a matching:

   • Euclidean distance between expected and real point: this measure does perform well since rejects matchings that
     do not agree with the found translational components but may results inaccurate for border points when a rotation
     occurs;
   • angle between the two local motion vectors: this measure performs well with rotational components.

2.3 Motion compensation
Obviously both error measures are fit to discard uncorrect matchings, but each measure captures a particular problem in
the matching algorithm and so these two quantities must be opportunely combined in an unique quality index.
    These task is performed with a fuzzy logic model that evaluates these two error measures transforming them into
reliability values by membership functions and then derives a final estimation of the quality of the matching by using a
Sugeno model. Fuzzy logic has been successfully adopted for electronic video stabilization in.11
Figure 2. Fuzzy membership functions: three different fuzzy sets are used

    Our fuzzy logic model takes as input the two aforementioned error measures and outputs a single quality index, a real
value in the range [0, 1] which represents how good is the matching between a pair of points.
   Before a fuzzy logic model is to be adopted, fuzzyfication of inputs and and de-fuzzyfication of outputs must be
accordingly defined. In order to obtain a classification that does not depend on the particular values of the error measures
we adopted a simple strategy. Let E = (e1 , e2 , . . . eN ) be the set of error values computed for each keypoint matching
and let Me be the median error, that is the median value of E. For each element in E the error deviation di is defined as
      ei
di = M  e
          . Error deviation is less than one if an error is below the median error and greater than one in the opposite case.
    This formulation allows us to define a simpler fuzzy model, where absolute error values are not taken into account but
instead their value with respect to the median is considered. The median value is also more robust and less influenced by
extreme values, that may easily mislead the arithmetic mean.
   When a error deviation di is given as input to the membership functions, its value is mapped to three different classes of
accuracy, namely high, medium and low, as showed in Fig. 2. Lower values of error deviation are mapped to the best class
whereas higher values go into the worst class. By overlapping simple triangular and trapezoidal membership functions a
good definition of the input fuzzy sets can be achieved.
    A zero-order Takagi-Sugeno-Kang model12 of fuzzy inference is then adopted to infer the quality index: four different
output fuzzy sets are defined to describe the quality of the matching, namely excellent, good, medium and bad. These
values are defined to discriminate particularly between good and excellent results, since it is likely that good matchings
will exhibit good error measures, but we need to focus only on the very best points in order to improve our final result.
Each one of these four classes are mapped into a constant value, respectively 1.0, 0.75, 0.5, 0.

                           Figure 3. Overall fuzzy model surface, maps two inputs into a single output

    A zero-order TSK model is very simple, as it is a compact and computationally efficient representation, and lends itself
to the use of adaptive techniques for constructing fuzzy models. These adaptive techniques may be even used to customize
the membership functions so that the fuzzy system best models the data. Moreover this kind of model is at the same time
powerful enough to define a quite complex behaviour, if it is used with properly defined if-then fuzzy rules. Our rules are
defined from both inputs (the two error measures) accordingly to this formulation:
1. if both inputs are high then quality is excellent
   2. if one input is high and the other is medium then quality is good
   3. if both inputs are medium then quality is medium

   4. if at least one input is low then quality is bad

    The final output of our fuzzy model is the quality index, that is a value in the range [0, 1]: by tuning the membership
functions and the output classes it is possible to change how the error measures are mapped to the final quality index. In
Fig. 3 the smooth surface defined by the fuzzy model is shown, while in Fig. 4 the filtering process outcome is described.

                                                                 (a)

                                                                 (b)
                  Figure 4. Original frame with all Local Motion Vectors detected (a) and after fuzzy filtering (b).

   When a quality index has been computed for each pair of keypoints these items are sorted and only the best 60%
matchings are inserted in a second data input set for Least Square Method, whose results are taken as the final correct
motion estimation to be used in the motion compensation step of the method.

                                       3. INTEREST FEATURE DETECTORS
Our approach builds on efficient keypoint detectors and it is entirely independent on the particular detection algorithm
adopted as far as a standard SIFT descriptor is given for each detected point. Nonetheless, a good keypoint detector may
improve final performances, whereas another one may even jeopardize the final outcome of the video stabilization.
We adopted in our algorithm the standard SIFT detector presented in8 and other point detectors described in.13 All
of these detectors show scale invariance and are partly invariant to other image transformations occurring when the point
of view changes, so they are particularly suitable for the task of video stabilization. Basically, each approach first detects
features and then computes a set of descriptors for these features: while the first step is different among all methods, the
descriptor computation is performed always in the same way on a suitable region around the detected point.
    SIFT detector8 has been designed for extracting highly distinctive invariant features from images, which can be used
to perform reliable matching of the same object or scene between different images. It is an efficient algorithm for object
recognition based on local 3D extrema in the scale-space pyramid built with difference-of-Gaussian (DoG) filters. The
input image is successively smoothed with a Gaussian kernel and sampled and the DoG representation is obtained by
subtracting two successive smoothed images. Thus, all the DoG levels are constructed by combined smoothing and sub-
sampling. The local 3D extrema in the pyramid representation determine the localization and the scale of the interest
points.
    On the other hand the approaches presented in13 combine the reliable Harris and Hessian detectors with the Laplacian-
based automatic scale selection. Laplace automatic scale selection selects the points in the multi-scale representation which
are present at characteristic scales and makes use of local extrema over scale of normalized derivatives to individuate
characteristic local structures.14 These detectors provide the regions used to compute descriptors which shows invariance
for some image transformations: Harris-Laplace regions are invariant to rotation and scale changes and are likely to contain
corner-like patterns. The famous Harris corner detector15 locates interest points using the locally averaged moment matrix,
obtained from image gradients: then it combines the eigenvalues of the moment matrix to compute a corner strength of
which maximum values indicate the corner positions. On the other hand, the Hessian detector chooses interest points based
on the Hessian matrix, looking for points which simultaneously are local extrema of both the determinant and trace of the
Hessian matrix. Both detectors are modified to be used in the scale-space, combining them with a Gaussian scale-space
representation in order to create a scale-invariant detector. Hence all the derivatives are computed in an particular scale and
thus are derivatives of an image smoothed by a circular Gaussian kernel.
    In all cases the SIFT descriptor is computed from local image gradients, sampled in an appropriate neighborhood of
the interest point at the selected scale, even if in the affine approach this region may be deformed according to the detected
affine transformation. Then a descriptor that allows significant invariance to shape distortion and illumination changes is
created from a histogram containing local gradient values.

                                           4. EXPERIMENTAL RESULTS
The performance of our method have been evaluated on some standard video sequences with different shooting conditions.

                     (a)                                      (b)                                     (c)
                                     Figure 5. Sampled frame from the test video sequences.

   1. a zooming sequence of different objects on a table, while illumination is gradually fading out and in
   2. a close-up of a lit monitor while a computer mouse is swinging right in front of the camera

   3. a sequence shot while the cameraman is sliding onwards between office desks on a moving chair
One frame from each of these sequences is shown in Fig. 5.
    Numerical evaluation of the quality of the video stabilization is fulfilled using Peak Signal-to-Noise Ratio (PSNR) as
error measure. PSNR between frame n and frame n + 1 is defined as

                                                          M N
                                                      1 XX                                2
                                     M SE(n) =                   [In (x, y) − In+1 (x, y)]                                  (2)
                                                     N M y=1 x=1

                                                                            2
                                                                           IM AX
                                               P N SR(n) = 10 log10                                                         (3)
                                                                          M SE(n)
where M SE(n) is the Mean-Square-Error between frames, IM AX is the maximum intensity value of a pixel and N and
M are frame dimensions.
     PNSR measures how much an image is similar to another one, hence it is useful to evaluate how much a sequence
is stabilized by the algorithm by simply evaluating how much consecutive frames are similar in the processed sequence.
Inteframe Transformation Fidelity is then used as in16 to objectively assess the stabilization brought by a video stabilization
algorithm, for stabilized sequence should have a higher ITF than the original sequence.

                                                                   Nf rame −1
                                                          1           X
                                          IT F =                                P SN R(k)                                   (4)
                                                     Nf rame − 1
                                                                      k=1

    As Tab. 1 shows, our algorithm achieves a strong improvement in the ITF despite of the particular adopted feature
detector. Nevertheless, some detectors performs dramatically better than others.

                    Table 1. ITF on original and stabilized sequences: comparation for different point detectors.
                            Sequence      Original     Lowe     Harris-Laplace       Hessian-Laplace
                               1           27.82       36.16        35.28                 32.94
                               2           27.48       32.52        32.06                 30.76
                               3           24.86       30.28        30.62                 30.44

  Lowe ’s detector obtains the best performances (an average gain, with respect to the original sequences, of 6.27 dB).
Moreover it is less complex, in terms of computational time, with respect to the other approaches (Tab. 2).

Table 2. Average time of computation per frame for different point detectors (benchmarks obtained on a Intel Centrino Core 2 Duo
T5500 @ 1.6 GHZ).
                                                  Detector           Average time
                                                 Lowe SIFT              3.6 s
                                               Harris-Laplace           4.0 s
                                               Hessian-Laplace          3.8 s

   .

                                                      5. CONCLUSIONS
In this paper we have proposed a novel approach for video stabilization based on the extraction of SIFT features through
video frames. Feature points are detected and their stability is evaluated through a combination of geometric error measures
and fuzzy logic modeling. Moreover the algorithm performances have been evaluated against various feature detectors in
order to find the best for our application. Future work will be devoted to find faster feature detectors in order to made
feasible a real time implementation.
REFERENCES
 [1] Inc., C., “Canon faq: What is vari-angle prism?, http://www.canon.com/bctv/faq/vari.html.”
 [2] Auberger, S. and Miro, C., “Digital video stabilization architecture for low cost devices,” Proceedings of the 4th
     International Symposium on Image and Signal Processing and Analysis , 474 (2005).
 [3] Jang, S.-W., Pomplun, M., Kim, G.-Y., and Choi, H.-I., “Adaptive robust estimation of affine parameters from block
     motion vectors,” Image and Vision Computing , 1250–1263 (August 2005).
 [4] Vella, F., Castorina, A., Mancuso, M., and Messina, G., “Digital image stabilization by adaptive block motion vectors
     filtering,” IEEE Trans. on Consumer Electronics 48, 796–801 (August 2002).
 [5] Bosco, A., Bruna, A., Battiato, S., Di Bella, G., and Puglisi, G., “Digital video stabilization through curve warping
     techniques,” IEEE Transactions on Consumer Electronics 54, 220–224 (May 2008).
 [6] Censi, A., Fusiello, A., and Roberto, V., “Image stabilization by features tracking,” International Conference on
     Image Analysis and Processing (1999).
 [7] Fusiello, A., Trucco, E., Tommasini, T., and Roberto, V., “Improving feature tracking with robust statistics,” Pattern
     Analysis & Applications (2), 312–320 (1999).
 [8] Lowe, D., “Distinctive image features from scale-invariant keypoints,” International Journal of Computer Vision Vol.
     60(2), 91–110 (2004).
 [9] Battiato, S., Gallo, G., Puglisi, G., and Scellato, S., “Sift features tracking for video stabilization,” in [ICIAP ’07:
     Proceedings of the 14th International Conference on Image Analysis and Processing], 825–830, IEEE Computer
     Society, Modena, Italy (2007).
[10] Björck, A., “Numerical methods for least squares problems,” SIAM (1996).
[11] Egusa, Y., Akahori, H., Morimura, A., and Wakami, N., “An electronic video camera image stabilizer operated on
     fuzzy theory,” Fuzzy Systems, 1992., IEEE International Conference on , 851–858 (8-12 Mar 1992).
[12] Sugeno, M., [Industrial Applications of Fuzzy Control ], Elsevier Science Inc., New York, NY, USA (1985).
[13] Mikolajczyk, K. and Schmid, C., “Scale & affine invariant interest point detectors,” Int. J. Comput. Vision 60(1),
     63–86 (2004).
[14] Lindeberg, T., “Feature detection with automatic scale selection,” International Journal of Computer Vision 30(2),
     77–116 (1998).
[15] Harris, C. G. and Stephens, M., “A combined corner and edge detector,” in [In Proceeding of 4th Alvey Vision
     Conference], 147–151 (1988).
[16] Mercenaro, L., Vernazza, G., and Regazzoni, C., “Image stabilization algorithms for video-surveillance application,”
     IEEE Proceedings International Conference of Image Processing Vol. 1, p. 349–352 (2001).
You can also read