NOTE A Fully Projective Formulation to Improve the Accuracy of Lowe's Pose-Estimation Algorithmff

Page created by Eva Steele

Shopping

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

COMPUTER VISION AND IMAGE UNDERSTANDING
Vol. 70, No. 2, May, pp. 227–238, 1998
ARTICLE NO. IV970632

NOTE
A Fully Projective Formulation to Improve the Accuracy of Lowe’s
Pose-Estimation Algorithm∗
Helder Araújo, Rodrigo L. Carceroni, and Christopher M. Brown
University of Rochester, Computer Science Department, Rochester, New York 14627

Received April 18, 1996; accepted April 2, 1997

ble to decouple completely the recovery of rotational pose pa-
Both the original version of David Lowe’s influential and clas- rameters from their translational counterparts. However, unlike
sic algorithm for tracking known objects and a reformulation of it Lowe’s, none of these methods is easily generalizable to deal
implemented by Ishii et al. rely on (different) approximated imag- with uncalibrated focal length or objects (scenes) with internal
ing models. Removing their simplifying assumptions yields a fully degrees of freedom.
projective solution with significantly improved accuracy and con- Lowe’s algorithm is attractive because of its elegant simplic-
vergence, and arguably better computation-time properties. °c 1998 ity and its powerful generality. In this note, we first recall the
Academic Press
original algorithm and another incarnation from the literature.
Both algorithms contain certain simplifying assumptions that
are easily eliminated. We present and comparatively evaluate
1. INTRODUCTION AND HISTORY
the resulting fully projective solution. It preserves the appeal-
The ability to track a set of points in a moving image plays ing properties of Lowe’s original conception while performing
a fundamental role in several computer vision applications with substantially better than either approximation. Section 7 relates
real-time constraints such as autonomous navigation, surveil- our findings to previous speculations on and analyses of Lowe’s
lance, grasping, manipulation, and augmented reality. Often algorithm. This note is an abbreviation of [2], which is less terse
some geometrical invariants of these points (such as their rela- and contains more experimental results.
tive spatial positions, in the case of a rigid object) are known in
advance. Algebraic solutions with perspective camera models
2. LOWE’S ALGORITHM
have been proposed for several variations of this problem [1,
5, 6, 8, 10, 12, 17, 20, 22]. However, the resulting techniques Lowe’s original algorithm [13–15] addresses the issue of view-
usually work only with a limited number of points and are thus point and model parameter computation given a known 3-D ob-
sensitive to additive noise and erroneous matching. Furthermore, ject and the corresponding image. It assumes that the imaging
they usually depend on numerical techniques for finding zeros process is a projective transformation. The method can thus be
of fourth-degree (or higher) polynomial equations. used to identify the pose (translation and orientation with respect
Pioneering work by Lowe [13–15] and Gennery [7] addressed to the camera coordinate system) of a local coordinate system
the problem in a projective framework. Lowe showed that the affixed to an imaged rigid object. It can also be extended to dis-
direct use of numerical optimization techniques is an effective cover the values of other parameters such as the camera focal
way to overcome the lack of robustness that makes the traditional length and shape parameters of nonrigid objects. The recovery
analytical techniques infeasible in practice. process is based on the application of Newton’s method.
DeMenthon and Davis [4, 18] and Horaud et al. [9] pro- Rather than solving directly for the parameter vector s in a
pose techniques that start with weak- or para-perspective so- nonlinear system, Newton’s method computes a vector of cor-
lutions, respectively, and refine them iteratively to recover the rections δ to be subtracted from the current estimate for s on
full-perspective pose. Phong et al. [19] showed that it is possi- each iteration. If s(i) is the parameter vector for iteration i, then

∗ This material is based on work supported by the Luso–American Foun-
s(i+1) = s(i) − δ. (1)
dation, Calouste Gulbenkian Foundation, JNICT, CAPES process BEX 0591/
95-5, NSF IIP Grant CDA-94-01142, NSF Grant IRI-9306454, and DARPA Given a vector of error measurements e between components
Grant DAAB07-97-C-J027. of the model and the image, we want to solve for a correction
227
1077-3142/98 $25.00
Copyright ° c 1998 by Academic Press
All rights of reproduction in any form reserved.

228                                                 ARAÚJO, CARCERONI, AND BROWN

vector δ that eliminates this error                                                               TABLE 2
                                                                         The Partial Derivatives of u and v with Respect to Each of the
                                           ∂ei
                  Jδ = e, where Ji j =          .              (2)    Camera Viewpoint Parameters and the Focal Length, According
                                           ∂x j                       to Lowe’s Original Approximation
  The equations used to describe the projection of a three-                                               u                             v
dimensional model point p into a two-dimensional image point
[u, v] are                                                            dx                                   1                            0
                                            ·     ¸                   dy                                   0                            1
                                              x y                     dz                              − f c2 x 0                   − f c2 y 0
           [x, y, z] = R(p − t), [u, v] = f
                    T
                                               ,    ,    (3)
                                              z z                     φx                             − f c2 x 0 y 0            − f c(z 0 + cy 0 2 )
                                                                      φy                           f c(z 0 + cx 0 2 )               f c2 x 0 y 0
where T denotes transpose, t is a 3-D translation vector (de-         φz                               − f cy 0                       f cx 0
fined in the model coordinate frame) and R is a rotation matrix        f                                  cx 0                         cy 0
that transforms p in the original model coordinates into a point
[x, y, z]T in camera-centered coordinates. These are combined           Note. c = 1/(z 0 + dz ).
in the second equation above with the focal length f to perform
perspective projection into an image point [u, v].
   The problem is to solve for t, R, and possibly f , given a            Newton’s method is carried out by calculating the optimum
number of model points and their corresponding locations in an        correction rotations 1φx , 1φ y , and 1φz to be made about the
image. In order to apply Newton’s method, we must be able to          camera-centered axes. Given Lowe’s parameterization, the par-
calculate the partial derivatives of u and v with respect to each     tial derivatives of u and v with respect to each of the seven
of the unknown parameters. Lowe [14] proposes a reparameteri-         parameters of the imaging model (including the focal length f )
zation of the projection equations, to simplify the calculation by    are given in Table 2.
“express[ing] the translations in terms of the camera coordinate         Lowe then notes that each iteration of the multidimensional
system rather than model coordinates”:                                Newton’s method solves for a vector of corrections

       [x 0 , y 0 , z 0 ]T = Rp,                                                    δ = [1dx , 1d y , 1dz , 1φx , 1φ y , 1φz ]T .             (6)
                             ·                           ¸               Lowe’s algorithm dictates that for each point in the model
                                 x0              y0
                [u, v] = f 0          + dx , f 0     + dy .    (4)    matched against some corresponding point in the image, we
                               z + dz         z + dz
                                                                      first project the model point into the image using the current
    The variables R and f remain the same as in the previous          parameter estimates and then measure the error in the resulting
transform, but vector t has been replaced by the parameters dx ,      position with respect to the given image point. The u and v
d y , and dz . The two transforms are equivalent when                 components of the error can be used independently to create
                                                                      separate linearized constraints. Making use of the u component
                       ·     0             0           ¸T
                     −1 dx (z + dz ) d y (z + dz )
                                                                      of the error, eu , we create an equation that expresses this error
            t = −R                  ,              , dz .   (5)       as the sum of the products of its partial derivatives times the
                               f             f
                                                                      unknown error-correcting values
   According to Lowe, “in the new parameterization, dx and d y
                                                                                   ∂u        ∂u          ∂u
simply specify the location of the object on the image plane                           1dx +      1d y +     1dz
and dz specifies the distance of the object from the camera.” To                   ∂dx       ∂d y        ∂dz
compute the partial derivatives of the error with respect to the                          ∂u        ∂u          ∂u
rotation angles (φx , φ y , and φz are the rotation angles about x,                   +       1φx +      1φ y +     1φz = eu .                (7)
                                                                                          ∂φx       ∂φ y        ∂φz
y, and z, respectively), it is necessary to calculate the partial
derivatives of x, y, and z with respect to these angles. Table 1
                                                                         The same point yields a similar equation for its v component.
gives these derivatives for all combinations of variables.
                                                                      Thus, each point correspondence yields two equations. As Lowe
                                                                      says: “from three point correspondences we can derive six equa-
                            TABLE 1                                   tions and produce a complete linear system which can be solved
 The Partial Derivatives of x, y, and z with Respect to Counter-      for all six camera-model corrections.”
 clockwise Rotations φ (in Radians) about the Coordinate Axes
                                                                                        3. LOWE’S APPROXIMATION
                     x                     y                    z
                                                                        Lowe’s formulation assumes that dx and d y are constants to
φx                   0                    −z 0                  y0
                                                                      be determined by the iterative procedure, when in fact they are
φy                   z0                    0                   −x 0
φz                  −y 0                  x0                    0     not constants at all—they depend on the location of the points
                                                                      being imaged.

FULLY PROJECTIVE POSE ESTIMATION                                                                               229

  Let the rows of the rotation matrix R be denoted by rx , r y ,                                      TABLE 3
and rz , such that                                                           The Partial Derivatives of u and v with Respect to Each of the
                                                                          Camera Viewpoint Parameters and the Focal Length According to
                                                                        Ishii’s Approximation
                                  rx
                            R =  ry  .                                                                           u                                           v
                                  rz
                                                                          xt                                      −fc                                          0
                                                                          yt                                        0                                        −fc
Then, using the projective transformation formulated in Eq. (3),          zt                                      f ac2                                      f bc2
the new parameters dx , d y , dz are given by                             φx                                   − f ac2 p y                            − f c( pz + bcp y )
                                                                          φy                                f c( pz + acpx )                               f bc2 px
                                                                          φz                                     − f cp y                                    f cpx
                   dz = −rz · t, and then                                  f                                       ac                                         bc
                              ·                         ¸
                                   rx · t      ry · t
            [dx , d y ] = − f              ,              .        (8)       Note. Here [a, b, c] = [ px − xt , p y − yt , 1/( pz − z t )], where p = [ px , p y ,
                                rz · p + dz rz · p + dz
                                                                          pz ]T .

   Notice that dz is dependent only on the object pose parame-
ters, but dx and d y are also a function of each point’s coordinates      Model the image formation process by Eq. (3). Remove the
in the object coordinate frame. It is therefore in general impossi-       approximations of Lowe and Ishii by defining
ble to find a single consistent value either for dx or for d y . In the
general case both these parameters will depend on the position                                    [dx0 , d y0 , dz0 ] = −[rx · t, r y · t, rz · t].                (10)
of each individual object feature. They are not constants—they
are only the same for those points for which rz · p has the same          In this case, the image coordinates of each point are given by
value. Therefore, we cannot use dx and d y as defined in Eq. (4).                                                      ·             ¸
The assumption that is implicit in Lowe’s algorithm as published                                                x 0 + dx0 y 0 + d y0
                                                                                                     [u, v] = f 0        ,             .                           (11)
is that the corrections needed for the translation are much larger                                              z + dz0 z 0 + dz0
than those due to the rotation of the object. However, if no re-
strictions are imposed, the coordinates of the points in the object       The partial derivatives of u and v with respect to each of the six
coordinate frame (p) can assume high values. Even if they do              pose parameters and the focal length are given in Table 4.
not, the term rz · p may change significantly (due to the object’s           As in Lowe’s formulation, the translation vector is computed
own geometry) and affect the estimation process.                          using Eq. (5), with dx0 , d y0 , and dz0 as defined in Eq. (10). This
                                                                          translation vector is defined in the object coordinate frame. The
                                                                          minimization process yields estimates of dx0 , d y0 , and dz0 , which
                4. ISHII’S APPROXIMATION                                  are the result of the product of the rotation matrix by the trans-
                                                                          lation vector.
   Ishii’s formulation [11] also contains simplifications. Image
                                                                             A numerically equivalent but conceptually more elegant way
formation is again given by Eq. (3).
                                                                          of looking at this solution is through a redefinition of the image
   Defining
                                                                          formation process, so that rotation and translation are explicitly
                                                                          decoupled, and the translation vector is defined in the camera
                         [xt , yt , z t ]T = Rt,                   (9)
                                                                                                      TABLE 4
the partial derivatives of u and v with respect to each of the              The Partial Derivatives of u and v with Respect to Each of the
seven parameters of the camera model are given in Table 3. The            Camera Viewpoint Parameters and the Focal Length According to
vector [xt , yt , z t ]T represents the translation vector in the cam-    Our Fully Projective Solution
era coordinate frame. In this approximation, the computation
of the partial derivatives is performed using the coordinates of                                                    u                                          v
the points in the object coordinate frame, ignoring the effect of
                                                                          dx0                                       fc                                         0
rotation.
                                                                          d y0                                       0                                          fc
                                                                          dz0                                   − f ac2                                    − f bc2
        5. OUR FULLY PROJECTIVE SOLUTION                                  φx                                   − f ac2 y 0                             − f c(z 0 + bcy 0 )
                                                                          φy                                 f c(z 0 + acx 0 )                              f bc2 x 0
  Initially, define x 0 , y 0 , and z 0 as in Lowe’s formulation          φz                                     − f cy 0                                     f cx 0
                                                                           f                                        ac                                         bc

                         [x 0 , y 0 , z 0 ]T = Rp.                               Note. Here [a, b, c] = [x 0 + dx0 , y 0 + d y0 , 1/(z 0 + dz0 )].

230                                                            ARAÚJO, CARCERONI, AND BROWN

coordinate frame. Redefine                                                         The other nine pose and initial solution parameters are in
                                                                                general sampled uniformly over their whole domain. The true
                              [x, y, z]T = Rp + t,                      (12)    object position is constrained to lie in the interior of the infinite
                                                                                pyramid whose origin is the optical center and whose faces are
                   then   [dx0 , d y0 , dz0 ]T   = t,                   (13)    the semi-planes z = |x| and z = |y|, z ≥ 0.
                                                                                   For each test we compute two global image-space error mea-
and Eqs. (10) and (11) can be collapsed into                                    sures, assuming known correspondence between image and
                                  ·              ¸                              model features. The first, called Norm of Distances Error
                              x 0 + tx y 0 + t y
                   [u, v] = f 0       ,            .                    (14)    (NDE), is the norm of the vector of distances between the po-
                              z + tz z 0 + tz                                   sitions of the features in the actual image and the positions of
                                                                                the same features in the reprojected image generated by the
  In this case, the least-squares minimization procedure gives                  estimated pose. The second, called Maximum Distance Error
the estimates of the translation vector directly.                               (MDE), is the greatest absolute value of the vector of error
                                                                                distances. Both measures are always expressed using the focal
                6. EXPERIMENTAL RESULTS                                         length as length unit.
                                                                                   NDE and MDE do not necessarily indicate how close the
   In order to compare the three algorithms described in the pre-
                                                                                estimated pose is from the true pose. We also record individual
vious sections we report extensive experiments with synthetic
                                                                                errors for six different pose parameters: the errors in the x, y,
data. Our goal is to estimate the relative accuracy and conver-
                                                                                and z coordinates of the estimate for the actual object translation
gence speed of each algorithm for a number of useful situations.
                                                                                vector, measured as relative errors with respect to the object’s
So, in the tests we control a few parameters explicitly and sam-
                                                                                center actual depth (z true ), and the absolute errors in the esti-
ple all the others uniformly, hoping to cover important cases
                                                                                mates for the roll, pitch, and yaw angles of the object frame with
while keeping the amount of data down to a manageable level.
                                                                                respect to the camera, measured in units of π radians. Although
In Lowe’s approximation, we use the depth of the center of the
                                                                                all these metrics were computed, this note usually shows only
object in the camera frame as the multiplicative factor that yields
                                                                                results with NDE, and x-translation error: they are faithfully
the values of dx and d y . All the methods are tested with exactly
                                                                                representative of both image-space error metrics and the three
the same poses and initial conditions [2].
                                                                                translation and three rotation error metrics.
   Unless explicitly stated otherwise, all the experiments de-
                                                                                   For each of these eight different error measures, we compute
scribed here take the imaged object to be the eight corners of
                                                                                the average, the standard deviation, the averages and standard
a cube, with edge lengths equal to 25 times the focal length of
                                                                                deviations excluding the 1, 5, or 25% smallest and largest ab-
the camera (for a 20 mm lens, for instance, this corresponds
                                                                                solute values, and the median. Statistics that leave out the tails
to a half-meter-wide, long and deep object). The parameters
                                                                                of the error distributions are included to be fair to a method
explicitly controlled, in general, are the depth of the object’s
                                                                                (if any) that underperforms in a few exceptional situations but
center with respect to the camera frame (z true ), measured in fo-
                                                                                is better “in general”: for instance, one that occasionally vio-
cal lengths, and the magnitudes of the translation (tdiff ) and the
                                                                                lently diverges but usually gives better results. In this note we
rotation (rdiff ) needed to align the initial solution with the true
                                                                                usually present only the average error and its standard devia-
pose. z true is always measured in focal lengths and tdiff and rdiff
                                                                                tion and the results with the exclusion of the upper and lower
are measured as a relative error with respect to z true and as an
                                                                                25% of the errors. For more error measures and more statistics
absolute error in π radians, respectively. A formal definition
                                                                                see [2].
of these parameters and of the whole sampling methodology is
given in [2].
                                                                                6.1. Convergence in the General Case
   Unless stated otherwise, three average values are chosen for
each of those parameters (Table 5). For each average value v,                      Initially, we tried to compare the speed of convergence and
the corresponding parameter is then sampled uniformly in the                    final accuracy of each method with arbitrary poses and initial
region [3v/4, 5v/4].                                                            conditions. The statistics for the NDE, based on 13,500 exe-
                                                                                cutions per method, are plotted in Fig. 1. They show that for
                                                                                most poses, Lowe’s original approximation converges to a very
                             TABLE 5
                                                                                high global error level, and Ishii’s approximation only improves
         General Average Sampling Values Used in Most Tests
                   for the Controlled Parameters
                                                                                the initial solutions in its first iteration and diverges after that.
                                                                                Our fully projective solution, on the other hand, converges at
Param                 Avg 1                         Avg 2               Avg 3   a superexponential rate to an error level roughly equivalent to
                                                                                the relative rounding error of double precision, which is about
z true                 50                               500             5,000
                                                                                1.11 × 10−16 .
tdiff                  0.1                              0.01            0.001
rdiff                  0.2                              0.02            0.002      Even taking into account the worst data, our approximation
                                                                                still converges superexponentially to this maximum precision

FULLY PROJECTIVE POSE ESTIMATION 231

FIG. 1. Convergence of an image-space error metric, the Norm of Distances Error (see introduction of Section 6), with respect to the number of iterations of
Lowe’s (solid line), Ishii’s (dotted line), and our fully projective solution (dash–dotted line). Tests were performed with a cube rotated by arbitrary angles with
respect to the camera frame.

level—the bad cases only slow convergence a bit. But in this asymmetric object whose eight points were all uniformly sam-
case, Lowe’s original algorithm and (especially) Ishii’s approx- pled in the space [−1, 1]3 and then scaled for a maximum edge
imation tend to diverge, yielding some solutions worse than the size of 25 focal lengths. All the results were almost identical to
initial conditions. those obtained with the cube.
The statistics for the errors in the individual pose parame-
ters make the superiority of the fully projective approach even 6.2. Convergence with Rough Alignment
clearer. Figure 2 exhibits the relative errors in the value of the x For some relevant practical applications, our initial assump-
translation. Both Lowe’s and Ishii’s algorithms diverge in most tion that all the attitudes of the object with respect to the cam-
situations, while the fully projective solution keeps its super- era occur with equal probability is too general. For instance, in
exponential convergence. Due to their simplifications, Lowe’s vehicle-following applications it is reasonable to assume that
and Ishii’s methods in those cases are not able to recover the the poses in which the object frame is roughly aligned to the
true rotation of the object. They tend to make corrections in the camera frame occur with much larger probability than poses in
translation components to fit the erroneously rotated models to which the object frame is rotated by large angles. We therefore
the image in least-squares sense, generating very imprecise val- performed some tests in which the rotation component of the ini-
ues for the parameters themselves. This problem is especially tial solutions was represented by a quaternion whose axis was
acute with Ishii’s approximation, which tends to translate the sampled uniformly on a unit semi-sphere with z ≥ 0, but whose
object as far away from the camera as possible, so that the re- angle was constrained to the region [−π/5, π/5].
projected images for all points are collapsed into a single spot The NDE statistics, plotted in Fig. 3, show that in this case
that minimizes the mean of the squared distances with respect the accuracy of Ishii’s approximation is much improved (pre-
to the true images. Similar results were obtained for the other dictably, given its semantics). Instead of diverging, now it con-
five parameter-space errors. verges exponentially toward the rounding error lower bound.
To ensure that the results did not depend on symmetries in So, even in this favorable situation, Ishii’s approximation is
the cubical imaged object, we repeated the same tests with an still much less efficient than the fully projective solution, that

FIG. 2. Convergence of the ratio between the error on the estimated x translation and the actual depth of the object’s center, with respect to the number of
iterations of Lowe’s (solid line), Ishii’s (dotted line), and our fully projective solution (dash–dotted line). Tests were performed with a cube rotated by arbitrary
angles with respect to the camera frame.

232 ARAÚJO, CARCERONI, AND BROWN

FIG. 3. Convergence of an image-space error metric, the Norm of Distances Error (see introduction of Section 6), with respect to the number of iterations of
Lowe’s (solid line), Ishii’s (dotted line), and our fully projective solution (dash–dotted line). Tests were performed with a cube rotated by angles of at most π/5
radians with respect to the camera frame.

converges superexponentially (in about 5 iterations) for the NDE, time constraints, due to its smaller sensitivity to ill-conditioned
as shown, and also for all other error metrics tested. configurations. The problem is that Lowe’s original method is
much more likely to face singularity problems in the resolution
6.3. Execution Times of the system described in Eq. (2), resulting in the execution of
slower built-in Matlab routines. The fully projective approach
Lowe’s and Ishii’s simplifications do not result in a significant
looks even better when compared to Ishii’s solution. The expla-
inner-loop performance gain with respect to the fully projective
nation is that a careful subexpression factorization can save us
solution. We hand-optimized the three algorithms, with common
the work that Ishii’s simplifications are designed to save, so we
subexpression factorization, loop vectorization, and static pre-
pay no time penalty for a solution that is less sensitive to the
allocation of all matrices. After that, the internal loop (in Matlab)
proximity of singularities [2].
for Lowe’s method (which is the simplest of the three) contained
only four floating-point operations less than the internal loop of
6.4. Sensitivity to Depth in Object Center Position
the fully projective solution.
We measured the execution times of 20 iterations of each We also performed some experiments to check the sensitiv-
method (details in [2]). The statistics shown in Fig. 4 were gath- ity of the techniques to individual variations in each one of
ered from a set of 13,500 runs per method, performed with the the three controlled parameters. First, we varied the average
same sampling techniques employed in the convergence exper- value of z true (object depth) logarithmically between 25 and
iments. 51,200 focal lengths, (corresponding, respectively, to 50 cm and
Fully projective solution average times were 2.99 to 4.21% 1024 m, with a 20 mm lens). The statistics for the NDE, plotted
longer than those of Lowe’s original method, but the standard de- in Fig. 5 for each of the 12 values chosen for z true , show that our
viations of the elapsed times for Lowe’s solution were between method is almost always much more accurate than both Lowe’s
6 and 130% larger than those of the fully projective. Thus, the and Ishii’s. The only exception occurs at a distance of 25 focal
fully projective approach may be more suitable for hard real- lengths.

FIG. 4. Execution times (in seconds) for 20 iterations of each method, computed over all data and with elimination of the 1, 5, and 25% best and worst data.

FULLY PROJECTIVE POSE ESTIMATION 233

FIG. 5. Sensitivity of an image-space error metric, the Norm of Distances Error (see introduction of Section 6), with respect to the actual depth of the object’s
center (in focal lengths), for Lowe’s (solid line), Ishii’s (dotted line), and our fully projective solution (dash–dotted line).

The problem is that in this situation some individual object An analysis of the statistics for the x translation (Fig. 8—
points may get as close as 5 focal lengths from the zero depth other translation and pose angle results are similar) shows that
plane on the camera frame, due to the errors in the initial con- in these cases no divergence toward infinite depth occurs, but
ditions. In this case, our method tends to behave like Ishii’s, merely a premature convergence to false local minima. It is in-
shifting the object as far away from the camera as it can (so as to teresting to note that the accuracy of Lowe’s method stays at this
collapse the image in a single point), instead of aligning it. This same high error levels even with much better initial conditions,
can be confirmed by the analysis of the errors for the x translation which indicates that Lowe’s algorithm (as well as Ishii’s, which
(Fig. 6). But even in this extreme situation, our method, unlike performs even worse) usually (and not only in extreme cases)
Lowe’s and Ishii’s, still converges in most cases. The results for gets stuck in local minima.
the errors on the rotation also support these observations.
6.6. Sensitivity to Rotational Error in Initial Solution
6.5. Sensitivity to Translational Error in Initial Solution
Using the same sampling strategy once more, we selected
Using the same sampling methodology as the previous exper- 10 average values for the absolute rotational error rdiff , ranging
iment, we also studied the effect of changing the relative error in from π/10 to π radians. The statistics for the NDE, exhibited in
the translational component of the initial pose estimates. Fifteen Fig. 9, show again the superiority of our approach for relatively
values for the relative initial translational error tdiff ranging from small errors. Similarly, with errors larger than 3π/10 radians, our
0.025 to 0.5 were chosen. method starts having convergence problems and its reprojection
The statistics for the NDE, depicted in Fig. 7, show that our accuracy approaches that of Lowe’s.
method is once again much more accurate in general. However, The errors in x translation recovery (Fig. 10) and in the pose
when the average magnitude of the translational error is greater angles show that large errors in initial rotation, unlike those
than 30% of the actual depth of the object’s center, our method in initial translation, make our method diverge towards infinite
has convergence problems for the worst 1% of the data, and its depth. This causes its accuracy in terms of pose parameter val-
overall reprojection accuracy drops to a level close to that of ues to drop to levels comparable to (in some cases even worse
Lowe’s original approximation. than) those of Ishii’s. However, in this situation Lowe’s original

FIG. 6. Sensitivity of the ratio between the error on the estimated x translation and the actual depth of the object’s center, with respect to the actual depth of the
object’s center (in focal lengths), for Lowe’s (solid line), Ishii’s (dotted line), and our fully projective solution (dash–dotted line).

FIG. 7. Sensitivity of an image-space error metric, the Norm of Distances Error (see introduction of Section 6), with respect to the ratio between the magnitude
of the translational disturbance in the initial solution and the actual depth of the object’s center, for Lowe’s (solid line), Ishii’s (dotted line), and our fully projective
solution (dash–dotted line).

FIG. 8. Sensitivity of the ratio between the error on the estimated x translation and the actual depth of the object’s center, with respect to the ratio between the
magnitude of the translational disturbance in the initial solution and the actual depth of the object’s center, for Lowe’s (solid line), Ishii’s (dotted line), and our
fully projective solution (dash–dotted line).

FIG. 9. Sensitivity of an image-space error metric, the Norm of Distances Error (see introduction of Section 6), with respect to the magnitude of the rotational
disturbance in the initial solution (in π radians), for Lowe’s (solid line), Ishii’s (dotted line), and our fully projective solution (dash–dotted line).

FIG. 10. Sensitivity of the ratio between the error on the estimated x translation and the actual depth of the object’s center, with respect to the magnitude of the
rotational disturbance in the initial solution (in π radians), for Lowe’s (solid line), Ishii’s (dotted line), and our fully projective solution (dash–dotted line).

FULLY PROJECTIVE POSE ESTIMATION 235

FIG. 11. Sensitivity of an image-space error metric, the Norm of Distances Error (see introduction of Section 6), with respect to the standard deviation of the
noise added to the image (in focal lengths), for Lowe’s (solid line), Ishii’s (dotted line), and our fully projective solution (dash–dotted line).

method also diverges. A solution with a relative translational er- is still a considerably wide gap of accuracy (about one order of
ror of 1010 , 105 , or even 101 is not much more useful in practice magnitude) between our technique and Lowe’s, the second most
than another solution with a relative translational error of 1020 . accurate method.
The problem in this case is the intrinsically downhill nature of The analysis of the effect on the x translation errors (Fig.
Newton’s method, which is the core of all the techniques studied 12) shows that divergence toward infinite depth is a problem
here. We believe that the only way to overcome this limitation again for relatively high noise levels (greater than 10−3 fo-
would be to use a method based on an optimization technique cal lengths in the worst cases). However, the roll angle errors,
with better global convergence properties, such as trust-region displayed in Fig. 13, illustrate the fact that the degradation in
optimization. the estimate for the rotation provided by our method occurs
smoothly. Our technique remains significatively more precise,
6.7. Sensitivity to Additive Noise at least for rotation recovery, for noise levels of up to 10−1 focal
In this experiment, Gaussian noise with zero mean and con- lengths. This is quite impressive given the fact that the restric-
trolled standard deviation was added to the coordinates of the tions in the view angle constrain the images to a 2 × 2 win-
features in the image. Two thousand seven hundred executions dow (in focal lengths) on the image plane, where the noise was
of each method were performed for each of the 15 values of the added.
noise standard deviation chosen in the range of 2−15 to 2−1 focal
6.8. Accuracy in Practice
lengths.
The statistics for the NDE, plotted in Fig. 11, show that in this Finally, we also wanted to compare the three methods in a
case the accuracy of our solution is always limited by the noise realistic situation, in order to check if the better accuracy prop-
level, while the other two approaches get stuck on higher error erties of our approach would make any difference in practice.
levels even when the noise level is very small. For an error level The introduction of noise in the experiments was a first step to-
of about 10−3 focal lengths (which corresponds roughly to the ward this direction, but up to this point we have not addressed
quantization noise with a sensing array of 1k × 1k pixels), there the question of what would be realistic initial conditions. One

FIG. 12. Sensitivity of the ratio between the error on the estimated x translation and the actual depth of the object’s center, with respect to the standard deviation
of the noise added to the image (in focal lengths), for Lowe’s (solid line), Ishii’s (dotted line), and our fully projective solution (dash–dotted line).

236                                                        ARAÚJO, CARCERONI, AND BROWN

FIG. 13. Sensitivity of the error on the estimated roll angle (measured in π radians), with respect to the standard deviation of the noise added to the image (in
focal lengths), for Lowe’s (solid line), Ishii’s (dotted line), and our fully projective solution (dash–dotted line).

possibility for applications such as tracking would be to create                  by
reasonably precise initial estimates of the pose with a smoothing                                                                      
filter. But this approach is very dependent on application-specific                                                 1
                                                                                                                   d1
                                                                                                                          0     − du1
parameters, such as the sampling rate of the camera, the band-                                                                        
                                                                                                         R=                   − d1vd2 
                                                                                                                   uv     d1
width of the image processing system as a whole, the positional                                                  d1 d2   d2           ,                  (16)
                                                                                                                   u      v       1
depth, the linear speed, and the angular speed of the tracked                                                      d2     d2      d2
object.
    A more general approach, which we follow here, is to use a                    where
weaker camera model to generate an initial solution for the prob-                                        p             p
lem analytically and then use the projective iterative solution(s)                                d1 =    u 2 + 1, d2 = u 2 + v 2 + 1.
to refine this initial estimate. This approach was suggested by
DeMenthon and Davis [4], who introduced a way of describing                           After this preprocessing, we applied the technique described
the discrepancy between a weak-perspective solution and the                       by Eq. (15), in order to recover the “foveated” pose. Then, we
full-perspective pose with a set of parameters that can then be                   premultiplied the resulting transformation by the inverse of the
refined numerically, yielding the latter from the former. Let pi                  matrix defined in Eq. (16) in order to recover the original weak-
be the description of the ith model point in the model frame and                  perspective pose, which was used as the initial solution for the
[u i , vi ] be the corresponding image, 1 ≤ i < n. Then, the weak-                iterative techniques being compared.
perspective solution proposed in that paper amounts to solving                        The only controlled parameter left was the actual depth of
the following set of equations (in a least-squares sense), for the                the object’s center (z true ). We chose nine average values for it,
unknown three-dimensional vectors x and y                                         growing exponentially from 25 to 6400 focal lengths. The noise
                                                                                  standard deviation was set at 0.002 focal lengths (corresponding
               (pi − p0 ) · x = u i − u 0 , 1 ≤ i < n,                            roughly to a 512 × 512 spatial quantization). The number of
                                                                         (15)     iterations of each method per run was set at 2, allowing a real-
               (pi − p0 ) · y = vi − v0 , 1 ≤ i < n.                              time execution rate of about 100 Hz. For each average value of
                                                                                  z true , 2500 independent runs of each technique were performed.
   A normalization of these vectors yields the first two rows                         The statistics for the NDE, depicted in Fig. 14, show that our
of the rotation component of the transformation that describes                    fully projective solution was up to one order of magnitude more
the object frame in the camera coordinate system. The third                       accurate than the other two methods for most cases in which the
row can then be obtained with a single cross product opera-                       distance was smaller than 1000 focal lengths (about 20 m, with
tion. After that, the recovery of the translation is straightfor-                 the typical focal length of 20 mm). For distances greater than
ward.                                                                             that, the precision of the weak-perspective initial solution alone
   However, this simple weak-perspective approximation intro-                     was greater than the limitation imposed by the noise and so the
duces errors that increase proportionally not only to the inverse                 three techniques performed equally well.
depth of the object, but also to its “off-axis” angle (the angle                      Analysis of the results for the x translation error (Fig. 15) and
of its center with respect to the optical axis as viewed from the                 the other five parameter-space errors, shows the interesting fact
optical center). In order to avoid this last problem, we first pre-               that all the techniques exhibit parameter-space accuracy peaks
processed the image to simulate a rotation that puts the center of                in the range of 50 to 400 focal lengths. The explanation for
the object’s image in the intersection of the optical axis with the               that is the fact that when the object gets too close, the quality
image plane. Let the center of the object image be described by                   of the initial weak-perspective solution degrades quickly. But
[u, v]. Then, this transformation, as suggested in [22], is given                 on the other hand, when the object is too far away, the noise

FULLY PROJECTIVE POSE ESTIMATION 237

FIG. 14. Sensitivity of an image-space error metric, the Norm of Distances Error (see introduction of Section 6), with respect to the actual depth of the object’s
center (in focal lengths), for Lowe’s (solid line), Ishii’s (dotted line), and our fully projective solution (dash–dotted line). Tests were performed with initial solutions
generated by a weak-perspective approximation.

gradually overpowers the information about both the distance d y themselves, but the affine approximation does not extend
(via observed size) and the orientation of the object, since all the through the whole formulation—in Eq. (4) the denominators
feature images tend to collapse into a single point. Of course, use z 0 + dz instead of just dz . If a constant value had been used
in practice, the exact location of these peaks depends on the for those denominators then the formulation would be purely
dimensions of the actual object(s) whose pose is being recovered. affine. Without implementing other formulations, McIvor spec-
In the case of our technique, the accuracy peak occurred ulates (correctly) that the use of full perspective would improve
clearly at distances of 50 to 100 focal lengths (1 to 2 m with the accuracy of the viewpoint, perhaps at the expense of de-
20 mm lens). Similar results were obtained when the number creased numerical stability. But as we show in Section 6, the
of iterations for each run was raised to 5. This suggests that fully projective formulation is actually more stable except in
our solution may be very well suited for indoor applications in situations that break the other two formulations tested as well.
which it is possible to keep a safe distance between the objects Bray [3] uses Lowe’s algorithm without discussing the ap-
of interest and the camera. proximation. Worrall et al. [21] compare their algorithm for per-
spective inversion with Lowe’s algorithm. They claim that their
7. DISCUSSION AND CONCLUSION technique outperforms both Lowe’s original method and a refor-
mulation of it using fully perspective projection in terms of speed
This note formulates a fully projective treatment of a pose- or of convergence in simulations performed with a cube. This work
parameter-recovery algorithm initially proposed by Lowe [13– sounds similar to ours, but [21] provides no detail on the per-
15]. The resulting formulation is compared with formulations spective projection version of Lowe’s algorithm used in the com-
by Lowe and Ishii [11] that approximate the fully projective parison. They also do not present any discussion or comparison
case. Many experiments based on different scenaria are pre- between the two different implementations of Lowe’s algorithm
sented here, and more are available in [2]. that they mention. Finally, they only report concrete experimen-
Lowe’s approximation was discussed by McIvor [16]. He tal results for their own inversion method, which is based on line
states that assuming that dx and d y are constants amounts to (rather than point) correspondences. No comparative evaluation
an affine approximation. This is true for the parameters dx and of the two variants of Lowe’s algorithm was presented.

FIG. 15. Sensitivity of the ratio between the error on the estimated x translation and the actual depth of the object’s center, with respect to the actual depth of
the object’s center (in focal lengths), for Lowe’s (solid line), Ishii’s (dotted line), and our fully projective solution (dash–dotted line). Tests were performed with
initial solutions generated by a weak-perspective approximation.

238                                                          ARAÚJO, CARCERONI, AND BROWN

FIG. 16. Summary. Left (subset of Fig. 1), convergence of the NDE, an image-space error metric (see Section 6), with respect to the number of iterations of
Lowe’s (solid line), Ishii’s (dotted line), and the fully projective solution (dash–dotted line); statistics exclude the best and worst 25% results. Right (subset of
Fig. 4), mean and standard deviation of execution times; statistics include all data.

   Our experiments indicate that a straightforward reformulation                     7. D. B. Gennery, Visual tracking of known three-dimensional objects, Int. J.
of the imaging equations removes mathematical approximations                            Comput. Vision 7(3), 1992, 243–270.
that limit the precision of Lowe’s and Ishii’s formulations. The                     8. R. M. Haralick and C. Lee, Analysis and solutions of the three point per-
fully projective algorithm has better accuracy with a minimal                           spective pose estimation problem, in Proc. IEEE Conference on Computer
                                                                                        Vision and Pattern Recognition, 1991, pp. 592–598.
increase in terms of computational cost per iteration (Fig. 16).
                                                                                     9. R. Horaud, S. Christy, F. Dornaika, and B. Lamiroy, Object pose: Links
   The fully projective solution is very stable for a wide range of                     between paraperspective and perspective, in Proc. 5th IEEE International
actual object poses and initial conditions. In some particularly                        Conference on Computer Vision, 1995, pp. 426–433.
extreme scenaria, our approach does suffer from numerical sta-                      10. R. Horaud, B. Conio, O. Leboulleux, and B. Lacolle, An analytic solution for
bility problems, but in these situations the accuracy of Lowe’s                         the perspective 4-point problem, Comput. Vision Graphics Image Process.
and Ishii’s approximations is also unacceptable, with errors of                         47, 1989, 33–44.
one or more orders of magnitude in the values of the pose pa-                       11. M. Ishii, S. Sakane, M. Kakikura, and Y. Mikami, A 3-D sensor system
                                                                                        for teaching robot paths and environments, Int. J. Robotics Res. 6(2), 1987,
rameters. We believe that this type of problem is a consequence
                                                                                        45–59.
of Newton’s method and can only be overcome with the use
                                                                                    12. S. Linnainmaa, D. Harwood, and L. S. Davis, Pose determination of a three-
of more powerful numerical optimization techniques, such as                             dimensional object using triangle pairs, IEEE Trans. Pattern Anal. Machine
trust-region methods.                                                                   Intell. 10(5), 1988, 634–647.
   In scenaria that may realistically arise in applications such as                 13. D. G. Lowe, Solving for the parameters of object models from im-
indoor navigation, with the use of reasonable (weak-perspective)                        age descriptions, in Proc. ARPA Image Understanding Workshop, 1980,
initial solutions and taking into account the effect of additive                        pp. 121–127.
Gaussian noise in the imaging process, the fully projective for-                    14. D. G. Lowe, Three-dimensional object recognition from single two-
                                                                                        dimensional images, Artificial Intell. 31(3), 1987, 355–395.
mulation outperforms both Lowe’s and Ishii’s approximations
                                                                                    15. D. G. Lowe, Fitting parameterized three-dimensional models to images,
by up to an order of magnitude in terms of accuracy, with prac-
                                                                                        IEEE Trans. Pattern Anal. Machine Intell. 13(5), 1991, 441–450.
tically the same computational cost.
                                                                                    16. A. McIvor, An analysis of Lowe’s model-based vision system, in Proc. 4th
                                                                                        Alvey Vision Conference, University of Manchester, U.K., 1988, pp. 73–
                             REFERENCES                                                 77.
                                                                                    17. N. Navab and O. Faugeras, Monocular pose determination from lines: Crit-
 1. M. A. Abidi and T. Chandra, A new efficient and direct solution for pose es-        ical sets and maximum number of solutions, in Proc. IEEE Conference on
    timation using quadrangular targets: Algorithm and evaluation, IEEE Trans.          Computer Vision and Pattern Recognition, 1993, pp. 254–260.
    Pattern Anal. Machine Intell. 17(5), 1995, 534–538.                             18. D. Oberkampf, D. F. DeMenthon, and L. S. Davis, Iterative pose estimation
 2. H. Araujo, R. L. Carceroni, and C. M. Brown, A fully projective formu-              using coplanar feature points, Comput. Vision Image Understanding 63(3),
    lation for Lowe’s tracking algorithm, Technical Report 641, University of           1996, 495–511.
    Rochester Computer Science Dept., Nov. 1996.                                    19. T. Q. Phong, R. Horaud, and P. D. Tao, Object pose from 2-D to 3-D
 3. A. J. Bray, Tracking objects using image disparities, Image Vision Comput.          point and line correspondences, Int. J. Comput. Vision 15, 1995, 225–
    8(1), 1990, 4–9.                                                                    243.
 4. D. F. DeMenthon and L. S. Davis, Model-based object pose in 25 lines of         20. T. Shakunaga and H. Kaneko, Perspective angle transform: Principle of
    code, Int. J. Comput. Vision 15, 1995, 123–141.                                     shape from angles, Int. J. Comput. Vision 3, 1989, 239–254.
 5. M. Dhome, M. Richetin, J-T. Lapresté, and G. Rives, Determination of the       21. A. D. Worrall, K. D. Baker, and G. D. Sullivan, Model based perspective
    attitude of 3-D objects from a single perspective view, IEEE Trans. Pattern         inversion, Image Vision Comput. 7(1), 1989, 17–23.
    Anal. Machine Intell. 11(12), 1989, 1265–1278.                                  22. Y. Wu, S. S. Iyengar, R. Jain, and S. Bose, A new generalized compu-
 6. M. A. Fischler and R. C. Bolles, Random sample consensus: A paradigm                tational framework for finding object orientation using perspective trihedral
    for model fitting with applications to image analysis and automated cartog-         angle constraint, IEEE Trans. Pattern Anal. Machine Intell. 16(10), 1994,
    raphy, Comm. ACM 24(6), 1981, 381–395.                                              961–975.

You can also read