Robust 3D Street-View Reconstruction using Sky Motion Estimation

Page created by Jerome Adkins
 
CONTINUE READING
Robust 3D Street-View Reconstruction using Sky Motion Estimation
Robust 3D Street-View Reconstruction using Sky Motion Estimation

                                             Taehee Lee
              Computer Science Department, University of California, Los Angeles, CA 90095
                                                     taehee@cs.ucla.edu

                         Abstract

    We introduce a robust 3D reconstruction system that uses
a combination of the structure-from-motion (SfM) filter and
the bundle adjustment. The local bundle adjustment pro-
vides an initial depth of a newly introduced feature to the
SfM filter, and the filter enables to predict the motion of the
camera while performing the reconstruction process. In ad-
dition, we increase the robustness of the rotation estimation
                                                                             (a) Omnidirectional camera image
by estimating the motion of the sky from cylindrical panora-
mas of street views. The sky region is segmented by a robust
estimating algorithm based on a translational motion model
in the cylindrical panoramas. We show that the combina-
tion of the SfM filter and the bundle adjustment with sky
motion estimation algorithms produces a robust 3D recon-
struction from the street view images, compared to running
each method separately.
                                                                               (b) Projective camera images

                                                                  Figure 1: Google Street-View dataset: (a) omnidirectional
1. Introduction                                                   camera image, (b) projective camera images of left, front,
   The 3D reconstruction problem is to recover the depth of       right and rear directions.
the scene structures from the 2D information in an image
or in multiple images. Especially when a camera is mov-
                                                                  that the large area of the streets in city blocks can be suc-
ing around the scene as taking photos or a video sequence,
                                                                  cessfully reconstructed. We introduce the 3D reconstruction
the problem is to estimate both of the structure of the scene
                                                                  algorithms that can be processed separately, and improve
and the motion of the camera at the same time. Estimating
                                                                  the robustness by proposing a combined method of those
the 3D scene structure and the camera motion is applicable
                                                                  algorithms and by estimating the camera’s rotation on the
in many areas: for examples, unmanned vehicles or robots
                                                                  ground using the motion of the sky.
can utilize the visual information in real-time so that they
                                                                     The rest of the paper is structured as follows: we first
can navigate through the scene or build a map of the place
                                                                  introduce the street-view image dataset and review the re-
while interacting with the environment. Moreover, many
                                                                  lated work on the 3D reconstruction problem in the follow-
recent internet-based map services provide not only two di-
                                                                  ing subsections. Next we explain our method of enhancing
mensional maps, but also the terrain or building models in
                                                                  the robustness of the 3D reconstruction in Section 2. Then
3D interactive interfaces. The image based modeling is also
                                                                  we show the experimental results and discuss in Section 3.
often used in computer graphics and filming industries.
                                                                  Finally, we conclude the paper and address the future work
   In this paper, we focus on reconstructing the 3D structure
                                                                  in Section 4.
of the streets in cities for assisting the vehicle navigation.
In order to generate the 3D map, the camera trajectory and
                                                                  1.1. Street-View Image Sequence
the structure are estimated using the image sequence taken
by a survey vehicle with a camera mounted on it. The re-             The images of street views are collected from a survey
construction algorithm requires to be accurate and robust so      vehicle that usually has a mounted camera or multiple cam-
Robust 3D Street-View Reconstruction using Sky Motion Estimation
eras on the roof. When an omnidirectional camera is used,         some attention to the newly introduced features to the filter
for example in the Google Street-View dataset, the field of       without knowing their depths. In [1] the SfM filter manages
view of the camera is nearly spherical, covering 360◦ in          the features as a group when they are introduced. The filter
horizontal angles, and more than a hemisphere in vertical         then sets the group’s reference frame separately to the main
angles. Similarly, multiple cameras can be arranged so that       reference frame, and brings them into account only after
they cover the large field of view, as in [12].                   they are converged to an accurate depth estimation with a
    In this paper, we use the Google Street-View dataset that     threshold in the covariance.
is used to provide the Google Street-View service. The im-            In [3], a single monocular camera is used to capture im-
ages of streets are taken with an omnidirectional camera as       ages so that it can be worn by a person moving around the
in Figure 1a, and then projected into four rectangular image      scene. As targeting the real-time performance while track-
planes so that each of the image can be modeled as a projec-      ing the camera motion and building a map of the scene, they
tive camera as in Figure 1b. The images typically contain         use a few number of distinct image patches for their features
the sky, roads, side walls of buildings, and other objects as     and develop an extended Kalman filter to maintain and to
trees, cars, and pedestrians.                                     update the estimation of the feature locations and the cam-
    In most cases, the depths of these objects in the street-     era motion model. Since the reconstruction is kept in the
level views are significantly various in a large range, e.g.      state of a stochastic filter, the motion of the camera needs
the sky is extremely far away compared to nearby build-           to be well modeled by the motion dynamics. The system
ings. Moreover, the translation of the camera between con-        also requires to detect its unstable situations to prevent or
secutive frames is very small relatively to the farthest region   recover from the tracking failure cases.
in the image, i.e. the sky. From these observations, we need          On the other hand, there is another approach that does
to put more efforts on analyzing the street view images and       not use the Kalman filter but performs in similar scenarios
estimating the structure of the street scene. Additionally,       of a person interacting in small environments. In [6], corner
some part of the vehicle, for example the roof, is always         features are detected and matched in the selected keyframe
visible in all frames. In order to improve the robustness of      images from the real-time video sequence. The 3D location
the feature tracking and the 3D reconstruction algorithms,        of the features are estimated using the local bundle adjust-
we can rule out the stationary region by cropping the lower       ment and the global bundle adjustment in a multi-threaded
part of the image, or by using a mask image.                      system, while the camera motion is estimated in another
                                                                  thread running in real-time. Since this approach does not
1.2. Related Work                                                 require the camera motion to follow the dynamics model
   There have been many researchers investigating the 3D          as in the filter approach [3], the user can move more freely
reconstruction problem with different approaches. The re-         with wearing the camera as long as the system relocalize
construction algorithms are proposed and have been im-            the camera as described in [17] after losing the tracking and
proved according to their requirements in different tasks.        looking again at the scene that can be matched to the stored
For example, there is a strict constraint on real-time ap-        keyframes.
plications to complete the calculation within a given time            Without the purpose of running in the real-time appli-
constraint. On the other hand, some algorithms can spend          cation, there are other approaches that aim for reconstruct-
more time on modeling the 3D structure from a collection          ing larger areas of the scene. The bundle adjustment al-
of images without the strict real-time constraint, aiming for     gorithm is used for refining the camera parameters and the
more accurate and robust reconstructions. Based on their re-      feature parameters together using a non-linear optimiza-
quirements and capability on devices, some systems include        tion method [16]. The bundle adjustment is usually per-
more sophisticated and expensive sensors other than just a        formed at the final refinement step of the 3D reconstruction
camera, for example a global positioning system (GPS), an         as a global optimization problem, or it can run incremen-
inertial measurement unit (IMU), or a light detection and         tally as adding observations from new images over time as
ranging system (LIDAR).                                           in [6][15]. However, as the number of images increases, the
   The structure from motion (SfM) problem is to estimate         computation of each bundle adjustment takes longer time
the 3D structure of the scene and the motion of the camera        because of the increased problem size. In order to produce
as the camera moves in the scene. The observations in 2D          an initial estimation to the global refinement or to the SfM
image planes are the feature correspondences with the noise       filter, we can run the bundle adjustment locally within a
in measurements, e.g. the Gaussian noise. At every time           fixed time window, e.g. using the most recent Nlocal frames
moment as the estimation proceeds, the SfM filter as defined      that can be performed in the time constraint. In case of run-
in [1] predicts and updates its state variables that represent    ning the local bundle adjustment, the first Nf ixed frames are
the structure of the scene and the motion of the camera. As       set to be fixed so that they act as a reference frame while the
running the realization of the SfM filter, there must be paid     local optimization window slides over time.
Robust 3D Street-View Reconstruction using Sky Motion Estimation
In [15] the images are collected from the internet to see
the scene location from different viewing angles and from
different cameras. The images are matched with each other
and sorted so that they produce a good track of feature cor-
respondences. Then the features from each image are pro-
cessed in order so that the 3D structure of the scene and the
camera locations are estimated using the incremental bundle
adjustment. Since this approach is to process the collection
of images in a batch process, the system takes more time
than the real-time applications as described earlier. The
resulting 3D reconstruction is used for the user to browse
through the virtual model of the scene and the photos that
are registered to the locations where they are taken from.
Moreover, the large-scale 3D reconstruction can be refined       Figure 2: Flowchart of combined procedures of the SfM
using the global bundle adjustment as in [10], where the 3D      filter and the bundle adjustment
map is divided into smaller submaps and their connections
are used for the global optimization.
    In the scope of building the 3D model of large areas such    features. Especially for the street-view images, the distribu-
as the streets in urban cities, a few systems have been devel-   tion of the depths of features in the field of view is not uni-
oped [2][12]. They collect the images of the locations using     form in general, caused by the commonly visible straight-
a survey vehicle that has a set of sensors including multiple    ahead roads and the side walls in deep perspective viewing
cameras, an omnidirectional cameras, and/or a GPS. In [2]        angles, for example. Therefore, here we introduce a method
two cameras are used in the stereo reconstruction pipeline,      of combining the SfM filter and the local bundle adjustment
running the local windowed bundle adjustment. Then the           in order to take the initial depth estimation of the new fea-
city model is generated as a textured model of the facades       tures from the bundle adjustment and to provide the infor-
of the buildings and the roads. Not only using the image         mation to the filter.
sequence, in [12] a GPS is used in order to estimate the             The procedure of running the SfM filter together with
camera motion and to prevent the estimation from drifting        the bundle adjustment is shown in Figure 2. For each image
over time. Then the depth map estimated from a stereo al-        captured from the video sequence, first we detect corner fea-
gorithm is refined and used for generating a 3D model of         tures [14] and track them using optical flows [7], or match
textured triangle meshes.                                        the features in case the features are revisited. At this step,
    Since the 3D reconstruction systems commonly use the         the prediction from the SfM filter can be used for restrict-
visual information from cameras entirely or partially, the       ing the camera motion at the feature tracking stage. The
task of enhancing the robustness and accuracy of the esti-       feature correspondences are then passed to the local bundle
mation is an important challenge. Thus we focus to improve       adjustment step. The local bundle adjustment uses a certain
the robustness of the vision-based reconstruction algorithms     number of previous frames as a fixed local reference frame,
in the rest of the paper.                                        and takes the initial estimations of the camera parameters
                                                                 and the depths of the new features using the linear methods
2. Method Description                                            as in [5]. The features are then passed to the SfM filter as
                                                                 observations, now with the initial estimation of the depths
   In this section, we propose our method of improving the       from the local bundle adjustment. The SfM filter updates its
robustness of the 3D reconstruction. First we explain how        states as same as it performs separately. And optionally, the
we can combine the procedures of the SfM filter and the          incremental bundle adjustment or the global bundle adjust-
bundle adjustment. Then the method of estimating the sky         ment can be performed in order to minimize the reprojec-
motion in cylindrical projection is introduced, followed by      tion error across the whole image sequence so far. As this
the method of estimating the rotation of the camera using        procedure goes on for each image, the SfM filter takes the
the motion of the sky. In the end, we describe how to select     initial depth estimation from the local bundle adjustment,
good quality feature for robust feature tracking and further     and provides the predictions of the camera motion to the
for maintaining the stable 3D reconstruction.                    feature tracking stage.

2.1. Combining SfM Filter and Bundle Adjustment                  2.2. Sky Motion Estimation
   Obviously the 3D reconstruction using the SfM filter can         When we define the sky as the region that is far apart
be improved with estimating the better initial depths of new     from the camera view point, the motion of the sky region
Robust 3D Street-View Reconstruction using Sky Motion Estimation
(a) Translation 1              (b) Translation 2

        (c) Rotation 1                   (d) Rotation 2
                                                                 Figure 4: The relationship of cylindrical projection coor-
Figure 3: Sky motion in cylindrical projection panoramas.        dinates (x, y) from the perspective projection coordinates
Camera moving forward (a),(b): When the camera trans-            (x0 , y0 ): (1) The top-down view shows the relationship on
lates without rotation, the sky does not move in the panora-     the x coordinates. The red arc from the Top-down view cor-
mas. Camera turning to the right (c),(d): When the camera        responds to the cylindrical projection image plane, and the
rotates on the ground along the vertical axis, the sky trans-    blue side of the rectangle is the perspective projection im-
lates horizontally.                                              age plane. (2) The side view shows the relationship on the
                                                                 y coordinates similarly.
is characterized differently than those of closer objects in
the scene. For example, the direction from a viewpoint on
the ground to a point in the sky is fixed under the assump-
tion that the sky is infinitely far away. Thus we model that
the points in the sky region is far away with respect to the
translation of the camera. We further approximate the sky
motion as a rigid translational model under the cylindrical
projection. As in Figure 3a-b, the sky stays still when the
camera translates without rotation. On the other hand, as in
Figure 3c-d, the sky translates horizontally when the camera                                 (a)
rotates along the vertical axis.
    First, cylindrical projection panoramas are generated
from the quadruple perspective projection images. Figure 4
illustrates the cylindrical projection from a rectangular pro-
jection image, in which the resulting cylindrical panorama
has a horizontal axis corresponding to the yaw angle, and
a vertical axis corresponding to the pitch angle. The cylin-
drical projection from the perspective projection image is
                                                                                             (b)
performed as below:
                         w                 π                 Figure 5: An example result of the sky segmentation: (a)
              x0 =           1 + tan x −                         dense optical flow computed from the cylindrical projection
                        2                  4
                                                                 panorama. (b) the result of sky segmentation is shown as
                               h        −1
                                                π h
              y0 =        y−        cos      x−     +            highlighted.
                               2                 4    2

     Icyl (x, y)   = Irect (x0 , y0 ),                    (1)
                                                                 between the frames, for example, using the Lucas-Kanade
where w and h are the width and the height of the perspec-       algorithm on image pyramids [7] as shown in Figure 5a.
tive projection image. This projection is performed for each     When we compute the optical flow from the cylindrical pro-
of quadruple images and concatenated to form a cylindrical       jection panoramas, we also need to crop the top and bottom
projection panorama.                                             of the image in order to avoid the aperture problem around
   In order to estimate the motion of the sky region between     the curved boundaries. Then we perform a robust sky seg-
two consecutive frames, we compute the dense optical flow        mentation algorithm using the RANSAC approach [4] to
Robust 3D Street-View Reconstruction using Sky Motion Estimation
find the maximum number of pixels that satisfy a rigid            vehicles, the depths of the features in the image have wide
translation model: (1) randomly sample a pixel at every           range of distribution. In this case, limiting the maximum
iteration and collect other pixels as inliers that have simi-     depth of the features is a reasonable constraint in order to
lar optical flows within a threshold error, (2) and choose the    increase the numerical stability of the linear method solvers,
largest set of inliers as the sky region pixels. The sky motion   for example the singular value decomposition. In [8][5],
is then estimated as the mean of the optical flow from the        the importance of normalization is explained in the same
inlier pixels. In Figure 5b, the segmentation of sky region is    sense. Also the restricted depth range allows another outlier
highlighted in the cropped cylindrical projection panorama.       rejection mechanism so that the features with their depths
                                                                  less than a threshold are considered as inliers. After the
2.3. Rotation Estimation from Sky Motion                          feature tracking step, this constraint can still work for the
   After segmenting the sky region based on the transla-          local bundle adjustment stage as well. The features with
tional model in the cylindrical projection panoramas as in        too large depths or negative depths in the reference image
the previous section, the rotation parameter can be esti-         frame are rejected as outliers during the whole procedures.
mated from the motion of the sky. Especially as we model
the motion of the camera to be on the ground of the street,       3. Experimental Results
the sky motion in the cylindrical panoramas becomes a hor-
izontal translation. Thus the rotation parameter along the           We implemented the methods described in the previous
vertical axis can be estimated from the horizontal transla-       section and we experimented with the Google Street-View
tion of the sky. Given the cylindrical panorama with its          dataset images. In order to perform the 3D reconstruction
width W , the rotation on y-axis is computed as below:            algorithms using the SfM filter and the bundle adjustment,
                                                                  the images from the front camera are used. The calibration
                        θy = 2πdx /W ,                     (2)    matrix of the projective camera is known from the dataset,
                W W
where dx ∈ − 2 , 2 is the horizontal translation of the           and a manually labeled stationary mask image was used in
sky within the cylindrical panorama as computed from the          order to rule out the roof of the survey vehicle from the
optical flow in the previous section.                             reconstruction process. The performance of 3D reconstruc-
   Then the estimated rotation parameter θy can be used for       tion was compared between running the SfM filter and the
the SfM filter in the prediction of the camera motion, specif-    local bundle adjustment separately or combined. In addi-
ically for the rotation parameter along y-axis. The other two     tion, the rotation parameter was estimated from the sky mo-
components of the rotation can be predicted as the standard       tion estimation in cylindrical panoramas and it was tested
prediction of the SfM filter. The further steps of predict-       with the SfM filter prediction step to check if it helps the
ing the camera motion is performed as the SfM filter works        3D reconstruction to be more robust.
naturally.                                                           The image sequence from the Google dataset was taken
                                                                  to have 350 frames, which involves a few rotations and
2.4. Quality Feature Selection                                    makes a loop around a block of buildings. In Figure 7i, the
                                                                  camera trajectory from the GPS is overlayed on the satellite
   As shown in Figure 2, the feature correspondences re-          view of the street location.
sulted from the feature tracking procedure are passed to the
further steps in the 3D reconstruction, both to the local bun-    3.1. Rotation Estimation from Sky
dle adjustment and to the SfM filter. In order to increase
the robustness of the estimation algorithms in each step, ef-         First we experimented the rotation parameter estimation
forts must be put into every possible stages. For example,        from the sky motion estimation. In Figure 6a, the rota-
the tracked features between two frames must satisfy some         tion on vertical-axis θy was estimated between consecutive
constraints based on the assumptions on the scene, or on          frames using the sky motion, and plotted together with the
the camera motion. When we assume that the scene is not           ground truth rotations. As seen in the figure, the estimation
moving, then the features on moving objects as pedestrians        from the sky is accurate enough to be used as an initial rota-
or cars must be rejected as outliers. This outlier rejection      tion estimation for further 3D reconstruction steps. In Fig-
is commonly performed as applying the epipolar geome-             ure 6b, the histogram of the estimation error is displayed,
try constraint between the two frames, or among the three         of which mean is near zero and the standard deviation is
recent frames [8][5]. In addition to the epipolar geometry        0.0088 radian.
constraint, the predicted camera motion can be used in or-            As shown in the result, the sky motion gives a good ini-
der to set the expected locations of the features in the image    tial rotation parameter estimation of the camera. Especially
plane.                                                            when there is not enough number of features in the street
   Especially with the forward motion of the camera, which        structure, the sky motion plays a bigger role as estimating
is very common in the image sequence taken from survey            the rotation better than relying on few number of tracked
Robust 3D Street-View Reconstruction using Sky Motion Estimation
(a) Rotation estimation over time                         (b) Histogram of the estimation error

Figure 6: The rotation between two frames (in radian) is estimated over 350 frames using the Sky motion estimation: (a)
The thin red line is the ground truth rotation on y-axis, and the thick blue line is the rotation estimated from the sky motion.
(b) The histogram of the estimation error = θy − θtrue shows that the distribution is zero mean with the standard deviation
0.0088 radian.

features. The estimated rotation is then tested for improv-       for every 50 frames. The results from the LBA and IncrBA
ing the robustness of the 3D reconstruction in the following      show that overall estimation is more accurate than running
experiments.                                                      the SfM filter alone, but the rotation and the scale estima-
                                                                  tions are drifted from the true values as the errors are accu-
3.2. Comparisons of Trajectory Estimation                         mulated.
                                                                      Then the SfM filter was tested with the rotation estima-
    We tested the SfM filter and the bundle adjustment algo-
                                                                  tion from the sky motion estimation, shown in Figure 7d.
rithms on the same image sequence by combining them in
                                                                  Although the result shows that the filter adapted the rota-
a several different ways. In Figure 7, the results from these
                                                                  tion estimation well, the scale drifts over the frames. Next,
tests are displayed. In each plot, the thin red line is the
                                                                  the SfM filter was combined together with the LBA (see
ground truth trajectory from the GPS information, and the
                                                                  Figure 7e) and the IncrBA (see Figure 7f). In these tests the
thick blue line is the estimated trajectory from each method.
                                                                  initial depths of features are estimated from the bundle ad-
The estimated trajectories were aligned to the GPS trajec-
                                                                  justment algorithms and passed to the SfM filter. Although
tory so that the differences of the methods can be compared
                                                                  the camera motion was estimated more accurately, in both
correctly. In Figure 7i, the ground truth GPS trajectory is
                                                                  examples the rotation estimation was not robust enough so
overlayed on top of the satellite image of the location. For
                                                                  that the trajectory is rotated away from the ground truth.
these tests, the features were tracked only frame by frame
                                                                      Finally, the rotation parameter was estimated initially
and there was no relocalization method used so that the re-
                                                                  from the sky motion estimation, and it was used in the pre-
sult only shows how well each algorithm performs in terms
                                                                  diction step of the SfM filter. The estimated trajectory using
of the accuracy and the robustness without drifting away
                                                                  the SfM filter with the LBA together with the sky motion es-
from the true trajectory.
                                                                  timation is shown in Figure 7g. And the result of running
    To begin with, the SfM filter was tested alone as shown in
                                                                  the methods together is shown in Figure 7h. The results
Figure 7a, without the initial depth estimation of new fea-
                                                                  show that they outperform other methods in terms of the ro-
tures or the rotation estimation from the sky motion. Be-
                                                                  bustness of the reconstruction, and that the trajectories are
cause of the non-uniform distribution of the depths of the
                                                                  more accurate over the sequence. Among all experiments,
features in the scene, right after the vehicle starts to turn
                                                                  the combination of the SfM filter and the bundle adjust-
around the first corner the SfM filter begins to diverge from
                                                                  ments with sky motion estimation produces the most robust
the true path.
                                                                  reconstruction from the street-view images, as in Figure 7h.
    Next, the bundle adjustment was tested by running alone
the local bundle adjustment (LBA) (see Figure 7b), or
                                                                  3.3. Reconstructed 3D Structure
with the incremental bundle adjustment (IncrBA) (see Fig-
ure 7c). The LBA used 50 frames for the local windowed               In addition to the camera trajectory estimation, the 3D
optimization, and the IncrBA was performed periodically           structure of the scenes from the street-view images are re-
Robust 3D Street-View Reconstruction using Sky Motion Estimation
(a) Filter Only                       (b) Local BA Only               (c) Local BA + Incremental BA

              (d) Filter + Sky                        (e) Filter + LBA                  (f) Filter + LBA + IncrBA

          (g) Filter + Sky + LBA             (h) Filter + Sky + LBA + IncrBA                     (i) GPS

Figure 7: The result of running different algorithms for the 3D reconstruction: (a) only the SfM filter, (b) only the local
bundle adjustment (LBA), (c) LBA with the incremental bundle adjustment (IncrBA), (d) using the sky motion estimation
with the SfM filter, (e) combining the SfM filter with LBA, (f) combining the SfM filter with LBA and IncrBA, (g) using
the sky motion estimation with the SfM filter, combined with LBA, (h) using all methods together (the SfM filter with sky
motion estimation, LBA and IncrBA), and (i) the ground truth trajectory taken from the GPS information.

constructed as a 3D point cloud. In Figure 8, the top view      makes a loop clearly. When the relocalization algorithm
and the side view of the reconstructed point cloud are dis-     as [17] or the location recognition algorithm as [13][11] is
played together with the camera trajectory. Figure 8a shows     used in further, the revisited features can be used again so
the reconstructed points along the camera trajectory, and       that the accumulated error over the past frames can be prop-
Figure 8b shows the satellite view image of the street lo-      agated throughout the whole image sequence appropriately,
cation together in order to assist visual understanding. The    recovering from the drift.
features are frequently detected on the walls of buildings         In case that a more sophisticated 3D model is required,
accurately as shown in the overlayed image. Moreover,           a triangulated mesh structure and a textured model can be
the point cloud of the reconstructed street scene shows the     created from the reconstructed 3D point cloud and the im-
ground plane of the roads and the vertical walls of build-      ages. Or, instead of using point features, superpixels can
ings clearly in the side view in Figure 8c. Some features       be used for reconstructing planar structures as introduced
with noisy measurements still have inaccurate estimations       in [9]. Other features as edges and planes can be used as
of their 3D locations.                                          well to improve the quality of the reconstruction result.

   Although no camera relocalization algorithm was used
                                                                4. Conclusion
in these experiments, the reconstructed camera trajectory
and the 3D point cloud show that the combination of the           We introduced a robust 3D reconstruction system that
3D reconstruction algorithms described in the previous sec-     combines the structure-from-motion filter with the local
tion produces an accurate estimation so that the trajectory     bundle adjustment which initializes the depths of new fea-
Robust 3D Street-View Reconstruction using Sky Motion Estimation
(a) Top-view                   (b) Top-view overlay                               (c) Side-view

Figure 8: The result of the 3D reconstruction from the street-view image sequence: (a) The top-view of the 3D point cloud
with the camera trajectory, (b) overlayed on the satellite image of the location (aligned manually), and (c) the side-view.

tures. Experimental results show that the proposed method            [7] B. Lucas and T. Kanade. An iterative image registration tech-
enhances the robustness of the 3D reconstruction from the                nique with an application to stereo vision. In Proc. Intl. Joint
street-view image sequence. In addition, the rotation pa-                Conference on Artificial Intelligence, pages 674–679, 1981.
rameter was estimated from the sky motion in cylindrical             [8] Y. Ma, S. Soatto, J. Kosecka, and S. Sastry. An invitation to
projection panoramas to estimate the camera trajectory and               3D vision, from images to models. Springer Verlag, 2003.
the 3D structure more robustly.                                      [9] B. Micusik and J. Kosecka. Multi-view superpixel stereo in
                                                                         man-made environments. Technical report, George Mason
   For future work, we are aiming to reconstruct larger area
                                                                         University, 2008.
of the streets in city blocks with an efficient relocalization
                                                                    [10] K. Ni, D. Steedly, and F. Dellaert. Out-of-core bundle ad-
algorithm in order to recover from accumulated errors over
                                                                         justment for large-scale 3d reconstruction. In Proc. IEEE
time. More investigation on fixing the scale drifting prob-              Intl. Conference on Computer Vision, pages 1–8, 2007.
lem is also necessary.                                              [11] D. Nister and H. Stewenius. Scalable recognition with a vo-
                                                                         cabulary tree. In Proc. IEEE Intl. Conference on Computer
Acknowledgment                                                           Vision and Pattern Recognition, volume 2, pages 2161–2168,
                                                                         2006.
   We would like to thank Google Research for providing             [12] M. Pollefeys, D. Nistér, J. M. Frahm, A. Akbarzadeh, P. Mor-
the Street-View dataset for the research.                                dohai, B. Clipp, C. Engels, D. Gallup, S. J. Kim, P. Mer-
                                                                         rell, C. Salmi, S. Sinha, B. Talton, L. Wang, Q. Yang,
                                                                         H. Stewénius, R. Yang, G. Welch, and H. Towles. Detailed
References                                                               real-time urban 3d reconstruction from video. Intl. Journal
 [1] A. Chiuso, P. Favaro, H. Jin, and S. Soatto. Structure from         of Computer Vision, 78(2-3):143–167, 2008.
     motion causally integrated over time. IEEE Trans. on Pattern   [13] G. Schindler, M. Brown, and R. Szeliski. City-scale location
     Analysis and Machine Intelligence, 24(4):523–535, 2002.             recognition. In Proc. IEEE Intl. Conference on Computer
 [2] N. Cornelis, K. Cornelis, and L. V. Gool. Fast compact city         Vision and Pattern Recognition, pages 1–7, 2007.
     modeling for navigation pre-visualization. Proc. IEEE Intl.    [14] J. Shi and C. Tomasi. Good features to track. In Proc. IEEE
     Conference on Computer Vision and Pattern Recognition,              Intl. Conference on Computer Vision and Pattern Recogni-
     2:1339–1344, 2006.                                                  tion, pages 593–600, 1994.
 [3] A. J. Davison, I. D. Reid, N. D. Molton, and O. Stasse.        [15] N. Snavely, S. M. Seitz, and R. Szeliski. Photo tourism: Ex-
     MonoSLAM: Real-time single camera SLAM. IEEE Trans.                 ploring photo collections in 3d. In SIGGRAPH Conference
     on Pattern Analysis and Machine Intelligence, 29(6):1052–           Proceedings, pages 835–846. ACM Press, 2006.
     1067, 2007.                                                    [16] B. Triggs, P. Mclauchlan, R. Hartley, and A. Fitzgibbon.
 [4] M. A. Fischler and R. C. Bolles. Random Sample Consen-              Bundle adjustment - a modern synthesis. In Vision Al-
     sus: A paradigm for model fitting with applications to im-          gorithms: Theory and Practice, LNCS, pages 298–375.
     age analysis and automated cartography. Commun. ACM,                Springer Verlag, 2000.
     24(6):381–395, 1981.                                           [17] B. Williams, G. Klein, and I. Reid. Real-time slam relocali-
 [5] R. Hartley and A. Zisserman. Multiple View Geometry in              sation. In Proc. IEEE Intl. Conference on Computer Vision,
     Computer Vision. Cambridge University Press, 2003.                  pages 1–8, 2007.
 [6] G. Klein and D. Murray. Parallel tracking and mapping for
     small AR workspaces. In Proc. IEEE/ACM Intl. Symposium
     on Mixed and Augmented Reality, pages 225–234, 2007.
You can also read