ROW-SLAM: Under-Canopy Cornfield Semantic SLAM - arXiv

Page created by June Phillips
 
CONTINUE READING
ROW-SLAM: Under-Canopy Cornfield Semantic SLAM - arXiv
ROW-SLAM: Under-Canopy Cornfield Semantic SLAM
                                                                      Jiacheng Yuan1 , Jungseok Hong2 , Junaed Sattar2 , and Volkan Isler2

                                            Abstract— We study a semantic SLAM problem faced by a
                                         robot tasked with autonomous weeding under the corn canopy.
                                         The goal is to detect corn stalks and localize them in a global
                                         coordinate frame. This is a challenging setup for existing
                                         algorithms because there is very little space between the camera
                                         and the plants, and the camera motion is primarily restricted
arXiv:2109.07134v1 [cs.RO] 15 Sep 2021

                                         to be along the row. To overcome these challenges, we present
                                         a multi-camera system where a side camera (facing the plants)
                                         is used for detection whereas front and back cameras are used
                                         for motion estimation. Next, we show how semantic features in
                                         the environment (corn stalks, ground, and crop planes) can be
                                         used to develop a robust semantic SLAM solution and present
                                         results from field trials performed throughout the growing
                                         season across various cornfields.

                                                                 I. I NTRODUCTION                                    Fig. 1: Our robot in a mid-season corn field. Dense canopy and
                                                                                                                     narrow row spacing makes it challenging for semantic SLAM.
                                            Cornfield weed control conventionally has relied heavily
                                         on herbicides which are undesirable due to environmental                       In this paper, we introduce a multi-view vision system
                                         and health-related concerns and can not be used in organic                  for the detection and 3D localization of corn stalks in mid-
                                         fields. The alternative, manual weeding, is labor-intensive                 season corn rows, called ROW SLAM. We present this
                                         and costly. Therefore, there has been significant interest in               system (Fig. 1, 3) as a prototype of the vision module for an
                                         robotic weeding [1]–[4]. However, most existing techniques                  autonomous weeding robot currently under development. In
                                         focus on early season weeding. During this time, since                      our approach, we model the corn stalks in the same row as
                                         the canopy is not closed, usually a reliable GPS signal is                  a plane that is perpendicular to the ground. A ground-view
                                         available. Further, the robots can obtain top-down views,                   camera (back camera) performs SLAM using ground plane
                                         which are convenient for detection and localization. In this                features to obtain 3D odometry. We implement a Structure-
                                         work, we focus on mid-season weeding where the robot must                   from-Motion (SfM) strategy that accommodates the multi-
                                         operate under the canopy.                                                   view inputs to estimate the 3D pose of the corn stalk plane.
                                            This setup introduces unique challenges for autonomous                   Combining this motion estimation module with the object
                                         weeding: (1) top-down views are no longer available, and                    detection and tracking modules, we build a map that has both
                                         therefore a “planar world” assumption does not hold (2) mid-                metric (location and orientation) and semantic information
                                         season corn and weed canopies are usually in close distance                 of the corn stalks in it. We present field results which
                                         and overlap with each other (3) there is frequent occlusion                 demonstrate the accuracy and robustness of our approach
                                         from corn leaves and weed even if we lower the height of the                toward weeding mid-season cornfields.
                                         cameras closer to the ground in between two rows. (Fig. 1)                     Our contributions can be summarized as follows:
                                            We observe that established SLAM algorithms [5], [6]                        • We present ROW SLAM: a multi-camera vision system

                                         struggle in the setup because: (1) dynamic features such as                      with non-overlapping views for semantic SLAM in
                                         shaking leaves and weeds due to wind or robot motion cause                       mid-season corn rows. Specifically, we use the system
                                         the SLAM module to fail (2) the corn row lacks regular                           to build maps with both 3D location and semantic
                                         structures such as cylinders or walls, making it challenging                     information of corn stalks.
                                         to observe the corn plane from the map directly.                               • We present a multi-view Structure-from-Motion (SfM)

                                            In Fig. 2, we present image samples from conventional and                     strategy for robust corn stalk plane estimation.
                                                                                                                        • We test our system in 8 different corn rows with heavy
                                         organic fields with heavy weed infestation as a comparison.
                                         We highlight the stronger interference from the weed stems                       weed infestation across the growing season. Our method
                                         and lower corn stalk visibility due to the existence of the                      outperforms all baseline approaches.
                                         weeds.
                                                                                                                                         II. R ELATED W ORK
                                           1 is with Department of Electrical and Computer Engineering, University      The topic of robotic weed control [7] has been extensively
                                         of Minnesota, Minneapolis, MN, 55455, USA yuanx320@umn.edu                  studied over the last decade from various viewpoints: weed
                                            2 are with the Department of Computer Science and Engineering,
                                         University of Minnesota, Minneapolis, MN, 55455, USA {jungseok,             detection, field coverage, and vehicle development. Most
                                         junaed, isler}@umn.edu                                                      recent weed control robots have been developed in the
ROW-SLAM: Under-Canopy Cornfield Semantic SLAM - arXiv
Sec. V-D we compare the performance using different views
                                                                    to support this choice.
                                                                        Another aspect of obtaining semantic features is object
                                                                    detection. Some recent works [6], [19], [20] build offline
                                                                    approximation models for the objects of interest and detect
                                                                    them by model matching while building the map. However,
 (a) Front view (conventional)     (b) Side view (conventional)     such a method is unsuitable for corn stalk detection due to
                                                                    the lack of structured shapes in the non-uniform canopy. The
                                                                    dense occlusion from the leaves and the weeds makes it even
                                                                    more challenging to detect corn stalks by matching offline
                                                                    models.
                                                                        Recent advances in deep learning allow accurate real-time
                                                                    object detection [21] and detect a wide range of objects from
   (c) Front view (organic)          (d) Side view (organic)        a single model with a large amount of data. Such advances
Fig. 2: Sampled images from both the conventional field and the     have enabled adopting deep learning-based object detection
organic field. The conventional field has almost no weeds due to    in agricultural applications [22], and many studies [23]
herbicides, while the organic field shows heavy weed infestation.   apply the object detection to detect weed. However, these
                                                                    approaches are limited because they (1) require a large
form of drones and ground vehicles. While drone-based               amount of weed data to train the models which is challenging
approaches [8] can cover large areas, they only provide             to obtain, (2) can only be used for the field that existing weed
weed information to manage weeds via ground vehicles                information is previously known, and (3) may need a survey
(e.g., applying herbicide) [9]. Recently, Wu et al. [10]            of a target field to collect field-specific weed information
proposed a method using ground vehicles with downward-              and its visual data before the deployment in a new field. In
looking cameras to recognize weeds and remove them by               contrast, we propose to focus on detecting corns and consider
end effectors. However, such an approach could fail as the          the remaining as weeds. Our approach can be applied to a
crop canopies grow and block the top view.                          broader range of fields and growing stages since corns have
   Sivakumar et al. [11] use a front view camera to predict         fewer variants compared to weeds.
a vanishing point and corn lines for navigation under-canopy            To associate object detection across frames, object tracking
tasks. However, using the front view loses a lot of the details     algorithms [24] with visual sensors have been proposed
of the objects in the narrow corn row due to the viewing            to track a wide range of objects (e.g., pedestrians [25],
angle. In Sec. V-C, we compare against their method for corn        vehicles [26], sports players [27], crops [28], [29]). The
plane estimation. Zhang et al. [12] use side view for corn          tracking algorithms generally use an estimation model and
stalk counting, but we show that side view also experiences         data association method such as optical flow [30], Kalman
frequent occlusion. To get accurate semantic features (i.e.,        filter, and Hungarian algorithm to associate detection results
corn stalks, ground plane, corn plane) from the map, we             from the previous frame to the next frame. In our pipeline, we
propose to combine the input from multiple views.                   use Simple online and realtime tracking (SORT) [31] due to
   SfM [13] is one of the classical methods available to obtain     its speed and accuracy, as demonstrated on the MOT15 [32]
the 3D location of the features. In our application, since          dataset.
there is very little space and frequent occlusion between the           In this work, we show that robotic weed control in mid-
camera and the plants, the camera motion is also restricted.        season corn row poses unique challenges to existing SLAM
Classical SfM strategies perform poorly for estimating the          algorithms. With our system design and combining multiple
motion when observing only the plants. Hasheminasab et al.          existing algorithms, the results demonstrate that we can
[14] proposes to assist the SfM pipeline with GPS signals.          overcome the challenges.
Since GPS is unreliable under the canopy, we use SLAM
odometry from the back camera and the front view corridor                          III. P ROBLEM F ORMULATION
estimation to assist the SfM process using the side view.              We are given a cornfield planted in a row, and have a
   For localization in outdoor environments, SLAM with              vehicle that can traverse the field. A camera frame is rigidly
RGB input [15]–[17] has been widely used. However, the              attached to the vehicle and moves under the corn canopy
localization accuracy of SLAM is sensitive to dynamic               with three cameras (one in the front-facing along the row,
features. Recent works [5], [18] adopt neural networks to           one on the side facing the corn plane, and one on the back
detect moving objects and remove their features. The re-            facing the ground as shown in Fig. 3).
maining static features are then used to build the map. In our         Our goal is to detect the corn stalks and localize them
application scenario, most of the features above ground are         in the global coordinate frame. Therefore, we formulate this
susceptible to unconstrained motion due to the interference         problem as a semantic SLAM problem where we aim to build
from wind or robot motion, making it challenging to detect          a map with semantic features (corn stalks, corn plane, and the
dynamic features. So we select the ground view as input for         ground plane). We choose corn stalks as the target semantic
the SLAM module to ensure the static map assumption. In             feature instead of weeds because, unlike the corn stalks,
ROW-SLAM: Under-Canopy Cornfield Semantic SLAM - arXiv
Side Camera                                ROW SLAM

                                              Tracking Module                                                           Corn IDs
                                                                               Corn IDs
                         Corn Detection                                                                           3D Pos. of Corn stalks
                                           Bbox Feature Extraction          2D Pos. of Corns
                                                & Matching
    Back Camera
                                                                                                             ca       cb      cc       cd    ce

                                                                     Multiview
                            SLAM               T                                        dp
                                                                       SfM

    Front Camera

                                              Plane fitting     ng
                           Downsampled                                                Corn
                              PCD                                          np
                                                   PCA          vl                    plane

                                Back Camera                                 Front Camera                                           Corn Plane

                                                    Side Camera                                                                    Ground Plane

Fig. 3: (Top) Proposed pipeline of ROW SLAM: Our pipeline takes images from {front, back, side} cameras and yields 1)
IDs for each corn (denoted with C{a,..,e} in the scene, and 2) 3D positions of the corn stalks (denoted by yellow cylinders).
(Bottom-Left) Proposed robot design with multi-view cameras. (Bottom-Right) A 3D reconstruction sample of the corn
field. Beige and green planes represent ground and corn planes, respectively. Each camera pose is displayed separately.

weeds in the field have various species and appearances.                                      TABLE I: Summary of Notations
They also tend to vary across fields and regions. As a result,             Notation     Description
corn detection can be more consistent and robust than weed                              Normal vector and distance to the origin for the corn
detection. Once corn stalks are identified, we can treat the               np , dp
                                                                                        plane (shown in Fig. 3)
remaining plants as weeds.                                                              Normal vector for the ground plane (shown in Fig. 3)
                                                                           ng
                                                                                        Intersection of the ground plane and the corn plane, also
                   IV. S YSTEM OVERVIEW                                    vl
                                                                                        the orientation of the corn row

  In this section, we introduce the necessary mathematical                 T
                                                                                        Relative transformation between sequential side camera
notations in Table I and describe our system’s hardware.                                poses
We discuss the details of our approach for 3D corn stalk                   Notes: bold uppercase letters for matrix, bold lowercase for column
                                                                                             vector, normal ones are scalar
detection and localization next.
                                                                        B. Method Overview
A. System Description
                                                                           As shown in Fig. 3, we use RGB images from the side
   Our hardware system consists of three cameras (Fig. 3                view camera for corn stalk detection. The detections only
Bottom-Left). The front and side cameras (Intel Realsense               localize the stalks in 2D. Thus, to find the 3D position of
D435) publish RGB images and depth images at 30 Hz. The                 the detected corn stalks, we also need to find the 3D pose of
back camera (Intel Realsense D435i) has a built-in IMU and              the corn plane. We address the corn plane estimation problem
is mounted at an angle facing the ground. The back camera               in two parts: (1) finding the plane normal direction, np , and
publishes RGB and depth images at 30 Hz, gyro at 400 Hz,                (2) finding the distance from the camera center to the plane,
and accelerometer at 250 Hz. We use ROS for robot control               dp , shown in Fig. 4.
and routing sensor data. For our experiments, we mount the                 We use the front and back camera to assist with the
system on a small 4-wheel rover from Rover Robotics.                    estimation of the corn plane np and dp . With the RGB-D
ROW-SLAM: Under-Canopy Cornfield Semantic SLAM - arXiv
input from the front view camera, we estimate the ground
plane normal direction ng and the direction of the corn line
vl . The corn plane orientation is found by the cross product
of these two vectors.
                        np = ng × vl                        (1)
   Then, we use the corn plane normal np , combined with
side view object detection results and SLAM odometry T
as the input for the Multiview SfM module to estimate dp .
Before going through the details of this module, we first
                                                                  Fig. 4: Illustration for the multi-view SfM strategy. The relative
introduce the detection and tracking.                             camera pose (mint) is given by the SLAM module, and feature
C. Corn Stalk Detection                                           matching is masked by detected bounding boxes. np is the plane
                                                                  normal of the corn plane, and dp is the distance between the corn
   We implement our corn stalk detection model using Faster       plane and the left camera center.
R-CNN [33] with MobileNet V3 [34] as the backbone
and Fully Connected Network (FCN) as a head. We select            with the largest and smallest eigenvalues from PCA indicate
MobileNet to provide fast inference, thus preventing the deep     the directions of the corn line vl and the ground plane normal
detection model from being a bottleneck in our pipeline.          ng , respectively. The corn plane normal is then computed by
The backbone can be replaced with a larger network such           the cross product of these two vectors.
as Resnet-18, -34, -50 [35] to improve the accuracy of the           Plane Distance Across different frames, the motion of
model. We train our model with pre-trained weights using          the features on the corn stalks can be modeled by planar
the COCO dataset [36], and refine further using our corn          homography. However, instead of doing the full SfM purely
stalk dataset.                                                    on the side view input, we find it much more stable if we
D. Corn Stalk Tracking                                            use the ground features to estimate the camera trajectory.
                                                                  We thus use the images from the back camera, providing
   With the outputs from the detection model, we implement
                                                                  a predominantly ground-plane view to avoid the unstable
the corn stalk tracking pipeline using (1) optical flow with
                                                                  features from the corn leaves and weeds. With these images,
centroids [37], and (2) simple online and realtime tracking
                                                                  we use the off-the-shelf RTAB-Map SLAM [17] package
(SORT) [31].
                                                                  to perform visual SLAM and use the Sigma Point Kalman
   1) Optical Flow: We build the corn stalk tracking al-
                                                                  Filter [42] to fuse the SLAM odometry with IMU input. The
gorithm by applying optical flow algorithm and the corn
                                                                  resulting odometry runs at 200 Hz. After calibrating the back
detection results. The model detects bounding boxes for
                                                                  view and side view cameras, we can use the camera trajectory
each corn stalk every 200 frames, and the centroids of the
                                                                  as is. Combining this with the estimated corn plane normal,
bounding boxes are calculated. After we obtain the centroids
                                                                  we formulate plane distance estimation as the following least
from the output of the detection model at frame X, the
                                                                  square problem over re-projection error (shown in Fig. 3 as
iterative Lucas-Kanade method with pyramids [38] is used
                                                                  the Multiview SfM block):
to track each centroid using optical flow from frame {X +1}
to {X + 199}.                                                        Given two side view RGB frames I1 , I2 , the relative
   2) SORT: SORT performs four actions for each input             transformation T21 between their camera poses (where R
frame: detection, estimation, data association, and update        and c is the rotation and translation component) and the corn
tracking identities. The detections from Faster R-CNN are         plane normal np , we want to estimate the plane distance dp ,
propagated to the next frame using a linear constant velocity     as shown in Fig. 4.
model. The results are also utilized to update the target state      We first run Faster R-CNN on I1 , I2 to get bounding boxes
with the Kalman filter [39]. When a new detection is obtained     for detected corn stalks. Next, we compute SIFT features
from the next frame, the Hungarian algorithm [40] is used to      f1 , f2 for I1 , I2 only within the regions defined by the
associate detections from the previous frame to the current       bounding boxes, and then use cross-matching and ratio test to
frame using the intersection-over-union (IoU) metric. When        find good matches between f1 , f2 . For each matched pair, let
the IoU metric is below a predefined threshold, then new          (u1 , v1 ) and (u2 , v2 ) be the corresponding pixel coordinates,
identification of the detection is created.                       K and λ1 respectively be the camera intrinsic matrix and the
                                                                  depth for the first feature.
E. Corn Plane Estimation                                                                          T
                                                                     Letting x1 = u1 v1 1 , the 3D position of f1 can be
   Plane Normal To get the ground plane pose, we first use        expressed as: λ1 K−1 x1 .
the front view RGB-D input to compute the point cloud of             Since we assume the feature lies in the corn plane,
the scene, followed by color thresholding and RANSAC [41]         according to the plane equation
plane fitting. The points are usually denser when closer to                           λ1 nTp K−1 x1 + dp = 0                    (2)
the camera center, so we downsample the ground plane inlier
points before applying principal component analysis (PCA).          We can rewrite λ1 in terms of dp :
Since the corn row is narrow (≈ 70cm), the two eigenvectors                            λ1 = −dp /nTp K−1 x1                     (3)
ROW-SLAM: Under-Canopy Cornfield Semantic SLAM - arXiv
and the other 4 are from organic fields. We collected data
                                                                     across different growing stages throughout the mid-season
                                                                     (between V5 stage and V12 stage, late June to early August
                                                                     in Minnesota, USA).

                                                                     B. Evaluation Metric
Fig. 5: Corridor projection of the corn row. Two corn lines (blue)      We perform offline evaluations using the dataset collected
are parallel and intersects at the vanishing point (red). Due to     in two different aspects, detection accuracy and localization.
occlusion, the corn lines are usually not clearly visible.
                                                                     For evaluating the corn detection module, we label 763
                                                                     images and separate them into 563 training images (235
   After applying the rotation R and translation c, the
                                                                     images from conventional field and 328 images from organic
projection of the 3D position of f1 in the second camera
                                                                     field) and 200 test images (100 images from each field).
frame can be expressed as:
                                                                     Training and test dataset images for each field are picked
                       λ1 KRK−1 x1 + Kc                       (4)    from different rows to reduce the correlation between the
   If we substitute Eq. 3 in Eq. 4 we can rewrite Eq. 4 as           datasets and provide an accurate measure of the general-
follows:                                                             ization capability of our network. For localization under
                              
             − dp l0 l1 l2 + s0 s1 s2
                                             
                                                       (5)           the canopy, it is difficult to get high-accuracy GPS signals
                                                                     reliably. So we manually measure the distance between
where,                                                               neighboring corn stalks and use the corn stalks as landmarks
                            T          −1
                                                  KRK−1 x1
           
            l0    l1   l2        =                            (6)    along a straight line to evaluate the localization accuracy. We
                                     nTp K−1 x1                      first line fit on the camera trajectory and project predicted
                           T
           s0    s1    s2        = Kc                         (7)    target positions to the fitted line to reduce their dimension
   The projected pixel position of f1 in the second camera           to one. To remove the duplicated predictions for a single
(u02 , v20 ) is:                                                     target, we compute its nearest neighbor from the world
                (                                                    measurements for each predicted corn stalk position. The
                 u02 = (−dp l0 + s0 )/(−dp l2 + s2 )                 distance error 1 is computed as the mean absolute error of
                                                       (8)
                 v20 = (−dp l1 + s1 )/(−dp l2 + s2 )                 the neighboring corn stalks distance between prediction and
   We can linearize the re-projection error (u2 − u02 , v2 − v20 )   real-world measurements. Finally, we measure the tracking
with respect to dp so it can be estimated by least squares and       variance through re-projection error 2 . After accumulating
RANSAC. However, notice that the least square robustness is          the 3D position measurements for each corn stalk, we com-
sensitive to the second term Kc in Eq. 4. So when applying           pute the mean position of the centroids and re-project them
this algorithm, we need to make sure the translation scale is        back to each frame. The re-projection error is computed by
above a threshold to ensure robust triangulation.                    the mean pixel error between the centroids and the centroids
                                                                     of the bounding boxes proposed by the tracking module
F. 3D Localization                                                   across all the frames.
   The 3D position of a corn stalk is obtained by the
                                                                     C. Baselines
projection of the bounding box centroid onto the corn stalk
plane. To show that it is more robust than directly using               SLAM Input As the SLAM pipeline is sensitive to
RGB-D input, we compare against RANSAC Plane Fitting                 moving objects in the scene, we compare the robustness
in Sec. V-D. We also use the tracking module to link the             and accuracy of SLAM while using different input views.
3D corn stalk positions with corn IDs. Such association              Specifically, we replace the SLAM input in Fig. 3 with
across multiple frames allows us to reject outliers and false        the front camera view or side camera view and re-run the
positives, making the localization more robust.                      Semantic SLAM pipeline.
                                                                        Corridor Prediction The corridor prediction method re-
                V. E XPERIMENTS AND R ESULTS                         places the corn plane prediction module in Fig. 4. It is also
   We present our data collection procedures, followed by            based on the planar approximation of the corn row. Since the
the evaluation metrics for our method. Then we introduce             corn stalk planes on both sides are parallel, their intersection
the baseline methods we compare against.                             line with the ground plane must also be parallel. Therefore
                                                                     by projective geometry, they intersect in the image plane at
A. Data Collection                                                   the vanishing point (Fig. 5). After obtaining the 3D pose
   The rover is controlled by a human operator to drive at           of the ground plane, the vanishing point can be used to get
around 0.3m/s. To provide additional control of imaging              vl . The corn stalk plane normal np can be computed by the
conditions, we used a wheeled arch platform to go over the           cross product of vl and ng . Then, dp can be obtained by the
rows and provide cover for the rover. In the future, we are          3D position of any point along the corn line since it is also
planning to combine the rover and the cover into a single            on the ground plane.
robotic platform. Our collected dataset includes 8 different            It is known that color thresholding yields unreliable re-
corn rows where 4 of them are from conventional fields,              sults [11] for vanishing line detection. Thus, we combine
ROW-SLAM: Under-Canopy Cornfield Semantic SLAM - arXiv
Fig. 6: The reconstructed point cloud of a corn row and the estimated corn stalk position (red cylinders) from ROW SLAM.

    TABLE II: Faster RCNN Corn Detection Accuracy (%)               accuracy for both cases, and it is because irregular shapes of
                    Conventional field     Organic field            the corns tend to have relatively low IoU values for many
                                                                    detections while they are still detected as corns.
             AP            26.2                 47.8
             AP50          83.1                 89.1                   We compute the metric localization accuracy by compar-
             AP75          9.3                  46.6                ing predicted target positions against the manually measured
     TABLE III: Performance of Corn Stalk Localization              distance between corn stalks. We also evaluate the tracking
      (1 : Metric Error, 2 : Centroid Re-Projection Error)        variance through the re-projection error of the centroid. In
                                                                    Table III we summarize the metric error 1 and centroid re-
                           Conventional field      Organic field
                           1 (cm)     2         1 (cm)     2
                                                                    projection error 2 using the dataset from both conventional
                                                                    field and the more challenging organic field.
   our approach              1.8          5.6          3.5   10.4
   corridor prediction       8.5         19.3          9.7   27.8      Our method outperforms all the baseline methods, which
   front-view SLAM           2.9          8.2          6.2   13.3   demonstrates its accuracy and robustness. As a comparison,
   side-view SLAM            4.5         16.8          7.1   15.5   the corridor prediction method shows fragile estimations of
                                                                    dp . Due to the challenge of accurate estimation for #»
   RANSAC plane fitting      3.7          9.5          9.3   24.2
   optical flow tracking     3.3         11.7          4.6   17.9                                                           v l and
                                                                    projective geometry, the error of dp is significantly impacted
                                                                    by the angular error of the vanishing line in the front view.
the ResNet34 head with a 3-layer multi-layer perceptron to
                                                                    As for the view selection in SLAM, we notice that the
predict the vanishing point pixel location and two slopes for
                                                                    front-view SLAM achieves comparable accuracy with our
the corn line. We do not predict the full parameters for the
                                                                    method during a windless day in the conventional field.
two lines so that we can enforce the projective geometry.
                                                                    However, in general, the comparison between our ROW
   RANSAC Plane Fitting The RANSAC plane fitting
                                                                    SLAM and front/side view SLAM shows that the static
method replaces the multi-view SfM module in Fig. 4 for
                                                                    features are critical for accurate under-canopy localization
the estimation of dp . After obtaining the side plane normal
                                                                    in the corn rows. For the RANSAC plane fitting method,
np , one way to estimate dp is by direct observation of the 3D
                                                                    the performance drop indicates the challenge in directly
scene in the side view. Therefore to support our claim on the
                                                                    observing the corn stalk plane from a single view. Lastly,
low observability of the corn plane, we present plane fitting
                                                                    we observe greater centroid position drift in image plane
in the side view as another baseline method. The detected
                                                                    with optical flow tracking than SORT, which affects the
object bounding boxes are used to find candidate points in
                                                                    localization accuracy and variance, proving the importance
the corn plane. We only need the 3D position of one selected
                                                                    of robust visual tracking.
point with the plane normal to compute dp . Then RANSAC
is applied to make the estimation robust.
                                                                                         VI. C ONCLUSIONS
   Optical Flow Tracking To ensure high localization ac-
curacy, we use SORT as our tracking module which runs                  This paper presents our effort to address the under-canopy
detection at every frame and uses the Kalman Filter to make         corn stalk detection and localization problem within narrow
the tracking more robust. In this baseline, we replace the          corn rows by proposing ROW SLAM and a multi-view
tracking module with optical flow tracking. Unlike SORT,            camera system mounted on a ground vehicle. ROW SLAM
the optical flow tracking method will re-initialize target id       combines existing algorithms (i.e., object tracking, SLAM,
every k frame. In our case, we choose k = 200.                      SfM) and applies several geometric methods to accurately
                                                                    estimate the three planes that bound the row as well as in-
D. Results                                                          dividual plants. Our results demonstrate ROW SLAM yields
   Faster R-CNN is trained for 100 epochs, and Table II             an accurate map containing corn stalk positions and their
shows evaluation results for each field. It tracks objects at       IDs while existing SLAM algorithms fail. Future work will
≈ 11 FPS on a laptop (Intel i7-8850H, Quadro P3200, and             add an end-effector to our existing system to remove weed
32GB RAM). The organic field case has higher Average                physically. Furthermore, we are developing our algorithm to
Precision (AP) values since about 60% of the training dataset       minimize the drift from the SLAM module by using corn
is from organic fields. AP50 (AP at IoU=.50) has the highest        stalks as landmarks together with an offline map.
ROW-SLAM: Under-Canopy Cornfield Semantic SLAM - arXiv
R EFERENCES                                       [22] A. Kamilaris and F. X. Prenafeta-Boldú, “Deep Learning in
                                                                                   Agriculture: A survey,” Computers and Electronics in Agriculture,
 [1] A. Michaels, S. Haug, and A. Albert, “Vision-based High-speed Ma-             vol. 147, pp. 70–90, 2018. [Online]. Available: https://www.
     nipulation for Robotic Ultra-precise Weed Control,” in 2015 IEEE/RSJ          sciencedirect.com/science/article/pii/S0168169917308803
     International Conference on Intelligent Robots and Systems (IROS).       [23] A. M. Hasan, F. Sohel, D. Diepeveen, H. Laga, and M. G. Jones,
     IEEE, 2015, pp. 5498–5505.                                                    “A Survey of Deep Learning Techniques for Weed Detection from
 [2] X. Wu, S. Aravecchia, P. Lottes, C. Stachniss, and C. Pradalier,              Images,” Computers and Electronics in Agriculture, vol. 184, p.
     “Robotic Weed Control Using Automated Weed and Crop Classifi-                 106067, 2021.
     cation,” Journal of Field Robotics, vol. 37, no. 2, pp. 322–340, 2020.   [24] G. Ciaparrone, F. L. Sánchez, S. Tabik, L. Troiano, R. Tagliaferri,
 [3] R. Raja, T. T. Nguyen, D. C. Slaughter, and S. A. Fennimore, “Real-           and F. Herrera, “Deep Learning in Video Multi-object Tracking: A
     time Weed-crop Classification and Localisation Technique for Robotic          Survey,” Neurocomputing, vol. 381, pp. 61–88, 2020.
     Weed Control in Lettuce,” Biosystems Engineering, vol. 192, pp. 257–     [25] A. Brunetti, D. Buongiorno, G. F. Trotta, and V. Bevilacqua, “Com-
     274, 2020.                                                                    puter Vision and Deep Learning Techniques for Pedestrian Detection
 [4] R. Bogue, “Robots Poised to Transform Agriculture,” Industrial                and Tracking: A Survey,” Neurocomputing, vol. 300, pp. 17–33, 2018.
     Robot: the international journal of robotics research and application,   [26] A. Osep, W. Mehner, M. Mathias, and B. Leibe, “Combined Image and
     2021.                                                                         World-space Tracking in Traffic Scenes,” in 2017 IEEE International
 [5] F. Zhong, S. Wang, Z. Zhang, and Y. Wang, “Detect-SLAM: Making                Conference on Robotics and Automation (ICRA). IEEE, 2017, pp.
     Object Detection and SLAM Mutually Beneficial,” in 2018 IEEE                  1988–1995.
     Winter Conference on Applications of Computer Vision (WACV).             [27] L. Bridgeman, M. Volino, J.-Y. Guillemaut, and A. Hilton, “Multi-
     IEEE, 2018, pp. 1001–1010.                                                    person 3d Pose Estimation and Tracking in Sports,” in Proceedings of
 [6] K. Ok, K. Liu, K. Frey, J. P. How, and N. Roy, “Robust Object-                the IEEE/CVF Conference on Computer Vision and Pattern Recogni-
     based SLAM for High-speed Autonomous Navigation,” in 2019 In-                 tion Workshops, 2019, pp. 0–0.
     ternational Conference on Robotics and Automation (ICRA). IEEE,          [28] X. Liu, S. W. Chen, S. Aditya, N. Sivakumar, S. Dcunha, C. Qu,
     2019, pp. 669–675.                                                            C. J. Taylor, J. Das, and V. Kumar, “Robust Fruit Counting: Combin-
 [7] P. Pandey, H. N. Dakshinamurthy, and S. Young, “A Literature Review           ing Deep Learning, Tracking, and Structure from Motion,” in 2018
     of Non-Herbicide, Robotic Weeding: A Decade of Progress,” 2020.               IEEE/RSJ International Conference on Intelligent Robots and Systems
 [8] P. Lottes, R. Khanna, J. Pfeifer, R. Siegwart, and C. Stachniss, “UAV-        (IROS). IEEE, 2018, pp. 1045–1052.
     based Crop and Weed Classification for Smart Farming,” in 2017 IEEE      [29] P. Roy and V. Isler, “Surveying Apple Orchards with a Monocular Vi-
     International Conference on Robotics and Automation (ICRA), 2017,             sion System,” in 2016 IEEE International Conference on Automation
     pp. 3024–3031.                                                                Science and Engineering (CASE), 2016, pp. 916–921.
 [9] F. Castaldi, F. Pelosi, S. Pascucci, and R. Casa, “Assessing the         [30] B. K. Horn and B. G. Schunck, “Determining Optical Flow,” Artificial
     Potential of Images from Unmanned Aerial Vehicles (UAV) to Support            intelligence, vol. 17, no. 1-3, pp. 185–203, 1981.
     Herbicide Patch Spraying in Maize,” Precision Agriculture, vol. 18,      [31] A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft, “Simple Online
     no. 1, pp. 76–94, 2017.                                                       and Realtime Tracking,” in 2016 IEEE international conference on
[10] X. Wu, S. Aravecchia, P. Lottes, C. Stachniss, and C. Pradalier,              image processing (ICIP). IEEE, 2016, pp. 3464–3468.
     “Robotic Weed Control Using Automated Weed and Crop Classifi-            [32] L. Leal-Taixé, A. Milan, I. Reid, S. Roth, and K. Schindler, “Motchal-
     cation,” Journal of Field Robotics, vol. 37, no. 2, pp. 322–340, 2020.        lenge 2015: Towards a Benchmark for Multi-target Tracking,” arXiv
[11] A. N. Sivakumar, S. Modi, M. V. Gasparino, C. Ellis, A. E. B.                 preprint arXiv:1504.01942, 2015.
     Velasquez, G. Chowdhary, and S. Gupta, “Learned Visual Nav-              [33] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards Real-
     igation for Under-Canopy Agricultural Robots,” arXiv preprint                 time Object Detection with Region Proposal Networks,” Advances in
     arXiv:2107.02792, 2021.                                                       neural information processing systems, vol. 28, pp. 91–99, 2015.
[12] Z. Zhang, E. Kayacan, B. Thompson, and G. Chowdhary, “High               [34] A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan,
     Precision Control and Deep Learning-based Corn Stand Counting                 W. Wang, Y. Zhu, R. Pang, V. Vasudevan, et al., “Searching for Mo-
     Algorithms for Agricultural Robot,” Autonomous Robots, vol. 44,               bilenetv3,” in Proceedings of the IEEE/CVF International Conference
     no. 7, pp. 1289–1302, 2020.                                                   on Computer Vision, 2019, pp. 1314–1324.
[13] P. Moulon, P. Monasse, and R. Marlet, “Global Fusion of Relative         [35] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning
     Motions for Robust, Accurate and Scalable Structure from Motion,”             for Image Recognition,” in Proceedings of the IEEE conference on
     in Proceedings of the IEEE International Conference on Computer               computer vision and pattern recognition, 2016, pp. 770–778.
     Vision, 2013, pp. 3248–3255.                                             [36] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
[14] S. M. Hasheminasab, T. Zhou, and A. Habib, “GNSS/INS-Assisted                 P. Dollár, and C. L. Zitnick, “Microsoft CoCo: Common objects in
     Structure from Motion Strategies for UAV-Based Imagery over Mech-             Context,” in European conference on computer vision. Springer, 2014,
     anized Agricultural Fields,” Remote Sensing, vol. 12, no. 3, p. 351,          pp. 740–755.
     2020.                                                                    [37] R. Revathi and M. Hemalatha, “Certain Approach of Object Tracking
[15] B. Czupryński and A. Strupczewski, “Real-time RGBD SLAM Sys-                 using Optical Flow Techniques,” International Journal of Computer
     tem,” in Photonics Applications in Astronomy, Communications, In-             Applications, vol. 53, no. 8, 2012.
     dustry, and High-Energy Physics Experiments 2015, vol. 9662. In-         [38] J.-Y. Bouguet et al., “Pyramidal Implementation of the Affine Lucas
     ternational Society for Optics and Photonics, 2015, p. 96622B.                Kanade Feature Tracker Description of the Algorithm,” Intel corpora-
[16] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos, “ORB-SLAM: a                tion, vol. 5, no. 1-10, p. 4, 2001.
     Versatile and Accurate Monocular SLAM System,” IEEE transactions         [39] G. Welch, G. Bishop, et al., “An Introduction to the Kalman Filter,”
     on robotics, vol. 31, no. 5, pp. 1147–1163, 2015.                             1995.
[17] M. Labbé and F. Michaud, “RTAB-Map as an Open-source Lidar and          [40] H. W. Kuhn and B. Yaw, “The Hungarian Method for the Assignment
     Visual Simultaneous Localization and Mapping Library for Large-               Problem,” Naval Res. Logist. Quart, pp. 83–97, 1955.
     scale and Long-term Online Operation,” Journal of Field Robotics,        [41] M. A. Fischler and R. C. Bolles, “Random Sample Consensus: a
     vol. 36, no. 2, pp. 416–446, 2019.                                            Paradigm for Model Fitting with Applications to Image Analysis and
[18] Z. Wang, Q. Zhang, J. Li, S. Zhang, and J. Liu, “A Computationally            Automated Cartography,” Communications of the ACM, vol. 24, no. 6,
     Efficient Semantic SLAM Solution for Dynamic Scenes,” Remote                  pp. 381–395, 1981.
     Sensing, vol. 11, no. 11, p. 1363, 2019.                                 [42] J. L. Crassidis, “Sigma-point Kalman Filtering for Integrated GPS and
[19] C. Rubino, M. Crocco, and A. Del Bue, “3d Object Localisation from            Inertial Navigation,” IEEE Transactions on Aerospace and Electronic
     Multi-view Image Detections,” IEEE transactions on pattern analysis           Systems, vol. 42, no. 2, pp. 750–756, 2006.
     and machine intelligence, vol. 40, no. 6, pp. 1281–1294, 2017.
[20] L. Nicholson, M. Milford, and N. Sünderhauf, “QuadricSLAM: Dual
     Quadrics from Object Detections as Landmarks in Object-oriented
     SLAM,” IEEE Robotics and Automation Letters, vol. 4, no. 1, pp.
     1–8, 2018.
[21] Z. Zou, Z. Shi, Y. Guo, and J. Ye, “Object Detection in 20 Years: A
     Survey,” 2019.
ROW-SLAM: Under-Canopy Cornfield Semantic SLAM - arXiv ROW-SLAM: Under-Canopy Cornfield Semantic SLAM - arXiv ROW-SLAM: Under-Canopy Cornfield Semantic SLAM - arXiv
You can also read