FlowMOT: 3D Multi-Object Tracking by Scene Flow Association

Page created by Tiffany Reese

Finance

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

FlowMOT: 3D Multi-Object Tracking by Scene Flow Association

FlowMOT: 3D Multi-Object Tracking by Scene Flow Association
                                                                Guangyao Zhai1 , Xin Kong1 , Jinhao Cui1 , Yong Liu1,2,∗ and Zhen Yang3

                                           Abstract— Most end-to-end Multi-Object Tracking (MOT)                                                                    Frame # 309
                                        methods face the problems of low accuracy and poor gener-
                                        alization ability. Although traditional filter-based methods can
                                        achieve better results, they are difficult to be endowed with
                                                                                                                        1
                                        optimal hyperparameters and often fail in varying scenarios.
arXiv:2012.07541v3 [cs.CV] 5 Mar 2021

                                        To alleviate these drawbacks, we propose a LiDAR-based 3D
                                        MOT framework named FlowMOT, which integrates point-wise
                                        motion information with the traditional matching algorithm,
                                                                                                                                                                    Frame # 310
                                        enhancing the robustness of the motion prediction. We firstly
                                        utilize a scene flow estimation network to obtain implicit motion                                                  1

                                        information between two adjacent frames and calculate the                                                2
                                        predicted detection for each old tracklet in the previous frame.            2
                                        Then we use Hungarian algorithm to generate optimal matching
                                        relations with the ID propagation strategy to finish the tracking
                                        task. Experiments on KITTI MOT dataset show that our
                                        approach outperforms recent end-to-end methods and achieves
                                        competitive performance with the state-of-the-art filter-based          Fig. 1: A glimpse of FlowMOT results in sequence 01. Red 1 and
                                        method. In addition, ours can work steadily in the various-             Green 2 stand for the same chosen car in two adjacent frames. Blue
                                        speed scenarios where the filter-based methods may fail.                arrows represent the point-wise scene flow belonging to the car.

                                                               I. I NTRODUCTION                                 method based on collecting distribution information of the
                                           The task of Multi-Object Tracking (MOT) has been a                   dataset. However, each dataset has its own characteristics
                                        long-standing and challenging problem, which aims to locate             (e.g., the condition of sensors and collection frequencies
                                        objects in the video and assign a consistent ID for the same            vary in different datasets), theoretically making this method a
                                        instance. Many vision applications, such as autonomous                  paucity of generalization across various datasets. As for cur-
                                        driving, robot collision prediction, and video face alignment,          rent end-to-end approaches, they normally obtain a restrained
                                        require MOT as their crucial component technology. Re-                  accuracy, and the explainability is not apparent enough.
                                        cently, MOT has been fueled along with the development                     Inspired by using the optical flow of the 2D object pixel-
                                        of research concerned with 3D object detection.                         level segmentation results to generate tracklets in [3], we
                                           The researches on 3D MOT have attracted widespread                   realize that scene flow reflects the point-wise 3D motion
                                        attention in the vision community. Weng et.al. [1] presented            consistency and has the potential ability to tackle the above
                                        a Kalman Filter-based 3D MOT framework based on LiDAR                   issues by integrating with traditional matching algorithms.
                                        point clouds and proposed a series of evaluation metrics                We follow the track-by-detection paradigm and propose a
                                        for such task. Although impressive performance both on                  hybrid motion prediction strategy based on the contribution
                                        accuracy and operation speed, a notable drawback is that                of point-wise consistency. We use the estimated scene flow to
                                        such a method only focuses on the bounding boxes of                     calculate the object-level movement to make it have tracklet-
                                        the detection results but ignores the internal correlation              prediction ability. Then, through some match algorithms
                                        of point clouds. Besides, Kalman Filter requires frequent               (such as the Hungarian algorithm [4]), we can get the
                                        hyperparameter tuning due to hand-crafted motion models,                matching relationship between old tracklets in the previous
                                        which are sensitive to the attributes of tracked objects, like          frames and all new detection results in subsequent ones to
                                        speed and categories (e.g., optimal settings for Car are                update the tracking outcomes. We use the proposed Motion
                                        various from the ones for Pedestrian). In the following work            Estimation and Motion Tracking modules, taking advantage
                                        based on Kalman Filter, Chiu et.al. [2] has further improved            of scene flow in tracklet prediction instead of Kalman Filter,
                                        the performance by implementing a hyperparameter-learning               which saves the trouble of adjusting numerous hyperpa-
                                                                                                                rameters and achieves the intrinsic variable motion model
                                           1 Guangyao Zhai, Xin Kong, Jinhao Cui and Yong Liu are with the
                                                                                                                construction through the whole process. Experiments show
                                        Institute of Cyber-Systems and Control, Zhejiang University, Hangzhou
                                        310027, P. R. China.                                                    that our framework can maintain satisfactory performance,
                                           2 Yong Liu is with the State Key Laboratory of Industrial Control    especially in challenging various-speed scenarios where tra-
                                        Technology, Zhejiang University, Hangzhou 310027, P. R. China (*Yong    ditional methods may fail due to improper tuning.
                                        Liu is the corresponding author, Email: yongliu@iipc.zju.edu.cn).
                                           3 Zhen Yang is with the Noah’s Ark Laboratory, Huawei Technologies      We expect our framework can serve as a prototype for
                                        Co. Ltd., Shanghai 200127, P. R. China.                                 drawing researchers’ attention to explore related motion-

based methods. Our contributions are as follows: part by fusing the appearance based on the two modalities –
• We propose a 3D MOT framework with scene flow IoU and appearance are selected through training to build a
estimation named FlowMOT, which, to the best of our better cost matrix then perform JPDA procedure. However,
knowledge, firstly introduces learning-based scene flow the 3D appearance used in this method does not have the
estimation into 3D MOT. object’s representative features (2D is available because it
• By integrating the learning-based motion prediction uses the ReID technique [12]). A proper exploration of
module to update the object-level predictions in 3D the 3D appearance extraction will undoubtedly help this
space, we avoid the trouble of repeatedly tuning the performance in the future.
hyperparameters and constant motion model issues in- Our method follows this paradigm. In addition, we use
herent in the traditional filter-based methods. scene flow estimation to extract 3D motion consistent in-
• Experiments on the KITTI MOT dataset show that formation for prediction instead of Kalman Filter to avoid
our approach achieves competitive performance against hyperparameter adjustments. [13] had proposed to use sparse
state-of-art methods and can work in challenging sce- scene flow for moving object detection and tracking, but
narios where the traditional methods usually fail. using image information to calculate geometric leads to
relatively lower accuracy. In contrast, we benefit from the
II. R ELATED W ORK robustness and higher accuracy of the learning-based 3D
In recent years, various methods have been explored for scene flow to achieve better results.
MOT, but in terms of methodology, so far, most MOT Joint detection and tracking. As the detection task and
solutions are based on two paradigms, namely track-by- the tracking task are positively correlated, another paradigm
detection and joint detection and tracking. different from track-by-detection is created, allowing detec-
Track-by-detection. Most tracking methods are developed tion and tracking to refine each other or integrate the two
based on this paradigm, which can be attributed to prediction, tasks for a multi-task completion. [14] introduced a siamese
data association, and update. As the most critical part, data network with current and past frame inputs that is used
association has attracted tons of researchers to this field. to predict the offsets of the bounding boxes to complete
In the traditional methods, SORT [5] used Kalman Filter the tracking work, using the Viterbi algorithm to maximize
for object attribute prediction and then used Hungarian the output of a proposed score function to optimize the
algorithm [4] for data association. Weng et al. [1] applied detection result. MOTSFusion [3] faced the Multi-Object
this paradigm to the tracking of 3D objects, depending on Tracking and Segmentation [15] by using multi-modality
the Hungarian algorithm as well, and propose a new 3D MOT data to generate tracklets and combines the reconstruction
evaluation standard similar to the 2D MOT but overcome its task to optimize and integrate the final track results by
drawbacks. This filter-based tracking framework can achieve detecting the continuity between tracks, thereby recovering
accurate results with real-time performance, but its disadvan- missed detection. Tracktor [16] utilized a previous robust
tage is evident in the meantime. To begin with, it needs to detector and adopts the strategy of ID propagation to achieve
adjust hyperparameters in the covariance matrixes to achieve tracking without complex algorithms. Inspired by this idea,
promising results. Secondly, its motion model is too constant CenterTrack [17] uses the two frames of images and the heat
to generalize across different circumstances. However, in map of the previous frame as input to the network to predict
practical usage, application scenes are diversified, and it is the bounding boxes and offsets of the next frame achieving
not easy to unify a standard motion model that can adapt to the integration of the two tasks.
all conditions. CenterPoint [6] proposed a BEV-based detec-
tor and used information collected in advance (e.g., speed III. M ETHODOLOGY
and orientation) to associate the center of the object by the We focus on processing 3D point cloud generated by
Hungarian-based closest-point matching algorithm. Recently, LiDAR in the autonomous driving scenes. Figure 2 is the
due to the robust feature extraction ability of deep learning, pipeline of our proposed framework. It generally takes a
more and more learning-based methods for data association pair of consecutive labelled point clouds as inputs, including
has been explored. [7] proposed the Deep Structured Model old tracklets B t−1 and new detection results Dt in the
(DSM) method based on hinge loss, which is entirely end- previous and current frame severally. Finally, the framework
to-end. mmMOT [8] used off-the-shelf detectors, including returns the new tracklets B t in the updated frame. Details
images and point clouds for data association through multi- are illustrated in the following parts.
modal feature fusion and learning of the adjacency matrix
A. Scene Preprocessing
between objects. The above two methods both use linear
programming [9] and belong to offline methods in tracking To promote efficiency and reduce the superfluous calcula-
research. [10] proposed to interact with appearance features tion, we propose to preprocess point cloud scenes frame by
through a Graph Neural Network, thus helping the associ- frame through expanding the Field of View and fitting the
ation and tracking process. JRMOT [11] was designed as ground plane.
a hybrid framework combining traditional and deep learning Expanded Field of View. As the unified evaluation pro-
methods. It also used camera and LiDAR sensors, but it is an cedure only cares about results within the camera’s Field
online method and improves the traditional data association of View (FoV), we first use the calibration relationship

point cloud 1
                                                                                                                                        Scene Flow
  Tracklets                                                               ﬂow embedding
                                                                                                    set conv layers
    B t-1

                                                                     64
                                      N1                  N1/8
                                            3
                                                                                                                                                                  t-1 → t

                                                                 +
                                                                 3
                                             set conv layers

                                                                                               8

                                                                                                                       2
                                                                                                                                               N1

                                                                                              12

                                                                                                                      51
                                                                                                                           set upconv layers        3

                                                                                          +

                                                                                                                 +
                                                                                          3

                                                                                                                3
                                                                                N1/8
                                 T                                                                 N1/128
                           t-1

                                                                     64
                                                         N2/8

                                                                 +
                                      N2
                                                                                                                  Motion Estimation

                                                                 3
                                            3
                                      point cloud 2
                                                                                          offset
                                                                                                      Instance-mean
                                                                                  T
                                                        S                              3+1

                                 M
                                            Hungarian algorithm                                             Birth/death
                            t
                                                                                                                                                                       t

                     Detection                                                                         ID propagation
                        Dt                                                                                                                                    New Tracklets
                                                                                                                                                                   Bt
      Inputs                               Motion Tracking                                                                                          Outputs

Fig. 2: The FlowMOT architecture for 3D MOT task. The whole framework consists of three components. For the inputs, we use two
adjacent ground-regardless point clouds along with detection results of t frame and the tracklets of t − 1 frame. Motion Estimation
module is used to extract 3D motion information from point clouds expressing in the scene flow format. Next step, Motion Tracking
module transfers the estimated scene flow into a series of offset to help the data association procedure. Finally we use ID propagation
plus birth/death check to accomplish MOT.

between the sensors to transfer the point cloud coordinates                            the predictions of old tracklets from the previous frame and
into the camera coordinate system and then reserve point                               further guides the following data association as well.
clouds within the FoV. As predicting the birth and death of
the objects is a crucial part of the tracking task (details in
Section III-C), we expand the FoV to a certain range, which
allows our framework to procure more information in the
edge of original FoV between the previous and current frame.
It helps improve the inference performance of the scene
flow estimation network. Thus, it benefits the birth/death
check further. Our experiments find that this proceeding is
necessary. See Figure 3.
Ground plane fitting. As we know, points on the ground
plane constitute the majority of the point cloud, and it is
                                                                                       Fig. 3: In the original version, we notice that a car has left in the
not necessary to concern the motion of the ground in our                               current frame, so there is no point representing it. It is a normal
mission. Therefore, their removal will promote the sam-                                phenomenon that an object leaves the scene, but its vanishment
pling probability of potential moving objects and make the                             would mislead corresponding points in the previous frame to a
scene flow estimation more accurate (refer to Section III-                             wrong scene flow estimation. To moderate this problem, we expand
                                                                                       the FoV to let points in the current frame guide the Motion
B). Various algorithms can be used in this proceeding, such
                                                                                       Estimation process.
as geometric clustering [18], fitting plane normal vector by
Random Sample Consensus (RANSAC) [19] and learning-                                    Data presentation. Our work adopts FlowNet3D [21] as the
based inference [20]. Here we use the method of [20]. Note                             scene flow estimation network that can be replaced flexibly
that we only label the ground plane instead of directly                                by other potential methods. The inputs are the point clouds
removing it, as the front-end object detector may use raw                              P = {pi | i = 1, . . . , N1 } and Q = {qj | j = 1, . . . , N2 }
point clouds with the ground information.                                              from the previous and current frame, where N1 and N2 are
                                                                                       the quantities of points in each frame. pi and qj represent
B. Motion Estimation                                                                   geometric and additional information of every single point
   In this module, we leverage a learning-based framework                              as follows:
with scene flow estimation for predicting tracklets. The key                                                   pi = {vi , fi } (i = 1, . . . , N1 ),
insight behind this is that 3D motion information of the entire                                                                                                             (1)
                                                                                                               qj = {vj , fj } (j = 1, . . . , N2 )
scene is embedded in the scene flow stream regardless of
object categories, and the existing learning-based scene flow                          vi , vj ∈ R3 stand for point cloud coordinates X, Y, Z, while
estimation method has strong generalization ability and high                           fi , fj ∈ Rc stand for color, intensity and other features
robustness facing various scenarios. This operation modifies                           (optional). The output is a bunch of scene flow vectors

F ∈ RN1 ×3 . The order and quantity of these vectors are frame. The matching matrix S ∈ RT ×M produces scores for
consistent with the ones of points in the previous frame. all possible matches between the detections and tracklets,
Estimation procedure. The aim of the scene flow esti- which can be solved in polynomial time by Hungarian
mation network is to predict locations where points in the matching algorithm [4].
previous frame will appear in the next one. First, the network ID propagation. Most methods lying in the joint detection
uses set conv layers to encode the high-dimensional features and tracking paradigm share a common procedure: they
of the points randomly selected from point clouds in the first obtain the link information between detection results in
two frames respectively. After this, it mixes the geometric the previous and current frame and then directly propagate
feature similarities and spatial relationships of the points tracking IDs according to this prior information without
through the flow embedding layer and another set conv bells and whistles. We use this strategy without introducing
layer to generate embeddings containing 3D motion infor- complicated calculations in the new tracklets generation pro-
mation. Finally, through its set upconv layer up-sampling cess, but propagate IDs according to the instance matching
and decoding the embeddings, we can obtain the scene relationship generated by data association. That is, we fully
flow of the corresponding points containing abundant motion trust and adopt the detection results in the next frame. If we
information between two scenes [21]. fail to align an old tracklet to its corresponding detection,
C. Motion Tracking we will check its age and decide whether to keep it alive.
This module contains the back-end tracking procedures. Birth/Death check. This inspection serves to determine
It explicitly expresses the movement tendencies of objects whether a tracklet meets its birth or death. There are two
from the extraction of the point-wise motion consistency. circumstances that can cause a trajectory of birth and death
respectively. The normal one is that the tracked objects
Instance prediction. The goal is to estimate the movement
leave the scene or new objects enter the scene, whilst the
of every tracklet in the previous frame. According to the
abnormal but common one is caused by the wrong detection
result given by the detector and old tracklets, we can allocate
results from the front-end detector we use – False Negative
a unique instance ID to each object, separating them from
and False Positive phenomena. We handle these problems
each other. After obtaining the scene flow of the whole scene
by setting a max mis and a min det, which individually
between frames, we can predict the approximate locomotive
represents the maintenance of a lost tracklet and the number
translation and orientation of each object according to in-
of observation of a tracklet candidate.
stance IDs by calculating the linear increment prediction of
1) Birth check: On the one side, we treat all unmatched
each tracklet (∆xn , ∆yn , ∆zn ) and the angular increment
detection results Ddet as potential new objects entering the
prediction ∆θn which can be individually obtained from
scene, including False Positive ones. Then we deal with this
Equation 2 and the constant angular velocity model.
in an uniform way – the unmatched detection result ddet n will
N
1 X n not create a new tracklet bbirn for d det
n until the continuous
(∆xn , ∆yn , ∆zn ) = F (2) matching in the next min det frame. Ideally, this strategy will
N c=1 c
avoid generating wrong tracklets.
Fcn means the scene flow vector of the cth point belonging 2) Death check: On the other side, we treat all unmatched
to the corresponding tracklet and N is the quantity of points. tracklets B mis as potential objects leaving the scene, includ-
The final combined increment is called offset represented as: ing False Negative ones. Also we deal with this in another
uniform way – when a tracklet bmiss m cannot be associated
O = {on | n = 1, . . . , T } ,
(3) with a corresponding detection in the next frame, its age will
on = {(∆xn , ∆yn , ∆zn ), ∆θn } increase and we will keep updating attributes of the bounding
where T is the quantity of tracklets in the previous frame. box unless the age exceeds max mis. Typically, this tactic
Data association. In every frame, the state of whole will keep a true tracklet still alive in the scene but cannot be
tracklets and the bounding box of each tracklet can be found a match due to a lack of positive detection results.
presented as
IV. E XPERIMENTS
B t−1 = bt−1

n | n = 1, · · · , T ,
(4) A. Experimental protocols:
bn = xn , ynt−1 , znt−1 , lnt−1 , wnt−1 , ht−1
t−1
t−1 t−1
n , θn

including location of the object center in the 3D space Evaluation Metrics. We adopt the 3D MOT metrics firstly
(x, y, z), object 3D size (l, w, h) and heading angle θ. The proposed by AB3DMOT [1], which are more tenable on eval-
t−1 uating the performance of 3D MOT systems than the ones
state will be predicted as Bpre by offset
t−1 t−1 t−1
in CLEAR [23] focusing on 2D performance. The primary
(Xpre , Ypre , Zpre ) = (X, Y, Z) + (∆X, ∆Y, ∆Z) (5) metric used to rank tracking methods is also called MOTA,
Θt−1 which incorporates three error types: False Positives (FP),
pre = Θ + ∆Θ (6)
False Negatives (FN) and ID Switches (IDS). The difference
t−1
Then we calculate a similarity score matrix with Bpre and lies in the calculation ways of these factors modified from
t t
the new object detection D = {dn | n = 1, · · · , M } in the t 2D IoU to 3D IoU, and we match the 3D tracking results

Fig. 4: LiDAR Frame-dropping circumstance in KITTI sequence 01. For providing a better viewing experience, we show Frame 176 to
Frame 183 from the camera, which are consistent with the ones from the LiDAR. Notice that AB3DMOT dies at Frame 181 and respawns
at Frame 183.
                TABLE I: Performance of Car on the KITTI 10Hz val dataset using the 3D MOT evaluation metrics.

Method               Algorithm          Input Data        Matching criteria    sAMOTA↑       AMOTA↑     AMOTP↑    MOTA↑     MOTP↑    IDS↓   FRAG↓
mmMOT [8]            Learning           2D + 3D            IoUthres = 0.25       70.61         33.08     72.45     74.07     78.16    10       55
                                                            IoUthres = 0.7       63.91         24.91     67.32     51.91     80.71    24      141
FANTrack [22]        Learning           2D + 3D            IoUthres = 0.25       82.97         40.03     75.01     74.30     75.24    35      202
                                                            IoUthres = 0.7       62.72         24.71     66.06     49.19     79.01    38      406
AB3DMOT [1]          Filter             3D                 IoUthres = 0.25       93.28         45.43     77.41     86.24     78.43     0       15
                                                            IoUthres = 0.7       69.81         27.26     67.00     57.06     82.43     0      157
FlowMOT (Ours)       Hybrid             3D                 IoUthres = 0.25       90.56         43.51     76.08     85.13     79.37     1      26
                                                            IoUthres = 0.7       73.29         29.51     67.06     62.67     82.25     1      249

TABLE II: Performance of Car on the KITTI 5Hz (high-speed) val
                                                                                         system does not require training on this dataset. We adopt
dataset.
Method           Matching criteria      sAMOTA↑      AMOTA↑       AMOTP↑      MOTA↑
                                                                                         the evaluation tool provided by AB3DMOT as KITTI only
                                                                                         supports 2D MOT evaluation. For the matching principle,
AB3DMOT [1]       IoUthres = 0.25         82.98          36.83     69.77      74.55
                   IoUthres = 0.7         56.71          18.96     58.00      45.25      we follow the convention in KITTI 3D object detection
FlowMOT (Ours)    IoUthres = 0.25         87.30          40.34     74.22      80.11      benchmark and use 3D IoU to determine a successful match.
                   IoUthres = 0.7         70.35          27.23     65.20      59.72
                                                                                         Specifically, we use 3D IoU threshold IoUthres of 0.25
                                                                                         and 0.7 representing a tolerant criterion and a strict one
with ground truth directly in 3D space. Meanwhile, MOTP                                  respectively for the Car category and 0.25 for the Pedestrian
and FRAG are also used as parts of the metrics.                                          category. To evaluate the robustness, we generate a subset by
   In addition, AB3DMOT illustrates that the confidence of                               downsampling the val dataset from 10Hz to 5Hz to simulate
detection results can dramatically affect the performance                                high-speed scenes as parts of variable-speed scenarios.
of a specific MOT system (details are delivered in [1]).
It proposes AMOTA and AMOTP (average MOTA and                                            Implementation details. For choosing the front-end de-
MOTP) to handle the problem that current MOT evaluation                                  tector, we adopt a 3D detection method purely based on
metrics ignore the confidence and only evaluate at a specific                            the point cloud. In order to compare with other methods
threshold.                                                                               fairly, we use PointRCNN [26] as the detection module, and
                                                                                         off-the-shelf detection results are consistent with the ones
                                    1        X                                           conducted in [1].
                 AMOTA =                                 MOTAr ,                (7)
                                    L       1                                               Recently, several 3D scene flow approaches [21], [27],
                                        r∈{ L ,··· ,1}
                                                                                         [28] have been proposed and showed terrific performance,
where r is a specific recall value (confidence threshold) and                            which can be easily integrated with our Motion Estimation
L is the amount of recall values. AMOTP is in the similar                                module. In our experiments, we adopt FlowNet3D [21] and
format. To make the value of the AMOTA range from 0%                                     deploy it on an Nvidia Titan Xp GPU. Since the point
up to 100%, it scales the range of the MOTAr by:                                         clouds in KITTI MOT dataset lack the ground truth labels
                                                                                         of the scene flow, we keep the training strategy of the
                                             
                                     1
              sMOTAr = max 0, MOTAr                        (8)                           network consistent with the one conveyed in its original
                                     r
                                                                                         paper – using FlyingThings3D [29] as the training dataset.
The average version sAMOTA can be calculated via Equa-                                   Although the data set is a synthetic data set, FlowNet3D
tion 7 which is the most significant one among all metrics.                              has a strong generalization ability and can adequately adapt
Dataset settings. We evaluate our framework on the KITTI                                 to the LiDAR-based point cloud data distribution in KITTI.
MOT dataset [24], which provides high-quality LiDAR point                                For specific, we randomly sample 6000 points for each
clouds and MOT ground truth. Since KITTI does not provide                                frame resulting in 12000 points (two adjacent frames) as
the official train/val split, we follow [25] and use sequences                           the network input. Meanwhile, we notice that [30] proposes
01, 06, 08, 10, 12, 13, 14, 15, 16, 18, 19 as the val dataset and                        to train FlowNet3D in an unsupervised manner, making the
other sequences as the training dataset, though our 3D MOT                               training process of the network no longer bound by the lack

Camera view Camera view LiDAR view

Fig. 5: Qualitative results of FlowMOT on our 5Hz split. The first column shows the predicted results produced by AB3DMOT [1], whilst
the second and third column is our results both in the camera view and LiDAR view. Note that AB3DMOT loses the target car represented
by a red star in the third column, but we can track it successfully. (Best viewed with zoom-in.)

TABLE III: Performance of Pedestrian with IoUthres = 0.25 on the
KITTI 10Hz and 5Hz val dataset.
AB3DMOT. For the Pedestrian category, Table III shows
ours can be superior to AB3DMOT. The reason is that the
Method Frame rate sAMOTA↑ AMOTA↑ AMOTP↑ MOTA↑
motion style of Pedestrian is different from Car, and thus
AB3DMOT [1] 10Hz 75.85 31.04 55.53 70.90 the hyperparameter settings for Pedestrian in AB3DMOT
5Hz 62.53 20.91 44.50 57.90
FlowMOT (Ours) 10Hz 77.11 31.50 55.04 69.02 should not consist with the ones for Car. However, due to the
5Hz 74.58 29.73 53.23 66.97 inherent problem of constant covariance matrix, AB3DMOT
should repeatedly adjust matrix to maintain its performance
of ground truth, which shows more potential and may further across various categories. In contrast, our method can con-
improve the performance of our framework. duct one-off tracking procedure for all categories, as we
With respect to the designing of Motion Tracking module, achieve a various motion model by estimating scene flow.
we use the same parameters in AB3DMOT, aiming to 2) High-speed scenes: In practice, the various-speed
compare our performance without the effect caused by other motion of vehicles is common and the frame rate of sensors
factors. We use IoUmin = 0.01 in the match principle of the is usually different from each other. Thus, it is important to
data association part. In the birth and death memory module, evaluate robustness. We first observe that AB3DMOT shows
max mis and min det are set to 2 and 3 respectively. a performance descent in sequence 01 as Frame 177 to Frame
180 are missing, making sequence 01 a different frame rate
B. Experimental Results: compared with other sequences, as shown in Figure 4. It
Quantitative results. The quantitative experiment results implies that the single constant motion used in AB3DMOT
are divided into two parts – the tracking result on the KITTI may obtain more failures on the various-speed scenarios and
val dataset at the normal frame rate and the result with cannot handle the frame-dropping problem which is common
reduced frame rate to imitate the high-speed scenes. The in the practical daily usage of sensors. Inspired by this, we
most critical indicator is sAMOTA, as MOTA has some further imitate high-speed scenes, which belong to various-
inherent drawbacks. speed scenarios, by only keeping the even-numbered frames
in the dataset equivalent to reducing the frame rate to 5Hz.
1) Normal scenes: We compare against state-of-the-art
The quantitative comparison with AB3DMOT proves our
methods that are both learning-based and filter-based. We
statement and it is summarized in Table II and Table III.
conclude the results in Table I and Table III. Our system
can outperform mmMOT (ICCV0 19) and FANTrack (IV0 19) Qualitative results. Figure 5 shows the performance of
to a large margin and meanwhile achieve a competitive our method and AB3DMOT facing the various frame rate
result with AB3DMOT (IROS0 20). For the Car category, scenarios. It shows that AB3DMOT cannot handle a lower
we consider that the reason why we are slightly inferior to frame rate representing high-speed or frame-dropping cir-
AB3DMOT on tolerant IoUthres is that KITTI lacks scene cumstances because the parameters in the covariance matrix
flow ground truth and FlowNet3D is unable to be fine-tuned used in this filter-based method are only suitable for a spe-
adequately. When IoUthres becomes stricter, our ID propaga- cific frame rate, and it fails when the frame rate changes. Our
tion strategy starts to be effective as raw detection results method takes advantage of the robustness and generalization
are more accurate toward ground truth bounding boxes than of the learning-based algorithm and can better adapt to this
filter-updated tracklets. Thus, our framework can outperform scenario than AB3DMOT.

V. D ISCUSSION                                      [11] A. Shenoi, M. Patel, J. Gwak, P. Goebel, A. Sadeghian, H. Rezatofighi,
                                                                                    R. Martin-Martin, and S. Savarese, “Jrmot: A real-time 3d multi-object
  Scene flow has shown its advantages in predicting objects’                        tracker and a new large-scale dataset,” 2020 IEEE/RSJ International
motion and further improve the robustness of MOT, but                               Conference on Intelligent Robots and Systems (IROS), 2020.
                                                                               [12] X. Zhang, H. Luo, X. Fan, W. Xiang, Y. Sun, Q. Xiao, W. Jiang,
there are a few limitations we are still facing. Firstly,                           C. Zhang, and J. Sun, “Alignedreid: Surpassing human-level perfor-
due to the data distribution of LiDAR-based point cloud,                            mance in person re-identification,” arXiv preprint arXiv:1711.08184,
the density of the point cloud changes with the distance.                           2017.
                                                                               [13] P. Lenz, J. Ziegler, A. Geiger, and M. Roser, “Sparse scene flow
This phenomenon make FlowNet3D’s estimation relatively                              segmentation for moving object detection in urban environments,” in
inaccurate at a distance, as FlowNet3D is mainly trained on                         2011 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2011, pp.
dense depth images. Thus, tracklets in the distance are not                         926–932.
                                                                               [14] C. Feichtenhofer, A. Pinz, and A. Zisserman, “Detect to track and
well associated causing tracking failure. One solution is the                       track to detect,” in Proceedings of the IEEE International Conference
variable density sampling, which can increase more attention                        on Computer Vision, 2017, pp. 3038–3046.
on farther points and keep less closer points. Secondly, at                    [15] P. Voigtlaender, M. Krause, A. Osep, J. Luiten, B. B. G. Sekar,
                                                                                    A. Geiger, and B. Leibe, “Mots: Multi-object tracking and segmen-
present, networks used for scene flow estimation are always                         tation,” in Conference on Computer Vision and Pattern Recognition
designed to be large, and therefore they tend to occupy much                        (CVPR), 2019.
memory usage of GPU. How to produce more lightweight                           [16] P. Bergmann, T. Meinhardt, and L. Leal-Taixe, “Tracking without bells
                                                                                    and whistles,” in Proceedings of the IEEE international conference on
network is a future direction researchers can study.                                computer vision, 2019, pp. 941–951.
                                                                               [17] X. Zhou, V. Koltun, and P. Krähenbühl, “Tracking objects as points,”
                          VI. C ONCLUSION                                           ECCV, 2020.
                                                                               [18] D. Zermas, I. Izzat, and N. Papanikolopoulos, “Fast segmentation of
   This paper has presented a tracking-by-detection 3D MOT                          3d point clouds: A paradigm on lidar data for autonomous vehicle
framework, which is the first to propose to predict and                             applications,” in 2017 IEEE International Conference on Robotics and
                                                                                    Automation (ICRA). IEEE, 2017, pp. 5067–5073.
associate tracklets by learning-based scene flow estimation.                   [19] A. Behl, D. Paschalidou, S. Donné, and A. Geiger, “Pointflownet:
We take advantage of the traditional matching method and                            Learning representations for rigid motion estimation from point
the deep learning approach to avoid the problems caused                             clouds,” in Proceedings of the IEEE Conference on Computer Vision
                                                                                    and Pattern Recognition, 2019, pp. 7962–7971.
by cumbersome manual hyperparameter adjustment and con-                        [20] Q. Hu, B. Yang, L. Xie, S. Rosa, Y. Guo, Z. Wang, N. Trigoni, and
stant motion model existing in filter-based methods. The                            A. Markham, “Randla-net: Efficient semantic segmentation of large-
experiments show that our method is more accurate than                              scale point clouds,” in Proceedings of the IEEE/CVF Conference on
                                                                                    Computer Vision and Pattern Recognition, 2020, pp. 11 108–11 117.
state-of-the-art end-to-end methods and obtain competitive                     [21] X. Liu, C. R. Qi, and L. J. Guibas, “Flownet3d: Learning scene flow in
results against filter-based methods. Moreover, our frame-                          3d point clouds,” in Proceedings of the IEEE Conference on Computer
work can succeed to process challenge scenarios where the                           Vision and Pattern Recognition, 2019, pp. 529–537.
                                                                               [22] E. Baser, V. Balasubramanian, P. Bhattacharyya, and K. Czarnecki,
accurate filter-based method may break down. Despite some                           “Fantrack: 3d multi-object tracking with feature association network,”
limitations, we hope our work can be a prototype to inspire                         in 2019 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2019, pp.
researchers to explore related fields.                                              1426–1433.
                                                                               [23] K. Bernardin and R. Stiefelhagen, “Evaluating multiple object tracking
                                                                                    performance: the clear mot metrics,” EURASIP Journal on Image and
                             R EFERENCES                                            Video Processing, vol. 2008, pp. 1–10, 2008.
                                                                               [24] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous
 [1] X. Weng, J. Wang, D. Held, and K. Kitani, “3d multi-object tracking:           driving? the kitti vision benchmark suite,” in Conference on Computer
     A baseline and new evaluation metrics,” 2020 IEEE/RSJ International            Vision and Pattern Recognition (CVPR), 2012.
     Conference on Intelligent Robots and Systems (IROS), 2020.                [25] S. Scheidegger, J. Benjaminsson, E. Rosenberg, A. Krishnan, and
 [2] H.-k. Chiu, A. Prioletti, J. Li, and J. Bohg, “Probabilistic 3d                K. Granström, “Mono-camera 3d multi-object tracking using deep
     multi-object tracking for autonomous driving,” arXiv preprint                  learning detections and pmbm filtering,” in 2018 IEEE Intelligent
     arXiv:2001.05673, 2020.                                                        Vehicles Symposium (IV). IEEE, 2018, pp. 433–440.
 [3] J. Luiten, T. Fischer, and B. Leibe, “Track to reconstruct and recon-     [26] S. Shi, X. Wang, and H. Li, “Pointrcnn: 3d object proposal gener-
     struct to track,” IEEE Robotics and Automation Letters, vol. 5, no. 2,         ation and detection from point cloud,” in Proceedings of the IEEE
     pp. 1803–1810, 2020.                                                           Conference on Computer Vision and Pattern Recognition, 2019, pp.
 [4] H. W. Kuhn, “The hungarian method for the assignment problem,”                 770–779.
     Naval research logistics quarterly, vol. 2, no. 1-2, pp. 83–97, 1955.     [27] X. Gu, Y. Wang, C. Wu, Y. J. Lee, and P. Wang, “Hplflownet:
 [5] A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft, “Simple online             Hierarchical permutohedral lattice flownet for scene flow estimation
     and realtime tracking,” in 2016 IEEE International Conference on               on large-scale point clouds,” in Proceedings of the IEEE Conference
     Image Processing (ICIP). IEEE, 2016, pp. 3464–3468.                            on Computer Vision and Pattern Recognition, 2019, pp. 3254–3263.
 [6] T. Yin, X. Zhou, and P. Krähenbühl, “Center-based 3d object detection   [28] X. Liu, M. Yan, and J. Bohg, “Meteornet: Deep learning on dynamic
     and tracking,” arXiv preprint arXiv:2006.11275, 2020.                          3d point cloud sequences,” in Proceedings of the IEEE International
 [7] D. Frossard and R. Urtasun, “End-to-end learning of multi-sensor 3d            Conference on Computer Vision, 2019, pp. 9246–9255.
     tracking by detection,” in 2018 IEEE International Conference on          [29] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy,
     Robotics and Automation (ICRA). IEEE, 2018, pp. 635–642.                       and T. Brox, “A large dataset to train convolutional networks for
 [8] W. Zhang, H. Zhou, S. Sun, Z. Wang, J. Shi, and C. C. Loy, “Robust             disparity, optical flow, and scene flow estimation,” in Proceedings
     multi-modality multi-object tracking,” in Proceedings of the IEEE              of the IEEE conference on computer vision and pattern recognition,
     International Conference on Computer Vision, 2019, pp. 2365–2374.              2016, pp. 4040–4048.
 [9] S. Schulter, P. Vernaza, W. Choi, and M. Chandraker, “Deep network        [30] H. Mittal, B. Okorn, and D. Held, “Just go with the flow: Self-
     flow for multi-object tracking,” in Proceedings of the IEEE Conference         supervised scene flow estimation,” in Proceedings of the IEEE/CVF
     on Computer Vision and Pattern Recognition, 2017, pp. 6951–6960.               Conference on Computer Vision and Pattern Recognition, 2020, pp.
[10] X. Weng, Y. Wang, Y. Man, and K. M. Kitani, “Gnn3dmot: Graph                   11 177–11 185.
     neural network for 3d multi-object tracking with 2d-3d multi-feature
     learning,” in Proceedings of the IEEE/CVF Conference on Computer
     Vision and Pattern Recognition, 2020, pp. 6499–6508.