Multi-Object Tracking and Segmentation with a Space-Time Memory Network

Page created by Donna Salazar

Cars & Machinery

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Multi-Object Tracking and Segmentation with a Space-Time Memory Network

Mehdi Miah, Guillaume-Alexandre Bilodeau and Nicolas Saunier
Polytechnique Montréal
{mehdi.miah, gabilodeau, nicolas.saunier}@polymtl.ca
arXiv:2110.11284v1 [cs.CV] 21 Oct 2021

Abstract (MOTS) [37] were generally evaluated with measures heav-
ily biased against the association step [20, 35]. Luiten et
We propose a method for multi-object tracking and seg- al. [20] proved that some metrics such as the MOTA [31]
mentation that does not require fine-tuning or per bench- and its variants ignore largely the association quality.
mark hyper-parameter selection. The proposed tracker, Hence, they introduced the HOTA metric which not only
MeNToS, addresses particularly the data association prob- balances the detection and the association step but also has a
lem. Indeed, the recently introduced HOTA metric, which better alignment with the human visual assessment. There-
has a better alignment with the human visual assessment by fore, evaluating MOTS with the HOTA metric should lead
evenly balancing detections and associations quality, has to methods that are more robust and in turn that could im-
shown that improvements are still needed for data associ- prove the quality of the analysis of traffic videos.
ation. After creating tracklets using instance segmentation In addition to MOTS, some researchers consider another
and optical flow, the proposed method relies on a space- tracking task: one-shot video object segmentation (OS-
time memory network developed for one-shot video object VOS) [4, 3, 5] where the segmentation masks of all objects
segmentation to improve the association of tracklets with of interest are provided at the first frame. Even if it is similar
temporal gaps. We evaluated our tracker on KITTIMOTS to MOTS, where the objective is to detect and track the ob-
and MOTSChallenge and show the benefit of our data asso- jects belonging to a predefined set of classes (usually pedes-
ciation strategy with the HOTA metric. The project page is trians, cyclists and cars), both tasks are evaluated differently
www.mehdimiah.com/mentos+. and standard solving strategies vary significantly. OSVOS
is mainly based on propagation methods [25, 44, 23] and
cannot rely on detections since the classes of objects of in-
1. Introduction terest are unknown, whereas MOTS is heavily detection-
Analyzing road traffic videos is an effective method to oriented since only the classes are known. Hence, MOTS is
develop and evaluate some urban policies and traffic man- regarded as an association problem.
agement methods [45, 7]. For example, it enables to accu- Our proposed tracker, named MeNToS (Memory
rately detect some hazardous behaviors such as exceeding Network-based Tracker of Segments), consists in using a
the speed limit, overtaking cyclists at high speed or com- propagation method originally developed for OSVOS to
mitting traffic violations. Detecting such events is possible solve an association problem for MOTS. We solve the as-
through video data analysis: given a video, the objective is sociation problem hierarchically as [43, 46]. First, given
to detect the road users, track them and compute key indi- some instance segmentation masks, we associated masks
cators, such as speed and trajectories. This paper focuses between consecutive frames. As such, masks are spatially
on the first two aspects, with an emphasis on the data as- close and visually similar and therefore a method based on
sociation component of tracking that aims to assign unique a low-level information (meaning colors) may be sufficient.
identities to all the objects in a video. Keeping the same That is why the short-term association is based on the op-
identity for an object after an occlusion is very important, tical flow. After this first association, we obtained continu-
especially for safety purposes. Indeed, occlusions happen ous tracklets. Second, we use a space-time memory (STM)
when two objects, for example, a car and a pedestrian, are network [24], originally developed to solve OSVOS, to as-
close, leading to a situation of a possible collision. Not at- sociate the tracklets. A STM network can be viewed as a
tributing the same identity to a car before and after an oc- spatio-temporal attention mechanism [36, 2] able to find a
clusion makes the recognition of hazardous events and their correspondence between a target and a set of queries. As
analysis harder. opposed to OSVOS where memory networks [15, 32] are
Until recently, multi-object tracking and segmentation used to propagate masks on the next frames, here we use

them to link tracklets that are temporally apart. it is unclear how STM network behaves when multiple in-
The main contributions of our work are as follows: stances from the same class appear in a video. We show in
this work that STM network performs well and can help to
• We propose MeNToS, a method to solve MOTS based solve a ReID problem by taking advantage of the informa-
on a space-time memory network with a new similarity tion at the pixel level and the presence of other objects.
measure between tracklets;
• We evaluate our method on KITTIMOTS and Bridging the gap between MOTS and OSVOS Despite
MOTSChallenge to prove that it is competitive, espe- some clear similarities between the two tasks (prediction of
cially on the association part. We demonstrate that our the future position of an object at the pixel level), there are
approach for long-term association using a STM net- some differences such as the evaluation (sMOTSA, HOTA
work is better than the recent approach based on re- measures which evaluate the quality of detection and as-
identification networks; sociation versus J &F which evaluates the masks quality
with a region similarity and a contour accuracy) and the
• We show the usefulness of the HOTA metric for MOTS
specification at the first frame (predefined classes without
to capture the improvement resulting from improved
any initial mask versus class-agnostic with initial mask).
data association.
Very recently, Wang et al. [38] addressed this issue by
considering a common shared appearance model to solve
2. Related works
several tracking problems including MOTS and OSVOS.
MOTS Similarly to multi-object tracking where the They compared the performance of several pre-trained vi-
“tracking-by-detection” paradigm is popular [6], MOTS sion models such as an ImageNet [28] pretrained architec-
is mainly solved by creating tracklets from segmentation ture, MoCo [10] or CRW [12] architectures to get visual fea-
masks then by building long-term tracks by merging the tures. Then the authors used these representations to either
tracklets [43, 46, 18]. Usually, methods use an instance seg- propagate instances like in OSVOS or associate them like in
mentation method to generate binary masks; ReMOTS [43] MOTS. Moreover, for the association step, they computed
used two advanced instance segmentation methods and self- a similarity score between tracklets with an attention per-
supervision to refine masks. As for the association step, spective. Their method consists in the reconstruction at the
many methods require a re-identification (ReID) step. For patch level of all features of existing tracklets and current
example, Voigtlaender et al. [37] extended Mask R-CNN by detections. The similarity is then the average cosine simi-
an association head to return an embedding for each detec- larity between the forward and backward reconstruction.
tion, Yang et al. [43] associated two tracklets if they were
temporally close, without any temporal overlap with simi- 3. Proposed method
lar appearance features based on all their observations and
a hierarchical clustering, while Zhang et al. [46] used tem- As illustrated in Figure 1, our pipeline for tracking multi-
poral attention to lower the weight of frames with occluded ple objects is based on three main steps: detections of all ob-
objects. More recently, Wei et al. [39] proposed to solve jects with classes in the COCO dataset [17], a short-term as-
MOTS by improving the quality of detected masks with sociation of segmentation masks in consecutive frames and
massive instance augmentation during training [9] and re- a greedy long-term association of tracklets using a memory
fining masks during inference [33], then associated detec- network.
tions with an ensemble method. 3.1. Detections
OSVOS Closely related to MOTS, OSVOS requires Our method follows the “tracking-by-detection”
tracking objects whose segmentation masks are only pro- paradigm. First, we used the non-overlapping detections
vided at the first frame. OSVOS is mainly solved by provided by the RobMOTS challenge [19] obtained from a
propagation-based methods: a model learns the represen- Mask R-CNN X-152 [11] and Box2Seg Network [21] for
tation of the initial mask and tries to make some correspon- all 80 categories of COCO. Objects with a detection score
dence on the next frames. MAST [16] used a memory com- higher than θd and bigger than θa are extracted.
ponent to predict the future location of objects. STM [24]
3.2. Short-term association (STA)
was proposed to solve OSVOS by storing some previous
frames and masks in a memory that is later read by an at- During the short-term association, we associate tempo-
tention mechanism to predict the new mask in a target im- rally close segmentation masks between consecutive frames
age. Such a network was recently used [8] to solve video by computing the optical flow with RAFT [34]. Masks from
instance segmentation (VIS), a problem in which no prior the previous frames are warped and a mask IoU (mIoU) is
knowledge was given about the objects to track. However, computed between these warped masks and the masks from

t1 Optical flow t1
t2 t1
t2 t2
t3 t3
t4 Deletion of short t3
t4 Mask-IoU t4
t5 t5 tracklets
t6 t5
t6 t6
Instance Hungarian Short-term association Filter on length
segmentation algorithm

t1 t1
t2 Deletion of -1 -1 0,6 0,4 -1 ❌ ❌ ✔ ✔ ❌
t2
t3 low t3 -1 -1 0,2 0,8 -1 ❌ ❌ ✔ ✔ ❌

t4 confidence t4 0,6 0,2 -1 -1 0,1 ✔ ✔ ❌ ❌ ✔
t5 tracklets t5
t6 t6
0,4 0,8 -1 -1 -1 ✔ ✔ ❌ ❌ ❌

Filter on confidence Long-term -1 -1 0,1 -1 -1 ❌ ❌ ✔ ❌ ❌

association
Ground truth : Hypotheses :
Similarity using a Admissible pairs
memory network
raw id1 id2
ID1 ID2
id3 id4 id5 id6
Updates

Figure 1. Illustration of our MeNToS method. Given an instance segmentation, binary masks are matched in consecutive frames to create
tracklets. Very short tracklets are deleted. An appearance similarity, based on a memory network, is computed between two admissible
tracklets. Then, tracklets are gradually merged starting with the pair having the highest similarity while respecting the updated constraints.
Finally, low confidence tracks are deleted. Illustration inspired by [43].

the next frame. As some classes of COCO are visually sim- computed on a selection of admissible tracklet pairs to re-
ilar (for example car and truck), only associating objects duce the computational cost. At this step, we point out that
of the same class may lead to missing some objects due to all tracklets have a length longer than or equal to two.
classification errors. That is why this step is a class-agnostic
association, letting the model match a car with a truck if the
optical flow corresponds. 3.3.1 Measure of similarity between tracklets
The Hungarian algorithm [14] is used to associate masks Our similarity measure is based on the ability to match some
where the cost matrix is computed based on the negative parts of two different tracklets (say T A and T B ) and can be
mIoU. Matched detections with a mIoU above a threshold interpreted as a pixel-level visual-spatial alignment rather
θs are connected to form a tracklet and the remaining detec- than a patch-level visual alignment [43, 46]. For that, we
tions form new tracklets. propagate some masks of tracklet T A to other frames where
Finally, the class of a tracklet is defined as the most fre- the tracklet T B is present and then compare the masks of
quent one among its detections, and tracklets with only one T B and the propagated version of the mask heatmaps, com-
detection are deleted since they often correspond to false puted before the binarization, for T A . The more they are
positives. spatially aligned, the higher the similarity is. In details, let
us consider two tracklets T A = (M1A , M2A , · · · , MNA
) and
3.3. Long-term association (LTA) B B B B
T = (M1 , M2 , · · · , MP ) of length N and P respec-
Greedy long-term association and the use of a STM net- tively, such that T A appears first and where M1A denotes
work for re-identification of tracklets are the novelties of the first segmentation mask of the tracklet T A . We use a
our approach. Once tracklets have been created, it is nec- pre-trained STM network [24] to store two binary masks
essary to link them in case of fragmentation caused, for ex- as references (and their corresponding frames): the closest
A
ample, by occlusion. In this long-term association, we use ones (MN for T A and M1B for T B ) and a second mask a
A A
a memory network as a similarity measure between track- little farther (MN −n−1 for T and MnB for T B ). The far-
lets to propagate some masks of a tracklet in the past and ther masks are used because the first and last object masks
the future. If a propagated mask of a tracklet sufficiently of a tracklet are often incomplete due, for example, to oc-
overlaps a mask of another tracklet, these two tracklets are clusions. Then, the reference frames are used as queries
linked together. Given the fact that this procedure is applied to produce heatmaps with continuous values between 0 and
A A B B
at the pixel-level on the whole image, this similarity is only 1 (HN , HN −n−1 , H1 , Hn ). Finally, the average cosine

similarity between these four heatmaps and the four masks T B are defined respectively as:
 A A B B
(MN , MN −n−1 , M1 , Mn ) is the final similarity between
 A
the two tracklets, denoted as sim(T A , T B ). Figure 2 il- |f (MN ) − f (M1B )|
 Ct (T A , T B ) = , (1)
lustrates a one-frame version of this similarity measure be- f ps
tween tracklets.
 2
 Cs (T A , T B ) = × ||M¯N
 A − M¯B || ,
 1 1 (2)
 H +W
 read
 Co (T A , T B ) = |{f (M ) ∀M ∈ T A } ∩ {f (M ) ∀M ∈ T B }|

 Frame 12
 
 (3)
 Space- A pair (T A , T B ) is admissible if the tracklets belong to
 query Time
 Memory the same object class, Ct (T A , T B ) ≤ τt , Cs (T A , T B ) ≤ τs
 Network and Co (T A , T B ) ≤ τo .

 Frame 13
 Tracklet A

 3.3.3 Greedy association
 
 Frame 3

 Similarly to Singh et al. [29], we gradually merge the ad-
 missible pairs with the highest cosine similarity, sim(T A ,
 Cosine similarity :
 Frame 14

 0,81 T B ), if it is above a threshold θl , while continuously up-
 dating the admissible pairs using equation 3. A tracklet can
 Average : 0,84 therefore be repeatedly merged with other tracklets. Finally,
 Frame 4

 tracks having their highest detection score lower than 90 %
 Cosine similarity : Tracklet B are deleted.
 0,87

 4. Experiments
 Frame 5

 4.1. Implementation details
 At the detection step, θd is 0.5 and small masks whose
 Space-
 Time query area is less than θa = 128 pixels are removed.
 Memory For the LTA step, the selection is done with (τt , τs , τo ) =
 Frame 6

 Network
 (1.5, 0.2, 1). To measure similarity, the second frame is
 
 picked using n = 5. If that frame is not available, n = 2,
 is used instead. As for the thresholds at the STA and LTA
 read steps, we selected θs = 0.15 and θl = 0.30. These hyper-
 parameters were selected through cross-validation and re-
Figure 2. Similarity used at the long-term association step. For main fixed regardless of the dataset and object classes.
simplicity, only one mask and frame are used as reference and as
target in the space-time memory network. 4.2. Datasets and performance evaluations
 We evaluated our method on KITTIMOTS and
 MOTSChallenge [37], two common datasets about MOTS.
3.3.2 Selection of pairs of tracklets KITTIMOTS contains 21 training videos and 29 test videos
 on cars and pedestrians and MOTSChallenge 4 training
Instead of estimating a similarity measure between all pairs videos and 4 test videos only with pedestrians.
of tracklets, a selection is made to reduce the computational Recently, the HOTA metric was introduced to fairly bal-
cost [43]. The selection is based on the following heuristic: ance the quality of detections and associations. It can be de-
two tracklets may belong to the same objects if they be- composed into the DetA and AssA metrics to measure the
long to the same object class, are temporally close, spatially quality of these two components. The higher the HOTA is,
close and with a small temporal overlap. the more the tracker is aligned with the human visual assess-
 In details, let us denote f (M ) the frame where the mask ment. Contrary to MOTSChallenge which is primarly eval-
M is present, M̄ its center and f ps, H and W respectively uated by sMOTSA and MOTSA measures, KITTIMOTS
the number of frames per second, height and width of the has adopted this new metric. For a matter of standardiza-
video. The temporal (Ct (T A , T B )), spatial (Cs (T A , T B )) tion, when evaluated on the training set, we reported HOTA
and temporal overlap (Co (T A , T B )) costs between T A and measures for MOTSChallenge.

 4

4.3. Results of the center of the masks may happened in the case of an
occlusion, like in the fifth sequence. In that particular case,
Results 1 in tables 1 and 2 indicate that our method is
the spatial distance between the center of the two masks (the
competitive on the association performance. We ranked 3rd
yellow one in the first frame and the green one in the second
on this criteria on KITTIMOTS and our identity switches
one) is higher than the threshold used for the constraint 2.
are the lowest on MOTSChallenge. We point out that our
method does not require any additional fine-tuning on the
benchmarks. Note also that the methods in the tables do not
5. Ablation studies
use all the same detection inputs. The ablation studies in 5.1. Contribution of each step
the next section gives a better understanding of our contri-
bution by setting the detection inputs to assess only the data To examine the effects of each step of our method, we
association component. evaluate the performance by successively adding a compo-
nent at a time.
Method Cars Pedestrians Step KT-car KT-ped MOTSChallenge
HOTA DetA AssA HOTA DetA AssA
HOTA HOTA HOTA sMOTSA
UW IPL ETRI AIRL [46] 79.6 79.7 80.0 64.4 66.7 63.8 STA 79.4 62.0 57.9 62.8
ReID MOT 78.3 79.3 77.9 65.6 67.6 65.0 + filter 79.8 (+0.4) 63.3 (+1.3) 58.2 (+0.3) 64.0 (+1.2)
ViP-DeepLab [26] 76.4 82.7 70.9 64.3 70.7 59.5 + LTA 84.1 (+4.3) 66.2 (+2.9) 64.5 (+6.3) 64.9 (+0.9)
EagerMOT [13] 74.7 76.1 73.8 57.7 60.3 56.2 + filter 84.1 (+0.0) 67.6 (+1.4) 65.4 (+0.9) 67.7 (+2.8)
MOTSFusion [18] 73.6 75.4 72.4 54.0 60.8 49.5
ReMOTS [43] 71.6 78.3 66.0 58.8 68.0 52.4 Table 3. Ablation studies on the validation set of KITTIMOTS
PointTrack [42] 62.0 79.4 48.8 54.4 62.3 48.1 (KT) and the train set of MOTSChallenge. Each step of our ap-
MeNToS (ours) 75.8 77.1 74.9 65.4 68.7 63.5 proach leads to an improvement in terms of HOTA and sMOTSA.

Table 1. Results on the test set of KITTIMOTS. Bold red and italic Results of Table 3 indicate that all components im-
blue indicate respectively the first and second best methods. Con-
prove the tracking performance (either in terms of HOTA
tains unpublished results.
or sMOTSA). More precisely, on KITTIMOTS and
MOTSChallenge, the HOTA metric is highly sensitive to
the quality of the association. From the tracking results
Method sMOTSA IDF1 MOTSA FP FN IDs
obtained at step STA, on average, two-thirds of the to-
ReMOTS [43] 70.4 75.0 84.4 819 3999 231 tal improvement in HOTA is made at the LTA step. As
COSTA st 69.5 70.3 83.3 806 4176 421 for sMOTSA, unsurprisingly, it lowers the effect of the
GMPHD [30] 69.4 66.4 83.3 935 3985 484 long-term association. Consequently, improving the LTA
SORTS [1] 55.0 57.3 68.3 1076 8598 552 step leads to a boost in terms of HOTA: this improvement
TrackRCNN [37] 40.6 42.4 55.2 1261 12641 567 in the data association is totally missed with the previous
MeNToS (ours) 64.8 73.2 76.9 654 6704 110 sMOTSA measure.
Table 2. Results on the test set of MOTSChallenge. Bold red and 5.2. Upper bound with oracle
italic blue indicate respectively the first and second best methods.
Contains unpublished results. Since the HOTA gives more importance to the data as-
sociation part of MOTS compared to the sMOTSA, we also
Some qualitative results are provided in Figure 3. We conducted some experiments to measure the limits of our
notice that our method can retrieve some objects even af- method and estimate the steps which can lead to the biggest
ter some occlusions like the green, yellow and purple cars gain in HOTA. Using the ground truth annotations on KIT-
in the first sequence and the blue pedestrian in the second TIMOTS, we build three methods that inject step by step
sequence. The main limitations are contaminated segmen- some ground truth knowledge.
tation masks and short tracklets since they provide little in-
formation and their appearances are unstable. For example, Oracle at the STA The oracle at the STA step is obtained
pedestrians usually walk in group provoking more occlu- with perfect short-term assignments based on the detections
sions and eventually lower quality masks, as in the third obtained by the multiclass object detector. This means that
and fourth sequences. Then, a large spatial displacement repeated false positives are also linked together.
1 Full results at http://www.cvlibs.net/datasets/kitti/ Precisely, an initial identity is provided for all masks at
eval_mots.php and https://motchallenge.net/results/ the first frame. Then for each mask at time t corresponding
MOTS/ to a ground truth object (mIoU over 50%), we look for the

Figure 3. Qualitative results on KITTIMOTS and MOTSChallenge. Each row corresponds to a subsequence of a video clip.

mask that corresponds to the same ground truth object at removed.
time t + 1. If it is present, we attribute the same identity.
Afterwards, for masks that do not correspond to a ground
Full oracle The full oracle is obtained with perfect short
truth mask, we look for a mask at time t + 1 that sufficiently
and long-term assignments of detections that match the
overlaps with the optical flow-warped mask at time t. If this
ground truth and a perfect deletion of false positives.
mask does not correspond to a ground truth object, the same
identity is attributed. Finally, a new identity is provided for Precisely, this is the highest upper bound using the
non-associated masks at time t + 1. The next steps follow ground truth knowledge from the detection step. For all seg-
those described in section 3. mentation masks, if they correspond to a ground truth mask
with a mIoU higher than 50%, they are kept and the ground
truth identity is attributed, otherwise they are removed.
Oracle at the LTA The oracle at the LTA step is obtained
with perfect long-term assignments of tracklets obtained Results We evaluated the HOTA of these three oracle
with our method into tracks. upper bounds on the validation set of KITTIMOTS and
Precisely, after associating masks using the optical flow compare them with our method. Results from Table 4 in-
with RAFT and filtering short tracklets, we compute a dicate that there is still room for improvement given the
ground truth identity for each tracklet as the most frequent same detections. A potential gain of 2.7 and 5.1 points are
one among its detections. Then, for tracklets sharing the still possible on the car and pedestrian classes of KITTI-
same ground truth identity, the same identity is attributed. MOTS. Moreover, the first step of short-term association
This oracle ignores the step of the selection of candidate with RAFT is nearly perfect: the possible gain here does
pairs. Similarly to section 3, tracklets with low scores are not exceed 0.5 point. On the contrary, the step of long-term

association is the one where a substantial gain is available: which indicate that ReID features are suitable for associ-
67% and 53% of the total gain for cars and pedestrians re- ating far apart detections and that color histograms are effi-
spectively is located at this step. We hypothesize that the cient specially for small objects.
difference between these classes comes from the fact that Figure 4 illustrates the behavior of these four techniques
there are more distractors for pedestrians than for cars. In- for LTA alongside the one based on the STM network on
deed, the model fully using the ground truth annotations is HOTA with regard to the similarity threshold θl . First, it
able to eliminate all distractors. indicates that for color histogram-based and ReID-based
 methods, using all frames of a tracklet as reference leads to
 Method KT-car KT-ped a higher HOTA. Second, the four additional techniques are
 MeNToS 84.1 67.6 sensitive to the parameter θl : the interval where the HOTA
 Oracle at STA 84.5 (+0.4) 68.1 (+0.5) is the highest is narrow. Third, on the contrary, our tech-
 Oracle at LTA 85.9 (+1.8) 70.3 (+2.7) nique based on the STM network is less sensitive to the pa-
 Full oracle 86.8 (+2.7) 72.7 (+5.1) rameter θl : the HOTA is high on the interval [0.1; 0.5].
 However, our STM-based association struggles when
Table 4. Oracle methods on the validation set of KITTIMOTS different parts of the same object are present in tracklets.
(KT). Given the same detections, the largest future gain is at the For example, if one tracklet displays only the hood of a car
long-term association step. and another only the car trunk, even if they belong to the
 same object, our STM-based approach could not associate
 such tracklets since no particular parts are in common. As
5.3. Comparison with other strategies of long-term for ReID features, they are trained to overlook such events
 association but in the case of heavy occlusion, the features would prob-
 In the proposed method, we use a STM network to com- ably be misaligned.
pute a similarity measure between tracklets. Other methods To address this issue of misalignment, Wang et al. [38]
traditionally used re-identification features [43] or color his- used an alignment based on a linear transformation. As op-
tograms [41]. Moreover, they also tend to use more ref- posed to their approach, we used a STM network to propa-
erence frames whereas ours only takes into account two gate masks, performed the propagation at the pixel level and
frames. In order to measure the contribution of the STM on the whole image for better precision to match tracklets.
network combined with a cosine similarity, we replace them Working on the entire image is advantageous when a detec-
by one the following method: tion is missing. Moreover, this is the closest form of use of
 a STM network within the framework of the OSVOS.
 1. RGB 2x2: for two tracklets, we extracted the his-
 5.4. Number of frames of reference
 togram of color of the same two frames of references
 computed at the pixel level. The similarity measure is From the results of Figure 4, it seems that taking into
 then the average of four Bhattacharyya coefficients; account more reference frames leads to better results. Is
 this trend still relevant for our technique based on STM ?
 2. RGB NxP: for two tracklets, we extracted the his- Since it is not technically possible with our GPU to load all
 togram of color of all frames of references computed frames in memory and compute the attention, we compared
 on the mask. The similarity measure is then the aver- the following approaches:
 age Bhattacharyya coefficient over all possible combi-
 A
 nations of histograms between the two tracklets; 1. Frame 1: only the closest frames (f (MN ) and
 B
 f (M1 ) with the notation used in section 3.3.1) are
 3. ReID 2x2: for two tracklets, we computed the ReID used as reference, as illustrated in Figure 2;
 features of the same two frames of references. The A
 2. Frames 1&2: the two closest frames (f (MN ),
 similarity measure is then the average of four cosine A B B
 f (MN −1 ) and f (M1 ), f (M2 )) are used at reference;
 similarities;
 3. Frames 1&5|2: this is the technique currently used
 4. ReID NxP: for two tracklets, we computed the ReID in our approach. The closest frames are used and
 features of all frames of references. The similarity f (MNA B A
 −4 ) (respectively f (M5 )) if it exists for T
 measure is then the average cosine similarity over all (respectively T B ), otherwise f (MN
 A
 ) (respectively
 −1
 possible combinations of ReID features between the f (M2B ));
 two tracklets.
 4. Frames 1&2&(5): the two closest frames are used
 A B A
The ReID features are computed with the OsNet-AIN [47] and f (MN −4 ) (respectively f (M5 )) if it exists for T
 B
network. This is motivated by some works [22, 40, 27] (respectively T ).

 7

STM-based RGB_2x2 ReID_2x2 ReID_NxP RGB_NxP
KITTIMOTS-car KITTIMOTS-pedestrian MOTSChallenge
68.0
84
65
67.5
83 64
67.0
63
66.5
HOTA

HOTA

HOTA
82
62
66.0
81 61
65.5
60
80 65.0 59
0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8
theta_l theta_l theta_l
Figure 4. Comparison of some strategies of long-term association for KITTIMOTS-car, KITTIMOTS-pedestrian and MOTSchallenge.

Frames 1&5|2 Frame 1 Frames 1&2 Frames 1&2&(5) because it is less similar to the first frame. Hence, more
KITTIMOTS-car uncorrelated information is available for the STM. These
results are similar to the conclusion drawn by Lai et al. [16]
84 on OSVOS: their Memory-Augmented Tracker used both
83 short and long term memory to recover objects.
HOTA

82 5.5. Why is there no propagation since the STM is
81 a propagation-based technique ?
80 We tried to include a propagation step after the LTA one.
0.2 0.4 0.6 0.8 Contrary to OSVOS or VIS where there are few objects
theta_l
of the same class, in MOTS the number of same-class in-
KITTIMOTS-pedestrian
stances can reach a dozen, making propagation challenging.
68.0
Indeed, same-class objects share similar parts that provoke
67.5
some erroneous propagation, especially when an object al-
67.0
ready exists in the scene and another undetected one ap-
HOTA

66.5
pears. Moreover, to fill in the gap between two tracklets,
66.0 the information for propagation at the closest extremities of
65.5 the tracklets is from frames where a detector was not able
65.0 to detect any object (or the detection was rejected due to a
0.2 0.4 0.6 0.8
theta_l low confidence score, a small size or an absence of consec-
utive association during the STA step). Hence, propagating
Figure 5. Ablation studies on the selection of the reference frames
in the STM-based technique on the validation set of KITTIMOTS. a mask in such a situation is by nature a complicated task.

6. Conclusion
Figure 5 illustrates the performance in terms of HOTA We proposed a method to solve MOTS using a STM net-
for different selections of reference frames in the STM with work originally developed for solving OSVOS. It is mainly
regard to the similarity threshold θl . On KITTIMOTS, based on a hierarchical association where masks are first
using the closest frames as reference leads to the lowest associated in consecutive frames to form tracklets. Then,
HOTA. For pedestrians, using at least two reference frames these tracklets are associated using a STM network, lever-
provides a little improvement. With more frames, it seems aging the ability of the network to match similar parts. Ex-
that the performance is saturating. We hypothesize that this periments on KITTIMOTS and MOTSChallenge show that
is due to some contaminated masks by other pedestrians, as our approach gives good performance with regard to the as-
people tend to walk in groups. As for cars, replacing the sociation quality. We demonstrate that our approach for
second closest frames by its fifth equivalent contributes to long-term association using a STM network is better than
a boost in terms of HOTA. Considering three frames seems the recent approach based on ReID networks. Nevertheless,
to saturate the HOTA. We hypothesize that using the fifth extensive analyses show that the long-term association can
frame as a reference is a better choice than the second one still be improved given the same initial detections.

References [17] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
 Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence
 [1] Martin Ahrnbom, Mikael Nilsson, and Håkan Ardö. Real- Zitnick. Microsoft COCO: Common Objects in Context. In
 time and online segmentation multi-target tracking with ECCV, 2014.
 track revival re-identification. In International Conference [18] J. Luiten, T. Fischer, and B. Leibe. Track to Reconstruct
 on Computer Vision Theory and Applications (VISAPP), and Reconstruct to Track. IEEE Robotics and Automation
 2021. Letters, 5(2):1803–1810, Apr. 2020.
 [2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. [19] Jonathon Luiten, Arne Hoffhues, Blin Beqa, Paul Voigtlaen-
 Neural Machine Translation by Jointly Learning to Align and der, István Sárándi, Patrick Dendorfer, Aljosa Osep, Achal
 Translate. In ICLR, 2015. Dave, Tarasha Khurana, Tobias Fischer, Xia Li, Yuchen
 [3] Sergi Caelles, Kevis-Kokitsi Maninis, Jordi Pont-Tuset, Fan, Pavel Tokmakov, Song Bai, Linjie Yang, Federico Per-
 Laura Leal-Taixe, Daniel Cremers, and Luc Van Gool. One- azzi, Ning Xu, Alex Bewley, Jack Valmadre, Sergi Caelles,
 Shot Video Object Segmentation. In CVPR, 2017. Jordi Pont-Tuset, Xinggang Wang, Andreas Geiger, Fisher
 [4] Sergi Caelles, Alberto Montes, Kevis-Kokitsi Maninis, Yu, Deva Ramanan, Laura Leal-Taixé, and Bastian Leibe.
 Yuhua Chen, Luc Van Gool, Federico Perazzi, and Jordi RobMOTS : A Benchmark and Simple Baselines for Robust
 Pont-Tuset. The 2018 DAVIS Challenge on Video Object Multi-Object Tracking and Segmentation. In CVPR RVSU
 Segmentation. arXiv:1803.00557, Mar. 2018. Workshop, 2021.
 [5] Sergi Caelles, Jordi Pont-Tuset, Federico Perazzi, Alberto [20] Jonathon Luiten, Aljosa Osep, Patrick Dendorfer, Philip
 Montes, Kevis-Kokitsi Maninis, and Luc Van Gool. The Torr, Andreas Geiger, Laura Leal-Taixe, and Bastian Leibe.
 2019 DAVIS Challenge on VOS: Unsupervised Multi-Object HOTA: A Higher Order Metric for Evaluating Multi-Object
 Segmentation. arXiv:1905.00737, May 2019. Tracking. International Journal of Computer Vision (IJCV),
 [6] Patrick Dendorfer, Aljosa Osep, Anton Milan, Konrad Oct. 2020.
 Schindler, Daniel Cremers, Ian Reid, Stefan Roth, and Laura [21] Jonathon Luiten, Paul Voigtlaender, and Bastian Leibe. PRe-
 Leal-Taixé. MOTChallenge: A Benchmark for Single- MVOS: Proposal-Generation, Refinement and Merging for
 Camera Multiple Target Tracking. International Journal of Video Object Segmentation. In ACCV, 2018.
 Computer Vision (IJCV), 129(4):845–881, Apr. 2021. [22] Mehdi Miah, Justine Pepin, Nicolas Saunier, and Guillaume-
 [7] Ting Fu, Weichao Hu, Luis Miranda-Moreno, and Nicolas Alexandre Bilodeau. An Empirical Analysis of Visual Fea-
 Saunier. Investigating secondary pedestrian-vehicle interac- tures for Multiple Object Tracking in Urban Scenes. In Inter-
 tions at non-signalized intersections using vision-based tra- national Conference on Pattern Recognition (ICPR), 2021.
 jectory data. Transportation Research Part C: Emerging [23] Seoung Wug Oh, Joon-Young Lee, Kalyan Sunkavalli,
 Technologies, 105:222–240, Aug. 2019. and Seon Joo Kim. Fast Video Object Segmentation by
 [8] Shubhika Garg and Vidit Goel. Mask Selection and Propaga- Reference-Guided Mask Propagation. In CVPR, 2018.
 tion for Unsupervised Video Object Segmentation. In WACV, [24] Seoung Wug Oh, Joon-Young Lee, Ning Xu, and Seon Joo
 2021. Kim. Video Object Segmentation Using Space-Time Mem-
 [9] Golnaz Ghiasi, Yin Cui, Aravind Srinivas, Rui Qian, Tsung- ory Networks. In ICCV, 2019.
 Yi Lin, Ekin D. Cubuk, Quoc V. Le, and Barret Zoph. Sim- [25] Federico Perazzi, Anna Khoreva, Rodrigo Benenson, Bernt
 ple Copy-Paste Is a Strong Data Augmentation Method for Schiele, and Alexander Sorkine-Hornung. Learning Video
 Instance Segmentation. In CVPR, 2021. Object Segmentation From Static Images. In CVPR, 2017.
[10] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross [26] Siyuan Qiao, Yukun Zhu, Hartwig Adam, Alan Yuille, and
 Girshick. Momentum Contrast for Unsupervised Visual Rep- Liang-Chieh Chen. ViP-DeepLab: Learning Visual Percep-
 resentation Learning. In CVPR, 2020. tion with Depth-aware Video Panoptic Segmentation. In
[11] Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Gir- CVPR, 2021.
 shick. Mask R-CNN. In ICCV, 2017. [27] Ergys Ristani and Carlo Tomasi. Features for Multi-Target
[12] Allan Jabri, Andrew Owens, and Alexei A. Efros. Space- Multi-Camera Tracking and Re-Identification. In CVPR,
 Time Correspondence as a Contrastive Random Walk. In 2018.
 NeurIPS, 2020. [28] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San-
[13] Aleksandr Kim, Aljoša Ošep, and Laura Leal-Taixé. Eager- jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy,
 MOT: Real-time 3D multi-object tracking and segmentation Aditya Khosla, Michael Bernstein, Alexander C. Berg, and
 via sensor fusion. In CVPR - Workshops, 2020. Li Fei-Fei. ImageNet Large Scale Visual Recognition Chal-
[14] H. W. Kuhn. The Hungarian method for the assignment prob- lenge. International Journal of Computer Vision (IJCV),
 lem. Naval Research Logistics Quarterly, 1955. 115(3):211–252, Dec. 2015.
[15] Ankit Kumar, Ozan Irsoy, Peter Ondruska, Mohit Iyyer, [29] Gurinderbeer Singh, Sreeraman Rajan, and Shikharesh Ma-
 James Bradbury, Ishaan Gulrajani, Victor Zhong, Romain jumdar. A Greedy Data Association Technique for Multiple
 Paulus, and Richard Socher. Ask me anything: Dynamic Object Tracking. In International Conference on Multimedia
 memory networks for natural language processing. In ICML, Big Data (BigMM), 2017.
 2016. [30] Young-min Song and Moongu Jeon. Online Multi-Object
[16] Zihang Lai, Erika Lu, and Weidi Xie. MAST: A Memory- Tracking and Segmentation with GMPHD Filter and Simple
 Augmented Self-supervised Tracker. In CVPR, 2020. Affinity Fusion. In CVPR - Workshops, 2020.

 9

[31] Rainer Stiefelhagen, Keni Bernardin, Rachel Bowers, John [46] Haotian Zhang, Yizhou Wang, Jiarui Cai, Hung-Min Hsu,
 Garofolo, Djamel Mostefa, and Padmanabhan Soundarara- Haorui Ji, and Jenq-Neng Hwang. LIFTS: Lidar and monoc-
 jan. The CLEAR 2006 Evaluation. In Multimodal Technolo- ular image fusion for multi-object tracking and segmenta-
 gies for Perception of Humans, Lecture Notes in Computer tion. In CVPR - Workshops, 2020.
 Science, pages 1–44, Berlin, Heidelberg, 2007. Springer. [47] Kaiyang Zhou, Yongxin Yang, Andrea Cavallaro, and Tao
[32] Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Xiang. Omni-Scale Feature Learning for Person Re-
 Fergus. End-To-End Memory Networks. In NIPS, 2015. Identification. In ICCV, 2019.
[33] Chufeng Tang, Hang Chen, Xiao Li, Jianmin Li, Zhaoxiang
 Zhang, and Xiaolin Hu. Look Closer To Segment Better:
 Boundary Patch Refinement for Instance Segmentation. In
 CVPR, 2021.
[34] Zachary Teed and Jia Deng. RAFT: Recurrent All-Pairs
 Field Transforms for Optical Flow. In ECCV, 2020.
[35] Jack Valmadre, Alex Bewley, Jonathan Huang, Chen Sun,
 Cristian Sminchisescu, and Cordelia Schmid. Local Metrics
 for Multi-Object Tracking. arXiv:2104.02631, Apr. 2021.
[36] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
 reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
 Polosukhin. Attention is All you Need. In NIPS, 2017.
[37] Paul Voigtlaender, Michael Krause, Aljosa Osep, Jonathon
 Luiten, Berin Balachandar Gnana Sekar, Andreas Geiger,
 and Bastian Leibe. MOTS: Multi-Object Tracking and Seg-
 mentation. In CVPR, 2019.
[38] Zhongdao Wang, Hengshuang Zhao, Ya-Li Li, Shengjin
 Wang, Philip H. S. Torr, and Luca Bertinetto. Do Differ-
 ent Tracking Tasks Require Different Appearance Models?
 arXiv:2107.02156, July 2021.
[39] Dongxu Wei, Jiashen Hua, Hualiang Wang, Baisheng Lai,
 Kejie Huang, Chang Zhou, Jianqiang Huang, and Xiansheng
 Hua. RobTrack : A Robust Tracker Baseline towards Real-
 World Robustness in Multi-Object Tracking and Segmenta-
 tion. In CVPR RVSU Workshop, 2021.
[40] Nicolai Wojke, Alex Bewley, and Dietrich Paulus. Simple
 online and realtime tracking with a deep association met-
 ric. In IEEE International Conference on Image Processing
 (ICIP), 2017.
[41] Junliang Xing, Haizhou Ai, and Shihong Lao. Multi-object
 tracking through occlusions by local tracklets filtering and
 global tracklets association with detection responses. In
 CVPR, pages 1200–1207, 2009.
[42] Zhenbo Xu, Wei Zhang, Xiao Tan, Wei Yang, Huan Huang,
 Shilei Wen, Errui Ding, and Liusheng Huang. Segment as
 Points for Efficient Online Multi-Object Tracking and Seg-
 mentation. In ECCV, 2020.
[43] Fan Yang, Xin Chang, Chenyu Dang, Ziqiang Zheng, Sakri-
 ani Sakti, Satoshi Nakamura, and Yang Wu. ReMOTS: Self-
 Supervised Refining Multi-Object Tracking and Segmenta-
 tion. In CVPR - Workshops, 2020.
[44] Linjie Yang, Yanran Wang, Xuehan Xiong, Jianchao Yang,
 and Aggelos K. Katsaggelos. Efficient Video Object Seg-
 mentation via Network Modulation. In CVPR, 2018.
[45] Sohail Zangenehpour, Jillian Strauss, Luis F. Miranda-
 Moreno, and Nicolas Saunier. Are signalized intersections
 with cycle tracks safer? A case–control study based on au-
 tomated surrogate safety analysis using video data. Accident
 Analysis & Prevention, 86:161–172, Jan. 2016.

 10

You can also read