Multi-Object Tracking and Segmentation with a Space-Time Memory Network
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Multi-Object Tracking and Segmentation with a Space-Time Memory Network Mehdi Miah, Guillaume-Alexandre Bilodeau and Nicolas Saunier Polytechnique Montréal {mehdi.miah, gabilodeau, nicolas.saunier}@polymtl.ca arXiv:2110.11284v1 [cs.CV] 21 Oct 2021 Abstract (MOTS) [37] were generally evaluated with measures heav- ily biased against the association step [20, 35]. Luiten et We propose a method for multi-object tracking and seg- al. [20] proved that some metrics such as the MOTA [31] mentation that does not require fine-tuning or per bench- and its variants ignore largely the association quality. mark hyper-parameter selection. The proposed tracker, Hence, they introduced the HOTA metric which not only MeNToS, addresses particularly the data association prob- balances the detection and the association step but also has a lem. Indeed, the recently introduced HOTA metric, which better alignment with the human visual assessment. There- has a better alignment with the human visual assessment by fore, evaluating MOTS with the HOTA metric should lead evenly balancing detections and associations quality, has to methods that are more robust and in turn that could im- shown that improvements are still needed for data associ- prove the quality of the analysis of traffic videos. ation. After creating tracklets using instance segmentation In addition to MOTS, some researchers consider another and optical flow, the proposed method relies on a space- tracking task: one-shot video object segmentation (OS- time memory network developed for one-shot video object VOS) [4, 3, 5] where the segmentation masks of all objects segmentation to improve the association of tracklets with of interest are provided at the first frame. Even if it is similar temporal gaps. We evaluated our tracker on KITTIMOTS to MOTS, where the objective is to detect and track the ob- and MOTSChallenge and show the benefit of our data asso- jects belonging to a predefined set of classes (usually pedes- ciation strategy with the HOTA metric. The project page is trians, cyclists and cars), both tasks are evaluated differently www.mehdimiah.com/mentos+. and standard solving strategies vary significantly. OSVOS is mainly based on propagation methods [25, 44, 23] and cannot rely on detections since the classes of objects of in- 1. Introduction terest are unknown, whereas MOTS is heavily detection- Analyzing road traffic videos is an effective method to oriented since only the classes are known. Hence, MOTS is develop and evaluate some urban policies and traffic man- regarded as an association problem. agement methods [45, 7]. For example, it enables to accu- Our proposed tracker, named MeNToS (Memory rately detect some hazardous behaviors such as exceeding Network-based Tracker of Segments), consists in using a the speed limit, overtaking cyclists at high speed or com- propagation method originally developed for OSVOS to mitting traffic violations. Detecting such events is possible solve an association problem for MOTS. We solve the as- through video data analysis: given a video, the objective is sociation problem hierarchically as [43, 46]. First, given to detect the road users, track them and compute key indi- some instance segmentation masks, we associated masks cators, such as speed and trajectories. This paper focuses between consecutive frames. As such, masks are spatially on the first two aspects, with an emphasis on the data as- close and visually similar and therefore a method based on sociation component of tracking that aims to assign unique a low-level information (meaning colors) may be sufficient. identities to all the objects in a video. Keeping the same That is why the short-term association is based on the op- identity for an object after an occlusion is very important, tical flow. After this first association, we obtained continu- especially for safety purposes. Indeed, occlusions happen ous tracklets. Second, we use a space-time memory (STM) when two objects, for example, a car and a pedestrian, are network [24], originally developed to solve OSVOS, to as- close, leading to a situation of a possible collision. Not at- sociate the tracklets. A STM network can be viewed as a tributing the same identity to a car before and after an oc- spatio-temporal attention mechanism [36, 2] able to find a clusion makes the recognition of hazardous events and their correspondence between a target and a set of queries. As analysis harder. opposed to OSVOS where memory networks [15, 32] are Until recently, multi-object tracking and segmentation used to propagate masks on the next frames, here we use 1
them to link tracklets that are temporally apart. it is unclear how STM network behaves when multiple in- The main contributions of our work are as follows: stances from the same class appear in a video. We show in this work that STM network performs well and can help to • We propose MeNToS, a method to solve MOTS based solve a ReID problem by taking advantage of the informa- on a space-time memory network with a new similarity tion at the pixel level and the presence of other objects. measure between tracklets; • We evaluate our method on KITTIMOTS and Bridging the gap between MOTS and OSVOS Despite MOTSChallenge to prove that it is competitive, espe- some clear similarities between the two tasks (prediction of cially on the association part. We demonstrate that our the future position of an object at the pixel level), there are approach for long-term association using a STM net- some differences such as the evaluation (sMOTSA, HOTA work is better than the recent approach based on re- measures which evaluate the quality of detection and as- identification networks; sociation versus J &F which evaluates the masks quality with a region similarity and a contour accuracy) and the • We show the usefulness of the HOTA metric for MOTS specification at the first frame (predefined classes without to capture the improvement resulting from improved any initial mask versus class-agnostic with initial mask). data association. Very recently, Wang et al. [38] addressed this issue by considering a common shared appearance model to solve 2. Related works several tracking problems including MOTS and OSVOS. MOTS Similarly to multi-object tracking where the They compared the performance of several pre-trained vi- “tracking-by-detection” paradigm is popular [6], MOTS sion models such as an ImageNet [28] pretrained architec- is mainly solved by creating tracklets from segmentation ture, MoCo [10] or CRW [12] architectures to get visual fea- masks then by building long-term tracks by merging the tures. Then the authors used these representations to either tracklets [43, 46, 18]. Usually, methods use an instance seg- propagate instances like in OSVOS or associate them like in mentation method to generate binary masks; ReMOTS [43] MOTS. Moreover, for the association step, they computed used two advanced instance segmentation methods and self- a similarity score between tracklets with an attention per- supervision to refine masks. As for the association step, spective. Their method consists in the reconstruction at the many methods require a re-identification (ReID) step. For patch level of all features of existing tracklets and current example, Voigtlaender et al. [37] extended Mask R-CNN by detections. The similarity is then the average cosine simi- an association head to return an embedding for each detec- larity between the forward and backward reconstruction. tion, Yang et al. [43] associated two tracklets if they were temporally close, without any temporal overlap with simi- 3. Proposed method lar appearance features based on all their observations and a hierarchical clustering, while Zhang et al. [46] used tem- As illustrated in Figure 1, our pipeline for tracking multi- poral attention to lower the weight of frames with occluded ple objects is based on three main steps: detections of all ob- objects. More recently, Wei et al. [39] proposed to solve jects with classes in the COCO dataset [17], a short-term as- MOTS by improving the quality of detected masks with sociation of segmentation masks in consecutive frames and massive instance augmentation during training [9] and re- a greedy long-term association of tracklets using a memory fining masks during inference [33], then associated detec- network. tions with an ensemble method. 3.1. Detections OSVOS Closely related to MOTS, OSVOS requires Our method follows the “tracking-by-detection” tracking objects whose segmentation masks are only pro- paradigm. First, we used the non-overlapping detections vided at the first frame. OSVOS is mainly solved by provided by the RobMOTS challenge [19] obtained from a propagation-based methods: a model learns the represen- Mask R-CNN X-152 [11] and Box2Seg Network [21] for tation of the initial mask and tries to make some correspon- all 80 categories of COCO. Objects with a detection score dence on the next frames. MAST [16] used a memory com- higher than θd and bigger than θa are extracted. ponent to predict the future location of objects. STM [24] 3.2. Short-term association (STA) was proposed to solve OSVOS by storing some previous frames and masks in a memory that is later read by an at- During the short-term association, we associate tempo- tention mechanism to predict the new mask in a target im- rally close segmentation masks between consecutive frames age. Such a network was recently used [8] to solve video by computing the optical flow with RAFT [34]. Masks from instance segmentation (VIS), a problem in which no prior the previous frames are warped and a mask IoU (mIoU) is knowledge was given about the objects to track. However, computed between these warped masks and the masks from 2
t1 Optical flow t1 t2 t1 t2 t2 t3 t3 t4 Deletion of short t3 t4 Mask-IoU t4 t5 t5 tracklets t6 t5 t6 t6 Instance Hungarian Short-term association Filter on length segmentation algorithm t1 t1 t2 Deletion of -1 -1 0,6 0,4 -1 ❌ ❌ ✔ ✔ ❌ t2 t3 low t3 -1 -1 0,2 0,8 -1 ❌ ❌ ✔ ✔ ❌ t4 confidence t4 0,6 0,2 -1 -1 0,1 ✔ ✔ ❌ ❌ ✔ t5 tracklets t5 t6 t6 0,4 0,8 -1 -1 -1 ✔ ✔ ❌ ❌ ❌ Filter on confidence Long-term -1 -1 0,1 -1 -1 ❌ ❌ ✔ ❌ ❌ association Ground truth : Hypotheses : Similarity using a Admissible pairs memory network raw id1 id2 ID1 ID2 id3 id4 id5 id6 Updates Figure 1. Illustration of our MeNToS method. Given an instance segmentation, binary masks are matched in consecutive frames to create tracklets. Very short tracklets are deleted. An appearance similarity, based on a memory network, is computed between two admissible tracklets. Then, tracklets are gradually merged starting with the pair having the highest similarity while respecting the updated constraints. Finally, low confidence tracks are deleted. Illustration inspired by [43]. the next frame. As some classes of COCO are visually sim- computed on a selection of admissible tracklet pairs to re- ilar (for example car and truck), only associating objects duce the computational cost. At this step, we point out that of the same class may lead to missing some objects due to all tracklets have a length longer than or equal to two. classification errors. That is why this step is a class-agnostic association, letting the model match a car with a truck if the optical flow corresponds. 3.3.1 Measure of similarity between tracklets The Hungarian algorithm [14] is used to associate masks Our similarity measure is based on the ability to match some where the cost matrix is computed based on the negative parts of two different tracklets (say T A and T B ) and can be mIoU. Matched detections with a mIoU above a threshold interpreted as a pixel-level visual-spatial alignment rather θs are connected to form a tracklet and the remaining detec- than a patch-level visual alignment [43, 46]. For that, we tions form new tracklets. propagate some masks of tracklet T A to other frames where Finally, the class of a tracklet is defined as the most fre- the tracklet T B is present and then compare the masks of quent one among its detections, and tracklets with only one T B and the propagated version of the mask heatmaps, com- detection are deleted since they often correspond to false puted before the binarization, for T A . The more they are positives. spatially aligned, the higher the similarity is. In details, let us consider two tracklets T A = (M1A , M2A , · · · , MNA ) and 3.3. Long-term association (LTA) B B B B T = (M1 , M2 , · · · , MP ) of length N and P respec- Greedy long-term association and the use of a STM net- tively, such that T A appears first and where M1A denotes work for re-identification of tracklets are the novelties of the first segmentation mask of the tracklet T A . We use a our approach. Once tracklets have been created, it is nec- pre-trained STM network [24] to store two binary masks essary to link them in case of fragmentation caused, for ex- as references (and their corresponding frames): the closest A ample, by occlusion. In this long-term association, we use ones (MN for T A and M1B for T B ) and a second mask a A A a memory network as a similarity measure between track- little farther (MN −n−1 for T and MnB for T B ). The far- lets to propagate some masks of a tracklet in the past and ther masks are used because the first and last object masks the future. If a propagated mask of a tracklet sufficiently of a tracklet are often incomplete due, for example, to oc- overlaps a mask of another tracklet, these two tracklets are clusions. Then, the reference frames are used as queries linked together. Given the fact that this procedure is applied to produce heatmaps with continuous values between 0 and A A B B at the pixel-level on the whole image, this similarity is only 1 (HN , HN −n−1 , H1 , Hn ). Finally, the average cosine 3
similarity between these four heatmaps and the four masks T B are defined respectively as: A A B B (MN , MN −n−1 , M1 , Mn ) is the final similarity between A the two tracklets, denoted as sim(T A , T B ). Figure 2 il- |f (MN ) − f (M1B )| Ct (T A , T B ) = , (1) lustrates a one-frame version of this similarity measure be- f ps tween tracklets. 2 Cs (T A , T B ) = × ||M¯N A − M¯B || , 1 1 (2) H +W read Co (T A , T B ) = |{f (M ) ∀M ∈ T A } ∩ {f (M ) ∀M ∈ T B }| Frame 12 (3) Space- A pair (T A , T B ) is admissible if the tracklets belong to query Time Memory the same object class, Ct (T A , T B ) ≤ τt , Cs (T A , T B ) ≤ τs Network and Co (T A , T B ) ≤ τo . Frame 13 Tracklet A 3.3.3 Greedy association Frame 3 Similarly to Singh et al. [29], we gradually merge the ad- missible pairs with the highest cosine similarity, sim(T A , Cosine similarity : Frame 14 0,81 T B ), if it is above a threshold θl , while continuously up- dating the admissible pairs using equation 3. A tracklet can Average : 0,84 therefore be repeatedly merged with other tracklets. Finally, Frame 4 tracks having their highest detection score lower than 90 % Cosine similarity : Tracklet B are deleted. 0,87 4. Experiments Frame 5 4.1. Implementation details At the detection step, θd is 0.5 and small masks whose Space- Time query area is less than θa = 128 pixels are removed. Memory For the LTA step, the selection is done with (τt , τs , τo ) = Frame 6 Network (1.5, 0.2, 1). To measure similarity, the second frame is picked using n = 5. If that frame is not available, n = 2, is used instead. As for the thresholds at the STA and LTA read steps, we selected θs = 0.15 and θl = 0.30. These hyper- parameters were selected through cross-validation and re- Figure 2. Similarity used at the long-term association step. For main fixed regardless of the dataset and object classes. simplicity, only one mask and frame are used as reference and as target in the space-time memory network. 4.2. Datasets and performance evaluations We evaluated our method on KITTIMOTS and MOTSChallenge [37], two common datasets about MOTS. 3.3.2 Selection of pairs of tracklets KITTIMOTS contains 21 training videos and 29 test videos on cars and pedestrians and MOTSChallenge 4 training Instead of estimating a similarity measure between all pairs videos and 4 test videos only with pedestrians. of tracklets, a selection is made to reduce the computational Recently, the HOTA metric was introduced to fairly bal- cost [43]. The selection is based on the following heuristic: ance the quality of detections and associations. It can be de- two tracklets may belong to the same objects if they be- composed into the DetA and AssA metrics to measure the long to the same object class, are temporally close, spatially quality of these two components. The higher the HOTA is, close and with a small temporal overlap. the more the tracker is aligned with the human visual assess- In details, let us denote f (M ) the frame where the mask ment. Contrary to MOTSChallenge which is primarly eval- M is present, M̄ its center and f ps, H and W respectively uated by sMOTSA and MOTSA measures, KITTIMOTS the number of frames per second, height and width of the has adopted this new metric. For a matter of standardiza- video. The temporal (Ct (T A , T B )), spatial (Cs (T A , T B )) tion, when evaluated on the training set, we reported HOTA and temporal overlap (Co (T A , T B )) costs between T A and measures for MOTSChallenge. 4
4.3. Results of the center of the masks may happened in the case of an occlusion, like in the fifth sequence. In that particular case, Results 1 in tables 1 and 2 indicate that our method is the spatial distance between the center of the two masks (the competitive on the association performance. We ranked 3rd yellow one in the first frame and the green one in the second on this criteria on KITTIMOTS and our identity switches one) is higher than the threshold used for the constraint 2. are the lowest on MOTSChallenge. We point out that our method does not require any additional fine-tuning on the benchmarks. Note also that the methods in the tables do not 5. Ablation studies use all the same detection inputs. The ablation studies in 5.1. Contribution of each step the next section gives a better understanding of our contri- bution by setting the detection inputs to assess only the data To examine the effects of each step of our method, we association component. evaluate the performance by successively adding a compo- nent at a time. Method Cars Pedestrians Step KT-car KT-ped MOTSChallenge HOTA DetA AssA HOTA DetA AssA HOTA HOTA HOTA sMOTSA UW IPL ETRI AIRL [46] 79.6 79.7 80.0 64.4 66.7 63.8 STA 79.4 62.0 57.9 62.8 ReID MOT 78.3 79.3 77.9 65.6 67.6 65.0 + filter 79.8 (+0.4) 63.3 (+1.3) 58.2 (+0.3) 64.0 (+1.2) ViP-DeepLab [26] 76.4 82.7 70.9 64.3 70.7 59.5 + LTA 84.1 (+4.3) 66.2 (+2.9) 64.5 (+6.3) 64.9 (+0.9) EagerMOT [13] 74.7 76.1 73.8 57.7 60.3 56.2 + filter 84.1 (+0.0) 67.6 (+1.4) 65.4 (+0.9) 67.7 (+2.8) MOTSFusion [18] 73.6 75.4 72.4 54.0 60.8 49.5 ReMOTS [43] 71.6 78.3 66.0 58.8 68.0 52.4 Table 3. Ablation studies on the validation set of KITTIMOTS PointTrack [42] 62.0 79.4 48.8 54.4 62.3 48.1 (KT) and the train set of MOTSChallenge. Each step of our ap- MeNToS (ours) 75.8 77.1 74.9 65.4 68.7 63.5 proach leads to an improvement in terms of HOTA and sMOTSA. Table 1. Results on the test set of KITTIMOTS. Bold red and italic Results of Table 3 indicate that all components im- blue indicate respectively the first and second best methods. Con- prove the tracking performance (either in terms of HOTA tains unpublished results. or sMOTSA). More precisely, on KITTIMOTS and MOTSChallenge, the HOTA metric is highly sensitive to the quality of the association. From the tracking results Method sMOTSA IDF1 MOTSA FP FN IDs obtained at step STA, on average, two-thirds of the to- ReMOTS [43] 70.4 75.0 84.4 819 3999 231 tal improvement in HOTA is made at the LTA step. As COSTA st 69.5 70.3 83.3 806 4176 421 for sMOTSA, unsurprisingly, it lowers the effect of the GMPHD [30] 69.4 66.4 83.3 935 3985 484 long-term association. Consequently, improving the LTA SORTS [1] 55.0 57.3 68.3 1076 8598 552 step leads to a boost in terms of HOTA: this improvement TrackRCNN [37] 40.6 42.4 55.2 1261 12641 567 in the data association is totally missed with the previous MeNToS (ours) 64.8 73.2 76.9 654 6704 110 sMOTSA measure. Table 2. Results on the test set of MOTSChallenge. Bold red and 5.2. Upper bound with oracle italic blue indicate respectively the first and second best methods. Contains unpublished results. Since the HOTA gives more importance to the data as- sociation part of MOTS compared to the sMOTSA, we also Some qualitative results are provided in Figure 3. We conducted some experiments to measure the limits of our notice that our method can retrieve some objects even af- method and estimate the steps which can lead to the biggest ter some occlusions like the green, yellow and purple cars gain in HOTA. Using the ground truth annotations on KIT- in the first sequence and the blue pedestrian in the second TIMOTS, we build three methods that inject step by step sequence. The main limitations are contaminated segmen- some ground truth knowledge. tation masks and short tracklets since they provide little in- formation and their appearances are unstable. For example, Oracle at the STA The oracle at the STA step is obtained pedestrians usually walk in group provoking more occlu- with perfect short-term assignments based on the detections sions and eventually lower quality masks, as in the third obtained by the multiclass object detector. This means that and fourth sequences. Then, a large spatial displacement repeated false positives are also linked together. 1 Full results at http://www.cvlibs.net/datasets/kitti/ Precisely, an initial identity is provided for all masks at eval_mots.php and https://motchallenge.net/results/ the first frame. Then for each mask at time t corresponding MOTS/ to a ground truth object (mIoU over 50%), we look for the 5
Figure 3. Qualitative results on KITTIMOTS and MOTSChallenge. Each row corresponds to a subsequence of a video clip. mask that corresponds to the same ground truth object at removed. time t + 1. If it is present, we attribute the same identity. Afterwards, for masks that do not correspond to a ground Full oracle The full oracle is obtained with perfect short truth mask, we look for a mask at time t + 1 that sufficiently and long-term assignments of detections that match the overlaps with the optical flow-warped mask at time t. If this ground truth and a perfect deletion of false positives. mask does not correspond to a ground truth object, the same identity is attributed. Finally, a new identity is provided for Precisely, this is the highest upper bound using the non-associated masks at time t + 1. The next steps follow ground truth knowledge from the detection step. For all seg- those described in section 3. mentation masks, if they correspond to a ground truth mask with a mIoU higher than 50%, they are kept and the ground truth identity is attributed, otherwise they are removed. Oracle at the LTA The oracle at the LTA step is obtained with perfect long-term assignments of tracklets obtained Results We evaluated the HOTA of these three oracle with our method into tracks. upper bounds on the validation set of KITTIMOTS and Precisely, after associating masks using the optical flow compare them with our method. Results from Table 4 in- with RAFT and filtering short tracklets, we compute a dicate that there is still room for improvement given the ground truth identity for each tracklet as the most frequent same detections. A potential gain of 2.7 and 5.1 points are one among its detections. Then, for tracklets sharing the still possible on the car and pedestrian classes of KITTI- same ground truth identity, the same identity is attributed. MOTS. Moreover, the first step of short-term association This oracle ignores the step of the selection of candidate with RAFT is nearly perfect: the possible gain here does pairs. Similarly to section 3, tracklets with low scores are not exceed 0.5 point. On the contrary, the step of long-term 6
association is the one where a substantial gain is available: which indicate that ReID features are suitable for associ- 67% and 53% of the total gain for cars and pedestrians re- ating far apart detections and that color histograms are effi- spectively is located at this step. We hypothesize that the cient specially for small objects. difference between these classes comes from the fact that Figure 4 illustrates the behavior of these four techniques there are more distractors for pedestrians than for cars. In- for LTA alongside the one based on the STM network on deed, the model fully using the ground truth annotations is HOTA with regard to the similarity threshold θl . First, it able to eliminate all distractors. indicates that for color histogram-based and ReID-based methods, using all frames of a tracklet as reference leads to Method KT-car KT-ped a higher HOTA. Second, the four additional techniques are MeNToS 84.1 67.6 sensitive to the parameter θl : the interval where the HOTA Oracle at STA 84.5 (+0.4) 68.1 (+0.5) is the highest is narrow. Third, on the contrary, our tech- Oracle at LTA 85.9 (+1.8) 70.3 (+2.7) nique based on the STM network is less sensitive to the pa- Full oracle 86.8 (+2.7) 72.7 (+5.1) rameter θl : the HOTA is high on the interval [0.1; 0.5]. However, our STM-based association struggles when Table 4. Oracle methods on the validation set of KITTIMOTS different parts of the same object are present in tracklets. (KT). Given the same detections, the largest future gain is at the For example, if one tracklet displays only the hood of a car long-term association step. and another only the car trunk, even if they belong to the same object, our STM-based approach could not associate such tracklets since no particular parts are in common. As 5.3. Comparison with other strategies of long-term for ReID features, they are trained to overlook such events association but in the case of heavy occlusion, the features would prob- In the proposed method, we use a STM network to com- ably be misaligned. pute a similarity measure between tracklets. Other methods To address this issue of misalignment, Wang et al. [38] traditionally used re-identification features [43] or color his- used an alignment based on a linear transformation. As op- tograms [41]. Moreover, they also tend to use more ref- posed to their approach, we used a STM network to propa- erence frames whereas ours only takes into account two gate masks, performed the propagation at the pixel level and frames. In order to measure the contribution of the STM on the whole image for better precision to match tracklets. network combined with a cosine similarity, we replace them Working on the entire image is advantageous when a detec- by one the following method: tion is missing. Moreover, this is the closest form of use of a STM network within the framework of the OSVOS. 1. RGB 2x2: for two tracklets, we extracted the his- 5.4. Number of frames of reference togram of color of the same two frames of references computed at the pixel level. The similarity measure is From the results of Figure 4, it seems that taking into then the average of four Bhattacharyya coefficients; account more reference frames leads to better results. Is this trend still relevant for our technique based on STM ? 2. RGB NxP: for two tracklets, we extracted the his- Since it is not technically possible with our GPU to load all togram of color of all frames of references computed frames in memory and compute the attention, we compared on the mask. The similarity measure is then the aver- the following approaches: age Bhattacharyya coefficient over all possible combi- A nations of histograms between the two tracklets; 1. Frame 1: only the closest frames (f (MN ) and B f (M1 ) with the notation used in section 3.3.1) are 3. ReID 2x2: for two tracklets, we computed the ReID used as reference, as illustrated in Figure 2; features of the same two frames of references. The A 2. Frames 1&2: the two closest frames (f (MN ), similarity measure is then the average of four cosine A B B f (MN −1 ) and f (M1 ), f (M2 )) are used at reference; similarities; 3. Frames 1&5|2: this is the technique currently used 4. ReID NxP: for two tracklets, we computed the ReID in our approach. The closest frames are used and features of all frames of references. The similarity f (MNA B A −4 ) (respectively f (M5 )) if it exists for T measure is then the average cosine similarity over all (respectively T B ), otherwise f (MN A ) (respectively −1 possible combinations of ReID features between the f (M2B )); two tracklets. 4. Frames 1&2&(5): the two closest frames are used A B A The ReID features are computed with the OsNet-AIN [47] and f (MN −4 ) (respectively f (M5 )) if it exists for T B network. This is motivated by some works [22, 40, 27] (respectively T ). 7
STM-based RGB_2x2 ReID_2x2 ReID_NxP RGB_NxP KITTIMOTS-car KITTIMOTS-pedestrian MOTSChallenge 68.0 84 65 67.5 83 64 67.0 63 66.5 HOTA HOTA HOTA 82 62 66.0 81 61 65.5 60 80 65.0 59 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 theta_l theta_l theta_l Figure 4. Comparison of some strategies of long-term association for KITTIMOTS-car, KITTIMOTS-pedestrian and MOTSchallenge. Frames 1&5|2 Frame 1 Frames 1&2 Frames 1&2&(5) because it is less similar to the first frame. Hence, more KITTIMOTS-car uncorrelated information is available for the STM. These results are similar to the conclusion drawn by Lai et al. [16] 84 on OSVOS: their Memory-Augmented Tracker used both 83 short and long term memory to recover objects. HOTA 82 5.5. Why is there no propagation since the STM is 81 a propagation-based technique ? 80 We tried to include a propagation step after the LTA one. 0.2 0.4 0.6 0.8 Contrary to OSVOS or VIS where there are few objects theta_l of the same class, in MOTS the number of same-class in- KITTIMOTS-pedestrian stances can reach a dozen, making propagation challenging. 68.0 Indeed, same-class objects share similar parts that provoke 67.5 some erroneous propagation, especially when an object al- 67.0 ready exists in the scene and another undetected one ap- HOTA 66.5 pears. Moreover, to fill in the gap between two tracklets, 66.0 the information for propagation at the closest extremities of 65.5 the tracklets is from frames where a detector was not able 65.0 to detect any object (or the detection was rejected due to a 0.2 0.4 0.6 0.8 theta_l low confidence score, a small size or an absence of consec- utive association during the STA step). Hence, propagating Figure 5. Ablation studies on the selection of the reference frames in the STM-based technique on the validation set of KITTIMOTS. a mask in such a situation is by nature a complicated task. 6. Conclusion Figure 5 illustrates the performance in terms of HOTA We proposed a method to solve MOTS using a STM net- for different selections of reference frames in the STM with work originally developed for solving OSVOS. It is mainly regard to the similarity threshold θl . On KITTIMOTS, based on a hierarchical association where masks are first using the closest frames as reference leads to the lowest associated in consecutive frames to form tracklets. Then, HOTA. For pedestrians, using at least two reference frames these tracklets are associated using a STM network, lever- provides a little improvement. With more frames, it seems aging the ability of the network to match similar parts. Ex- that the performance is saturating. We hypothesize that this periments on KITTIMOTS and MOTSChallenge show that is due to some contaminated masks by other pedestrians, as our approach gives good performance with regard to the as- people tend to walk in groups. As for cars, replacing the sociation quality. We demonstrate that our approach for second closest frames by its fifth equivalent contributes to long-term association using a STM network is better than a boost in terms of HOTA. Considering three frames seems the recent approach based on ReID networks. Nevertheless, to saturate the HOTA. We hypothesize that using the fifth extensive analyses show that the long-term association can frame as a reference is a better choice than the second one still be improved given the same initial detections. 8
References [17] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence [1] Martin Ahrnbom, Mikael Nilsson, and Håkan Ardö. Real- Zitnick. Microsoft COCO: Common Objects in Context. In time and online segmentation multi-target tracking with ECCV, 2014. track revival re-identification. In International Conference [18] J. Luiten, T. Fischer, and B. Leibe. Track to Reconstruct on Computer Vision Theory and Applications (VISAPP), and Reconstruct to Track. IEEE Robotics and Automation 2021. Letters, 5(2):1803–1810, Apr. 2020. [2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. [19] Jonathon Luiten, Arne Hoffhues, Blin Beqa, Paul Voigtlaen- Neural Machine Translation by Jointly Learning to Align and der, István Sárándi, Patrick Dendorfer, Aljosa Osep, Achal Translate. In ICLR, 2015. Dave, Tarasha Khurana, Tobias Fischer, Xia Li, Yuchen [3] Sergi Caelles, Kevis-Kokitsi Maninis, Jordi Pont-Tuset, Fan, Pavel Tokmakov, Song Bai, Linjie Yang, Federico Per- Laura Leal-Taixe, Daniel Cremers, and Luc Van Gool. One- azzi, Ning Xu, Alex Bewley, Jack Valmadre, Sergi Caelles, Shot Video Object Segmentation. In CVPR, 2017. Jordi Pont-Tuset, Xinggang Wang, Andreas Geiger, Fisher [4] Sergi Caelles, Alberto Montes, Kevis-Kokitsi Maninis, Yu, Deva Ramanan, Laura Leal-Taixé, and Bastian Leibe. Yuhua Chen, Luc Van Gool, Federico Perazzi, and Jordi RobMOTS : A Benchmark and Simple Baselines for Robust Pont-Tuset. The 2018 DAVIS Challenge on Video Object Multi-Object Tracking and Segmentation. In CVPR RVSU Segmentation. arXiv:1803.00557, Mar. 2018. Workshop, 2021. [5] Sergi Caelles, Jordi Pont-Tuset, Federico Perazzi, Alberto [20] Jonathon Luiten, Aljosa Osep, Patrick Dendorfer, Philip Montes, Kevis-Kokitsi Maninis, and Luc Van Gool. The Torr, Andreas Geiger, Laura Leal-Taixe, and Bastian Leibe. 2019 DAVIS Challenge on VOS: Unsupervised Multi-Object HOTA: A Higher Order Metric for Evaluating Multi-Object Segmentation. arXiv:1905.00737, May 2019. Tracking. International Journal of Computer Vision (IJCV), [6] Patrick Dendorfer, Aljosa Osep, Anton Milan, Konrad Oct. 2020. Schindler, Daniel Cremers, Ian Reid, Stefan Roth, and Laura [21] Jonathon Luiten, Paul Voigtlaender, and Bastian Leibe. PRe- Leal-Taixé. MOTChallenge: A Benchmark for Single- MVOS: Proposal-Generation, Refinement and Merging for Camera Multiple Target Tracking. International Journal of Video Object Segmentation. In ACCV, 2018. Computer Vision (IJCV), 129(4):845–881, Apr. 2021. [22] Mehdi Miah, Justine Pepin, Nicolas Saunier, and Guillaume- [7] Ting Fu, Weichao Hu, Luis Miranda-Moreno, and Nicolas Alexandre Bilodeau. An Empirical Analysis of Visual Fea- Saunier. Investigating secondary pedestrian-vehicle interac- tures for Multiple Object Tracking in Urban Scenes. In Inter- tions at non-signalized intersections using vision-based tra- national Conference on Pattern Recognition (ICPR), 2021. jectory data. Transportation Research Part C: Emerging [23] Seoung Wug Oh, Joon-Young Lee, Kalyan Sunkavalli, Technologies, 105:222–240, Aug. 2019. and Seon Joo Kim. Fast Video Object Segmentation by [8] Shubhika Garg and Vidit Goel. Mask Selection and Propaga- Reference-Guided Mask Propagation. In CVPR, 2018. tion for Unsupervised Video Object Segmentation. In WACV, [24] Seoung Wug Oh, Joon-Young Lee, Ning Xu, and Seon Joo 2021. Kim. Video Object Segmentation Using Space-Time Mem- [9] Golnaz Ghiasi, Yin Cui, Aravind Srinivas, Rui Qian, Tsung- ory Networks. In ICCV, 2019. Yi Lin, Ekin D. Cubuk, Quoc V. Le, and Barret Zoph. Sim- [25] Federico Perazzi, Anna Khoreva, Rodrigo Benenson, Bernt ple Copy-Paste Is a Strong Data Augmentation Method for Schiele, and Alexander Sorkine-Hornung. Learning Video Instance Segmentation. In CVPR, 2021. Object Segmentation From Static Images. In CVPR, 2017. [10] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross [26] Siyuan Qiao, Yukun Zhu, Hartwig Adam, Alan Yuille, and Girshick. Momentum Contrast for Unsupervised Visual Rep- Liang-Chieh Chen. ViP-DeepLab: Learning Visual Percep- resentation Learning. In CVPR, 2020. tion with Depth-aware Video Panoptic Segmentation. In [11] Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Gir- CVPR, 2021. shick. Mask R-CNN. In ICCV, 2017. [27] Ergys Ristani and Carlo Tomasi. Features for Multi-Target [12] Allan Jabri, Andrew Owens, and Alexei A. Efros. Space- Multi-Camera Tracking and Re-Identification. In CVPR, Time Correspondence as a Contrastive Random Walk. In 2018. NeurIPS, 2020. [28] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- [13] Aleksandr Kim, Aljoša Ošep, and Laura Leal-Taixé. Eager- jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, MOT: Real-time 3D multi-object tracking and segmentation Aditya Khosla, Michael Bernstein, Alexander C. Berg, and via sensor fusion. In CVPR - Workshops, 2020. Li Fei-Fei. ImageNet Large Scale Visual Recognition Chal- [14] H. W. Kuhn. The Hungarian method for the assignment prob- lenge. International Journal of Computer Vision (IJCV), lem. Naval Research Logistics Quarterly, 1955. 115(3):211–252, Dec. 2015. [15] Ankit Kumar, Ozan Irsoy, Peter Ondruska, Mohit Iyyer, [29] Gurinderbeer Singh, Sreeraman Rajan, and Shikharesh Ma- James Bradbury, Ishaan Gulrajani, Victor Zhong, Romain jumdar. A Greedy Data Association Technique for Multiple Paulus, and Richard Socher. Ask me anything: Dynamic Object Tracking. In International Conference on Multimedia memory networks for natural language processing. In ICML, Big Data (BigMM), 2017. 2016. [30] Young-min Song and Moongu Jeon. Online Multi-Object [16] Zihang Lai, Erika Lu, and Weidi Xie. MAST: A Memory- Tracking and Segmentation with GMPHD Filter and Simple Augmented Self-supervised Tracker. In CVPR, 2020. Affinity Fusion. In CVPR - Workshops, 2020. 9
[31] Rainer Stiefelhagen, Keni Bernardin, Rachel Bowers, John [46] Haotian Zhang, Yizhou Wang, Jiarui Cai, Hung-Min Hsu, Garofolo, Djamel Mostefa, and Padmanabhan Soundarara- Haorui Ji, and Jenq-Neng Hwang. LIFTS: Lidar and monoc- jan. The CLEAR 2006 Evaluation. In Multimodal Technolo- ular image fusion for multi-object tracking and segmenta- gies for Perception of Humans, Lecture Notes in Computer tion. In CVPR - Workshops, 2020. Science, pages 1–44, Berlin, Heidelberg, 2007. Springer. [47] Kaiyang Zhou, Yongxin Yang, Andrea Cavallaro, and Tao [32] Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Xiang. Omni-Scale Feature Learning for Person Re- Fergus. End-To-End Memory Networks. In NIPS, 2015. Identification. In ICCV, 2019. [33] Chufeng Tang, Hang Chen, Xiao Li, Jianmin Li, Zhaoxiang Zhang, and Xiaolin Hu. Look Closer To Segment Better: Boundary Patch Refinement for Instance Segmentation. In CVPR, 2021. [34] Zachary Teed and Jia Deng. RAFT: Recurrent All-Pairs Field Transforms for Optical Flow. In ECCV, 2020. [35] Jack Valmadre, Alex Bewley, Jonathan Huang, Chen Sun, Cristian Sminchisescu, and Cordelia Schmid. Local Metrics for Multi-Object Tracking. arXiv:2104.02631, Apr. 2021. [36] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is All you Need. In NIPS, 2017. [37] Paul Voigtlaender, Michael Krause, Aljosa Osep, Jonathon Luiten, Berin Balachandar Gnana Sekar, Andreas Geiger, and Bastian Leibe. MOTS: Multi-Object Tracking and Seg- mentation. In CVPR, 2019. [38] Zhongdao Wang, Hengshuang Zhao, Ya-Li Li, Shengjin Wang, Philip H. S. Torr, and Luca Bertinetto. Do Differ- ent Tracking Tasks Require Different Appearance Models? arXiv:2107.02156, July 2021. [39] Dongxu Wei, Jiashen Hua, Hualiang Wang, Baisheng Lai, Kejie Huang, Chang Zhou, Jianqiang Huang, and Xiansheng Hua. RobTrack : A Robust Tracker Baseline towards Real- World Robustness in Multi-Object Tracking and Segmenta- tion. In CVPR RVSU Workshop, 2021. [40] Nicolai Wojke, Alex Bewley, and Dietrich Paulus. Simple online and realtime tracking with a deep association met- ric. In IEEE International Conference on Image Processing (ICIP), 2017. [41] Junliang Xing, Haizhou Ai, and Shihong Lao. Multi-object tracking through occlusions by local tracklets filtering and global tracklets association with detection responses. In CVPR, pages 1200–1207, 2009. [42] Zhenbo Xu, Wei Zhang, Xiao Tan, Wei Yang, Huan Huang, Shilei Wen, Errui Ding, and Liusheng Huang. Segment as Points for Efficient Online Multi-Object Tracking and Seg- mentation. In ECCV, 2020. [43] Fan Yang, Xin Chang, Chenyu Dang, Ziqiang Zheng, Sakri- ani Sakti, Satoshi Nakamura, and Yang Wu. ReMOTS: Self- Supervised Refining Multi-Object Tracking and Segmenta- tion. In CVPR - Workshops, 2020. [44] Linjie Yang, Yanran Wang, Xuehan Xiong, Jianchao Yang, and Aggelos K. Katsaggelos. Efficient Video Object Seg- mentation via Network Modulation. In CVPR, 2018. [45] Sohail Zangenehpour, Jillian Strauss, Luis F. Miranda- Moreno, and Nicolas Saunier. Are signalized intersections with cycle tracks safer? A case–control study based on au- tomated surrogate safety analysis using video data. Accident Analysis & Prevention, 86:161–172, Jan. 2016. 10
You can also read