Occluded Video Instance Segmentation

Page created by Martha Ross

Current Events

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Occluded Video Instance Segmentation

Jiyang Qi1,2 * Yan Gao2 * Yao Hu2 Xinggang Wang1 Xiaoyu Liu2 Xiang Bai1
Serge Belongie3 Alan Yuille4 Philip H.S. Torr5 Song Bai2,5†
1
Huazhong University of Science and Technology 2 Alibaba Group 3 Cornell University
4
Johns Hopkins University 5 University of Oxford
arXiv:2102.01558v4 [cs.CV] 30 Mar 2021

Abstract

Can our video understanding systems perceive objects
when a heavy occlusion exists in a scene?
To answer this question, we collect a large-scale dataset
called OVIS for occluded video instance segmentation, that
is, to simultaneously detect, segment, and track instances
in occluded scenes. OVIS consists of 296k high-quality in-
stance masks from 25 semantic categories, where object oc-
clusions usually occur. While our human vision systems
can understand those occluded instances by contextual rea-
soning and association, our experiments suggest that cur-
rent video understanding systems are not satisfying. On the
OVIS dataset, the highest AP achieved by state-of-the-art
algorithms is only 14.4, which reveals that we are still at
a nascent stage for understanding objects, instances, and
videos in a real-world scenario. In experiments, a sim- Figure 1. Sample video clips from OVIS. Click them to watch the
ple plug-and-play module that performs temporal feature animations (best viewed with Acrobat/Foxit Reader).
calibration is proposed to complement missing object cues
caused by occlusion. Built upon MaskTrack R-CNN and
SipMask, we obtain an AP of 15.1 and 14.5 on the OVIS of video instance segmentation, a popular task recently
dataset and achieve 32.1 and 35.1 on the YouTube-VIS proposed in [50] that targets a comprehensive understand-
dataset respectively, a remarkable improvement over the ing of objects in videos. To this end, we explore a new
state-of-the-art methods. The OVIS dataset is released at and challenging scenario called Occluded Video Instance
http://songbai.site/ovis, and the project code will be avail- Segmentation (OVIS), which requests a model to simulta-
able soon. neously detect, segment, and track object instances in oc-
cluded scenes.
As the major contribution of this work, we collect a
1. Introduction large-scale dataset called OVIS, specifically for video in-
stance segmentation in occluded scenes. While being the
In the visual world, objects rarely occur in isolation.
second video instance segmentation dataset after YouTube-
The psychophysical and computational studies have demon-
VIS [50], OVIS consists of 296k high-quality instance
strated [32, 15] that human vision systems can perceive
masks out of 25 commonly seen semantic categories. Some
heavily occluded objects with contextual reasoning and as-
example clips are given in Fig. 1. The most distinctive prop-
sociation. The question then becomes, can our video un-
erty of our OVIS dataset is that most objects are under se-
derstanding system perceive objects that are severely ob-
vere occlusions. The occlusion level of each object is also
scured?
labeled (as shown in Fig. 2). Therefore, OVIS is a useful
Our work aims to explore this matter in the context testbed to evaluate video instance segmentation models for
* indicates equal contributions. dealing with heavy object occlusions.
† Corresponding author. E-mail: songbai.site@gmail.com To dissect the OVIS dataset, we conduct a thor-

Figure 2. Different occlusions levels in OVIS. Unoccluded objects are colored green, slightly occluded objects are colored yellow, and
severely occluded objects are colored red.

ough evaluation of 5 state-of-the-art algorithms whose clusion issue. Using MaskTrack R-CNN [50] and Sip-
code is publicly available, including FEELVOS [39], Mask [3] as baselines, this module obtains remarkable
IoUTracker+ [50], MaskTrack R-CNN [50], SipMask [3], improvements on both OVIS and YouTube-VIS.
and STEm-Seg [1]. However, the experimental results sug-
gest that current video understanding systems fall behind 2. Related Work
the capability of human beings in terms of occlusion per-
ception. The highest AP is only 14.4 achieved by [1]. In Our work focuses on Video Instance Segmentation in
this sense, we are still far from deploying those techniques occluded scenes. The most relevant work to ours is [50],
into practical applications, especially considering the com- which formally defines the concept of video instance seg-
plexity and diversity of scenes in the real visual world. mentation and releases the first dataset called YouTube-
To address the occlusion issue, we also propose a plug- VIS. Built upon the large-scale video object segmentation
and-play module called temporal feature calibration. For a dataset YouTube-VOS [48], the 2019 version of YouTube-
given query frame in a video, we resort to a reference frame VIS dataset contains a total of 2,883 videos, 4,883 in-
to complement its missing object cues. Specifically, the pro- stances, and 131k masks in 40 categories. Its latest 2021
posed module learns a calibration offset for the reference version contains a total of 3,859 videos, 8,171 instances,
frame with the guidance of the query frame, and then the and 232k masks. However, YouTube-VIS is not designed to
offset is used to adjust the feature embedding of the refer- study the occluded video understanding problem. Most ob-
ence frame via deformable convolution [7]. The refined ref- jects in our OVIS dataset are under severe occlusions. The
erence embedding is used in turn to assist the object recog- experimental results show that OVIS is more challenging
nition of the query frame. Our module is a highly flexible than YouTube-VIS.
plug-in. While applied to MaskTrack R-CNN [50] and Sip- Since the release of the YouTube-VIS dataset, video in-
Mask [3] respectively, we obtain an AP of 15.1 and 14.5, stance segmentation has attracted great attention in the com-
significantly outperforming the corresponding baselines by puter vision community, arising a series of algorithms re-
3.3 and 2.8 in AP respectively. cently [50, 3, 1, 2, 28, 31, 42, 10, 44, 11, 18]. MaskTrack
To summarize, our contributions are three-fold: R-CNN [50] is the first unified model for video instance
segmentation. It achieves video instance segmentation by
• We advance video instance segmentation by releas- adding a tracking branch to the popular image instance
ing a new benchmark dataset named OVIS (short for segmentation method Mask R-CNN [13].MaskProp [2] is
Occluded Video Instance Segmentation). OVIS is de- also a video extension of Mask R-CNN which adds a
signed with the philosophy of perceiving object occlu- mask propagation branch to tracks instances by the prop-
sions in videos, which could reveal the complexity and agated masks. SipMask [3] extends single-stage image in-
the diversity of real-world scenes. stance segmentation to the video level by adding a fully-
convolutional branch for tracking instances. Different from
• We streamline the research over the OVIS dataset by
those top-down methods, STEm-Seg [1] proposes a bottom-
conducting a comprehensive evaluation of five state-
up method, which performs video instance segmentation by
of-the-art video instance segmentation algorithms,
clustering the pixels of the same instance. Built upon Trans-
which could be a baseline reference for future research
formers, VisTR [44] supervises and segments instances at
on OVIS.
the sequence level as a whole. In experiments, a feature
• We propose a plug-and-play module to alleviate the oc- calibration module is proposed, in which calibrated features

from neighboring frames are fused with the current frame
800

Number of instances
for reasoning occluded objects. Based on strong baselines,
i.e., MaskTrack R-CNN [50] and SipMask [3], this simple 600
plug-and-play module obtains significant improvements in 400
occluded scenes. Different from MaskProp [2] which uses
200
deformable convolution to predict a local mask sequence
for better tracking, the deformable convolutional layer in 0

Ho al

Cae
n

Mo ep try

t
nk g
Ve Fish

Sh rse
Z eep
Raebra

cyc t
Mo Doe
B y
Tu oat

t
Pa ow

Gia iraffe
nt Tig e
Airpander
B ne
Lizear
pla a

ard
Po Birdt

Bicrro
tor han
i
our method is used to calibrate the features from reference

rtl
e
l

G ycl
rso

bb
hic

C
u l
Pe

E l
frames to complete missing object cues caused by occlu-
sion. Figure 3. Number of instances per category in the OVIS dataset.
Meanwhile, our work is also relevant to several other
tasks, including:
tion [22, 47, 24], large vocabulary instance segmenta-
Video Object Segmentation. Video object segmentation tion [12, 46], etc.
(VOS) is a popular task in video analysis. According to
whether to provide the mask for the first frame, VOS can be 3. OVIS Dataset
divided into semi-supervised and unsupervised scenarios.
Semi-supervised VOS [41, 20, 26, 35, 16, 19, 36, 27] aims Given an input video, video instance segmentation re-
to track and segment a given object with a mask. Many quires detecting, segmenting, and tracking object instances
Semi-supervised VOS methods [41, 20, 26] adopt an on- simultaneously from a predefined set of object categories.
line learning manner which fine-tunes the network on the An algorithm is supposed to output the class label, confi-
mask of the first frame during inference. Recently, some dence score, and a sequence of binary masks of each in-
other works [35, 16, 19, 36, 27] aim to avoid online learning stance.
for the sake of faster inference speed. Unsupervised VOS The focus of this work is on collecting a large-scale
methods [25, 43, 38] aim to segment the primary objects in benchmark dataset for video instance segmentation with se-
a video without the first frame annotations. Different from vere object occlusions. In this section, we mainly review
video instance segmentation that needs to classify objects, the data collection process, the annotation process, and the
both unsupervised and semi-supervised VOS does not dis- dataset statistics.
tinguish semantic categories.
3.1. Video Collection
Video Semantic Segmentation. Video semantic segmen-
tation requires semantic segmentation for each frame in a We begin with 25 semantic categories, including Person,
video. LSTM [9], GRU [34], and optical flow [52] are Bird, Cat, Dog, Horse, Sheep, Cow, Elephant, Bear, Ze-
introduced to leverage temporal contextual information for bra, Giraffe, Poultry, Giant panda, Lizard, Parrot, Mon-
more accurate or faster video semantic segmentation. Video key, Rabbit, Tiger, Fish, Turtle, Bicycle, Motorcycle, Air-
semantic segmentation does not require distinguishing in- plane, Boat, and Vehicle. The categories are carefully cho-
stances and tracking objects across frames. sen mainly for three motivations: 1) most of them are ani-
mals, with which object occlusions extensively happen, 2)
Video Panoptic Segmentation. Dahun et al. [21] define a they are commonly seen in our life, 3) these categories have
video extension of panoptic segmentation [22], which re- a high overlap with popular large-scale image instance seg-
quires generating consistent panoptic segmentation, and in mentation datasets [29, 12] so that models trained on those
the meantime, associating instances across frames. datasets are easier to be transferred. The number of in-
Multi-Object Tracking and Segmentation. Multi-object stances per category is given in Fig. 3.
tracking and segmentation (MOTS) [40] task extends Multi- As the dataset is to study the capability of our video un-
Object Tracking (MOT) [37] from a bounding box level to derstanding systems to perceive occlusions, we ask the an-
a pixel level. Paul et al. [40] release the KITTI MOTS and notation team to 1) exclude those videos, where only one
MOTSChallenge dataset, and propose Track R-CNN that single object stands in the foreground; 2) exclude those
extends Mask R-CNN by 3D convolutions to incorporate videos with a clean background; 3) exclude those videos,
temporal context and an extra tracking branch for object where the complete contour of objects is visible all the time.
tracking. Xu et al. [49] release the ApolloScape dataset Some other objective rules include: 1) video length is gener-
which provides more crowded scenes and proposes a new ally between 5s and 60s, and 2) video resolution is generally
track-by-points paradigm. 1920 × 1080;
Our work is of course relevant to some image-level After applying the objective rules, the annotation team
recognition tasks, such as semantic segmentation [30, 5, 6], delivers 8,644 video candidates and our research team only
instance segmentation [13, 17, 23], panoptic segmenta- accepts 901 challenging videos after a careful re-check. It

Dataset YTVIS 19 YTVIS 21 OVIS YouTube-VIS 2019 YouTube-VIS 2019
60

Instances (%)
YouTube-VIS 2021 YouTube-VIS 2021
20

Frames (%)
Masks 131k 232k 296k OVIS OVIS
40
Instances 4,883 8,171 5,223 10 20
Categories 40 40 25
0 0
Videos 2,883 3,859 901 0 10 20 30 0.0 0.2 0.4 0.6 0.8
Video duration? 4.61s 5.03s 12.77s Instance duration (s) BOR
Instance duration 4.47s 4.73s 10.05s (a) (b)

mBOR? 0.07 0.06 0.22 YouTube-VIS 2019 YouTube-VIS 2019
Objects / frame? 1.57 1.95 4.72 40 YouTube-VIS 2021
40 YouTube-VIS 2021

Frames (%)
Videos (%)
OVIS OVIS
Instances / video? 1.69 2.10 5.80
20 20
Table 1. Comparing OVIS with YouTube-VIS in terms of statis-
tics. See Eq. (1) for the definition of mBOR. ? means the value of 0 0
YouTube-VIS is estimated from the training set. 0 5 10 15 20 0 5 10 15 20
Number of instances per video Number of objects per frame
(c) (d)

should be mentioned that due to the stringent standard of Figure 4. Comparison of OVIS with YouTube-VIS, including the
video collection, the pass rate is as low as 10%. distribution of instance duration (a), BOR (b), the number of in-
stances per video (c), and the number of objects per frame (d).
3.2. Annotation
Given an accepted video, the annotation team is asked to 3.3. Dataset Statistics
exhaustively annotate all the objects belonging to the pre-
defined category set. Each object is given an instance iden- As YouTube-VIS [50] is the only dataset that is specifi-
tity and a class label. In addition to some common rules cally designed for video instance segmentation nowadays,
(e.g., no ID switch, mask fitness ≤1 pixel), the annotation we analyze the data statistics of our OVIS dataset with
team is trained with several criteria particularly about oc- YouTube-VIS as a reference in Table 1. We compare OVIS
clusions: 1) if an existing object disappears because of full with two versions of YouTube-VIS: YouTube-VIS 2019 and
occlusions and then re-appears, the instance identity should YouTube-VIS 2021. Note that some statistics, marked with
keep the same; 2) if a new instance appears in an in-between ?, of YouTube-VIS is only calculated from the training set
frame, a new instance identity is needed; and 3) the case because only the annotation of the training set is publicly
of “object re-appears” and “new instances” should be dis- available. Nevertheless, considering the training set occu-
tinguishable by you after you watch the contextual frames pies 78% of the whole dataset, those statistics could still
therein. All the videos are annotated every 5 frames, which reflect the properties of YouTube-VIS roughly.
results in that the annotation granularity ranges from 3 to 6 In terms of basic and high-level statistics, OVIS contains
fps. 296k masks and 5,223 instances. The number of masks in
To deeply analyze the influence of occlusion levels on OVIS is larger than YouTube-VIS 2019 and YouTube-VIS
the models’ performance, OVIS provides the occlusion 2021 that have 131k and 232k masks, respectively. The
level annotation of every object in each frame. The oc- number of instances in OVIS is larger than YouTube-VIS
clusion levels are defined as follows: no occlusion, slight 2019 that has 4,883 instances, and less than YouTube-VIS
occlusion, and severe occlusion. As illustrated in Fig. 2, no 2021 that has 8,171 instances. Note that there are fewer cat-
occlusion means the object is fully visible, slight occlusion egories in OVIS, so the mean instances count per category
means that more than 50% of the object is visible, and se- is larger than YouTube-VIS 2021. Nonetheless, OVIS has
vere occlusion means that more than 50% of the object area fewer videos than YouTube-VIS as our design philosophy
is occluded. favors long videos and instances so as to preserve enough
Each video is handled by one annotator to get the initial motion and occlusion scenarios.
annotation, and the initial annotation is then passed to an- As is shown, the average video duration and the average
other annotator to check and correct if necessary. The final instance duration of OVIS are 12.77s and 10.05s respec-
annotations will be examined by our research team and sent tively. Fig. 4(a) presents the distribution of instance dura-
back for revision if deemed below the required quality. tion, which shows that all instances in YouTube-VIS last
While being designed for video instance segmentation, less than 10s. Long videos and instances increase the dif-
it should be noted that OVIS is also suitable for evaluating ficulty of tracking and the ability of long-term tracking is
video object segmentation in either a semi-supervised or un- required.
supervised fashion, and object tracking since the bounding- As for occlusion levels, the proportions of objects with
box annotation is also provided The relevant experimental no occlusion, slight occlusion, and severe occlusion in
settings will be explored as part of our future work. OVIS are 18.2%, 55.5%, and 26.3% respectively. 80.2%

C Correlation
 Temporal Feature Calibration
 D Deformable Convolution
 Element-wise Addition

 Conv.
 C D
 = 0.17 = 0.51
Figure 5. Visualization of occlusions with different BOR values. × × × × × × 

 Figure 6. The pipeline of temporal feature calibration, which can
of instances are severely occluded in at least one frame, and be inserted into different video instance segmentation models by
only 2% of the instances are not occluded in any frame. It changing the following prediction head.
supports the focus of our work, that is, to explore the ability
of video instance segmentation models in handling occlu- 4. Experiments
sion scenes.
 In order to compare the occlusion degree with the In this section, we comprehensively study the newly col-
YouTube-VIS dataset, we define a metric named Bounding- lected OVIS dataset by conduct experiments on five existing
box Occlusion Rate (BOR) to approximate the degree of video instance segmentation algorithms and the proposed
occlusion. Given a video frame with N objects denoted by baseline method.
bounding boxes {B1 , B2 , . . . , BN }, we compute the BOR
for this frame as 4.1. Implementation Details
 S T Datasets and Metrics. On the newly collected OVIS
 | 1≤i

OVIS validation set OVIS test set YouTube-VIS validation set
 Methods
 AP AP50 AP75 AR1 AR10 AP AP50 AP75 AR1 AR10 AP AP50 AP75 AR1 AR10
 FEELVOS [39] 9.6 22.0 7.3 7.4 14.8 10.8 23.4 8.7 9.0 16.2 26.9 42.0 29.7 29.9 33.4
 IoUTracker+ [50] 7.3 17.9 5.5 6.1 15.1 9.5 18.8 10.0 6.6 16.5 23.6 39.2 25.5 26.2 30.9
 MaskTrack R-CNN [50] 10.8 25.3 8.5 7.9 14.9 11.8 25.4 10.4 7.9 16.0 30.3 51.1 32.6 31.0 35.5
 SipMask [3] 10.2 24.7 7.8 7.9 15.8 11.7 23.7 10.5 8.1 16.6 32.5 53.0 33.3 33.5 38.9
 STEm-Seg [1] 13.8 32.1 11.9 9.1 20.0 14.4 30.0 13.0 10.1 20.6 30.6 50.7 33.5 31.6 37.1
 VisTR[44] - - - - - - - - - - 34.4 55.7 36.5 33.5 38.9
 CSipMask 14.3 29.9 12.5 9.6 19.3 14.5 31.1 13.5 9.0 19.4 35.1 55.6 38.1 35.8 41.7
 CMaskTrack R-CNN 15.4 33.9 13.1 9.3 20.0 15.1 31.6 13.2 9.8 20.5 32.1 52.8 34.9 33.2 37.9
 Table 2. Quantitative comparison with state-of-the-art methods on the OVIS dataset and the YouTube-VIS dataset.

 After enumerating all the positions in Fq , we obtain C ∈ code is publicly available1 , including mask propagation
 H×W ×d2 methods (e.g., FEELVOS [39]), track-by-detect methods
R and forward it into multiple stacked convolution
layers to get the spatial calibration offset D ∈ RH×W ×18 . (e.g., IoUTracker+ [50]), and recently proposed end-to-end
We then obtain a calibrated version of Fr by applying de- methods (e.g., MaskTrack R-CNN [50], SipMask [3], and
formable convolutions with D as the spatial calibration off- STEm-Seg [1]).
set, which is denoted as Fr . At last, we fuse the calibrated As presented in Table 2, all those methods encounter a
reference feature Fr with the query feature Fq by element- performance degradation of at least 50% on OVIS com-
wise addition for the localization, classification and seg- pared with that on YouTube-VIS. For example, the AP of
mentation of the current frame afterward. SipMask [3] decreases from 32.5 to 11.7 and that of STEm-
 During training, for each query frame Fq , we randomly Seg [1] decreases from 30.6 to 14.4. It firmly suggests that
sample a reference frame Fr from the same video. In or- further attention should be paid to video instance segmenta-
der to ensure that the reference frame has a strong spatial tion in the real world where occlusions extensively happen.
correspondence with the query frame, the sampling is only Benefiting from 3D convolutional layers and the bottom-
done locally within train = 5 frames. Since the temporal up architecture, STEm-Seg surpasses other methods on
feature calibration is differentiable, it can be trained end-to- OVIS and obtains an AP of 14.4. Our interpretation is that
end by the original detection and segmentation loss. When 3D convolution is conducive to sensing temporal context,
inference, all frames adjacent to the query frame within the and the bottom-up architecture avoids the detection pro-
range test = 5 are taken as reference frames. We linearly cess which is difficult in occluded scenes. While achieving
fuse the classification confidences, regression bounding box higher performance than MaskTrack R-CNN on YouTube-
coordinates, and segmentation masks obtained from each VIS, SipMask is inferior to MaskTrack R-CNN on OVIS.
reference frame and output the final results for the query We think this is because the one-stage detector adopted in
frame. SipMask is inferior to two-stage detectors in dealing with
 In the experiments, we denote our method as CMask- occluded objects whose geometric centers are very close to
Track R-CNN and CSipMask, when Calibrating Mask- each other.
Track R-CNN [50] models and Calibrating SipMask [3] By leveraging the feature calibration module, the
models, respectively. performance on OVIS is significantly improved. Our
Experimental Setup. For all our experiments, we adopt CMaskTrack R-CNN leads to an AP improvement of 3.3
ResNet-50-FPN [14] as backbone. The models are ini- over MaskTrack R-CNN (11.8 vs. 15.1), and our CSip-
tialized by Mask R-CNN which is pre-trained on MS- Mask leads to an AP improvement of 2.8 over SipMask
COCO [29]. All frames are resized to 640 × 360 during (11.7 vs. 14.5).
both training and inference for fair comparisons with pre- Some evaluation examples of CMaskTrack R-CNN on
vious work [50, 3, 1]. For our new baselines (CMaskTrack OVIS are given in Fig. 7, including 3 successful cases (a)-
R-CNN and CSipMask), we use three convolution layers of (c) and 2 failure cases (d) and (e). In (a), the car in the
kernel size 3 × 3 in the module for temporal feature calibra- yellow mask first blocks the car in the red mask entirely
tion. The training epoch is set to 12, and the initial learning in the 2nd frame, then is entirely blocked by the car in the
rate is set to 0.005 and decays with a factor of 10 at epoch 8 purple mask in the 4th frame. It is surprising that even in
and 11. this extreme case, all the cars are well tracked. In (b), we
 present a crowded scene where almost all the ducks are cor-
4.2. Main Results rectly detected and tracked. In (c), our model successfully
 On the OVIS dataset, we first produce the per- 1 Since VisTR [44] is very slow to train on long videos (about 1k GPU

formance of several state-of-the-art algorithms whose hours in total to train on OVIS), its results will be updated in the future.

(a)

(b)

(c)

(d)

(e)

Figure 7. Qualitative evaluation of CMaskTrack R-CNN on OVIS. Each row presents the results of 5 frames in a video sequence. (a)-(c)
are successful cases and (d) and (e) are failure cases.

tracks the bear in the yellow mask, which is partially oc- Methods AP AP50 AP75 AR1 AR10
cluded by another object, i.e., the bear in the purple mask, MaskTrack R-CNN [50] 10.8 25.3 8.5 7.9 14.9
and the background, i.e., the tree. In (d), two persons and [50] + Uncalibrated 12.9 29.4 11.5 8.2 16.6
two bicycles heavily overlap with each other. Our model [50] + Calibrationdiff 14.4 32.6 12.3 8.6 18.9
fails to track the person and segment the bicycle. In (e), [50] + Calibrationcorr 15.4 33.9 13.1 9.3 20.0
when two cars are intersecting, severe occlusion leads to Table 3. Results of fusing uncalibrated and calibrated features with
failure of detection and tracking. MaskTrack R-CNN. “+ Uncalibrated” means adding feature maps
We further evaluate the proposed CMaskTrack R-CNN directly without calibration. “+ Calibrationdiff ” means generat-
ing the calibration offset based on the element-wise difference be-
and CSipMask on the YouTube-VIS dataset. As shown in
tween feature maps, similar to [2] did. “+ Calibrationcorr ” is the
Table 2, CMaskTrack R-CNN and CSipMask surpass the proposed method in this paper.
corresponding baseline by 1.8 and 2.6 in terms of AP, re-
spectively, which demonstrates the flexibility and the gener-
alization power of the proposed feature calibration module. naive combination, which sums up the feature of the query
Moreover, our methods also beat other representative meth- frame and the reference frame without any feature align-
ods by a larger margin, including DeepSORT [45], STEm- ment. The second option is to replace the correlation oper-
Seg [1], etc. In [2], Gedas et al. propose MaskProp by re- ation in our module by calculating the element-wise differ-
placing the bounding-box level tracking in MaskTrack R- ence between feature maps, which is similar to the operation
CNN with a novel mask propagation mechanism. By us- used in [2]. We denote the two options as “+ Uncalibrated”
ing a better detection network (HybridTask Cascade Net- and “+ Calibrationdiff ”, respectively and our module as “+
work [4]), higher resolution inputs for segmentation net- Calibrationcorr ” in Table 3.
work, and more training iterations, it obtains a much higher As we can see, with MaskTrack R-CNN as the base
AP of 40.0 on YouTube-VIS. We believe that our module model, even the naive “+ Uncalibrated” combination can
is also pluggable to this strong baseline and better perfor- achieve a decent performance boost, which shows that a
mance could be achieved. Meanwhile, it is also interesting kind of feature fusion between different frames is neces-
to evaluate the performance of MaskProp on OVIS after its sary and beneficial to an accurate prediction of video in-
code is released. stance segmentation. When applying feature calibration,
the performance is further improved. “+ Calibrationcorr ”
4.3. Discussions
achieves an AP of 15.4, an improvement of 2.6 over “+ Un-
Ablation Study. We study the temporal feature calibra- calibrated”, and 1.0 over “+ Calibrationdiff ”. We argue that
tion module with a few alternatives. The first option is a the correlation operation is able to provide a richer context

Error type Methods No occlusion Slight occlusion Severe occlusion All
MaskTrack R-CNN 41.3% 41.0% 50.1% 42.5%
Classification error rate MaskTrack R-CNN + DCN 30.2% 32.9% 45.8% 34.5%
CMaskTrack R-CNN (ours) 27.5% 29.0% 39.2% 30.2%
MaskTrack R-CNN 12.1% 25.6% 34.1% 28.3%
Segmentation error rate MaskTrack R-CNN + DCN 13.2% 25.1% 33.2% 28.0%
CMaskTrack R-CNN (ours) 13.0% 22.2% 29.5% 25.3%
MaskTrack R-CNN 18.6% 22.5% 32.6% 22.9%
ID switch rate MaskTrack R-CNN + DCN 12.5% 16.1% 26.1% 16.8%
CMaskTrack R-CNN (ours) 11.2% 14.0% 21.6% 14.4%
Table 4. Error analysis under different occlusion levels. DCN means applying a deformable convolutional layer on the query frame itself.

15.6 the most, from 13.0% to 22.2%. Under severe occlusion, the
15.4
15.4 15.3 15.3 error rates of classification, segmentation, and tracking all
15.2
15.2 increase significantly, which demonstrates that severe oc-
OVIS validation AP

15.0 14.9 clusion will greatly improve the difficulty of video instance
14.8 segmentation.
14.8 14.7
14.6 By comparing the error rates between our method and
14.6 14.5 the baseline method, we can find that our method can signif-
14.4 icantly improve the performance of classification, segmen-
14.2 tation, and tracking under occlusion scenes. In the case of
14.0
14.0 severe occlusion, compared with MaskTrack R-CNN, our
0 2 4 6 8 method reduces the classification error rate from 50.1% to
test 39.2%, the segmentation error rate from 34.1% to 29.5%,
Figure 8. Results of different reference frame range test on the and the ID switch rate from 32.6% to 21.6%.
OVIS validation set. Notably, test = 0 indicates applying the
deformable convolutional layer to the query frame itself, without
leveraging adjacent frames. 5. Conclusions
In this work, we target video instance segmentation in
for feature calibration because it calculates the similarity occluded scenes, and accordingly contribute a large-scale
between the query position and its neighboring positions, dataset called OVIS. OVIS consists of 296k high-quality in-
while the element-wise difference only considers the differ- stance masks of 5,223 heavily occluded instances. While
ence between the same positions. being the second benchmark dataset after YouTube-VIS,
We also conduct experiments to analyze the influence of OVIS is designed to examine the ability of current video un-
the reference frames range test . test = 0 means applying derstanding systems in terms of handling object occlusions.
the deformable convolutional layer to the query frame itself. A general conclusion comes to that the baseline perfor-
As can be seen in Fig. 8, AP increases as test increases, and mance on OVIS is far below that on YouTube-VIS, which
reaching the highest value at test = 5. Even if test = 1, suggests that more effort should be devoted in the future to
the performance exceeds the setting of test = 0, which tackling object occlusions or de-occluding objects [51]. We
demonstrates that calibrating features from adjacent frames also explore ways about leveraging temporal context cues
is beneficial to video instance segmentation. to alleviate the occlusion matter, and obtain an AP of 15.1
Error Analysis. We analyze the error rates of classifica- on OVIS and 35.1 on YouTube-VIS, a remarkable gain over
tion, segmentation, and tracking under different occlusion the state-of-the-art algorithms.
levels to explore the influence of occlusion levels on video In the future, we are interested in formalizing the exper-
instance segmentation. A segmentation error refers to that imental track of OVIS for video object segmentation, ei-
the IoU between the predicted mask of an object and its ther in an unsupervised, semi-supervised, or interactive set-
ground-truth less than 0.5 and the tracking error is reflected ting. It is also of paramount importance to extend OVIS
by ID switch rate. to video panoptic segmentation [21]. At last, synthetic oc-
As shown in Table 4, under slight occlusion, the error cluded data [33] requires further exploration. We believe
rates of these three types are all higher than those of no oc- the OVIS dataset will trigger more research in understand-
clusion. Among them, the segmentation error rate increases ing videos in complex and diverse scenes.

References [16] Yuan Ting Hu, Jia Bin Huang, and Alexander G. Schwing.
 Videomatch: Matching based video object segmentation. In
 [1] Ali Athar, Sabarinath Mahadevan, Aljoša Ošep, Laura Leal- ECCV, 2018. 3
 Taixé, and Bastian Leibe. Stem-seg: Spatio-temporal em- [17] Zhaojin Huang, Lichao Huang, Yongchao Gong, Chang
 beddings for instance segmentation in videos. In ECCV, Huang, and Xinggang Wang. Mask scoring r-cnn. In CVPR,
 2020. 2, 6, 7 2019. 3
 [2] Gedas Bertasius and Lorenzo Torresani. Classifying, seg- [18] Joakim Johnander, Emil Brissman, Martin Danelljan, and
 menting, and tracking object instances in video with mask Michael Felsberg. Learning video instance segmenta-
 propagation. In CVPR, 2020. 2, 3, 7 tion with recurrent graph neural networks. arXiv preprint
 [3] Jiale Cao, Rao Muhammad Anwer, Hisham Cholakkal, Fa- arXiv:2012.03911, 2020. 2
 had Shahbaz Khan, Yanwei Pang, and Ling Shao. Sipmask: [19] Joakim Johnander, Martin Danelljan, Emil Brissman, Fa-
 Spatial information preservation for fast image and video in- had Shahbaz Khan, and Michael Felsberg. A generative ap-
 stance segmentation. In ECCV, 2020. 2, 3, 6 pearance model for end-to-end video object segmentation. In
 [4] Kai Chen, Jiangmiao Pang, Jiaqi Wang, Yu Xiong, Xiaox- CVPR, 2019. 3
 iao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jianping Shi, [20] Anna Khoreva, Federico Perazzi, Rodrigo Benenson, Bernt
 Wanli Ouyang, et al. Hybrid task cascade for instance seg- Schiele, and Alexander Sorkine-Hornung. Learning video
 mentation. In CVPR, pages 4974–4983, 2019. 7 object segmentation from static images. In CVPR, 2017. 3
 [5] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, [21] Dahun Kim, Sanghyun Woo, Joon-Young Lee, and In So
 Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image Kweon. Video panoptic segmentation. In CVPR, 2020. 3, 8
 segmentation with deep convolutional nets, atrous convolu- [22] Alexander Kirillov, Kaiming He, Ross Girshick, Carsten
 tion, and fully connected crfs. IEEE TPAMI, 40(4):834–848, Rother, and Piotr Dollár. Panoptic segmentation. In CVPR,
 2017. 3 2019. 3
 [6] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian [23] Alexander Kirillov, Yuxin Wu, Kaiming He, and Ross Gir-
 Schroff, and Hartwig Adam. Encoder-decoder with atrous shick. Pointrend: Image segmentation as rendering. In
 separable convolution for semantic image segmentation. In CVPR, 2020. 3
 ECCV, 2018. 3
 [24] Qizhu Li, Xiaojuan Qi, and Philip HS Torr. Unifying training
 [7] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong and inference for panoptic segmentation. In CVPR, 2020. 3
 Zhang, Han Hu, and Yichen Wei. Deformable convolutional [25] Siyang Li, Bryan Seybold, Alexey Vorobyov, Alireza Fathi,
 networks. In ICCV, 2017. 2 and C. C. Jay Kuo. Instance embedding transfer to unsuper-
 [8] Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip vised video object segmentation. In CVPR, 2018. 3
 Hausser, Caner Hazirbas, Vladimir Golkov, Patrick Van [26] Xiaoxiao Li and Chen Change Loy. Video object segmen-
 Der Smagt, Daniel Cremers, and Thomas Brox. Flownet: tation with joint re-identification and attention-aware mask
 Learning optical flow with convolutional networks. In propagation. In ECCV, 2018. 3
 CVPR, 2015. 5
 [27] Yuxi Li, Ning Xu, Jinlong Peng, John See, and Weiyao Lin.
 [9] Mohsen Fayyaz, Mohammad Hajizadeh Saffar, Moham- Delving into the cyclic mechanism in semi-supervised video
 mad Sabokrou, Mahmood Fathy, Reinhard Klette, and Fay object segmentation. NeurIPS, 33, 2020. 3
 Huang. Stfcn: spatio-temporal fcn for semantic video seg-
 [28] Chung-Ching Lin, Ying Hung, Rogerio Feris, and Linglin
 mentation. In ACCV, 2016. 3
 He. Video instance segmentation tracking with a modified
[10] Qianyu Feng, Zongxin Yang, Peike Li, Yunchao Wei, and Yi vae architecture. In CVPR, 2020. 2
 Yang. Dual embedding learning for video instance segmen- [29] Tsung-Yi Lin, Michael Maire, Serge J Belongie, James
 tation. In ICCVW, 2019. 2 Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and
[11] Yang Fu, Linjie Yang, Ding Liu, Thomas S Huang, and C Lawrence Zitnick. Microsoft coco: Common objects in
 Humphrey Shi. Compfeat: Comprehensive feature aggre- context. In ECCV, 2014. 3, 6
 gation for video instance segmentation. arXiv preprint [30] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully
 arXiv:2012.03400, 2020. 2 convolutional networks for semantic segmentation. In
[12] Agrim Gupta, Piotr Dollar, and Ross Girshick. Lvis: A CVPR, 2015. 3
 dataset for large vocabulary instance segmentation. In CVPR, [31] Jonathon Luiten, Philip Torr, and Bastian Leibe. Video in-
 2019. 3 stance segmentation 2019: A winning approach for com-
[13] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir- bined detection, segmentation, classification and tracking. In
 shick. Mask r-cnn. In CVPR, 2017. 2, 3 ICCVW, 2019. 2
[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. [32] Ken Nakayama, Shinsuke Shimojo, and Gerald H Silver-
 Deep residual learning for image recognition. In CVPR, man. Stereoscopic depth: its relation to image segmentation,
 2016. 6 grouping, and the recognition of occluded objects. Percep-
[15] Jay Hegdé, Fang Fang, Scott O Murray, and Daniel Kersten. tion, 18(1):55–68, 1989. 1
 Preferential responses to occluded objects in the human vi- [33] Sergey I Nikolenko. Synthetic data for deep learning. arXiv,
 sual cortex. JOV, 8(4):16–16, 2008. 1 2019. 8

[34] David Nilsson and Cristian Sminchisescu. Semantic video [51] Xiaohang Zhan, Xingang Pan, Bo Dai, Ziwei Liu, Dahua
 segmentation by gated recurrent flow propagation. In CVPR, Lin, and Chen Change Loy. Self-supervised scene de-
 2018. 3 occlusion. In CVPR, 2020. 8
[35] Seoung Wug Oh, Joon Young Lee, Kalyan Sunkavalli, and [52] Xizhou Zhu, Yuwen Xiong, Jifeng Dai, Lu Yuan, and Yichen
 Seon Joo Kim. Fast video object segmentation by reference- Wei. Deep feature flow for video recognition. In CVPR,
 guided mask propagation. In CVPR, 2018. 3 2017. 3
[36] Seoung Wug Oh, Joon Young Lee, Ning Xu, and Seon Joo
 Kim. Video object segmentation using space-time memory
 networks. In ICCV, 2019. 3
[37] Arnold WM Smeulders, Dung M Chu, Rita Cucchiara, Si-
 mone Calderara, Afshin Dehghan, and Mubarak Shah. Vi-
 sual tracking: An experimental survey. IEEE TPAMI,
 36(7):1442–1468, 2013. 3
[38] Pavel Tokmakov, Karteek Alahari, and Cordelia Schmid.
 Learning motion patterns in videos. In CVPR, 2017. 3
[39] Paul Voigtlaender, Yuning Chai, Florian Schroff, Hartwig
 Adam, Bastian Leibe, and Liang-Chieh Chen. Feelvos: Fast
 end-to-end embedding learning for video object segmenta-
 tion. In CVPR, 2019. 2, 6
[40] Paul Voigtlaender, Michael Krause, Aljosa Osep, Jonathon
 Luiten, Berin Balachandar Gnana Sekar, Andreas Geiger,
 and Bastian Leibe. Mots: Multi-object tracking and segmen-
 tation. In CVPR, 2019. 3
[41] Paul Voigtlaender and Bastian Leibe. Online adaptation of
 convolutional neural networks for video object segmenta-
 tion. In BMVC, 2017. 3
[42] Qiang Wang, Yi He, Xiaoyun Yang, Zhao Yang, and Philip
 Torr. An empirical study of detection-based video instance
 segmentation. In ICCVW, 2019. 2
[43] Wenguan Wang, Hongmei Song, Shuyang Zhao, Jianbing
 Shen, and Haibin Ling. Learning unsupervised video ob-
 ject segmentation through visual attention. In CVPR, 2019.
 3
[44] Yuqing Wang, Zhaoliang Xu, Xinlong Wang, Chunhua Shen,
 Baoshan Cheng, Hao Shen, and Huaxia Xia. End-to-
 end video instance segmentation with transformers. arXiv
 preprint arXiv:2011.14503, 2020. 2, 6
[45] Nicolai Wojke, Alex Bewley, and Dietrich Paulus. Simple
 online and realtime tracking with a deep association metric.
 In ICIP, 2017. 7
[46] Jialian Wu, Liangchen Song, Tiancai Wang, Qian Zhang, and
 Junsong Yuan. Forest r-cnn: Large-vocabulary long-tailed
 object detection and instance segmentation. In ACM Multi-
 media, 2020. 3
[47] Yuwen Xiong, Renjie Liao, Hengshuang Zhao, Rui Hu, Min
 Bai, Ersin Yumer, and Raquel Urtasun. Upsnet: A unified
 panoptic segmentation network. In CVPR, 2019. 3
[48] Ning Xu, Linjie Yang, Yuchen Fan, Jianchao Yang,
 Dingcheng Yue, Yuchen Liang, Brian Price, Scott Cohen,
 and Thomas Huang. Youtube-vos: Sequence-to-sequence
 video object segmentation. In ECCV, 2018. 2
[49] Zhenbo Xu, Wei Zhang, Xiao Tan, Wei Yang, Huan Huang,
 Shilei Wen, Errui Ding, and Liusheng Huang. Segment as
 points for efficient online multi-object tracking and segmen-
 tation. In ECCV, 2020. 3
[50] Linjie Yang, Yuchen Fan, and Ning Xu. Video instance seg-
 mentation. In ICCV, 2019. 1, 2, 3, 4, 5, 6, 7

You can also read