DS-Net: Dynamic Spatiotemporal Network for Video Salient Object Detection

Page created by Wesley Perkins
 
CONTINUE READING
DS-Net: Dynamic Spatiotemporal Network for Video Salient Object Detection
DS-Net: Dynamic Spatiotemporal Network for Video Salient Object Detection
 (The paper is under consideration at Pattern Recognition Letters)

 Yuting Sua , Weikang Wanga , Jing Liua,∗, Peiguang Jinga , Xiaokang Yangb
 a School
 of Electrical and Information Engineering, Tianjin University, Tianjin, China
 b
 MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, Shanghai, China
arXiv:2012.04886v2 [cs.CV] 7 Sep 2021

 Abstract
 As moving objects always draw more attention of human eyes, the motion information is always exploited comple-
 mentarily with static information to detect salient objects in videos. Although effective tools such as optical flow
 have been proposed to extract temporal motion information, it often encounters difficulties when used for saliency
 detection due to the movement of camera or the partial movement of salient objects. In this paper, we investigate
 the complimentary roles of spatial and temporal information and propose a novel dynamic spatiotemporal network
 (DS-Net) for more effective fusion of spatiotemporal information. We construct a symmetric two-bypass network
 to explicitly extract spatial and temporal features. A dynamic weight generator (DWG) is designed to automatically
 learn the reliability of corresponding saliency branch. And a top-down cross attentive aggregation (CAA) procedure
 is designed so as to facilitate dynamic complementary aggregation of spatiotemporal features. Finally, the features
 are modified by spatial attention with the guidance of coarse saliency map and then go through decoder part for final
 saliency map. Experimental results on five benchmarks VOS, DAVIS, FBMS, SegTrack-v2, and ViSal demonstrate
 that the proposed method achieves superior performance than state-of-the-art algorithms. The source code is available
 at https://github.com/TJUMMG/DS-Net.
 Keywords: Video salient object detection, Convolutional neural network, Spatiotemporal feature fusion, Dynamic
 aggregation

 1. Introduction Compared with image salient object detection, VSOD
 is much more challenging due to the influence of time
 Video Salient Object Detection (VSOD) aims at dis- variance between video frames, camera shake, object
 covering the most visually distinctive objects in a video. movement and other factors. In recent years, with the
 The task of video salient object detection is not only to lo- development of convolution neural network (CNN), a va-
 cate the salient region, but also to focus on the segmenta- riety of learning based salient object detection networks
 tion of salient object, that is, to separate the salient object have been proposed [4, 5, 6, 7, 8], which have made re-
 pixels from the background pixels. Therefore, the results markable achievements in this field. In VSOD task, the
 of VOSD can be applied to a variety of subsequent com- motion information between frames plays a significant
 puter vision tasks, such as motion detection [1], image role since the researches have shown that human eyes pay
 segmentation [2] and image retrieval [3]. more attention to moving objects in a video [9]. There-
 fore, motion information contained in optical flow is al-
 ways used to guide the feature extraction process for more
 ∗ Correspondingauthor accurate saliency prediction [10, 5] . However, in cases
 Email address: jliu tju@tju.edu.cn (Jing Liu)

 Preprint submitted to Elsevier September 8, 2021
DS-Net: Dynamic Spatiotemporal Network for Video Salient Object Detection
where the camera moves, or the movement of the salient 0.7040 0. 2960

object is too small, or only part of the salient object is
moving, the motion information contained in the optical 0. 4972 0.5028
flow is less correlated with salient objects. Therefore, in-
discriminately aggregating the less reliable motion infor-
 Static Temporal Final
mation with spatial information can hardly improve but Frame Flow GT
 Saliency Saliency Saliency

may even ruin the saliency results.
 To address the changing reliability of static and motion Figure 1: Illustration of the proposed algorithm for two common yet
 challenging cases. (a) Noisy optical flow with high spatial contrast
saliency cues, dynamic aggregation of static and motion (Top). (b) Reliable optical flow with complex spatial background (Bot-
information according to their reliability is desired. In tom).
this paper, we propose a dynamic spatiotemporal network .
(DS-Net), which is composed of four parts: symmetric features in a cross attentive way.
spatial and temporal saliency network, dynamic weight 2) A dynamic weight generator is designed to fuse multi-
generation, top-down cross attentive aggregation, and fi- scale features and explicitly learn the weight vector
nal saliency prediction. Symmetric spatial and temporal representing the reliability of input features, which
saliency network explicitly extract saliency-relevant fea- can be used for aggregating multi-scale spatiotemporal
tures in static frame and optical flow respectively. Given features as well as coarse saliency maps.
the multi-scale spatial and temporal features, a novel dy- 3) A top-down cross attentive aggregation procedure with
namic weight generator (DWG) is designed to learn the non-linear cross threshold is designed to progressively
dynamic weight vectors representing the reliability of fuse the complementary spatial and temporal features
multi-scale features, which are then used for top-down according to their reliability.
cross attentive aggregation (CAA). The spatial and tem-
poral saliency maps are also adaptively fused to obtain a 2. Related Work
coarse saliency map based on the dynamic weight vector,
which is further utilized to eliminate background noises 2.1. Video Salient Object Detection
via spatial attention. In early days, handcrafted spatiotemporal features were
 As shown in Fig. 1, the reliability of the extracted mo- always employed to locate the salient regions [11] in
tion information largely depends on the quality of optical video sequences. Recently, the rapidly developing con-
flow. When it fails to reflect the motion of foreground ob- volution neural network has greatly improved the per-
jects (the top row), the proposed DS-Net can adjust the ag- formance of VSOD algorithms. Wang et al. [7] firstly
gregation process so that the spatial branch plays a dom- applied the fully convolution neural (FCN) network to
inant role. On the contrary, when video frame contains VSOD and achieved remarkable performance. After that,
complex background noises while the temporal informa- more and more deep learning based methods have been
tion is more highly correlated with the foreground object proposed. Li et al. [5] extended the FCN-based image
(the bottom row), the proposed DS-Net performs equally saliency model by the flow guided feature warp and re-
well by emphasizing on the temporal branch via cross at- current feature refinement module. Afterwards, Song et
tentive aggregation. Experimental results on five large- al. [4] proposed a parallel dilated bi-directional ConvL-
scale VSOD datasets validate the effectiveness of the pro- STMs structure to get the static and motion information
posed algorithm. To summarize, our main contributions effectively fused. Based on the previous works, Fan et
are: al. [6] equipped convLSTM with saliency-shift-aware at-
 tention mechanism to solve the saliency shift problem.
1) To tackle the dilemma of dynamic reliability of static Some works also investigated the guidance role of motion
 and motion saliency cues, we propose a dynamic spa- features. Li et al. [10] proposed a novel motion guided
 tiotemporal network (DS-Net) which automatically network taking single static frame and optical flow as
 learns the reliability of spatial and temporal branch input, which achieved excellent performance. Recently,
 and complementarily aggregates the corresponding some works started to focus on designing lightweight

 2
DS-Net: Dynamic Spatiotemporal Network for Video Salient Object Detection
Input Frame 
 1×1 Conv Aggregation 
 Rate=6 
 
 Rate=12 C Decoder S
 Rate=18
 
 GAP 
 DWG-S ATT
 
 ATT CAA1 CAA2 CAA3 CAA4 CAA5 Aggregation
 
 DWG-S 
 Temporal Weight Flow
 Spatial Weight Flow
 Fused Feature Flow
 ASPP Decoder 
 Temporal Feature Flow
 Spatial Feature Flow

 Optical Flow : + 

Figure 2: Overview of the proposed dynamic spatiotemporal network (DS-Net). DWG-S and DWG-T: dynamic weight generator for spatial and
temporal branch. CAA: cross attentive aggregation module.

VSOD networks. Gu et al. [12] proposed a novel local visual attention mechanism, we propose to dynamically
self-attention mechanism to capture motion information, fuse the multi-modal information so that both spatial and
which achieved state-of-the-art performance as well as the temporal features can show their strength without inter-
highest inference speed. fering with each other.

2.2. Fusion of Spatiotemporal Information 3. The Proposed Algorithm
 The fusion of static and motion information is always The whole network structure is illustrated in Fig. 2.
of the first importance in video saliency detection. In the The video frame and the optical flow between the cur-
early days, conventional algorithms mainly relied on tech- rent and next frame are input into the symmetric saliency
niques including local gradient flow [13], object propos- network to extract static and motion features respectively.
als [14] and so on. At present, there are generally three After that, a dynamic weight generator is used to fuse fea-
main categories of learning based spatiotemporal fusion tures of different scales and generate weight vectors. The
strategies: direct temporal fusion, recurrent fusion, and weight vectors are used to dynamically aggregate multi-
bypass fusion. scale spatial and temporal features as the final spatiotem-
 Representative models for direct temporal fusion poral features. Finally, through a top-down cross attentive
are [15, 7], which used FCN to extract spatial features feature fusion procedure, the multi-scale spatiotemporal
of different frames and fuse them via directly concatena- features are progressively fused to get the final saliency
tion. The simple connection fails to effectively reflect the maps. In addition, the spatial and temporal saliency maps
correlation between spatial and temporal features. For re- are also fused adaptively for a coarse saliency map based
current fusion, the convolutional memory units such as on the weight vector. For quick convergence and superior
ConvLSTM and ConvGRU are usually used [6, 16, 17] performance, the saliency predictions of spatial and tem-
to integrate sequence information in a recurrent manner. poral branches, the aggregated coarse saliency prediction
Though effective in extracting the temporal continuity, as well as the final saliency prediction are all supervised
these memory units suffered from high computational and by the saliency groundtruth with the widely used cross-
memory cost. The last type of spatiotemporal fusion em- entropy loss.
ploys different bypasses to explicitly extract static and
motion information and then fuse features from different 3.1. Symmetric Spatial and Temporal Saliency Network
bypasses together. Considering the effectiveness of opti- As shown in Fig. 2, the DS-Net mainly includes two
cal flow, Li et al. [10] proposed a novel flow-guided fea- symmetric branches, which are used to explicitly extract
ture fusion method, which is simple and effective, but it static and motion features respectively. Motivated by [18],
is highly sensitive to the accuracy of optical flow. Notic- these two branches both employ DeeplabV3+ structure
ing this drawback and being aware of the complex human with ResNet34 [19] as backbones.

 3
DS-Net: Dynamic Spatiotemporal Network for Video Salient Object Detection
Multi-Scale Features Weight Vector 3.3. Cross Attentive Aggregation
 F#5
 F#4
 Conv
 Conv
 As mentioned above, human attention is driven by
 F#3 Conv C Conv GAP both static and motion cues, but their contributions to
 F#2 Conv salient object detection vary from sequences to sequences.
 F#1 Conv
 Therefore, it is desired to dynamically aggregate the spa-
 tial and temporal information. Recall that the dynamic
Figure 3: The structure of dynamic weight generator. F#i denote the fea- weights wi and wi are learned to stand for the relative
 s t
tures extracted from the i-th layer of spatial or temporal saliency branch.
 reliability of their corresponding features f si and fti . A
 Set the current and the next frame as It and It+1 respec- straightforward way to get spatiotemporal feature is the
tively. These two frames are input into FlowNet2.0 [20] weighted summation as:
to generate the optical flow, which can be denoted as i
Ot,t+1 . Then It and Ot,t+1 go through the symmetric spa- Fagg = wis · f si + wit · fti . (1)
tial and temporal saliency network to extract features and
predict coarse saliency maps. The spatial features out- Although it seems reasonable, there is a key risk of un-
put by four residual structures of ResNet34 and ASPP stable weighting. As w s and wt are derived separately
module are denoted as F s = { f si |i = 1, · · · , 5}. Similarly, from two independent networks with totally different in-
the output features of the temporal branch are denoted as puts (i.e., static frame and optical flow), there is no guar-
Ft = { fti |i = 1, · · · , 5}. The coarse spatial and temporal antee that w s and wt can serve as the weighting factor, in
saliency map are denoted as S s and S t . other word, being comparable.
 + 

3.2. Dynamic Weight Generator
 Non-Linear 
 
 The structure of the proposed dynamic weight gen- Softmax
 
erator is illustrated in Fig. 3. Researches have shown 
that features from different layers of the network contain
complementary global context and local detailed infor- Figure 4: Illustration of cross attentive aggregation.
mation [21]. Therefore, the multi-scale features captured
 To address this issue, instead of direct aggregation us-
from spatial or temporal branch are jointly employed to
 ing independent weight vectors, we propose a simple
generate weight vectors which can better reflect the rela-
 yet efficient cross attentive aggregation scheme to jointly
tive reliability of different features.
 evaluate the relative reliability of spatial and temporal fea-
 Given the feature pyramid F s or Ft extracted from sym- tures. Fig. 4 illustrates the top-down cross attentive fea-
metric saliency network, a convolution layer is first ap- ture aggregation of coarser scale feature Fi+1 with spatial
plied to transform all feature maps into channel of 64. Af- feature f si and temporal feature fti . As shown in the fig-
terward, four deeper but coarser-scale features are upsam- ure, the spatial and temporal dynamic weights wis and wit
pled to the finest scale and concatenated with the feature are first normalized with ‘cross’ softmax to obtain the reg-
at the finest scale to get a feature with 320 channels. Fi- wi
 ularized dynamic weight vector vis/t = wei wi .
 s/t
nally, it goes through a channel reduction layer and global e s +e t
average pooling (GAP) to get a 5-d weight vector. Let As mentioned above, it is highly likely that one single
w s = [w1s , w2s , w3s , w4s , w5s ] and wt = [w1t , w2t , w3t , w4t , w5t ] saliency cue (spatial or temporal) is obviously much better
represent the 5-d weight vector for spatial and tempo- than the other. In this case, indiscriminate aggregation of
ral branch respectively. As will be shown later, the 5-d the poor saliency branch will heavily degrade the overall
weight vectors correspond to the reliability of 5 input fea- performance even if the other saliency cue performs ex-
tures, which will be used for the subsequent attentive fea- cellently. To further suppress the distracting impact from
ture fusion as well as the coarse saliency maps fusion. the noisy or invalid saliency branch, a non-linear cross

 4
DS-Net: Dynamic Spatiotemporal Network for Video Salient Object Detection
thresholding is proposed as: 3.4. Spatial Attention based on Coarse Saliency Map
 In the dynamic weight generation module, the 5-d
 ( 0 , vi ) vit − vis > τ
 
  i t weight vectors are learned to represent the reliability of
 
 
 i i
 (u s , ut ) =  (v s , 0 ) vit − vis < −τ , (2)
 
  (vi , vi ) input saliency features. When the feature extracted by one
 otherwise
 
 
 s t branch is better than that from the other branch, larger
 weights are assigned to it. Otherwise, smaller weights
where τ is a threshold. The generated dynamic weights uis
 are assigned. Under this circumstance, the 5-d vectors uis
and uit are then used to weight the corresponding features
 and uit as a whole can be regarded as the reliability for
extracted from spatial and temporal branches for comple-
 the corresponding spatial and temporal saliency branch.
mentary spatiotemporal features, which can be expressed
 Therefore, we sum up the 5-d vector to get the global dy-
as:
 i namic reliability weight for the corresponding saliency
 Fagg = uis · f si + uit · fti . (3) map. Given the spatial saliency map S s and temporal
 It is worth noting that compared with existing cross at- saliency map S t , the coarse saliency map is derived as:
tention modules [22, 23] where complex operations such
as correlations are generally carried out on the whole fea- S c =  s · S s + t · S t , (5)
ture maps, the proposed cross attentive aggregation is of P
 vi
 P
 vi
 where  s = P vi +Ps vi , t = P vi +Pt vi are the reliability
extremely low complexity since the interaction between s t s t

two-bypass network are carried out on the 5-d dynamic weights for the corresponding branches. With the atten-
weight vector, which can be easily obtained as described tive aggregation strategy, the model can not only generate
above. more reliable coarse saliency map, but also promote the
 adaptive learning of the weight vectors.
 As shown in Fig. 2, motivated by [10], the coarse
Table 1: The parameters of three convolution kernels in five CAA mod- saliency map is adopted as an accurate guidance to sup-
ules. ‘ks’, ‘pd’ and ‘chn’ denote the kernel size, padding and output
channel numbers, respectively. press background noises of spatiotemporal feature F f inal .
 Conv1 Conv2 Conv3 The output feature Fatt can be written as:
 ks pd chn ks pd chn ks pd chn
 CAA 1 3 1 64 3 1 64 3 1 64 Fatt = F f inal · S c + F f inal . (6)
 CAA 2 3 1 128 3 1 128 3 1 64
 CAA 3 5 2 256 5 2 256 5 2 128 Finally, the feature map Fatt is passed to a decoder to get
 CAA 4 7 3 512 7 3 512 7 3 256
 CAA 5 3 1 256 3 1 256 3 1 512 the final saliency map S f .

 For the coarse scale feature Fi+1 , it first goes through 4. Experimental Results
three convolution layers to gradually reduce the channel
 4.1. Implementation
number to match that of spatiotemporal feature at current
 i
scale Fagg . The parameters of convolution kernels in five The proposed model is built based on pytorch reposi-
CAA modules are listed in Table 1. Then, the coarse scale tory. We initialize ResNet34 backbone with the weights
feature is upsampled to match the resolution of Fagg i
 . The pretrained on ImageNet. And we trained our model on
final progressive aggregation can then be formulated as: the training set of one image dataset DUTS [24] and three
 video datasets DAVIS [25], FBMS [26] and VOS [27].
 i
 Fi = T (Fi+1 ) + Fagg , i = 1, 2, 3, 4, (4) Considering that the input of our network includes opti-
 cal flow images, for the image dataset, the input optical
where T (·) stands for three convolution layers and one up- flow images are filled with zeros, which means that these
sampling layer. For the coarsest scale i = 5, only current frames do not contain motion information.
 5
scale of features are available, so F5 = Fagg . The final During the training process, all input frames are resized
output feature after the proposed top-down progressive to 512 × 512. The Adam optimizer with initial learning
aggregation scheme can be denoted as F f inal = T (F1 ). rate 5e-5, a weight decay of 0.0005 and an momentum of

 5
DS-Net: Dynamic Spatiotemporal Network for Video Salient Object Detection
Table 2: Comparison with state-of-the-art VSOD algorithms. ”↑” means larger is better and ”↓” means smaller is better. ”-” indicates the model has
beenTable
 trained on this dataset.
 2: Comparison with”-” represents the
 state-of-the-art traditional
 VSOD method.
 algorithms. ”*” denotes
 ”↑” means larger unfixed
 is better backbone. Thesmaller
 and ”↓” means top twoismethods areindicates
 better. ”-” markedthe bold. has
 in model
 been trained on thisMBD
 Method dataset. MSTM
 ”-” represents
 STBPthe traditional
 SCOM method.
 SCNN ”*”FCNSdenotes FGRNE
 unfixed backbone.
 PDB The topSSAV
 two methodsMGA bold.
 are marked inPCSA ours
 Backbone
 Method - -MBD -MSTM
 - -STBP
 - --
 SCOM VGG
 SCNN VGG
 FCNS *
 FGRNE ResNet50
 PDB ResNet50
 SSAV ResNet101
 MGA MobileNet
 PCSA ResNet34
 ours
 maxF ↑
 Backbone 0.470- - 0.429- - 0.544 -- 0.783
 -- 0.714
 VGG 0.708
 VGG 0.783
 * 0.855
 ResNet50 0.861
 ResNet50 0.892
 ResNet101 0.880
 MobileNet 0.891
 ResNet34
 DAVIS S maxF
 ↑ ↑ 0.5970.470 0.583
 0.429 0.677
 0.544 0.832
 0.783 0.783
 0.714 0.794
 0.708 0.838
 0.783 0.882
 0.855 0.893
 0.861 0.912
 0.892 0.902
 0.880 0.914
 0.891
 DAVIS MAE S↓ ↑ 0.177 0.597 0.165
 0.583 0.096
 0.677 0.048
 0.832 0.064
 0.783 0.061
 0.794 0.043
 0.838 0.028
 0.882 0.028
 0.893 0.022
 0.912 0.022
 0.902 0.018
 0.914
 MAE
 maxF ↑ ↓ 0.487 0.177 0.165
 0.500 0.096
 0.595 0.048
 0.797 0.064
 0.762 0.061
 0.759 0.043
 0.767 0.028
 0.821 0.028
 0.865 0.022
 0.907 0.022
 0.831 0.018
 0.875
 FBMS S maxF
 ↑ ↑ 0.609
 0.487 0.613
 0.500 0.627
 0.595 0.794
 0.797 0.794
 0.762 0.794
 0.759 0.809
 0.767 0.851
 0.821 0.879
 0.865 0.910
 0.907 0.866
 0.831 0.895
 0.875
 FBMS MAE S↓ ↑ 0.206 0.609 0.177
 0.613 0.152
 0.627 0.079
 0.794 0.095
 0.794 0.091
 0.794 0.088
 0.809 0.064
 0.851 0.040
 0.879 0.026
 0.910 0.041
 0.866 0.034
 0.895
 MAE ↓ 0.206 0.177 0.152 0.079 0.095 0.091 0.088 0.064 0.040 0.026 0.041 0.034
 maxF ↑ 0.692 0.673 0.622 0.831 0.831 0.852 0.848 0.888 0.939 0.940 0.940 0.950
 ViSal S maxF
 ↑ ↑ 0.726
 0.692 0.749
 0.673 0.629
 0.622 0.762
 0.831 0.847
 0.831 0.881
 0.852 0.861
 0.848 0.907
 0.888 0.939
 0.939 0.941
 0.940 0.946
 0.940 0.949
 0.950
 ViSal MAE S↓ ↑ 0.129
 0.726 0.095
 0.749 0.163
 0.629 0.122
 0.762 0.071
 0.847 0.048
 0.881 0.045
 0.861 0.032
 0.907 0.020
 0.939 0.016
 0.941 0.946
 0.017 0.013
 0.949
 MAE ↓ 0.129 0.095 0.163 0.122 0.071 0.048 0.045 0.032 0.020 0.016 0.017 0.013
 maxF ↑ 0.562 0.336 0.403 0.690 0.609 0.675 0.669 0.742 0.742 0.745 0.747 0.801
 VOS S maxF
 ↑ 0.562
 ↑ 0.661 0.336
 0.551 0.403
 0.614 0.690
 0.712 0.609
 0.704 0.675
 0.760 0.669
 0.715 0.742
 0.818 0.742
 0.819 0.745
 0.806 0.747
 0.827 0.801
 0.855
 VOS MAE S↓ ↑ 0.158 0.661 0.551
 0.145 0.614
 0.105 0.712
 0.162 0.704
 0.109 0.760
 0.099 0.715
 0.097 0.818
 0.078 0.819
 0.073 0.806
 0.070 0.827
 0.065 0.855
 0.060
 MAE ↓ 0.158 0.145 0.105 0.162 0.109 0.099 0.097 0.078 0.073 0.070 0.065 0.060
 maxF ↑ 0.554 0.526 0.640 0.764 - - - 0.800 0.801 0.828 0.810 0.832
 SegV2 S↑maxF ↑ 0.554
 0.618 0.526
 0.643 0.640
 0.735 0.764
 0.815 -- -- -- 0.800
 0.864 0.801
 0.851 0.828
 0.885 0.810
 0.865 0.832
 0.875
 SegV2 MAE S↓ ↑ 0.1460.618 0.643
 0.114 0.735
 0.061 0.815
 0.030 -- -- -- 0.864
 0.024 0.851
 0.023 0.885
 0.026 0.865
 0.025 0.875
 0.028
 MAE ↓ 0.146 0.114 0.061 0.030 - - - 0.024 0.023 0.026 0.025 0.028

 Frame GT Ours PCSA MGA SSAV PDB FGRNE FCNS STBP MBD

 Figure
 Figure 5:5:Visual
 Visualcomparison
 comparisonwith
 withstate-of-the-art
 state-of-the-art VSOD
 VSOD methods.
 methods.

0.9 0.9 is used
 is used to train
 to train ourour model.And
 model. Andthe
 thetraining
 trainingbatch
 batchsize
 size 4.3. Performance
 4.3. Performance Comparison
 Comparison
 is set
is set to 8.The
 to 8. The proposedmodel
 proposed modelisistrained
 trainedby byaasimple
 simple Our proposed method is compared with 11 state-
 one-step end-to-end strategy.Experiments
 Experimentsare areperformed
 performed Our proposed method is compared with 11 state-
one-step end-to-end strategy. of-the-art methods, including 4 conventional methods:
 on a workstation with an NVIDIA GTX 1080Ti GPUandand of-the-art methods, including 4 conventional methods:
on a workstation with an NVIDIA GTX 1080Ti GPU MBD [31], MSTM [32], STBP [33], SCOM [34] and
 a 2.1 GHz Intel CPU. MBD [31], based
 7 learning MSTMmethods:
 [32], STBP
 SCNN [33],[35],
 SCOM FCNS [34][7],
 and
a 2.1 GHz Intel CPU.
 7FGRNE
 learning based methods: SCNN [35],
 [5], PDB [4], SSAV [6], MGA [10], PCSA [12]. FCNS [7],
 FGRNE [5], PDB [4], SSAV [6], MGA [10], PCSA
 We use the evaluation code provided by [6] for fair com- [12].
 4.2. Datasets and Evaluation Criteria
4.2. Datasets and Evaluation Criteria We use theThe
 parison. evaluation code provided
 maxF, S-measure, and by
 MAE [6] for fair are
 results com-
 We evaluate our model on five public video salient parison. The maxF,
 listed in Table 2 As S-measure, and MAE
 shown by Table 2, ourresults
 methodare
 We evaluate
 object our benchmarks,
 detection model on fiveincluding
 public video
 DAVISsalient
 [25], listed in the
 achieves Table 2 As shownonby
 best performance TableViSal,
 DAVIS, 2, ourandmethod
 VOS
object
 FBMS detection benchmarks,
 [26], ViSal including
 [13], SegTrackV2 [28], DAVIS
 and VOS[25],
 [27]. achieves the bestbest
 and the second performance on DAVIS,
 performance on FBMS ViSal, and VOS
 and SegV2.
FBMS
 Mean [26], ViSal [13],
 absolute errorSegTrackV2 [28], and VOS [27].
 (MAE), Structure-measure (S- and
 For the second
 DAVIS, the best performance
 proposed method on FBMS and
 outperforms theSegV2.
 sec-
Mean absolute
 m) [29] and maxerror (MAE),(maxF)
 F-measure Structure-measure
 [30] are adopted(S-
 as For
 ond DAVIS,
 best model theMGA
 proposed
 by 0.4%method outperforms
 in S-measure and 18.2%the in
 sec-
m) the
 [29]evaluation
 and maxmetrics.
 F-measure (maxF) [30] are adopted as MAE.
 ond For
 best ViSal,MGA
 model our algorithm surpasses
 by 0.4% in S-measurethe second
 and 18.2%best in
the evaluation metrics. MAE. For ViSal, our algorithm surpasses the second best
 6
 6
DS-Net: Dynamic Spatiotemporal Network for Video Salient Object Detection
model PCSA by 1.1% in maxF and 0.3% in S-measure.
 Table 3: Ablation study of the proposed DS-Net on DAVIS dataset.
As for VOS, the proposed method significantly outper- M1 M2 M3 M4 M5 M6 M7 (Ours)
forms the second best method PCSA by 7.2% in maxF Spatial X
 Temporal
and 3.4% in S-measure. Symmetric
 X
 X X X X X
 Fig. 5 gives visual comparison of saliency maps with DWG X X X X
 CAA X X X
state-of-the-art algorithms. Five diverse yet challenging Attention X X
cases are included. From top to bottom are cases with Multi-Supervision X
 maxF ↑ 0.815 0.772 0.844 0.861 0.877 0.882 0.891
complex background (e.g., the crowd in the 1st row), poor S↑ 0.867 0.811 0.886 0.897 0.907 0.910 0.914
static saliency cue (e.g., low color contrast in the 2nd
row), poor dynamic saliency cue (e.g., moving camera in mental manner and then specifically study the effective-
the 3rd row), multiple salient objects (e.g., two rabbits in ness of each key module’s structure in detail. We employ
the 4th row), and distracting non-salient object (e.g., ve- DeeplabV3+ structure as the spatial and temporal feature
hicles on road in 5th row). It is obvious that our method extractors. And the symmetric feature extraction struc-
can accurately segment salient objects in all these cases, ture with directly top-down sum-up operation is adopted
demonstrating its efficiency and generalization capability. as our baseline. The maxF and S-measure performance
 on the DAVIS dataset is listed in Table 3. It is obvi-
 ous that the performance is gradually improved as various
 modules are successively added to the baseline model. In
 order to further prove the effectiveness of each module,
 we conduct a series of experiments for each component
 on three representative datasets, including the commonly
 used DAVIS dataset, the large-scale VOS dataset and the
 ViSal dataset which is used only for testing.
 Table 4: Ablation study of the proposed dynamic weight generator.
 DAVIS ViSal VOS
Figure 6: Sensitive analysis on non-linear threshold τ. The performance maxF S MAE maxF S MAE maxF S MAE
 M-SEP 0.881 0.906 0.021 0.932 0.035 0.019‘ 0.792 0.848 0.061
is tested on DAVIS dataset. M-FC 0.877 0.907 0.020 0.932 0.939 0.018 0.796 0.852 0.064
 Ours 0.891 0.914 0.018 0.950 0.949 0.013 0.801 0.855 0.060
4.4. Sensitivity of Hyper-parameter
 Note that the proposed algorithm automatically gener- 4.5.1. Dynamic Weight Generator
ates the dynamic weights based on spatial and temporal To demonstrate the effectiveness of the proposed Dy-
features, the only key hyper-parameter of our algorithm isnamic Weight Generator, we explore two variants of
the non-linear threshold τ in CAA. The non-linear cross weight generation strategy. The first variant is denoted
thresholding is designed to suppress the inference from as DWG-SEP where the feature of each scale is sepa-
poor saliency cues where τ affects the degree of suppres- rately processed by global average pooling and convolu-
sion. Small τ may lead to over suppression of informa- tion to generate a scalar. The second variant is denoted
tive saliency cue while large τ may weaken the suppres- as DWG-FC where the multi-scale features are concate-
sion ability on distracting information of non-linear cross
 nated but directly passed through a global average pool-
thresholding. We train the proposed model with different ing layer and a fully connected layer. The performance of
thresholds and the results are shown in Fig. 6. As τ = 0.6these two variants on three datasets are compared with the
achieves the best performance, it is set to 0.6 empirically
 proposed model in Table 4. The effectiveness of the pro-
throughout this paper. posed DWG for boosting performance is clearly validated
 by the consistent improvement on all three datasets. It
4.5. Ablation Study can also be observed that the DWG-SEP performs worse
 In this section, we first explore the effectiveness of than DWG-FC because it fails to consider the correlation
each key module of the propose algorithm in an incre- among multi-scale features.

 7
DS-Net: Dynamic Spatiotemporal Network for Video Salient Object Detection
0.5009 0.5009 0.5001 0.5001 0.5006 0.4983 0.4983 0.4886 0.4886
 0.5006
 S S

 0.4991 0.49910.4999 0.4999 0.4994 0.4994 0.5017 0.5017 0.5114 0.5114
 T T

 0.4991 0.4991 0.4991 0.4991
 0.4991 0.4991 0.4991 0.4991 0.4991 0.4991
 S
 S

 0.4991
 0.4991
 0.4991
 0.4991
 0.4991
 0.4991
 0.4991
 0.4991
 0.4991
 0.4991
 Frame GT M-woATT Proposed
 T
 T
 F#1
 F#2
 F#2
 F#3
 F#3 F#4
 F#5
 F#5
 Figure
 Figure 8: 8: Visual
 Visual comparisonof
 comparison ofM-woATT
 M-woATT and
 andthe
 theproposed
 proposedmethod.
 method.
 F#1 F#4

 Figure 7: Multi-scale feature maps and their corresponding weights gen-
Figure 7: erated
 Multi-scale
 by the feature
 proposedmaps
 DWG.andF#i
 their corresponding
 represents weights
 the feature maps ofgen-
 ith lay- Table 6: Ablation study of the coarse saliency based operations.
 Table 6: Ablation study of the coarse saliency based operations.
erated by ers.
 the proposed DWG. F#i represents the feature maps of ith lay- DAVIS ViSal VOS
 maxFDAVIS S MAE maxF ViSal S MAE maxF SVOS MAE
ers. M-SS maxF0.882 S 0.910 MAE
 0.022 maxF S
 0.950 0.949 MAE 0.794
 0.014 maxF 0.854S 0.056MAE
 Table 5: Ablation study of the proposed cross attentive aggregation mod- M-SS 0.882 0.910 0.022 0.950 0.949 0.014
 M-AGGS 0.887 0.905 0.021 0.946 0.942 0.016 0.789 0.851 0.794 0.056
 0.854 0.063
 ule.
Table 5: Ablation study of the proposed cross attentive aggregation mod- M-AGGS
 M-woATT0.887
 0.884 0.905
 0.911 0.021
 0.019 0.946
 0.943 0.942 0.016 0.795
 0.945 0.016 0.789 0.854
 0.851 0.061
 0.063
 DAVIS ViSal VOS 0.891 0.911
 Ours 0.884
 M-woATT 0.914 0.019
 0.018 0.943
 0.950 0.945
 0.949 0.013
 0.016 0.801
 0.795 0.855
 0.854 0.060
 0.061
ule. maxF S MAE maxF S MAE maxF S MAE Ours 0.891 0.914 0.018 0.950 0.949 0.013 0.801 0.855 0.060
 M-WoS DAVIS
 0.0883 0.909 0.019 ViSal VOS 0.855 0.062
 0.949 0.948 0.014 0.794
 maxF S
 M-WoN 0.877 MAE
 0.905 maxF S
 0.022 0.947 MAE 0.016
 0.947 maxF 0.780S 0.843MAE0.061
 M-WoS 0.0883
 M-BU 0.9090.866 0.019
 0.898 0.949
 0.023 0.948 0.014 0.017
 0.933 0.938 0.794 0.793
 0.855 0.852
 0.0620.060 significant performance gain on three datasets. Moreover,
 M-WoN 0.877 0.891 0.022
 Ours 0.905 0.914 0.947
 0.018 0.947
 0.950 0.949
 0.016 0.013
 0.780 0.801
 0.843 0.855
 0.0610.060
 M-BU 0.866 0.898 0.023 0.933 0.938 0.017 0.793 0.852 0.060 significant performance
 we can also find that thegain on three
 nonlinear datasets.
 operation playsMoreover,
 a more
 Ours 0.891 0.914 0.018 0.950 0.949 0.013 0.801 0.855 0.060 weimportant role for
 can also find thatSOD task compared
 the nonlinear with softmax
 operation op-
 plays a more
 In addition, We display feature maps of five different eration. We ascribe it to the ability of getting rid of noisy
 scales in the spatial and temporal branches respectively asimportant features.
 role for SOD task compared with softmax op-
 And the ittop-down fusionofstrategy
 In addition, We display feature maps of five different
 well as the corresponding weights generated by the pro- large performance eration. We ascribe to the ability gettingalso brings
 rid of noisy
scales inposed
 the spatial and temporal branches respectively as gain compared with the bottom-up fu-
 DWG module in Fig. 7. Two different situations arefeatures. And the top-down fusion strategy also brings
 sion strategy.
well as considered:
 the corresponding weights generated by the pro-
 the temporal features can greatly reflect thelarge performance gain compared with the bottom-up fu-
posed DWGmovementmodule in Fig. 7. Two
 of foreground objectdifferent situations are(thesion strategy.
 in ‘car-roundabout’
considered: the temporal features can greatly
 first two rows) and the features extracted from reflect
 flowthemap
movement of foreground
 are quite object(the
 noisy in ‘boat2’ in ‘car-roundabout’
 last two rows). The visu- 4.5.3. Coarse Saliency Map
 (the
first twoalization
 rows) and results
 theshow that higher
 features weights
 extracted fromare assigned
 flow map to As analyzed in Section 3.4, the operations based on
are quitefeatures
 noisy more highly (the
 in ‘boat2’ correlated
 last twoto the foreground
 rows). objects.4.5.3.
 The visu- Coarse
 coarse Saliency
 saliency Mapthe multiple supervision, the
 map, e.g.,
alizationTherefore,
 results show the generated
 that higher weights
 weights canareeffectively
 assignedguided
 to weighted aggregation
 As analyzed in Section and 3.4,
 the spatial attention can
 the operations basedfur-on
featuresthe feature
 more fusion
 highly process by
 correlated to emphasizing
 the foreground on saliency
 objects.rel-coarse
 ther improve the performance. To prove this,
 saliency map, e.g., the multiple supervision, the we con-
 evant features. ducted a group of ablation experiments: M-SS represents
Therefore, the generated weights can effectively guided weighted aggregation and the spatial attention can fur-
the feature fusion process by emphasizing on saliency rel- thertheimprove model trained in a single supervision manner; M-
 the performance. To prove this, we con-
 4.5.2. Cross Attentive Aggregation AGGS denotes a simplified coarse saliency aggregation
evant features.
 To demonstrate the effectiveness of the proposed Cross method whichof ducted a group ablation the
 aggregates experiments:
 spatial and M-SS
 temporal represents
 coarse
 Attentive Aggregation module, three different variations saliency maps by direct addition; M-woATT denotes theM-
 the model trained in a single supervision manner;
4.5.2. Cross Attentivemethods
 of aggregation Aggregationare taken into account. The firstAGGS modeldenotes
 withoutaspatial
 simplified coarseAssaliency
 attention. shown in aggregation
 Table 6,
 To demonstrate the effectiveness
 and second aggregation modules of the proposed
 discard Crossandmethod
 the softmax which aggregates
 the proposed method obtains the spatial
 the bestand temporalon
 performance coarse
 all
Attentive Aggregation
 nonlinear module,
 operation in CAA three different variations
 respectively, which is de-saliency maps byproving
 three datasets, direct theaddition; M-woATT
 effectiveness denotes
 of various oper-the
 noted as methods
of aggregation CAA-woSare andtaken
 CAA-woN. The thirdThe
 into account. aggregation ationswithout
 first model based on coarseattention.
 spatial saliency map. Besides,inthe
 As shown final 6,
 Table
 method
and second is denotedmodules
 aggregation as CAA-BU whichthe
 discard adopts
 softmax and thesaliency
 a bottom-up proposed map of the obtains
 method proposedthe DS-Net and M-woATT
 best performance onisall
nonlinear operation in CAA respectively, which is de- three datasets, proving the effectiveness of various less
 fusion strategy, i.e., the spatial and temporal features at shown in Fig. 8. The proposed saliency maps contain oper-
noted asthe same scaleand
 CAA-woS areCAA-woN.
 fused and progressively
 The third aggregation background
 fed into coarserations based on noise and enjoy
 coarse higher
 saliency accuracy,
 map. which
 Besides, thealso
 final
method scale. The results of these three variants and the proposed demonstrate the effectiveness of the proposed spatial at-
 is denoted as CAA-BU which adopts a bottom-up saliency map of the proposed DS-Net and M-woATT is
 model are shown in Table 5. The proposed module shows tention.
fusion strategy, i.e., the spatial and temporal features at shown in Fig. 8. The proposed saliency maps contain less
the same scale are fused and progressively fed into coarser background noise and enjoy higher accuracy, which also
scale. The results of these three variants and the proposed 8
 demonstrate the effectiveness of the proposed spatial at-
model are shown in Table 5. The proposed module shows tention.

 8
DS-Net: Dynamic Spatiotemporal Network for Video Salient Object Detection
5. Conclusion [8] W. Zou, S. Zhuo, Y. Tang, S. Tian, X. Li, C. Xu,
 STA3D: spatiotemporally attentive 3d network for
 Noticing the high variation on the reliability of static video saliency prediction, Pattern Recognit. Lett.
and motion saliency cues for video salient object detec- 147 (2021) 78–84.
tion, in this paper, we propose a dynamic spatiotemporal
network (DS-Net) to complementarily aggregate spatial [9] L. Itti, C. Koch, E. Niebur, A model of saliency-
and temporal information. Unlike existing methods that based visual attention for rapid scene analysis, IEEE
treat saliency cues in an indiscriminate way, the proposed Transactions on Pattern Analysis and Machine Intel-
algorithm is able to automatically learn the relative relia- ligence 20 (1998) 1254–1259.
bility of spatial and temporal information and then carry
 [10] H. Li, G. Chen, G. Li, Y. Yu, Motion guided at-
out dynamic fusion of features as well as coarse saliency
 tention for video salient object detection, in: IEEE
maps. Specifically, a dynamic weight generator and a top-
 ICCV, 2019, pp. 7273–7282.
down cross attentive aggregation approach are designed to
obtain dynamic weights and facilitate complementary ag- [11] Z. Chen, R. Wang, M. Yu, H. Gao, Q. Li, H. Wang,
gregation. Extensive experimental results on five bench- Spatial-temporal multi-task learning for salient re-
mark datasets show that our DS-Net achieves superior gion detection, Pattern Recognit. Lett. 132 (2020)
performance than state-of-the-art VSOD algorithms. 76–83.
 [12] Y. Gu, L. Wang, Z. Wang, Y. Liu, M. Cheng, S. Lu,
References Pyramid constrained self-attention network for fast
 video salient object detection, in: The Thirty-Fourth
 [1] L. Maczyta, P. Bouthemy, O. L. Meur, Cnn-based AAAI Conference on Artificial Intelligence, 2020,
 temporal detection of motion saliency in videos, Pat- pp. 10869–10876.
 tern Recognit. Lett. 128 (2019) 298–305.
 [13] W. Wang, J. Shen, L. Shao, Consistent video
 [2] F. Sun, W. Li, Saliency guided deep network for saliency using local gradient flow optimization and
 weakly-supervised image segmentation, Pattern global refinement, IEEE Transactions on Image Pro-
 Recognit. Lett. 120 (2019) 62–68. cessing 24 (2015) 4185–4196.
 [3] H. Wang, Z. Li, Y. Li, B. B. Gupta, C. Choi, Vi- [14] F. Guo, W. Wang, J. Shen, L. Shao, J. Yang, D. Tao,
 sual saliency guided complex image retrieval, Pat- Y. Y. Tang, Video saliency detection using object
 tern Recognit. Lett. 130 (2020) 64–72. proposals, IEEE Transactions on Cybernetics 48
 (2018) 3159–3170.
 [4] H. Song, W. Wang, S. Zhao, J. Shen, K.-M. Lam,
 Pyramid dilated deeper convlstm for video salient [15] Y. Wei, X. Liang, Y. Chen, X. Shen, M. Cheng,
 object detection, in: ECCV, 2018, pp. 744–760. J. Feng, Y. Zhao, S. Yan, Stc: A simple to complex
 framework for weakly-supervised semantic segmen-
 [5] G. Li, Y. Xie, T. Wei, K. Wang, L. Lin, Flow guided tation, IEEE Transactions on Pattern Analysis and
 recurrent neural encoder for video salient object de- Machine Intelligence 39 (2017) 2314–2320.
 tection, in: IEEE CVPR, 2018, pp. 3243–3252.
 [16] L. Jiang, M. Xu, T. Liu, M. Qiao, Z. Wang, Deepvs:
 [6] D. Fan, W. Wang, M. Cheng, J. Shen, Shifting more A deep learning based video saliency prediction ap-
 attention to video salient object detection, in: IEEE proach, in: ECCV, 2018, pp. 625–642.
 CVPR, 2019, pp. 8546–8556.
 [17] P. Yan, G. Li, Y. Xie, Z. Li, C. Wang, T. Chen,
 [7] W. Wang, J. Shen, L. Shao, Video salient object L. Lin, Semi-supervised video salient object detec-
 detection via fully convolutional networks, IEEE tion using pseudo-labels, in: IEEE ICCV, 2019, pp.
 Transactions on Image Processing 27 (2018) 38–49. 7283–7292.

 9
DS-Net: Dynamic Spatiotemporal Network for Video Salient Object Detection
[18] L. Chen, Y. Zhu, G. Papandreou, F. Schroff, [27] P. Chockalingam, N. Pradeep, S. Birchfield, Adap-
 H. Adam, Encoder-decoder with atrous separable tive fragments-based tracking of non-rigid objects
 convolution for semantic image segmentation, in: using level sets, in: IEEE ICCV, 2009, pp. 1530–
 V. Ferrari, M. Hebert, C. Sminchisescu, Y. Weiss 1537.
 (Eds.), Computer Vision - ECCV 2018 - 15th Eu-
 ropean Conference, Munich, Germany, September [28] F. Li, T. Kim, A. Humayun, D. Tsai, J. M. Rehg,
 8-14, 2018, Proceedings, Part VII, volume 11211 of Video segmentation by tracking many figure-ground
 Lecture Notes in Computer Science, Springer, 2018, segments, in: IEEE ICCV, 2013, pp. 2192–2199.
 pp. 833–851. [29] D. Fan, M. Cheng, Y. Liu, T. Li, A. Borji, Structure-
 measure: A new way to evaluate foreground maps,
[19] K. He, X. Zhang, S. Ren, J. Sun, Deep residual
 in: IEEE ICCV, 2017, pp. 4558–4567.
 learning for image recognition, in: IEEE CVPR,
 2016, pp. 770–778. [30] R. Achanta, S. Hemami, F. Estrada, S. Susstrunk,
 Frequency-tuned salient region detection, in: IEEE
[20] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovit- CVPR, 2009, pp. 1597–1604.
 skiy, T. Brox, Flownet 2.0: Evolution of optical flow
 estimation with deep networks, in: IEEE CVPR, [31] J. Zhang, S. Sclaroff, Z. Lin, X. Shen, B. Price,
 2017, pp. 1647–1655. R. Mech, Minimum barrier salient object detection
 at 80 fps, in: IEEE ICCV, 2015, pp. 1404–1412.
[21] Q. Hou, M. Cheng, X. Hu, A. Borji, Z. Tu, P. H. S.
 Torr, Deeply supervised salient object detection [32] W. Tu, S. He, Q. Yang, S. Chien, Real-time salient
 with short connections, IEEE Transactions on Pat- object detection with a minimum spanning tree, in:
 tern Analysis and Machine Intelligence 41 (2019) IEEE CVPR, 2016, pp. 2334–2342.
 815–828. [33] T. Xi, W. Zhao, H. Wang, W. Lin, Salient object
 detection with spatiotemporal background priors for
[22] X. Wei, T. Zhang, Y. Li, Y. Zhang, F. Wu, Multi-
 video, IEEE Transactions on Image Processing 26
 modality cross attention network for image and sen-
 (2017) 3425–3436.
 tence matching, in: IEEE CVPR, 2020, pp. 10938–
 10947. [34] Y. Chen, W. Zou, Y. Tang, X. Li, C. Xu, N. Ko-
 modakis, Scom: Spatiotemporal constrained opti-
[23] L. Ye, M. Rochan, Z. Liu, Y. Wang, Cross-modal mization for salient object detection, IEEE Transac-
 self-attention network for referring image segmen- tions on Image Processing 27 (2018) 3345–3357.
 tation, in: IEEE CVPR, 2019, pp. 10494–10503.
 [35] Y. Tang, W. Zou, Z. Jin, Y. Chen, Y. Hua, X. Li,
[24] L. Wang, H. Lu, Y. Wang, M. Feng, D. Wang, Weakly supervised salient object detection with spa-
 B. Yin, X. Ruan, Learning to detect salient ob- tiotemporal cascade neural networks, IEEE Transac-
 jects with image-level supervision, in: IEEE CVPR, tions on Circuits and Systems for Video Technology
 2017, pp. 3796–3805. 29 (2019) 1973–1984.

[25] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van
 Gool, M. Gross, A. Sorkine-Hornung, A benchmark
 dataset and evaluation methodology for video object
 segmentation, in: IEEE CVPR, 2016, pp. 724–732.

[26] T. Brox, J. Malik, Object segmentation by long term
 analysis of point trajectories, in: ECCV, 2010, pp.
 282–295.

 10
You can also read