Multi-Scale Self-Contrastive Learning with Hard Negative Mining for Weakly-Supervised Query-based Video Grounding

Page created by Rosa Gibbs

Buildings

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Multi-Scale Self-Contrastive Learning with Hard Negative Mining for Weakly-Supervised Query-based Video Grounding

Multi-Scale Self-Contrastive Learning with Hard Negative Mining for
 Weakly-Supervised Query-based Video Grounding

 Shentong Mo Daizong Liu Wei Hu*
 Carnegie Mellon University Peking University Peking University
 Pittsburgh, United States Beijing, China Beijing, China
 shentonm@andrew.cmu.edu dzliu@stu.pku.edu.cn forhuwei@pku.edu.cn
arXiv:2203.03838v1 [cs.CV] 8 Mar 2022

 Abstract 1. Introduction
 Query-based video grounding has attracted increasing
 Query-based video grounding is an important yet chal- attention due to its wide spectrum of applications in video
 lenging task in video understanding, which aims to local- understanding [5, 11, 24, 30]. This task aims to deter-
 ize the target segment in an untrimmed video according mine the start and end timestamps of a target segment in
 to a sentence query. Most previous works achieve signifi- an untrimmed video that contains an activity semantically
 cant progress by addressing this task in a fully-supervised corresponding to a given sentence description, as shown
 manner with segment-level labels, which require high la- in Figure 1. Most previous works [6, 10, 19, 20, 36, 39]
 beling cost. Although some recent efforts develop weakly- have achieved significant performance by addressing query-
 supervised methods that only need the video-level knowl- based video grounding in a fully-supervised manner, which
 edge, they generally match multiple pre-defined segment however requires a large amount of segment-level annota-
 proposals with query and select the best one, which lacks tions (location of the target segment in the video according
 fine-grained frame-level details for distinguishing frames to the semantic of the matched query). Such manual annota-
 with high repeatability and similarity within the entire tion is quite labor-intensive and time-consuming, thus limit-
 video. To alleviate the above limitations, we propose a ing the wide applicability of query-based video grounding.
 self-contrastive learning framework to address the query- Recently, some weakly-supervised works [8, 14, 25, 26,
 based video grounding task under a weakly-supervised set- 47] have been proposed to alleviate the above issue by only
 ting. Firstly, instead of utilizing redundant segment propos- leveraging the video-level knowledge of matched video-
 als, we propose a new grounding scheme that learns frame- query pairs without detailed segment labels. These meth-
 wise matching scores referring to the query semantic to pre- ods generally pre-define multiple segment proposals, and
 dict the possible foreground frames by only using the video- employ video-level annotations as supervision to learn the
 level annotations. Secondly, since some predicted frames segment-query matching scores for selecting the best one.
 (i.e., boundary frames) are relatively coarse and exhibit However, the generated segment proposals are redundant
 similar appearance to their adjacent frames, we propose a and contain many negative (i.e., false) samples, resulting in
 coarse-to-fine contrastive learning paradigm to learn more inferior effectiveness and efficiency of the models. Further,
 discriminative frame-wise representations for distinguish- as for the positive (i.e., correct) proposals covering the ac-
 ing the false positive frames. In particular, we iteratively curate foreground frames, they are of high similarity [41]
 explore multi-scale hard negative samples that are close to and require more sophisticated intra-modal recognition ca-
 positive samples in the representation space for distinguish- pabilities to distinguish. Especially for the boundary frames
 ing fine-grained frame-wise details, thus enforcing more ac- which exhibit similar visual appearance to the foreground
 curate segment grounding. Extensive experiments on two frames in a certain segment, some of them are background
 challenging benchmarks demonstrate the superiority of our frames that are hard to recognize. Once such segment is se-
 proposed method compared with the state-of-the-art meth- lected as the best one, the grounding performance will be
 ods. degenerated due to the background noise.
 To this end, we propose a novel Multi-scale Self-
 Contrastive Learning (MSCL) paradigm with hard negative
 mining strategy for weakly-supervised query-based video
 * Corresponding author. grounding, aiming to learn fine-grained frame-wise seman-

Query: Person walks through the doorway.
Untrimmed
Video

Time line: 1.90s 8.10s

Multi-scale Self-Contrastive Learning

Frame-Scale Segment-Scale

frame-wise score curve segment-wise score curve

iteratively negative iteratively negative
frame mining step 1 segment mining step 1

…

step 2 step 2
…
……. …….
step n Positive step n
Negative
Hard Negative

Figure 1. Illustration of the proposed multi-scale self-contrastive learning for weakly-supervised query-based video grounding.

tic matching by progressively sampling harder negative- ples, where samples with scores within the range serve as
positive frames for discriminative feature learning. In par- negative frames and those above the range as positive ones.
ticular, instead of relying on redundant segment proposals The range is progressively updated to exploit harder neg-
for matching and selection, we propose to learn more fine- ative samples that are more similar to the positive ones,
grained frame-wise matching scores to predict whether each which leads to more discriminative features learning. Note
frame is the foreground frame. Once the scores of succes- that, our contrastive learning strategy is quite different from
sive frames are larger than a learnable threshold, they are the previous vanilla ones [28, 43, 47] in this task, since they
taken to construct the predicted segment, leading to more all utilize a one-step algorithm to define the constant nega-
efficient grounding. The threshold is acquired from the tive samples with coarse frame-level representations. Com-
frame-wise scores learned and enhanced by our developed pared to them, we employ a multi-step process to iteratively
multi-scale self-contrastive learning paradigm. We achieve mine the negative samples in a coarse-to-fine manner, defin-
such fine-grained frame-wise representation learning by the ing harder negative samples and thus leading to more dis-
following twofold novelties. criminative frame-wise representation learning. Further, we
explore hard negative samples from different hierarchies—
Firstly, since we only resort to video-level annotations local frame-scale and nonlocal segment-scale, thus learning
with no access to the frame-level knowledge, we propose multi-scale intrinsic features.
frame-wise matching score prediction to estimate the score
of each frame matched to the video-level annotations in Specifically, given the input video and query, we first
a weakly-supervised manner, as well as the frame-wise encode both visual and textual features, and align their
matching weight via an attention mechanism in order to semantics by a video-query attention mechanism to learn
choose possible foreground frames with query semantics. the cross-modal interactions. Next, we predict frame-wise
Secondly, in order to improve the frame-wise score predic- matching scores referring to the interacted features for fore-
tion by enhancing the frame-wise representations, we pro- ground frame (inside the target segment) localization, as
pose a self-contrastive learning with multi-scale hard nega- well as the frame-wise contribution weights for their im-
tive mining strategy, which especially discriminates frames portance estimation, which are aggregated to compute the
adjacent to the ground truth target segment—referred to as overall semantic dependency between the video-query pair.
hard negative samples as shown in Figure 1. In particu- Further, we enhance frame-wise representations by learning
lar, we dynamically set a range according to the previously more discriminative features via the proposed multi-scale
estimated scores so as to select positive and negative sam- self-contrastive learning strategy, where both frame-scale

and segment-scale hard negative sampling are deployed in grounding models [4, 26, 32, 33], which only require video-
a coarse-to-fine manner. level annotations. [26] proposed the first weakly-supervised
Our main contributions are summarized as follows: model to learn a joint embedding space for video and query
representations. [8] develop a two-stream structure to mea-
• We propose a novel self-contrastive learning frame- sure the moment-query consistency and conduct moment
work for weakly-supervised query-based video selection simultaneously. Although above methods have
grounding, which predicts fine-grained frame-wise achieved promising performance, they are two-stage ap-
matching scores referring to the query semantics for proaches that utilize multi-scale sliding windows to gener-
more accurate segment localization. ate moment candidates, therefore suffering from inferior ef-
fectiveness and efficiency. To address this issue, [14,25,47]
• We propose a multi-scale hard negative mining in the further improve the segment-sentence matching accuracy,
self-contrastive learning to learn discriminative frame- and score all the moments sampled at different scales in a
wise representations by adaptively sampling hard neg- single pass. [38] employ a reinforcement learning frame-
atives in the frame-scale and the segment-scale respec- work to refine the segment boundary. However, almost
tively, which captures both local and nonlocal intrinsic all of the existing methods rely on the segment proposals
patterns. for matching and selection, which fail to capture and dis-
tinguish more fine-grained details among visually similar
• Extensive experiments demonstrate that the proposed frames for acquiring more accurate segment boundaries.
MSCL model outperforms the state-of-the-art method
Contrastive Learning. Contrastive learning [3, 9] is a
significantly over two challenging benchmarks.
self-supervised learning paradigm that has demonstrated
its effectiveness in many tasks, such as image classifica-
2. Related Work tion, object detection and point cloud classification. Pre-
Fully-supervised query-based video grounding. Most ex- vious works [28, 43, 47] also showed promising results of
isting works address the video grounding task in a fully- contrastive learning in video grounding. Typically, [28]
supervised manner, where both the annotations of video- proposed a dual contrastive learning loss function by uti-
sentence pairs and corresponding segment boundaries are lizing video-level samples for video-to-video and video-
given. Traditional methods [6,10] utilize the proposal-based to-query representation learning. A Counterfactual Con-
framework that samples video segment proposals through trastive Learning framework [47] is designed to distin-
dense sliding windows and subsequently integrate query guish video-level embeddings between counterfactual posi-
with these proposal representations via a matrix operation. tive and negative samples for hard negative sampling. Our
To further mine the cross-modal interaction more effec- model differs significantly from these methods: 1) We
tively, some works [15–18, 21–23, 36, 37, 39, 42, 45, 46] in- leverage the input single video only to perform contrastive
tegrate the sentence representation with those pre-defined learning over frames and segments, without resorting to dif-
segment proposals individually, and then evaluate their ferent instances of videos as in previous works. 2) We pro-
matching relationships. The proposal with the highest pose multi-scale hard negative sampling at the frame scale
matching score is selected as the target segment. Although and the segment scale to iteratively capture both local and
the proposal-based methods can achieve significant perfor- non-local intrinsic feature representations.
mance, they severely rely on the quality of the proposals
and are very time-consuming. Instead of utilizing the seg- 3. Methodology
ment proposals, recent proposal-free methods [1, 2, 27, 40]
directly regress the temporal locations of the target seg- 3.1. Overview
ment. Specifically, they either regress the start/end times-
We focus on weakly-supervised query-based video
tamps based on the entire video representation [1, 27], or
grounding. Given an untrimmed video and the language
predict at each frame to determine whether this frame is a
query, the goal is to localize the start and end time of the
start or end boundary [2, 40]. These works are much more
temporal moment corresponding to the language query. As
efficient than the proposal-based ones, but achieve relatively
illustrated in Figure 2, the proposed MSCL model mainly
lower performance. However, both of proposal-based and
consists of four modules:
proposal-free methods heavily rely on a large amount of hu-
man annotations that are hard to collect in practice.
Weakly-supervised query-based video grounding. As • Multi-modal encoding. Given the multi-modal input,
manually annotating temporal boundaries of target mo- we first employ video and query encoders to encode
ments is time-consuming, recent research attentions both visual and textual features, and then interact the
have been shifted to developing weakly-supervised video cross-modal information for semantic alignment.

Frame
Video Feature Score Head " …
Video FC …
Encoder ′ Encoder #

∑" Final
Share Video-Query ⨂
weights Interaction Score

Query Feature
Query FC ′ Frame
Encoder Encoder #

Weight Head "
Multi-Modal Encoding Frame-wise Matching Score Prediction

Segment Localization (Inference) Multi-scale Self-Contrastive Learning

… … ……
frame-wise score curve … …
upper bound iteratively negative frame-scale mining
splitting

… … … ……
segment-wise score curve

iteratively negative segment-scale mining

1.83s 8.17s
3e-4 2e-1 5e-1 6e-1 6e-2

Figure 2. The overall framework of our proposed MSCL model.

• Frame-wise matching score prediction. After gener- to extract its frame-level features V = {vi }ni=1 ∈ Rn×Dv ,
ating query-specific video representations by the cross- where n is the number of frames and Dv is the feature di-
modal interaction, we predict frame-wise matching mension. For each query, we deploy the GloVe model [29]
scores and frame-wise matching weights referring to to obtain word-level embeddings Q = {qi }m i=1 ∈ R
m×Dq
,
the aligned multi-modal features for choosing the pos- where m is the number of the words and Dq is the feature di-
sible foreground frames within the video. mension. Then both video and query features are projected
into the same latent space by two fully-connected layers
• Multi-scale self-contrastive learning. In order to to generate V0 and Q0 with the same dimension D, where
improve the frame-wise score prediction by enhanc- V0 ∈ Rn×D and Q0 ∈ Rm×D . After that, we feed V0 , Q0
ing the frame-wise representations, we perform self- into the modality-specific encoders fv (·), fq (·) to generate
contrastive learning at both the frame-scale and the
the final visual representations V
e and query embeddings Q. e
segment-scale with progressive hard negative mining 0 e 0
That is, V = fv (V ), Q = fq (Q ). Here, fv (·), fq (·) share
e
to distinguish more fine-grained frame-wise details,
weights and consist of four convolution layers, followed by
thus enforcing more accurate segment grounding.
a multi-head attention layer [35].
• Segment localization. At the inference time, we first Video-query interaction. We further apply a video-query
construct possible segments by choosing the consecu- attention mechanism to learn the cross-modal interactions,
tive frames with scores higher than the threshold, and where we calculate the similarity score S ∈ Rn×m between
then select the best segment by comparing the average video and query features, and use the SoftMax operation
scores of the internal frames within each segment. along the row and column to generate Sr and Sc , respec-
tively. Next, we compute the video-to-query (V) and query-
We elaborate on the four modules in order as follows. to-video (Q) attention contexts [44] as:
3.2. Multi-Modal Encoding e ∈ Rn×D , Q = Sr · ST · V
V = Sr · Q e ∈ Rm×D . (1)
c

Video and query encoders. For each video, following pre- Then a single feed-forward layer FFN (composed of multi-
vious works, we first employ a pre-trained C3D model [34] ple linear layers) is applied to generate the interacted output

features Vq ∈ Rn×D as: Algorithm 1 Multi-scale contrastive learning algorithm
 Input: video and query features V and Q, iteration
 Vq = FFN(V;
 e V; V
 e V; V
 e Q), (2)
 number L
where denotes the Hadamard product. 1: Initialize the parameters fv (·), fq (·), hs (·), hw (·), bl , bu
 2: Warm-up our model for 50 epochs without hard nega-
3.3. Frame-wise Matching Score Prediction tive sampling
 3: for iteration l ← 1 to L do
 In the weakly-supervised setting, we only have access to
 4: Encode features V, Q and calculate Vq as in Eq. 2
the knowledge of the matched video-query pair without cor-
 5: Predict frame scores and weights as in Eq. 3
responding detailed segment-level annotations. In order to
 6: Calculate the score loss as in Eq. 4
determine which frame is matched with the query semantics
 7: Update bl , bu in Eq. 5
and how much the frame contributes to the final grounding,
 8: Calculate frame and segment losses in Eq. 6 and 7
we introduce a score-based self-supervised branch to pre-
 9: Compute the total loss in Eq. 8
dict frame-wise matching scores and frame-wise matching
 10: Update the parameters of fv (·), fq (·), hs (·), hw (·)
weights for choosing the most possible foreground frames.
 11: end for
Specifically, we devise a frame score head hs (·) and a frame
 Output: fv (·), fq (·), hs (·), hw (·)
weight head hw (·) to predict the corresponding matching
score si and weight wi for each frame i, respectively. Here,
both hs (·), hw (·) are composed of three linear layers. The and negative frames according to their predicted frame-wise
S = {si }ni=1 and W = {wi }ni=1 are formulated as: scores at frame-scale, and then consider one positive seg-
 S = Sigmoid(hs (Vq )); W = Softmax(hw (Vq )). (3) ment with the highest segment score while taking the other
 segments as negative samples at frame-scale. Then, we per-
 Then, for the k-th video and k-th query in each batch, form both frame-scale and segment-scale contrastive learn-
 ing to learn more discriminative fine-grained frame-wise
 Pnsemantic matching score ŝk,k is calculated as
their final
 details. The updated frame-wise features in turn provide
ŝk,k = i=1 si · wi . In addition, we also estimate the sim-
ilarity score (utilizing dot-product attention) between video more accurate matching score to mine harder negative sam-
features Ve and query features Q e to measure their distance. ples in the next step of learning. By performing multi-scale
The overall score objective is defined as: self-contrastive learning with such an iterative strategy, our
 model is able to enforce more accurate segment grounding.
 PK
 (ŝk,k + Vek · Q
 e k) We will illustrate the details of both frame- and segment-
 Lscore = − log PK k=1
 PK , (4) scale negative mining of each step in the following.
 j=1 (ŝk,j + Vk · Qj )
 e e
 k=1
 Frame-scale. In order to mine hard negative frames that
where ŝk,j represents the overall video score corresponding are close to positive frames in the representation space, we
to the j-th query features and k-th video features at the same iteratively assign a lower bound bl and a upper bound bu to
batch. ŝk,k denotes the score of the matched video-query select positive and negative frames. The lower bound bl and
pair. K denotes the batch size. In this way, we maximize the upper bound bu are defined from frame-wise scores as:
the overall score of video and query features from correct n
 1X
pairs while minimizing the score of false pairs. After getting bl = b0l ∗ δ e−e0 , bu = si , (5)
the matching scores of all frames, we take them as pseudo n i=1
labels to provide better supervisions for iteratively training where δ is the increasing step and e, e0 denote the current
the following contrastive learning module, and the learned epoch and the updated cycle of epoch respectively. b0l is
discriminative features in turn further lead to more precise the initial value of bl . We set b0l = e−8 , e0 = 50, δ = 10
matching score prediction. during the training, that is, we increase bl exponentially by
3.4. Multi-scale Self-contrastive Learning 10 every 50 epoch after the warm-up stage.
 Accordingly, we consider frames with scores greater
 In order to discriminate the frame-wise representations than bu as positive frames, and other frames with scores
for more accurate prediction of matching scores, we pro- ranging from bl to bu as negative frames. The loss function
pose a multi-scale self-contrastive learning paradigm with of the frame-scale contrastive learning is defined as:
hard negative mining to capture more discriminative frame- PK e
wise representations in a coarse-to-fine manner. Specif- e k · pf
 Vk · Q
 Lfra = − log Pk=1K
 k
 , (6)
ically, we iteratively mine hard negative samples that are e k · nf
 ek · Q
 V
 k=1 k
close to positive samples in the representation space with a
multi-step strategy. In each step, we first choose the positive where pfk and nfk denote the binary index mask of positive

Charades-STA ActivityNet-Caption
Method R@1 R@5 R@1 R@5
IoU=0.3 IoU=0.5 IoU=0.7 IoU=0.3 IoU=0.5 IoU=0.7 IoU=0.1 IoU=0.3 IoU=0.5 IoU=0.1 IoU=0.3 IoU=0.5
TGA 32.14 19.94 8.84 86.58 65.52 33.51 - - - - - -
CTF 39.80 27.30 12.90 - - - 74.20 44.30 23.60 - - -
ReLoCLNet - - - - - - - 42.65 28.54 - - -
SCN 42.96 23.58 9.97 95.56 71.80 38.87 71.48 47.23 29.22 - 71.45 55.69
MARN - 31.94 14.81 - 70.00 37.40 - 47.01 29.95 - 72.02 57.49
RTBPN 60.04 32.36 13.24 97.48 71.85 41.18 73.73 49.77 29.63 93.89 79.89 60.56
VGN+CCL - 33.21 15.68 - 73.50 41.87 - 50.12 31.07 - 77.36 61.29
Ours 58.92 43.15 23.49 98.02 81.23 48.45 75.61 55.05 38.23 95.26 82.72 68.05

Table 1. Comparison results on the Charades-STA and ActivityNet-Caption datasets.

and negative frames at batch index k, respectively. The en- 4. Experiments
tries of corresponding indices are 1 and others are 0.
Segment-scale. In order to enforce more accurate segment 4.1. Datasets and Evaluation Metrics
grounding predictions, we locate the predicted segments Charades-STA. The Charades-STA dataset [7] is built
{gt }Tt=1 with consecutive indices and calculate the segment based on the Charades [31] dataset, which contains 6,672
score by averaging the scores of internal frame located in videos of indoor activities and involves 16,128 query-video
the segment. Then we consider the segment with the highest pairs. There are 12,408 pairs used for training and 3,720
segment score as the positive segment and other segments used for testing. The average duration of each video is 29.76
as negative samples. The loss function of the segment-scale seconds. Each video has 2.4 annotated moments and each
contrastive learning is formulated as: annotated moment lasts for 8 seconds on average.
PK e e k · pg ActivityNet-Caption. The ActivityNet-Caption dataset
Vk · Q
Lseg = − log Pk=1
K
k
, (7) [13] contains 20,000 videos with 100,000 queries, where
ek · Q
V e k · ng
k=1 k 37,421 query-video pairs are used for training and 34,536
where pgk and ngk denote the binary index mask of frames are used for testing. The average duration of the videos
located in positive and negative segments at batch index k, is 1 minute and 50 seconds. On average, each video in
separately. The entries of corresponding indices are 1 and ActivityNet-Caption has 3.65 annotated moments and each
the others are 0. annotated moment lasts for 36 seconds.
The overall objective of our model is minimized in an Evaluation metrics. Following previous works, we adopt
end-to-end manner and formulated as: the metrics “R@n, IoU=m” to evaluate our model, where
“R@n, IoU=m” presents the proportion of the top n moment
L = Lscore + λfra · Lfra + λseg · Lseg , (8) candidates with IoU larger than m. Specifically, we set n as
1, 5 and set m as 0.3, 0.5, 0.7 in Charades-STA dataset and
where λfra and λseg denote the weighting hyper-parameters
0.1, 0.3, 0.5 in ActivityNet-Caption dataset.
of the frame loss and segment loss, respectively. In the ex-
periments, we set λfra = 10 and λseg = 5. 4.2. Experimental Settings
The overall algorithm of our training approach is sum-
marized in Algorithm 1, where we utilize an iterative strat- To make a fair comparison with previous methods like
egy to gradually mine the hard negative samples. In order [47], we extract video features from the pre-trained C3D
to enforce the model predict accurate positive samples with network [34] and query features from the 300-d Glove em-
higher confidence, we first warm-up our model in the first bedding [29]. We train the model for 200 epochs with the
50 epochs without the procedure of hard negative sampling. batch size of 16. We use a warm-up training without hard
Then, we iteratively mine the hard negative samples with negative sampling for 50 epochs. The dimension of en-
high similarity to the positive ones. coded features is set to 512. Our model is optimized by
Adam [12] with an initial learning rate of 0.01 and linear
3.5. Segment Localization decay of learning rate. All experiments are conducted on
At the inference time, we select the segment with the single NVIDIA GeForce RTX 3090 GPU.
highest segment score as the final prediction. Specifically,
4.3. Comparison with State-of-the-art Methods
we extract all possible segments by choosing consecutive
frames with scores higher than the upper bound bu , and cal- Charades-STA. We compare our method against the cur-
culate the segment score by averaging the scores of internal rent state-of-the-art methods under weakly-supervised set-
frames located in the segment. Then we take the segment tings on Charades-STA dataset in Table 1. As can be
with the highest segment score as the final output. seen, we achieve the best performance over all baselines.

R@1 R@1
Lscore Lfra Lseg λfra λseg
IoU=0.3 IoU=0.5 IoU=0.7 IoU=0.3 IoU=0.5 IoU=0.7
7 7 7 44.19±0.25 19.95±0.21 8.60±0.18 1 1 58.65±0.08 42.92±0.05 23.25±0.03
3 7 7 50.91±0.19 21.13±0.18 9.54±0.16 5 1 58.01±0.07 42.58±0.05 22.75±0.03
3 3 7 55.22±0.12 29.73±0.09 14.52±0.08 10 1 58.22±0.07 42.75±0.04 22.96±0.02
3 7 3 51.94±0.16 33.66±0.11 16.77±0.09 1 5 58.85±0.05 43.07±0.03 23.36±0.01
3 3 3 58.65±0.08 42.92±0.05 23.25±0.03 1 10 58.52±0.07 42.87±0.04 23.18±0.02
10 5 58.92±0.04 43.15±0.02 23.49±0.01
Table 2. Ablation study for the effect of each module.
Table 3. Ablation study for λfra and λseg .
Particularly, our MSCL outperforms VGN+CCL [47], the
current state-of-the-art method, by a large margin 9.94%, segments is introduced such that the results of our MSCL
7.81% and 7.73%, 6.58% in terms of R@1,IoU=0.5, 0.7 are improved. As can be seen, our MSCL reaches the best
and R@5,IoU=0.5, 0.7, respectively. This indeed shows the performance when λf ra =10 and λseg =5, which implies the
superiority of our multi-scale hard negative sampling strat- importance of balancing the weight of the frame-wise and
egy in weakly-supervised video grounding. segment-wise hard negative sampling during the training.
ActivityNet-Caption. Table 1 also reports the compari- Robustness to the batch size. In this part, we analyze
son results with the state-of-the-art methods on ActivityNet- the effect of the batch size on the final performance of
Caption dataset. We can observe that our MSCL achieves our MSCL, as shown in Table 4, where we set the batch
superior performance against all previous works. This fur- size as 4, 8, 16, 32, 48. From Table 4, we observe that
ther demonstrates the advantage of the multi-scale hard neg- our MSCL with the batch size of 16 achieves the best re-
ative sampling on distinguishing fine-grained frame-wise sults in terms of all metrics. Meanwhile, the performance
details and enforcing more accurate segment grounding. of our model does not change too much with the altering
of the batch size. This further validates the robustness of
4.4. Ablation Study our MSCL to the choice of the batch size. In other words,
In this part, we conduct extensive ablation studies on we do not need a large batch size that is desired in previ-
each module in our MSCL including frame-wise matching ous contrastive learning based methods [43, 47] for weakly-
score prediction and frame-/segment-scale hard negative supervised video grounding.
sampling, the effect of batch size, and the hyper-parameters. 4.5. Visualization Results
Unless specified, we perform all ablation studies on the
Charades-STA benchmark. In this section, we provide more detailed visualization
Effect of each module. In order to understand how each results on how our MSCL predicts more accurate segment
module in our MSCL affects the final performance, we grounding results given frame score curves. Qualitative ex-
explore the effect of each proposed loss as shown in Ta- amples of hard negative sampling and foreground frame
ble 2. Our model with frame-wise matching score predic- predictions on two benchmarks are visualized to validate
tion outperforms the baseline by 6.72%, 1.18%, and 0.94% the superiority of our MSCL.
in terms of three criteria, which shows the effectiveness of Frame-wise matching scores. In order to better under-
this module. Introducing frame-scale and segment-scale stand the effectiveness of the segment localization in our
hard negative sampling separately boosts the performance MSCL, we plot the frame-wise score curves with respect to
of our model with frame-wise matching score prediction the frame index among the positive frames in Figure 3. As
only. Furthermore, by adding frame-scale and segment- can be seen, many hard negative samples with high frame
scale hard negative sampling together, we observe the high- scores appeared in the training process. With the multi-
est increasing range of 7.74%, 21.79%, and 13.71%. This scale hard negative sampling, our MSCL predicts the seg-
demonstrates the superiority of our multi-scale hard nega- ment with the highest score greater than the upper bound bu
tive sampling over baselines. as the final output, which matches the target segment. This
Analysis of frame and segment loss. Furthermore, we shows the importance of the segment localization in more
conduct extensive experiments to explore the weighting accurate segment grounding.
hyper-parameters of frame and segment loss used in multi- Hard Negative Mining. In Figure 4, we also visualize the
scale hard negative sampling in Table 3. Specifically, we hard negative samples at epoch=0, 50, 100, 150. We ob-
take the value of λf ra and λseg from (1, 5, 10) for different serve the hard negative samples with high scores are pro-
control settings. With the increase of λf ra , the performance gressively closer to the positive frames, which validates the
of our model decreases since we fail to consider the more effectiveness of our multi-scale hard negative sampling.
global information of segments in the whole video. How- Qualitative Results. To validate the superiority of our
ever, with the increase of λseg , more global information of MSCL in a qualitative manner, we visualize qualitative ex-

R@1 R@5
 Batch Size
 IoU=0.3 IoU=0.5 IoU=0.7 IoU=0.3 IoU=0.5 IoU=0.7
 4 58.53±0.08 42.81±0.06 23.19±0.04 97.88±0.05 80.97±0.04 48.12±0.04
 8 58.81±0.05 43.03±0.03 23.38±0.02 97.96±0.04 81.14±0.04 48.36±0.03
 16 58.92±0.03 43.15±0.01 23.49±0.01 98.02±0.02 81.23±0.02 48.45±0.01
 32 58.91±0.02 43.12±0.01 23.45±0.01 97.98±0.01 81.19±0.01 48.42±0.01
 48 58.85±0.02 43.09±0.02 23.41±0.01 97.92±0.02 81.13±0.02 48.36±0.02

 Table 4. Exploration study on the effect of batch size.

Figure 3. Visualization of frame score curves used for segment localization. Red dotted lines denote the upper bound bu . The ground-truth
of frame indices are 49-60, 53-62, 6-23, and 26-37.

 Index: 1 9 25 26 27 80 100 112 113 114 115 116 142 169 188 235 GT: [27, 112]

 ℎ = 0
 ! : 0, " : 1e-6
 Score: 1e-6 1e-6 1e-6 1e-6 1e-6 1e-6 1e-6 1e-6 1e-6 1e-6 1e-6 1e-6 1e-6 1e-6 1e-6 1e-6

 ℎ = 50
 ! : 1e-8, " : 1e-5

 Score: 1e-7 3e-7 2e-3 8e-3 1e-2 5e-2 7e-2 5e-3 4e-3 3e-3 1e-3 8e-4 7e-5 5e-5 6e-7 2e-7

 ℎ = 100
 ! : 1e-7, " : 1e-3

 Score: 1e-8 1e-7 6e-4 9e-4 8e-2 2e-1 3e-1 6e-2 3e-2 1e-2 7e-3 5e-3 8e-4 3e-4 1e-7 5e-8

 ℎ = 150
 ! : 1e-6, " : 1e-1

 Score: 1e-8 2e-8 3e-4 6e-4 2e-1 5e-1 6e-1 9e-2 6e-2 4e-2 2e-2 1e-2 8e-7 6e-7 3e-8 1e-8
Figure 4. An illustration of the dynamic process of hard negative sampling (Gray Shadow) at epoch=0, 50, 100, 150. GT denotes the
ground-truth (red indexes denote the segment boundaries of ground-truth). It shows that our iterative mining strategy can mine harder
negative samples with the step goes on, leading to more discriminative frame-wise representation learning.

amples of Charades-STA and ActivityNet-Caption bench- wise features with query semantics. In order to learn more
marks in Figure 5. By comparison, our MSCL achieves bet- discriminative frame-wise representations for predicting ac-
ter performance than SCN [14] and VGN+CCL [47]. Par- curate frame-wise scores, we further introduce a multi-
ticularly, we achieve more accurate results on the boundary scale self-contrastive learning with multi-step hard negative
of the ground-truth segment due to the effectiveness of our mining strategy to progressively discriminate hard negative
multi-scale hard negative sampling strategy. samples that are close to positive samples in the represen-
 tation space. This iterative approach captures fine-grained
5. Conclusion frame-scale details as well as segment-scale semantics for
 distinguishing frames with high repeatability and similar-
 In this work, we propose a novel multi-scale self- ity within the entire video. Experimental results show that
contrastive learning model for weakly-supervised query- our proposed model outperforms state-of-the-art methods
based video grounding. Instead of utilizing redundant seg- on two challenging benchmarks.
ment proposal for semantic matching, we predict frame-
wise scores and weights for matching fine-grained frame-

Query: The person immediately opened a window.

GT: 8.00s 15.20s
SCN: 7.23s 16.33s
VGN+CCL: 8.00s 15.78s
Ours: 8.00s 15.23s
Query: Next, the man raises the woman and they turn around and continue dancing.

GT: 167.33s 209.17s

SCN: 95.12s 189.51s
VGN+CCL: 153.67s 209.17s
Ours: 167.21s 209.17s

Figure 5. Visualization of examples on Charades-STA and ActivityNet-Caption benchmarks. GT denotes the ground-truth.

References Processing (EMNLP-IJCNLP), pages 1481–1487, 2019. 1,
3
[1] Jingyuan Chen, Lin Ma, Xinpeng Chen, Zequn Jie, and Jiebo [9] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross
Luo. Localizing natural language in videos. In Proceedings Girshick. Momentum contrast for unsupervised visual repre-
of the American Association for Artificial Intelligence, 2019. sentation learning. In Proceedings of IEEE/CVF Conference
3 on Computer Vision and Pattern Recognition (CVPR), pages
[2] Long Chen, Chujie Lu, Siliang Tang, Jun Xiao, Dong Zhang, 9729–9738, 2020. 3
Chilie Tan, and Xiaolin Li. Rethinking the bottom-up frame- [10] Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef
work for query-based video localization. In Proceedings of Sivic, Trevor Darrell, and Bryan Russell. Localizing mo-
the AAAI Conference on Artificial Intelligence, volume 34, ments in video with temporal language. Proceedings of
pages 10551–10558, 2020. 3 the 2018 Conference on Empirical Methods in Natural Lan-
[3] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- guage Processing (EMNLP), page 1380–1390, 2018. 1, 3
offrey Hinton. A simple framework for contrastive learning [11] Wenhao Jiang, Lin Ma, Yu-Gang Jiang, Wei Liu, and Tong
of visual representations. In Proceedings of International Zhang. Recurrent fusion network for image captioning. In
Conference on Machine Learning (ICML), 2020. 3 Proceedings of the European Conference on Computer Vi-
[4] Zhenfang Chen, Lin Ma, Wenhan Luo, Peng Tang, and sion (ECCV), pages 499–515, 2018. 1
Kwan-Yee K Wong. Look closer to ground better: Weakly- [12] Diederik P Kingma and Jimmy Ba. Adam: A method for
supervised temporal grounding of sentence in video. arXiv stochastic optimization. arXiv preprint arXiv:1412.6980,
preprint arXiv:2001.09308, 2020. 3 2014. 6
[13] Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and
[5] Yu Cheng, Quanfu Fan, Sharath Pankanti, and Alok Choud-
Juan Carlos Niebles. Dense-captioning events in videos. In
hary. Temporal sequence modeling for video event detection.
Proceedings of the IEEE International Conference on Com-
In Proceedings of the IEEE Conference on Computer Vision
puter Vision (ICCV), pages 706–715, 2017. 6
and Pattern Recognition (CVPR), pages 2227–2234, 2014. 1
[14] Zhijie Lin, Zhou Zhao, Zhu Zhang, Qi Wang, and Huasheng
[6] Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. Liu. Weakly-supervised video moment retrieval via semantic
Tall: Temporal activity localization via language query. Pro- completion network. In Proceedings of the AAAI Conference
ceedings of the IEEE International Conference on Computer on Artificial Intelligence, volume 34, pages 11539–11546,
Vision (ICCV), page 5267–5275, 2017. 1, 3 2020. 1, 3, 8
[7] Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. [15] Daizong Liu, Xiang Fang, Wei Hu, and Pan Zhou. Exploring
Tall: Temporal activity localization via language query. In optical-flow-guided motion and detection-based appearance
Proceedings of the IEEE Conference on Computer Vision for temporal sentence grounding. arXiv preprint, 2022. 3
and Pattern Recognition (CVPR), pages 5267–5275, 2017. [16] Daizong Liu, Xiaoye Qu, Xing Di, Yu Cheng, Zichuan Xu
6 Xu, and Pan Zhou. Memory-guided semantic learning net-
[8] Mingfei Gao, Larry Davis, Richard Socher, and Caiming work for temporal sentence grounding. In AAAI, 2022. 3
Xiong. Wslln: Weakly supervised natural language local- [17] Daizong Liu, Xiaoye Qu, Jianfeng Dong, and Pan Zhou.
ization networks. In Proceedings of the 2019 Conference Reasoning step-by-step: Temporal sentence localization in
on Empirical Methods in Natural Language Processing and videos via deep rectification-modulation network. In COL-
the 9th International Joint Conference on Natural Language ING, 2020. 3

[18] Daizong Liu, Xiaoye Qu, Jianfeng Dong, and Pan Zhou. standing. In European Conference on Computer Vision
Adaptive proposal generation network for temporal sentence (ECCV), pages 510–526. Springer, 2016. 6
localization in videos. In EMNLP, 2021. 3 [32] Yijun Song, Jingwen Wang, Lin Ma, Zhou Yu, and Jun
[19] Daizong Liu, Xiaoye Qu, Jianfeng Dong, Pan Zhou, Yu Yu. Weakly-supervised multi-level attentional reconstruc-
Cheng, Wei Wei, Zichuan Xu, and Yulai Xie. Context-aware tion network for grounding textual queries in videos. arXiv
biaffine localizing network for temporal sentence grounding. preprint arXiv:2003.07048, 2020. 3
In CVPR, 2021. 1 [33] Reuben Tan, Huijuan Xu, Kate Saenko, and Bryan A Plum-
[20] Daizong Liu, Xiaoye Qu, Xiao-Yang Liu, Jianfeng Dong, mer. Logan: Latent graph co-attention network for weakly-
Pan Zhou, and Zichuan Xu. Jointly cross-and self-modal supervised video moment retrieval. In Proceedings of the
graph attention network for query-based moment localiza- IEEE Winter Conference on Applications of Computer Vision
tion. In ACM MM, 2020. 1 (WACV), pages 2083–2092, 2021. 3
[21] Daizong Liu, Xiaoye Qu, Yinzhen Wang, Xing Di, Kai Zou, [34] Du Tran, Lubomir D. Bourdev, Rob Fergus, Lorenzo Torre-
Yu Cheng, Zichuan Xu, and Pan Zhou. Unsupervised tempo- sani, and Manohar Paluri. Learning spatiotemporal features
ral video grounding with deep semantic clustering. In AAAI, with 3d convolutional networks. In Proceedings of the IEEE
2022. 3 International Conference on Computer Vision (ICCV), pages
[22] Daizong Liu, Xiaoye Qu, and Pan Zhou. Progressively guide 4489–4497, 2015. 4, 6
to attend: An iterative alignment framework for temporal [35] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
sentence grounding. In EMNLP, 2021. 3 reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
[23] Daizong Liu, Xiaoye Qu, Pan Zhou, and Yang Liu. Explor- Polosukhin. Attention is all you need. In Advances in Neural
ing motion and appearance information for temporal sen- Information Processing Systems (NIPS), pages 5998–6008,
tence grounding. In AAAI, 2022. 3 2017. 4
[36] Jingwen Wang, Lin Ma, and Wenhao Jiang. Tempo-
[24] Jingzhou Liu, Wenhu Chen, Yu Cheng, Zhe Gan, Licheng
rally grounding language queries in videos by contextual
Yu, Yiming Yang, and Jingjing Liu. Violin: A large-scale
boundary-aware prediction. In Proceedings of the AAAI Con-
dataset for video-and-language inference. In Proceedings
ference on Artificial Intelligence, volume 34, pages 12168–
of the IEEE Conference on Computer Vision and Pattern
12175, 2020. 1, 3
Recognition (CVPR), pages 10900–10910, 2020. 1
[37] Jingwen Wang, Lin Ma, and Wenhao Jiang. Tempo-
[25] Minuk Ma, Sunjae Yoon, Junyeong Kim, Youngjoon Lee,
rally grounding language queries in videos by contextual
Sunghun Kang, and Chang D Yoo. Vlanet: Video-language
boundary-aware prediction. In Proceedings of the AAAI Con-
alignment network for weakly-supervised video moment re-
ference on Artificial Intelligence, 2020. 3
trieval. In Proceedings of the European Conference on Com-
[38] Jie Wu, Guanbin Li, Xiaoguang Han, and Liang Lin. Rein-
puter Vision (ECCV), pages 156–171, 2020. 1, 3
forcement learning for weakly supervised temporal ground-
[26] Niluthpol Chowdhury Mithun, Sujoy Paul, and Amit K Roy- ing of natural language in untrimmed videos. In Proceedings
Chowdhury. Weakly supervised video moment retrieval from of the 28th ACM International Conference on Multimedia,
text queries. In Proceedings of the IEEE Conference on Com- pages 1283–1291, 2020. 3
puter Vision and Pattern Recognition (CVPR), pages 11592– [39] Yitian Yuan, Lin Ma, Jingwen Wang, Wei Liu, and Wenwu
11601, 2019. 1, 3 Zhu. Semantic conditioned dynamic modulation for tem-
[27] Jonghwan Mun, Minsu Cho, and Bohyung Han. Local- poral sentence grounding in videos. IEEE Transactions on
global video-text interactions for temporal grounding. In Pattern Analysis and Machine Intelligence, 2020. 1, 3
Proceedings of the IEEE Conference on Computer Vision [40] Runhao Zeng, Haoming Xu, Wenbing Huang, Peihao Chen,
and Pattern Recognition (CVPR), 2020. 3 Mingkui Tan, and Chuang Gan. Dense regression network
[28] Guoshun Nan, Rui Qiao, Yao Xiao, Jun Liu, Sicong Leng, for video grounding. In Proceedings of the IEEE Conference
Hao Zhang, and Wei Lu. Interventional video grounding on Computer Vision and Pattern Recognition (CVPR), 2020.
with dual contrastive learning. In Proceedings of the IEEE 3
Conference on Computer Vision and Pattern Recognition [41] Yawen Zeng, Da Cao, Xiaochi Wei, Meng Liu, Zhou Zhao,
(CVPR), pages 2765–2775, 2021. 2, 3 and Zheng Qin. Multi-modal relational graph for cross-
[29] Jeffrey Pennington, Richard Socher, and Christopher Man- modal video moment retrieval. In Proceedings of the IEEE
ning. Glove: Global vectors for word representation. In Pro- Conference on Computer Vision and Pattern Recognition
ceedings of the 2019 Conference on Empirical Methods in (CVPR), pages 2215–2224, 2021. 1
Natural Language Processing (EMNLP), pages 1532–1543, [42] Da Zhang, Xiyang Dai, Xin Wang, Yuan-Fang Wang, and
2014. 4, 6 Larry S Davis. Man: Moment alignment network for natu-
[30] Zheng Shou, Dongang Wang, and Shih-Fu Chang. Temporal ral language moment retrieval via iterative graph adjustment.
action localization in untrimmed videos via multi-stage cnns. In Proceedings of the IEEE Conference on Computer Vision
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1247–1257, 2019. 3
and Pattern Recognition (CVPR), pages 1049–1058, 2016. 1 [43] Hao Zhang, Aixin Sun, Wei Jing, Guoshun Nan, Liangli
[31] Gunnar A Sigurdsson, Gül Varol, Xiaolong Wang, Ali Zhen, Joey Tianyi Zhou, and Rick Siow Mong Goh. Video
Farhadi, Ivan Laptev, and Abhinav Gupta. Hollywood in corpus moment retrieval with contrastive learning. In Pro-
homes: Crowdsourcing data collection for activity under- ceedings of the 44th International ACM SIGIR Conference

on Research and Development in Information Retrieval, page
 685–695, 2021. 2, 3, 7
[44] Hao Zhang, Aixin Sun, Wei Jing, and Joey Tianyi Zhou.
 Span-based localizing network for natural language video lo-
 calization. In Proceedings of the 58th Annual Meeting of
 the Association for Computational Linguistics, pages 6543–
 6554, 2020. 4
[45] Songyang Zhang, Houwen Peng, Jianlong Fu, and Jiebo
 Luo. Learning 2d temporal adjacent networks for moment
 localization with natural language. In Proceedings of the
 AAAI Conference on Artificial Intelligence, volume 34, pages
 12870–12877, 2020. 3
[46] Zhu Zhang, Zhijie Lin, Zhou Zhao, and Zhenxin Xiao.
 Cross-modal interaction networks for query-based moment
 retrieval in videos. In Proceedings of the 42nd International
 ACM SIGIR Conference on Research and Development in
 Information Retrieval, pages 655–664, 2019. 3
[47] Zhu Zhang, Zhou Zhao, Zhijie Lin, Xiuqiang He, et al.
 Counterfactual contrastive learning for weakly-supervised
 vision-language grounding. Advances in Neural Information
 Processing Systems (NIPS), 33:18123–18134, 2020. 1, 2, 3,
 6, 7, 8

You can also read