Few-shot Action Recognition with Prototype-centered Attentive Learning

Page created by Michele Perkins
 
CONTINUE READING
Few-shot Action Recognition with Prototype-centered Attentive Learning
Few-shot Action Recognition with Prototype-centered Attentive Learning

 Xiatian Zhu1Antoine Toisoul1 Juan-Manuel Pérez-Rúa1 Li Zhang2 *
 Brais Martinez1 Tao Xiang3∗
 1
 Samsung AI Center Cambridge, UK 2 Fudan University 3 University of Surrey
arXiv:2101.08085v1 [cs.CV] 20 Jan 2021

 Abstract Query Prototype Query Prototype

 Few-shot action recognition aims to recognize action
 classes with few training samples. Most existing methods
 adopt a meta-learning approach with episodic training. In
 each episode, the few samples in a meta-training task are
 split into support and query sets. The former is used to
 build a classifier, which is then evaluated on the latter us-
 ing a query-centered loss for model updating. There are (a) (b)
 however two major limitations: lack of data efficiency due
 to the query-centered only loss design and inability to deal Figure 1: Illustration of (a) conventional query-centered
 with the support set outlying samples and inter-class distri- learning vs. (b) our prototype-centered learning in a 3-way
 bution overlapping problems. In this paper, we overcome 2-shot case. Specifically, the former classifies a query sam-
 both limitations by proposing a new Prototype-centered At- ple against the space of prototype via computing pairwise
 tentive Learning (PAL) model composed of two novel com- similarity and softmax normalization. In contrast, the latter
 ponents. First, a prototype-centered contrastive learning considers an opposite direction by comparing a prototype
 loss is introduced to complement the conventional query- with all the query samples, pulling close those of the same
 centered learning objective, in order to make full use of the class whilst pushing way the others. Each class is color
 limited training samples in each episode. Second, PAL fur- coded. Circle and diamond represent the query samples and
 ther integrates a hybrid attentive learning mechanism that per-class prototypes respectively.
 can minimize the negative impacts of outliers and promote
 class separation. Extensive experiments on four standard lecting and annotating such a large amount of data how-
 few-shot action benchmarks show that our method clearly ever is expensive, tedious and sometimes even infeasible for
 outperforms previous state-of-the-art methods, with the im- some rare fine-grained action classes. Therefore, few-shot
 provement particularly significant (> 10%) on the most action recognition has started to receive increasing interest
 challenging fine-grained action recognition benchmark. [39, 36, 1], which aims to construct a video action classifier
 with few training samples (e.g., 1-5) per class.
 Few-shot action recognition is a special case of few-shot
 1. Introduction learning (FSL). Most FSL methods follow a meta-learning
 The universal adoption of smartphones has led to large (or learning-to-learn) paradigm [23, 29]. It is characterized
 quality of videos being produced and shared on social me- by episodic training that aims to learn a model or optimizer
 dia platforms, which are in urgent need of automated anal- from a set of base/seen tasks, in order to generalize well to
 ysis. As a result, video action recognition [27, 3, 32, 17] new tasks with few labeled training samples/shots. Specifi-
 has been studied intensively with a recent focus on fine- cally, a meta-training set containing abundant training sam-
 grained action recognition [12]. Most recent action recog- ples per seen/base class is used to sample a large number of
 nition models employ deep convolutional neural networks training episodes. In each episode, the training data is split
 (CNNs) which are known to be data hungry. In particu- into a support set with N classes and K samples per class
 lar, a large number (at least hundreds) of annotated training to mimic the setting of target meta-test tasks, and a query
 samples need to be collected for each action class. Col- set from the same N classes. A classifier is built using the
 support set; it is then evaluated on each query set sample
 * Work done while at Samsung AI Center Cambridge with a classification loss for model updating. Many state-

 1
Few-shot Action Recognition with Prototype-centered Attentive Learning
Playing drums
 Both problems are not addressed when a class is simply rep-
 resented as class mean without considering both the intra-
 class relationships to identify outliers and inter-class rela-
 tionships to avoid class overlapping.
 To overcome the aforementioned limitations, we propose
 Playing trombone
 a novel Prototype-centered Attentive Learning (PAL) model
 with two key components. First, a prototype-centered con-
 trastive learning loss is introduced. This loss is computed
 by comparing a given prototype against all the query-set
 Playing trumpet
 samples for discriminative learning (see Figure 1(b)). This
 new learning objective is clearly complementary to conven-
 tional query-centered learning objective, and combining the
 two enables PAL to make full use of the limited training data
 and hence improve the data efficiency in training. Second,
 a hybrid attentive learning component is developed con-
 sisting of self-attention on support-set samples and cross-
Figure 2: Illustration of the outlying video samples and attention from query-set to support-set samples. This design
inter-class overlapping problems in a support set contain- aims to mitigate the negative effect of outlying samples and
ing three music-instrument-playing classes. There is a clear promote class separation. Importantly, it can be seamlessly
outlier in the trombone class due to its unusual view point. integrated with the query- and prototype-centered learning
There is also plenty of class overlapping due to the intrin- objectives in a unified meta-learning pipeline.
sic visual similarity between the three classes, particularly We make the following contributions in this paper: (1)
when these instruments are played in groups. We propose a novel Prototype-centered Attentive Learning
 (PAL) for few-shot action recognition, specifically designed
 to address the data efficiency, outliers and class overlapping
of-the-art FSL methods [33, 35] are based on the popular
 problems that existing methods suffer from. (2) A novel
prototypical network (ProtoNet) [23] for its simplicity and
 prototype-centered contrastive learning objective is intro-
effectiveness. With ProtoNet, each support set class mean
 duced to complement existing query-centered objective, in
is computed as a class prototype for classifying each query
 order to make full use of limited few shot training data.
sample (see Figure 1(a)).
 (3) We further introduce a hybrid attentive learning strat-
 There are however two major limitations in existing few- egy which is dedicated for solving the under-studied outlier
shot action recognition methods [39, 36, 1]. The first limita- sample and class overlapping problems. (4) Extensive ex-
tion is the lack of data efficiency due to the query-centered periments show that our model achieves new state-of-the-
learning objective adopted by existing methods. This is ev- art results on four few-shot action datasets. Its performance
ident in Figure 1(a): with only K samples for each of the is particularly compelling on the more challenging fine-
N classes in the support set, these N × K samples are first grained benchmark, yielding around 10% improvement.
reduced to N prototypes; a loss term (e.g., cross-entropy)
is then computed for each query sample individually based
 2. Related work
on its distances to the prototypes, without any consideration
on how the whole query set samples should be distributed. Action recognition The previous efforts have been focused
This query-centered only learning objective thus does not on training action recognition models on large-scale video
make full use of the limited training data in each episode. datasets (e.g., the coarse-grained Kinetics [3] and fine-
 The second limitation is their inability to address two grained Something-Something [12]). Computationally, uti-
fundamental challenges in FSL, i.e., outlying samples and lizing 2D CNNs [22, 30, 17] for action recognition is more
inter-class distribution overlapping in the support set. As efficient than the 3D counterparts [27, 3, 32, 17]. As one
illustrated in Figure 2, outliers are caused by unusual view- of the most popular action classification models, Temporal
point, background, occlusion etc. They are problematic Segment Networks (TSN) [30] extract features from tempo-
for any learning tasks, but particular so in few-shot ac- rally sparsely sampled frames based on a 2D CNN and then
tion recognition – with few samples per class, a single out- average pooling is used to obtain the video-level feature
lier could have an immense effect on the class distribution. representation and prediction. We use TSN as the feature
The problem of inter-class overlapping is also common- extractor/embedding network in our few-shot action recog-
place when the training samples of different classes have nition model as we found that other more complex feature
very similar background, or just being visually similar. This extractors [27, 3, 32, 17] do not fare better on performance
problem is especially acute for fine-grained action classes. but are computationally more demanding.

 2
Few-shot Action Recognition with Prototype-centered Attentive Learning
Few-shot action recognition When the research focus in attention and query-to-support cross-attention in a unified
action recognition has shifted towards fine-grained action meta-learning manner.
classes lately, the problem of collecting sufficient training
samples per class has become an obstacle. Few-shot action 3. Method
recognition emerged as a potential solution to this prob-
lem [39, 36, 1]. CMN [39] employs a compound mem- 3.1. Problem definition
ory network to store the representation and classify videos We consider the standard few-shot video action classifi-
by matching and ranking. OTAM [1] measures the query- cation problem definition [1, 39]. Given a meta-test dataset
centered distance with respect to support set samples, by Dtest , we sample a N -way K-shot classification task to test
explicitly leveraging the temporal ordering information in a learned FSL action model θ. To train the model θ in
query video via ordered temporal alignment. ARN [36] a way that it can perform well on those sampled classifi-
learns the query-centered similarity between query and sup- cation tasks, episodic training is adopted to meta-learn the
port video clips with a pipeline following RelationNet [25]. model. Concretely, a large number of N -way K-shot tasks
It exploits augmentation-guided spatial and temporal atten- are randomly sampled from a meta-training set Dtrain , and
tion with auxiliary self-supervision training losses. Despite then used to train the model in an episodic manner. In each
their differences in model design, all deploy a meta-learning episode, we start with sampling N classes C from Dtrain at
framework with a query-centered learning objective, thus random, from which labeled training samples are then ran-
being unable to make full use of the limited training data domly drawn to create a support set S and a query set Q
in each episode. Further, none is designed to address the consisting of K and Q samples per class, respectively. For-
the outlying sample and class overlapping problems, as our mally, the support and query sets are defined as:
model does.
 S = {(xi , yi ) | yi ∈ C}N K
 i=1 , (1)
Few-shot learning Few-shot action recognition is a spe-
cial case of few-shot learning (FSL). Most existing FSL Q = {(xi , yi ) | yi ∈ C}N Q
 i=1 . (2)
models focus on recognizing static images. They can be
 Note, S ∩ Q = φ are sample-wise non-overlapping.
roughly categorized into two groups including metric-based
 Typically, episodic training is conducted in a two-loop
methods [29, 25, 23, 7, 33, 19, 35] and optimization-based
 manner [23]: the support set is used in the inner loop to
methods [9, 20]. All existing few-shot action recognition
 construct a classifier for the N classes, and the query set is
methods and our PAL belong to the first group, where what
 then used in the outer loop to evaluate this classifier and up-
is meta-learned is a feature embedding network for video.
 date the model parameters θ. It is noteworthy that as the ob-
Note that the idea of using self-attention for few-shot learn-
 jective is to obtain a learner able to recognize novel classes
ing has previously been explored in [33]. The model in [33]
 each with only a few labeled examples, Dtrain and Dtest are
only applies self-attention to class prototypes. In contrast,
 set to be disjoint in the class space, i.e., Dtrain ∩ Dtest = φ.
our model applies it to the support set as well as query sam-
 Unlike the sparsely annotated meta-test classes, each meta-
ples, therefore being capable of dealing with sample out-
 training class comes with abundant labeled training data
liers and improving data efficiency. Following the practice
 that allow to sample sufficient episodes for meta-training.
of most recent FSL methods [33, 35], feature embedding
network pretraining is incorporated into our model. 3.2. Pretraining
Self-attention Our attentive learning is based on self- As in all existing few-shot action classification models
attention across data instances, which has been first in- [39, 36, 1], one of the key objectives is to learn a fea-
troduced in transformers as a global self-attention module ture embedding network through meta learning, so that it
for machine translation tasks [28]. Non-local neural net- can generalize to any unseen action classes. The embed-
works [31] applied the core self-attention block from trans- ding networks adopted are based on those used by exist-
formers for context modeling and feature learning in com- ing video action recognition model with TSN [30] as the
puter vision tasks. They learns an affinity map among all most popular choice. For example, in the state-of-the-art
pixels in the image and allow the neural network to ef- OTAM [1] model, a ResNet-50 based TSN model, pre-
fectively increase the receptive field to the global context. trained on ImageNet, was meta-trained together with a time-
State-of-the-art performances have been shown in classifi- warping based video distance. However, in most recent FSL
cation [8], self-supervised learning [5], semantic segmen- methods for static image classification [33, 35], pretraining
tation [10, 37], object detection [34, 2, 40] by using this the embedding network on the whole meta-training set be-
transformer-based attention model. Different from these fore episodic training starts has become a must-have step. In
works, in this paper, we propose a hybrid attentive learn- this work, we also adopt such a pretraining step and show
ing mechanism that aims to exploit the support set self- in our experiments that this step is vital (see Sec. 4).

 3
Few-shot Action Recognition with Prototype-centered Attentive Learning
1×1Conv 1×1Conv
 Query Key Value Prototypes *

 3-way, 3-shot (*+,
 +
 ( mean +
 Support set 1×1Conv
 -
 -
 -
 -
 Feature embedding Shared
 network Prototype-centered
 9× 9× ⨂
 ) 1×1Conv

 Query set
 )*+,
 3-way, 2-shot Softmax

 15× 15×9 15× Query-centered

Figure 3: Schematic illustration of the proposed Prototype-centered Attentive Learning (PAL). Given an episode of training
video data including a support set (3-way 3-shot) and a query set (2 samples per class), a CNN feature embedding network is
first used to extract their feature vectors Xs and Xq . The feature vectors Xs of all support samples are then selected to form
the Key and Value via respective linear transformation. For support-set self-attention learning, the transformation of Xs is set
as the Query and used to conduct attentive learning with Key and Value. For query-to-support cross-attention learning, Xq
is instead used as the input of Query. The two processes form a hybrid attentive learning strategy for solving the outlying
sample problem, which outputs attentive support-set Xctx ctx
 s and query-set Xq features. After obtaining the per-class prototypes
 ctx ctx
xc from Xs as feature mean, Xq can be classified by computing pairwise similarity against xc and softmax normalization
(i.e., query-centered learning). To further improve discriminative learning of limited training data, we additionally conduct a
novel prototype-centered contrastive learning in an opposite direction by comparing a given prototype with all the query-set
samples and conducting discriminative contrastive learning.

 Specifically, we use as our feature embedding network vectors are averaged to get video-level scores :
a TSN action model [30]. It is then pretrained on the
whole training set Dtrain with a cosine similarity based T
cross-entropy loss. Given a video sample xi ∈ Dtrain 1X f
 si = s (3)
with a varying length and redundant temporal informa- T j=1 j
tion, we first sample video frames. We adopt the same
 = [s1 , s2 , · · · , sZ ]
sparse sampling strategy as introduced in [30]: splitting
the video into T equal-sized segments and randomly se- Finally, a softmax is applied to the video-level scores to
lecting one frame from each segment. This way, each get video-level action probabilities :
video can be represented using a fixed number of frames
xi = {f1 , f2 , · · · , fT } where fj is a random frame from exp(sz )
 pz = PZ for z ∈ {1, · · · , Z} (4)
the j-th segment. As sampled frames cover the majority of i=1 exp(si )
original time span, long-term temporal information is kept
 In training, pi = [p1 , p2 , · · · , pZ ] and its ground-truth
whereas the spatio-temporal redundant information is re-
 class label yi are then utilized to optimize the model using
duced significantly.
 a cross-entropy loss:
 Next, each sampled video frame fj is individually en-
 Z
coded by a feature embedding network h() and classified X
 Lce = − I(z == yi ) log pz , (5)
by a cosine-distance based classifier, giving us frame-level
 z=1
classification score vectors sfj = [sf1 , sf2 , · · · , sfZ ] with Z
the total number of classes in Dtrain . In particular, for each where I() is an indicator function which returns 1 when the
frame, sfz = cos(h(fj ), wz ) where cos() denotes the co- argument is true, 0 otherwise.
sine similarity function and wz (z ∈ {1, · · · , Z}) the clas- After the pretraining stage, the feature embedding net-
sifier weights of each class. Then, these per-frame score work together with a cosine distance can be used directly

 4
Few-shot Action Recognition with Prototype-centered Attentive Learning
for meta-test without going through the episodic training support-set samples, and further used for weighted aggrega-
stage. Our experiments show that this turns out to be a sur- tion in the Value space. The intuition is that, statistically
prisingly strong baseline that even achieves better results sample pairs from the same class often enjoy more similar-
than current state-of-the-art OTAM [1] (see Table 1). This ity than those with different classes except very little outlier
verifies for the first time that a good feature embedding (or instances; As a result, this attentive learning would rein-
feature reuse) is also critical for few-shot action modeling – force the importance of class-sensitive information, subject
a finding that has been reported in recent static image FSL to the context of current task’s classes. Consequently, the
works [6, 26, 18]. Nonetheless, this baseline is still limited effect caused by class irrelevant information of outlier sam-
for FSL since it lacks a “learning to learn” or task adaptation ples will be well controlled during feature transformation.
capability to better deal with unseen new tasks. Finally, intra-class variation shrinks and inter-class ambi-
 To this end, we next introduce our Prototype-centered guity can be reduced accordingly.
Attentive Learning (PAL) method. The overview of PAL is Query set cross-attention For consistency, the query-set
depicted in Figure 3. PAL is built upon the ProtoNet [23] samples should be also contextualized in a task-specific
but consists of two new components: (1) Hybrid attentive manner, because they will be classified into a label space
learning including self-attention on support-set samples and formed by support-set samples. To that end, we intro-
cross-attention from query-set samples to support-set sam- duce a cross-attention process from query-set samples Xq
ples (Sec. 3.3). (2) Prototype-centered contrastive learning to support-set samples Xs formulated as:
(Sec. 3.4).
 Xq WQ (Xs WK )>
3.3. Hybrid attentive learning Xcxt
 q = Xq + softmax( √ )(Xs WV ). (8)
 da
 Hybrid attentive learning (HAL) is designed to mitigate
 By doing this, in the same spirit as support-set self-
inter-class ambiguity as well as intra-class outlying sample
 attention, query-set samples are also enriched by contextual
problem by allowing each support set sample to examine
 information. Note that each query set sample transforma-
all other samples in order to identify and fix both problems.
 tion is done independently from other query samples as the
It relies on a transformer self-attention module whose pa-
 Keys and Values are provided by the support set only.
rameters are meta-learned together with those of the em-
 Our model thus remains inductive as existing FSL action
bedding network h(). Once learned, during meta-test, it is
 models.
used to exploit task-specific contextual information for su-
perior task adaptation. Concretely, given an episode we first Learning objective To obtain the supervision for our HAL
extract a d-dimensional feature representation using a TSN module, we adopt the popular prototype based objective
h(). The video-level feature vectors are obtained for both loss [23]. With the attentive support-set feature matrix Xctx
 s ,
support-set and query-set by average pooling along frames. we first form the prototype for each class c as the mean fea-
 ture:
 Support set self-attention Formally, let Xs ∈ RN K×d NK
 1 X
and Xq ∈ RN Q×d be the support-set and query-set feature wc = I(yi == c)Xctx
 s (i, :), (9)
matrix respectively. The input to HAL is in the triplet form K i=1
of (Query, Key, Value). To learn discriminative contex- where yi denotes the class label of i-th support-set sample.
tual information per episode, the input is designed based on We then compute the cosine similarity between any query-
the support-set samples as: set feature Xctx N
 q (j, :) and all prototypes {wc }c=1 and obtain
 the classification probability vector p̃j = [p̃1 , · · · , p̃N ] ∈
 Query = Xs WQ , Key = Xs WK , Value = Xs WV , (6)
 RN by softmax function (Eq (4)). With the query-set class
where WQ /WK /WV ∈ Rd×da are the learnable parameters labels yj , a cross-entropy loss can be derived over N classes
(each represented by a fully connected layer) that project of the current episode as the meta-training objective:
the TSN feature to a da -D latent space. As Query, Key N
 X
and Value share the same input source (i.e., support-set Lmeta = − I(n == yj ) log p̃n , (10)
data), self-attention can be formulated as: n=1

 Xs Wq (Xs Wk )> Which will be used to update the parameters of both the
 Xctx
 s = Xs + softmax( √ )(Xs Wv ), (7) HAL module and the TSN embedding network.
 da
 3.4. Prototype-centered contrastive learning
where softmax() is a row-wise softmax function for at-
tention normalization. Residual learning is adopted for The meta-loss in Eq. (10) is still the conventional query-
more stable model convergence. As written in Eq (7), pair- centered loss (see Figure 1(a)). That is, only query-to-
wise similarity defines the attentive scores between any two prototype discrimination is considered, whilst the other way

 5
Few-shot Action Recognition with Prototype-centered Attentive Learning
around is ignored. We hypothesize that this design fails for meta-training, meta-validation and meta-testing respec-
to make full use of already-limited training data in each tively. Whilst Kinetics is one of the most commonly eval-
episode and would lead to sub-optimal task adaptation. uated datasets, visual appearance and background encapsu-
 To overcome this limitation, we further introduce a com- late most class-related information rather than motion pat-
plementary prototype-centered contrastive learning. More terns [21]. With less need for temporal modeling and in-
specifically, for a prototype wc , we define the query-set volving coarse-grained action classes, it presents a rela-
samples Qc from the same class as the positive matches tively easy action classification task. We hence further eval-
and all the others as the negative. We then compute the uated (2) Sth-Sth-100 created in a similar way based on the
cosine similarity between wc and all the query-set samples, Something-Something-V2 dataset [12]. For this dataset we
and devise the following prototype centered contrastive loss adopted the same protocol as [1] where 64/12/24 classes
function: and 66939/1925/2854 videos are included for train/val/test
 respectively. By considering fine-grained actions involv-
 N
 1 X P cos(wc , Xctx
 q (i, :))
  ing human object interactions with subtle differences be-
 Lpcc = log Pi∈Qc ctx , (11) tween different classes, it presents a significantly more chal-
 N c=1 j∈Q cos(wc , Xq (j, :))
 lenging action recognition task. To compare with very re-
where N is the number of class prototypes. This design cent models [36], we also tested two more YouTube video
attempts to pull positive query-set samples closer to their datasets under the same setting. (3) HMDB-51 [16] contains
corresponding prototype, whilst pushing away the negative 51 action classes with 6,849 videos. We used 31/10/10 ac-
ones. This helps further reduce intra-class variation and tion classes with 4280/1194/1292 videos for train/val/test,
simultaneously better separate different classes. Eq (11) respectively. (4) UCF101 [24] consists of 101 action cat-
is designed similarly in spirit to previous supervised con- egories with 13,320 video clips. We took 70/10/21 classes
trastive learning objective [15] and neighborhood compo- with 9154/1421/2745 videos for train/val/test, respectively.
nent analysis [11]. However, our loss function fundamen- Evaluation metrics We adopted the standard 5-way 1/5-
tally differs from them because (1) it acts uniquely as a shot FSL evaluation setting [39, 1]. We randomly sampled
meta-learning loss; and (2) it is specially designed for pos- 150,000 episodes from the meta-test set and reported the
ing prototype centered discrimination in a FSL context with mean classification accuracy.
prototypes as anchors.
 Implementation details We used the ImageNet pretrained
3.5. Objective loss, model training and inference ResNet-50 [13] as the backbone network. We replaced the
 original fully-connected layer with a new layer that per-
 The overall objective loss function of our PAL model for forms cosine similarity based classification. For the first
meta-training is defined as: training stage, we optimized the TSN feature embedding
 [30] (Sec. 3.2) with SGD, with a starting learning rate at
 L = Lmeta + λLpcc , (12) 0.001 and decaying every 30 epochs by 0.1 and a total of 70
 epochs. For the second stage, we conducted meta-training
where λ is a weight hyper-parameter which we set to 1. of both TSN feature backbone and our PAL model end-to-
 In summary, the proposed method is trained in two se- end. On Sth-Sth-100, from the initial learning rate at 0.0001
quential stages: First, the baseline TSN is trained on meta- we trained a total of 35 epochs with decaying epochs at 15
training set Dtrain in a standard supervised learning way and 30 and each epoch consists of 200 episodes. For the
(Sec. 3.2); Second, the feature embedding network (TSN) other datasets, we found that training 10 epochs sufficed,
and our PAL model are further optimized end-to-end in with decaying points at 5, 7 and 9. For both stages, we used
the episodic training stage. During test, the model is fixed cosine based classifier model. During training, we resized
and can be used similarly as in training to form prototypes each video frame to the size of 256 × 256 from which a ran-
with labeled support-set samples and classify each unla- dom 224 × 224 region was cropped to form the input. For
beled query-set samples on meta-test set Dtest with the co- three coarse-grained datasets, we applied random horizontal
sine similarity based classifier model. flip in training. However, many classes in the fine-grained
 Sth-Sth-100 are direction-sensitive (e.g., pulling something
4. Experiments from left to right and pulling something from right to left),
 horizontal flip was therefore not applied. During test, we
Datasets We used four few-shot action benchmarks in
 used the center crop only to get the prediction of test data
our evaluations. (1) Kinetics-100 is a 100-class subset
 and model performance.
of Kinetics-400 [14], and was introduced firstly in [39]
for few-shot action classification. We followed the same Competitors Three groups of methods are compared in
protocol: 64/12/24 classes and 13063/2210/4472 videos performance evaluation: (1) Classical FSL models origi-

 6
Few-shot Action Recognition with Prototype-centered Attentive Learning
Kinetics-100 Sth-Sth-100 HMDB51 UCF101
 Method 1-shot 5-shot 1-shot 5-shot 1-shot 5-shot 1-shot 5-shot
 Matching Net† [29] 53.3 74.6 - - - - - -
 MAML† [9] 54.2 75.3 - - - - - -
 ProtoNet++∗ [23] 64.5 77.9 33.6 43.0 - - - -
 TARN∗ [4] 64.8 78.5 - - - - - -
 TRN++∗ [38] 68.4 82.0 38.6 48.9 - - - -
 CMN [39] 60.5 78.9 - - - - - -
 CMN++∗ [39] 65.4 78.8 34.4 43.8 - - - -
 OTAM [1] 73.0 85.8 42.8 52.3 - - - -
 ARN [36] 63.7 82.4 - - 45.5 60.6 66.3 83.1
 PAL (Ours) 74.7 86.7 46.4 62.6 60.8 75.8 85.3 95.2

Table 1: Few-shot action classification results. ∗ : Results from [1]; † : Results from [39]; ProtoNet++: cosine-distance is
used; TRN++: same as ProtoNet++ but with TRN feature embedding instead of TSN; CMN++: also with TRN embedding.
 ;

 Blasting sand PAL instance
 CNN instance PAL instance
 PAL prototype
 CNN prototype PAL prototype
 Playing monopoly PAL query
 Playing monopoly Blasting sand

 Riding elephant Riding elephant

 Throwing axe Side Kick Throwing axe Side Kick

 (a) Before PAL (b) After PAL

Figure 4: t-SNE feature projection of support samples and per-class prototypes (a) before and (b) after the proposed PAL
in a 5-way 5-shot task on Kinetics-100 dataset. Circle indicates the variation and boundary of each class. Each class is
color coded. As can be seen, the boundary of each class becomes tighter and the overlap between different classes is clearly
reduced.

nally proposed for image classification including Matching model by a large margin.
Net [29], MAML [9] and ProtoNet [23]. All these FSL (3) Interestingly, the first FSL action model CMN is shown
methods used the same TSN feature embedding for fair to be inferior to both TRN++ and TARN, suggesting that its
comparison. (2) Stronger action recognition models includ- memory network’s merit is less critical than better temporal
ing TRN [38] and TARN [4]. (3) State-of-the-art FSL action structure modeling. Directly combining CMN with TRN
models including CMN [39], OTAM [1] and ARN [36]. helps little (CMN++).
 (4) By taking a pairwise temporal alignment strategy, the
4.1. Main results state-of-the-art OTAM further improves the performance.
 Nonetheless, it is still outperformed by our PAL particu-
Comparison to state-of-the-art The comparative results
 larly on the more challenging Sth-Sth-100 dataset (10.3%
are reported in Table 1. We make the following observa-
 improvement under 5-shot). This is not surprising because
tions:
 OTAM’s temporal warping is functionally susceptive to out-
(1) In comparison to all classical FSL methods, the pro-
 lier (less consistent) support-set videos as typically encoun-
posed PAL is clearly superior under both settings and on all
 tered in fine-grained actions with human-object interac-
datasets.
 tions. Besides, as compared with our model, OTAM is infe-
(2) Whilst stronger action recognition models (e.g., TRN++
 rior in leveraging few shots per class during meta-learning
and TARN) improve the results, they still lag behind our

 7
Few-shot Action Recognition with Prototype-centered Attentive Learning
due to lacking prototype-centered discrimination. On the Importance of pretraining In our model, the feature em-
easier dataset Kinetics-100 where class overlapping is less bedding TSN model is pretrained on the whole training set
a problem, our PAL model remains superior to OTAM by a before the episodic training stage. Instead, the current state-
smaller margin, indicating that tackling outlying samples is of-the-art OTAM [1] skipped this pretraining step. Table 3
consistently a more effective strategy. shows surprisingly that this pretraining step is vital. Our
(5) On the two smallest datasets HMDB51 and UCF, the PAL model only with pretraining is already significantly
proposed method also sets new state-of-the-art, due to its more effective than the state-of-the-art OTAM. This sug-
superior ability to deal with sparse training samples. gests for the first time that strong feature embedding is sig-
 nificant for action classification in FSL context, confirm-
Why PAL works? As described in Introduction and Fig- ing the similar conclusion drawn in the image counterparts
ure 1, our PAL is specially designed to overcome the class [6, 26, 18].
overlap challenge caused by inter-class boundary ambigu-
ity and outlier support-set samples intrinsic to new tasks. Kinetics-100 Sth-Sth-100
To understand the internal mechanism of our model, we Method 1-shot 5-shot 1-shot 5-shot
visualized the change of support samples’ feature repre- OTAM [1] 64.5 77.9 33.6 43.0
sentations and per-class prototypes in a new 5-way 5-shot Pretrained PAL 73.1 86.5 42.7 59.6
task. Concretely, this contrasted the feature distributions
with and without our PAL model to reveal how the feature Table 3: Baseline comparison.
space is improved. It is evident in Figure 4 that PAL does
improve the separation of different classes’ decision bound-
ary by posing two effects: (1) reducing intra-class varia- Attention design Similar as PAL, the recent FEAT model
tion by mitigating the distracting effect of outlier samples in [33] also exploits transformer [28] based self-attention for
forming class decision boundary, and (2) minimizing inter- task adaptation in the context of few-shot image classifi-
class overlap by conducting query-centered and prototype- cation. However, unlike our model, it directly adapts the
centered discriminative learning concurrently. class prototypes (i.e., feature mean) by self-attention alone;
 it is thus unable to deal with outlier samples in support set.
4.2. Ablation study Moreover, there is no cross-attention and only conventional
 query-centered learning is performed during meta-training.
Contributions of model components To understand the Table 4 demonstrates that our PAL is consistently superior
benefits of the two components in our PAL, namely Hybrid to FEAT.
Attentive Learning (HAL) and Prototype-centered Con-
trastive Learning (PCL), we conducted detailed component Kinetics-100 Sth-Sth-100
analysis by comparing the performances with and without Method 1-shot 5-shot 1-shot 5-shot
them. From Table 2 the following observations can be FEAT [33] 74.5 86.5 45.3 61.2
made. (1) Each component is useful and the two compo- PAL 74.7 86.7 46.4 62.6
nents are complementary to each other in improving the
classification accuracy on both datasets. (2) More perfor- Table 4: Attention design comparison.
mance gain is achieved on the more difficult Sth-Sth-100
than on Kinetics-100. This is not surprising considering
that Kinetics’ classes are coarse-grained with less distribu- 5. Conclusion
tion overlap and more distinctive pattern between classes.
 In this work we have proposed a novel Prototype-
 Kinetics-100 Sth-Sth-100 centered Attentive Learning (PAL) method for few-shot ac-
 Method 1-shot 5-shot 1-shot 5-shot tion recognition. It is designed specifically to address the
 Pretrained PAL 73.1 86.5 42.7 59.6
 data efficiency and inability to deal with outliers and class-
 overlapping problems of existing methods. To that end,
 HAL 74.5 86.5 45.3 62.1
 two complementary components are developed, namely
 PCL 73.4 86.6 44.5 61.4
 prototype-centered contrastive learning that allows to make
 HAL+PCL 74.7 86.7 46.4 62.6
 better use of few shots per class, and hybrid attention learn-
 ing that aims to mitigate the negative effect of outlier sup-
Table 2: Model component analysis. Two main components
 port samples as well as class overlapping. They can be inte-
of our PAL were evaluated. HAL: Hybrid Attentive Learn-
 grated in a single framework and trained end-to-end to max-
ing; PCL: Prototype-centered Contrastive Learning.
 imize their complementarity. Extensive experiments vali-
 date that the proposed PAL yields new state-of-the-art per-

 8
Few-shot Action Recognition with Prototype-centered Attentive Learning
formances on four action benchmarks, with the improve- [15] Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna,
ment on the more challenging fine-grained action recogni- Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and
tion benchmark being the most compelling. Dilip Krishnan. Supervised contrastive learning. In NeurIPS,
 2020. 6
References [16] Hildegard Kuehne, Hueihan Jhuang, Estı́baliz Garrote,
 Tomaso Poggio, and Thomas Serre. Hmdb: a large video
 [1] Kaidi Cao, Jingwei Ji, Zhangjie Cao, Chien-Yi Chang, and database for human motion recognition. In ICCV, 2011. 6
 Juan Carlos Niebles. Few-shot video classification via tem- [17] Ji Lin, Chuang Gan, and Song Han. Tsm: Temporal shift
 poral alignment. In CVPR, 2020. 1, 2, 3, 5, 6, 7, 8 module for efficient video understanding. In ICCV, 2019. 1,
 [2] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas 2
 Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- [18] Bin Liu, Yue Cao, Yutong Lin, Qi Li, Zheng Zhang, Ming-
 end object detection with transformers. In ECCV, 2020. 3 sheng Long, and Han Hu. Negative margin matters: Under-
 [3] Joao Carreira and Andrew Zisserman. Quo vadis, action standing margin in few-shot classification. In ECCV, 2020.
 recognition? a new model and the kinetics dataset. In CVPR, 5, 8
 2017. 1, 2 [19] Juan-Manuel Perez-Rua, Xiatian Zhu, Timothy M
 [4] Chien-Yi Chang, De-An Huang, Yanan Sui, Li Fei-Fei, and Hospedales, and Tao Xiang. Incremental few-shot ob-
 Juan Carlos Niebles. D3tw: Discriminative differentiable dy- ject detection. In CVPR, 2020. 3
 namic time warping for weakly supervised action alignment [20] Sachin Ravi and Hugo Larochelle. Optimization as a model
 and segmentation. In CVPR, 2019. 7 for few-shot learning. In ICLR, 2017. 3
 [5] Mark Chen, Alec Radford, Rewon Child, Jeff Wu, Hee-
 [21] Laura Sevilla-Lara, Shengxin Zha, Zhicheng Yan, Vedanuj
 woo Jun, Prafulla Dhariwal, David Luan, and Ilya Sutskever.
 Goswami, Matt Feiszli, and Lorenzo Torresani. Only time
 Generative pretraining from pixels. In ICML, 2020. 3
 can tell: Discovering temporal data for temporal modeling.
 [6] Wei-Yu Chen, Yen-Cheng Liu, Zsolt Kira, Yu-Chiang Frank arXiv preprint, 2019. 6
 Wang, and Jia-Bin Huang. A closer look at few-shot classi-
 [22] Karen Simonyan and Andrew Zisserman. Two-stream con-
 fication. In ICLR, 2019. 5, 8
 volutional networks for action recognition in videos. In
 [7] Carl Doersch, Ankush Gupta, and Andrew Zisserman.
 NeurIPS, 2014. 2
 Crosstransformers: spatially-aware few-shot transfer. In
 [23] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical
 NeurIPS, 2020. 3
 networks for few-shot learning. In NeurIPS, 2017. 1, 2, 3, 5,
 [8] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,
 7
 Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
 [24] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah.
 Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl-
 Ucf101: A dataset of 101 human actions classes from videos
 vain Gelly, et al. An image is worth 16x16 words: Trans-
 in the wild. arXiv preprint, 2012. 6
 formers for image recognition at scale. arXiv preprint, 2020.
 3 [25] Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS
 [9] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model- Torr, and Timothy M Hospedales. Learning to compare: Re-
 agnostic meta-learning for fast adaptation of deep networks. lation network for few-shot learning. In CVPR, 2018. 3
 In ICML, 2017. 3, 7 [26] Yonglong Tian, Yue Wang, Dilip Krishnan, Joshua B Tenen-
[10] Jun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhiwei baum, and Phillip Isola. Rethinking few-shot image classifi-
 Fang, and Hanqing Lu. Dual attention network for scene cation: a good embedding is all you need? In ECCV, 2020.
 segmentation. In CVPR, 2019. 3 5, 8
[11] Jacob Goldberger, Geoffrey E Hinton, Sam Roweis, and [27] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani,
 Russ R Salakhutdinov. Neighbourhood components analy- and Manohar Paluri. Learning spatiotemporal features with
 sis. In NeurIPS, 2004. 6 3D convolutional networks. In ICCV, 2015. 1, 2
[12] Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michal- [28] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
 ski, Joanna Materzynska, Susanne Westphal, Heuna Kim, reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
 Valentin Haenel, Ingo Fründ, Peter Yianilos, Moritz Mueller- Polosukhin. Attention is all you need. In NeurIPS, 2017. 3,
 Freitag, Florian Hoppe, Christian Thurau, Ingo Bax, and 8
 Roland Memisevic. The ”something something” video [29] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan
 database for learning and evaluating visual common sense. Wierstra, et al. Matching networks for one shot learning.
 In ICCV, 2017. 1, 2, 6 In NeurIPS, 2016. 1, 3, 7
[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. [30] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua
 Deep residual learning for image recognition. In CVPR, Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment
 2016. 6 networks: Towards good practices for deep action recogni-
[14] Will Kay, João Carreira, Karen Simonyan, Brian Zhang, tion. In ECCV, 2016. 2, 3, 4, 6
 Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, [31] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaim-
 Tim Green, Trevor Back, Apostol Natsev, Mustafa Suley- ing He. Non-local neural networks. In CVPR, 2018. 3
 man, and Andrew Zisserman. The Kinetics human action [32] Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and
 video dataset. arXiv preprint, 2017. 6 Kevin Murphy. Rethinking spatiotemporal feature learning:

 9
Few-shot Action Recognition with Prototype-centered Attentive Learning
Speed-accuracy trade-offs in video classification. In ECCV,
 2018. 1, 2
[33] Han-Jia Ye, Hexiang Hu, De-Chuan Zhan, and Fei Sha. Few-
 shot learning via embedding adaptation with set-to-set func-
 tions. In CVPR, 2020. 2, 3, 8
[34] Minghao Yin, Zhuliang Yao, Yue Cao, Xiu Li, Zheng Zhang,
 Stephen Lin, and Han Hu. Disentangled non-local neural
 networks. In ECCV, 2020. 3
[35] Chi Zhang, Yujun Cai, Guosheng Lin, and Chunhua Shen.
 Deepemd: Few-shot image classification with differentiable
 earth mover’s distance and structured classifiers. In CVPR,
 2020. 2, 3
[36] Hongguang Zhang, Li Zhang, Xiaojuan Qi, Hongdong Li,
 Philip HS Torr, and Piotr Koniusz. Few-shot action recogni-
 tion with permutation-invariant attention. In ECCV, 2020. 1,
 2, 3, 6, 7
[37] Li Zhang, Dan Xu, Anurag Arnab, and Philip HS Torr. Dy-
 namic graph message passing networks. In CVPR, 2020. 3
[38] Bolei Zhou, Alex Andonian, Aude Oliva, and Antonio Tor-
 ralba. Temporal relational reasoning in videos. In ECCV,
 2018. 7
[39] Linchao Zhu and Yi Yang. Compound memory networks for
 few-shot video classification. In ECCV, 2018. 1, 2, 3, 6, 7
[40] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang,
 and Jifeng Dai. Deformable detr: Deformable transformers
 for end-to-end object detection. arXiv preprint, 2020. 3

 10
You can also read