Generalized Few-Shot Semantic Segmentation

Page created by Jeffery Alexander
 
CONTINUE READING
Generalized Few-Shot Semantic Segmentation
Generalized Few-Shot Semantic Segmentation

 Zhuotao Tian† Xin Lai† Li Jiang† Michelle Shu‡ Hengshuang Zhao§ Jiaya Jia†
 † ‡ §
 The Chinese University of Hong Kong Cornell University University of Oxford

 Abstract the situation where only a few labeled data of novel classes
arXiv:2010.05210v1 [cs.CV] 11 Oct 2020

 are available. Without fine-tuning, both support and query
 Training semantic segmentation models requires a large samples are sent to FS-Seg models to yield query samples
 amount of finely annotated data, making it hard to quickly predictions based on the information provided by support
 adapt to novel classes not satisfying this condition. Few-Shot samples.
 Segmentation (FS-Seg) tackles this problem with many con- There are several inherent limitations to apply FS-Seg
 straints. In this paper, we introduce a new benchmark, called in practice. First, FS-Seg requires support samples to con-
 Generalized Few-Shot Semantic Segmentation (GFS-Seg), to tain classes that exist in query samples. It may be overly
 analyze the generalization ability of segmentation models strong in many situations to have this prior knowledge. Also,
 to simultaneously recognize novel categories with very few providing support samples in the same classes requires man-
 examples as well as base categories with sufficient examples. ual selection. Second, FS-Seg models are only evaluated
 Previous state-of-the-art FS-Seg methods fall short in GFS- on novel classes, while test samples in normal semantic
 Seg and the performance discrepancy mainly comes from the segmentation may contain both the base and novel classes.
 constrained training setting of FS-Seg. To make GFS-Seg Experiments in Section 5 show that SOTA FS-Seg models
 tractable, we set up a GFS-Seg baseline that achieves de- can not well tackle the practical situation of evaluation on
 cent performance without structural change on the original both base and novel classes due to the constraints of FS-Seg.
 model. Then, as context is the key for boosting performance With these facts, in this paper, we set up a new benchmark,
 on semantic segmentation, we propose the Context-Aware called Generalized Few-Shot Semantic Segmentation (GFS-
 Prototype Learning (CAPL) that significantly improves per- Seg). The difference between GFS-Seg and FS-Seg is that,
 formance by leveraging the contextual information to update during evaluation, GFS-Seg does not require forwarding
 class prototypes with aligned features. Extensive experi- support samples that contain the same target classes in the
 ments on Pascal-VOC and COCO manifest the effectiveness test (query) samples to make the prediction. GFS-Seg can
 of CAPL, and CAPL also generalizes well to FS-Seg. also perform well on novel classes without sacrificing the
 accuracy on base classes when making predictions on them
 simultaneously, achieving the essential step towards practical
 use of semantic segmentation in more challenging situations.
 1. Introduction
 Inspired by [13, 27], we design a baseline for GFS-Seg
 The development of deep learning has provided signifi- with decent performance. Considering that contextual re-
 cant performance gain to semantic segmentation tasks. Rep- lation is essential for semantic segmentation, we further
 resentative semantic segmentation methods [54, 4] have ben- propose the Context-Aware Prototype Learning (CAPL) that
 efited a wide range of applications for robotics, automatic provides significant performance gain to the baseline by up-
 driving, medical imaging, etc. However, once these frame- dating the weights of base prototypes with aligned features.
 works are trained, without sufficient fully-labeled data, they The baseline method and the proposed CAPL can be applied
 are unable to deal with unseen classes in new applications. to normal semantic segmentation models, e.g., PSPNet [54]
 Even if the required data of novel classes are ready, fine- and DeepLab [4]. Also, CAPL demonstrates its effectiveness
 tuning still costs additional time and resources. in the setting of FS-Seg by improving PANet[41] by a large
 In order to fast adapt to novel classes with only limited margin. Our overall contribution is threefold.
 labeled data, Few-Shot Segmentation (denoted as FS-Seg) • We extend the classic Few-Shot Segmentation (FS-
 [31] models are trained on well-labeled base classes and Seg) and propose a more practical setting – Gener-
 then tested on previously unseen novel classes. As shown alized Few-Shot Semantic Segmentation (GFS-Seg).
 in Figure 1, during training, FS-Seg divides data into the
 support and query sets. Samples of support set aim to pro- • We design the first baseline for GFS-Seg that yields
 vide FS-Seg models with target class information to identify decent performance without major structural change
 target regions in query samples, with the purpose to mimic on normal semantic segmentation models. We also

 1
Generalized Few-Shot Semantic Segmentation
show that previous SOTA FS-Seg models can not well data augmentation helps models achieve better performance
 tackle the task of GFS-Seg. by creating more training examples [51, 15, 42].
 • We propose the Context-Aware Prototype Learning Few-Shot Segmentation (FS-Seg) places semantic seg-
 (CAPL) that brings significant performance gains to mentation in the few-shot scenario [31, 33, 25, 28, 48, 17, 45,
 baseline models in both settings of GFS-Seg and FS- 12, 9, 56, 2, 34, 46, 23, 37, 11, 24, 38, 43, 40, 20, 52], where
 Seg, and it is general to be applied to various normal dense pixel labeling is performed on new classes with only
 semantic segmentation models. a few support samples. OSLSM [31] first introduces this
 setting in segmentation and provides a solution by yielding
2. Related Work weights of the final classifier for each query-support pair dur-
 ing evaluation. The idea of the prototype was used in PL [7]
Semantic Segmentation Semantic segmentation is a fun- where predictions are based on the cosine similarity between
damental while challenging topic that asks models to accu- pixels and the prototypes. Additionally, prototype alignment
rately predict the label for each pixel. FCN [32] is the first regularization was introduced in PANet [41]. Predictions
framework designed for semantic segmentation by replac- can be also generated by convolutions. PFENet [38] uses
ing the last fully-connected layer in a classification network the prior knowledge from pre-trained backbone to help find
with convolution layers. To get per-pixel predictions, some the region of the interest, and the spatial inconsistency be-
encoder-decoder style approaches [26, 1, 29] are adopted to tween the query and support samples is alleviated by Feature
help refine the outputs step by step. The receptive field is vi- Enrichment Module (FEM). CANet [49] concatenates the
tal for semantic segmentation thus dilated convolution [4, 47] support and query feature and applies Iterative Optimization
is introduced to enlarge the receptive field. Context infor- Module (IOM) to refine prediction. PGNet [48] proposes
mation plays a important role for semantic segmentation Graph Attention Unit (GAU) that establishes the element-
and some context modeling architectures are introduced like to-element correspondence between the query and support
global pooling [22] and pyramid pooling [4, 54, 53, 44]. features in multiple scales.
Meanwhile, attention models [55, 50, 19, 18] are also shown Though FS-Seg models perform well on identifying novel
to be effective for capturing the long-range relationships classes given the corresponding support samples, as shown
inside scenes. Despite the success of these powerful segmen- in Section 5, even the state-of-the-art FS-Seg model can not
tation frameworks, they cannot be easily adapted to unseen well tackle the practical setting that contains both base and
classes without fine-tuning on sufficient annotated data. novel classes.

Few-Shot Learning Few-shot learning aims at making 3. Task Description
prediction on novel classes with only a few labeled exam-
ples. Popular solutions include meta-learning based methods We revisit previous few-shot segmentation followed by
[3, 10, 30] and metric-learning ones [39, 36, 35]. In addi- presenting our new benchmark – Generalized Few-Shot Se-
tion, data augmentation helps models achieve better perfor- mantic Segmentation.
mance by combating overfitting. Therefore, synthesizing
new training samples or features based on few labeled data 3.1. Classic Few-Shot Segmentation
is also a feasible solution for tackling the few-shot problem We call the task [31] classic Few-Shot Segmentation (FS-
[51, 15, 42]. Seg) where data is split into two sets for support S and
 Generalized few-shot learning was proposed in [15]. In query Q. An FS-Seg model needs to make predictions on
the generalized few-shot setting, models are first trained on Q based on the class information provided by S, and it is
base classes (representation learning) and then a few labeled trained on base classes C b and tested on previously unseen
data of novel classes together with the base ones are used novel classes C n (C b ∩ C n = ∅). To help adapt to novel
to help models gather information from novel classes (low- classes, the episodic paradigm was proposed in [39] to train
shot learning). After these two training phases, models can and evaluate few-shot models. Each episode is formed by
predict labels from both base and novel classes on a set of a support set S and a query set Q of the same class c. In
previously unseen images (test set). The proposed GFS-Seg a K-shot task, the support set S contains K samples S =
is motivated by the generalized few-shot setting, but dense {S1 , S2 , ..., SK } of class c. Each support sample Si is a pair
pixel labeling in semantic segmentation is different with the of {ISi , MSi } where ISi and MSi are the support image and
single image classification that does not contain contextual label of c.
information. For the query set, Q = {IQ , MQ } where IQ is the input
 query image and MQ is the ground truth mask of the class
Few-Shot Segmentation Few-shot learning aims at mak- c. The input data batch used for model training is the query-
ing prediction on novel classes with only a few labeled exam- support pair {IQ , S} = {IQ , IS1 , MS1 , ..., ISK , MSK }.
ples. Popular solutions include meta-learning based methods The ground truth mask MQ of the query image IQ is not ac-
[3, 10, 30] and metric-learning ones [39, 36, 35, 13]. Besides, cessible for the model. It is to evaluate the prediction of the
Generalized Few-Shot Semantic Segmentation
Classic 2-way K-shot Semantic Segmentation Masks Generalized K-shot Semantic Segmentation Construct one time for all test images.
 Novel Class 1 Novel Class 2 (2 novel classes) Novel Class 1 Novel Class 2 Novel Class ! ( + , C)
 Embedding
 (2, C)
 … Generation
 Masks (all classes)
 Embedding
 Novel classes:

 …
 Generation
 … Base classes:
 1

 Feature Extractor
 Feature Extractor
 K Classifier
 1 Support Set ( , C) New Classifier
 K Support Set
 Dist
 (H, W, 3) (H, W, 3)
 Dist

 (H, W, + )
 (H, W, C) Query Image (H, W, C)
 An Episode Query Image
 (H, W, 2)
 (a) (b)
Figure 1. Illustrations of (a) classic Few-Shot Segmentation (FS-Seg) and (b) Generalized Few-Shot Semantic Segmentation (GFS-Seg).
‘Dist’ can be any method measuring the distance/similarity between each feature and prototype and makes predictions based on that
distance/similarity. GFS-Seg models make predictions on base and novel classes simultaneously without being affected by redundant classes,
but FS-Seg models only predict novel classes provided by the support set.

query image in each episode. More details for the episodic To better distinguish between FS-Seg and GFS-Seg, we
training paradigm in FS-Seg are included in [31]. illustrate a 2-way K-shot task of FS-Seg and a case of GFS-
 In summary, there are two key criteria in the classic Few- Seg with the same query image in Figure 1, where Cow and
Shot Segmentation task. (1) Samples of testing classes C n Motorbike are novel classes, and Person and Car are base
are not seen by the model during training. (2) The model classes.
requires its support samples to contain target classes existing FS-Seg model (Figure 1(a)) is limited to predicting binary
in query samples to make the predictions. segmentation masks only for the classes whose prototypes
 are available in the episode. Person on the right and Car at
3.2. Generalized Few-Shot Semantic Segmentation the top are missing without providing prototypes, even if the
 Criterion (1) of classic few-shot segmentation evaluates model is trained on these base classes for sufficient iterations.
model generalization ability to new classes with only a few In addition, if redundant novel classes that do not appear
provided samples. Criterion (2) makes this setting not prac- in the query image (e.g., Aeroplane) are provided by the
tical in many cases since users may not know exactly how support set of (a), they adversely affect performance because
many and what classes are contained in each test image. So FS-Seg has a prerequisite that the query images must contain
it is hard to feed the support samples that contain the same the classes provided by support samples.
classes as the query sample to the model. As shown in Section 5.2, FS-Seg models only learn to
 In addition, even if users have already known that there predict the foreground masks for the given novel classes, the
are N classes contained in the test image, relation module performance is therefore much degraded in our proposed
[36] based FS-Seg models [49, 17] may need to process N K generalized setting where all possible base and novel classes
additional manually selected support samples to make pre- require prediction. Contrarily, GFS-Seg (Figure 1(b)) identi-
dictions over all possible classes for the test image. This is fies base and novel classes simultaneously without the prior
insufficient in real applications, where models are supposed knowledge of classes in each query image, and extra support
to output predictions of all classes in the test image directly classes (e.g., Aeroplane at the top-left of Figure 1(b)) do not
without processing N K additional images. affect the models.
 In GFS-Seg, base classes C b have sufficient labeled train-
ing data. Each novel class cni ∈ C n only has limited K 4. Our Method
labeled data (e.g., K = 1, 5, 10). Similar to FS-Seg, mod-
els in GFS-Seg are first trained on base classes C b to learn In this section, we present the motivation and details
good representations. Then, when training is accomplished, of Context-Aware Prototype Learning (CAPL) that aims to
after acquiring the information of N novel classes from the tackle GFS-Seg. CAPL is built upon Prototype Learning
limited N K support samples, GFS-Seg models are evalu- (PL) [35, 7]. The idea of PL and the baseline for GFS-Seg
ated on images of the test set to predict labels from both are introduced first.
base and novel classes C b ∪ C n , rather than from only novel
classes C n in the setting of FS-Seg. Therefore the major 4.1. Prototype Learning in FS-Seg
evaluation metric of GFS-Seg is the mIoU of all classes. The In Few-Shot Segmentation (FS-Seg) frameworks [7, 41]
episodic paradigm is not feasible in GFS-Seg, since the input of a N -way K-shot FS-Seg task (N novel classes and each
data during evaluation is no longer the query-support pair novel class has K support samples), all support samples
{IQ , IS , MS } used in FS-Seg, and instead is only the query sij (i ∈ {1, 2, ..., N }, j ∈ {1, 2, ..., K}) are first processed
image IQ as testing common semantic segmentation models. by a feature extractor F and mask average pooling. They
Generalized Few-Shot Semantic Segmentation
are then averaged over K shots to form N prototypes pi that can be utilized for further improvement. However, in
(i ∈ {1, 2, ..., N }). GFS-Seg, there is no such limitation on classes contained
 i
 in each test image – contextual cue always plays an impor-
 K
 ◦ F (sij )]h,w
 P
 1 X h,w [mj tant role in semantic segmentation (e.g., PSPNet [54] and
 pi = ∗ P i
 , i ∈ {1, 2, ..., N }, (1)
 K j=1 h,w [mj ]h,w Deeplab [6]), especially in the proposed setting (GFS-Seg).
 For example, Dog and People are base classes. The
 where mij ∈ Rh,w,1 is the class mask for class ci on learned base prototypes only capture the contextual rela-
F(sij ) ∈ Rh,w,c . sij represents the j-th support image of tion between Dog and People during training. If Sofa is a
class ci , and ◦ is the Hadamard Product. After acquiring N novel class and some instances of Sofa in support samples
prototypes, for query features of the feature map F(q), the appear with Dog (e.g., a dog is lying on the sofa), merely
predictions are the labels of the prototypes that they are clos- mask-pooling in each support sample of Sofa to form the
est to. The distance metric can be either Euclidean distance novel prototype may result in the base prototype of Dog
[35], cosine similarity[39], or relation module [36]. losing the contextual co-occurrence information with Sofa
 and hence yield inferior results. Thus, for GFS-Seg, reason-
4.2. Baseline for GFS-Seg able utilization of contextual information is the key to better
 FS-Seg only requires identifying targets from novel performance.
classes, but our generalized setting in GFS-Seg requires
predictions on both base and novel classes. However, it is Inference We propose the Context-Aware Prototype
hard and also inefficient for FS-Seg model to form proto- Learning (CAPL) to tackle GFS-Seg by effectively enriching
types for base classes via Eq. (1) by forwarding all samples the contextual information to the classifier. Let N b and N n
of base classes to the feature extractor, especially when the denote the total number of base C b and novel classes C n
training set is large. respectively. We use nb (nb 6 N b ) to denote the number of
 The common semantic segmentation frameworks can be extra base classes cb,i ∈ C b (i ∈ {1, ..., nb }) contained in
decomposed into two parts of feature extractor and classifier. N n × K support samples of previously unseen novel classes
 cn,u ∈ C n (u ∈ {1, ..., N n }). Before evaluation, the novel
Feature extractor projects input image into c-dimensional
space and then the classifier of size N b × c makes predic- prototypes are formed as Eq. (1). The final prototype pb,i for
tions on N b base classes. In other words, the classifier of base class cb,i is the weighted sum of the weights of original
 b
size N b ×c can be seen as N b base prototypes (P b ∈ RN ,c ). classifier pb,i b,i
 cls and the new base prototypes pf eat generated
Since forwarding all base samples to form the base proto- from support features, expressed as
types is impractical, inspired by low-shot learning methods PN n PK P i,u u
 u=1 j=1 h,w [mj ◦ F (sj )]h,w
[13, 27], we learn the classifier via back-propagation during pb,i
 f eat = PN n PK P i,u
 ,
training on base classes. u=1 j=1 h,w [mj ]h,w (3)
 Specifically, our baseline for GFS-Seg is trained on base b n
 i ∈ {1, ..., n }, u ∈ {1, ..., N }
classes as normal segmentation frameworks. After training,
 n
the prototypes P n ∈ RN ,c for N n novel classes are formed pb,i = γ i ∗ pb,i i b,i
 i ∈ {1, ..., nb },
 cls + (1 − γ ) ∗ pf eat , (4)
as Eq. (1) with N × K support samples. P n and P b are
 n
 b n
then concatenated to form P all ∈ RN +N ,c to simulta- where mi,uj represents the binary mask for base class c
 b,i
 on
neously predict the base and novel classes. Since the dot F(sj ), and sj is the j-th support sample of novel class cn,u .
 u u

product used by classifier of common semantic segmentation The dynamic weight γ balances the impact of the old and
frameworks produces different norm scales of F(sij ) that new base prototypes. To facilitate understanding, we provide
negatively impact the average operation in Eq. (1), like [41], a visual illustration of a 7-class 2-shot example in Figure 2.
we adopt cosine similarity as the distance metric d to yield We note that γ is data-dependent and its value is condi-
output O for pixels in query sample q ∈ Rh,w,3 as tioned on each pair of old and new prototypes as shown
 in Figure 2(b). For base class cb,i , γ i is calculated as
 exp(αd(F(qx,y ), pi ))
 Ox,y = arg max P i
 , (2) γ i = G(pb,i b,i
 cls , pf eat ) where G is a two-layer Multilayer Per-
 i pi ∈P all exp(αd(F(qx,y ), p ))
 ceptron (MLP) (2c × c followed by c × 1, where c is the
where x ∈ {1, ..., h}, y ∈ {1, ..., w}, i ∈ {1, ..., N b + N n }, dimension number of pb,i b,i i
 cls and pf eat ) that predicts γ based
and α is set to 10 in all experiments. on pb,i b,i
 cls and pf eat with the sigmoid activation function. The
 MLP is learned during training.
4.3. Context-Aware Prototype Learning (CAPL)
 We further note that directly applying the weighted sum of
Motivation Prototype Learning (PL) is applicable to few- Pfbeat and Pcls
 b
 is difficult because the former is the averaged
shot classification and FS-Seg, but it works inferiorly for support features and the latter is by classifier’s weights, thus
GFS-Seg. In the setting of FS-Seg, target labels of query they are in different feature spaces. To make it tractable,
samples are only from novel classes. Thus there is no essen- we modify the normal training scheme by letting the feature
tial co-occurrence interaction between novel and base classes extractor F produce Pfbeat that is well-aligned with Pcls b
 .
Generalized Few-Shot Semantic Segmentation
cow motor New Classifier Generation of 

 Feature Extractor
 (1~ ! )

 Masked
 Average 
 Pooling ( , C)

 ∗ ( − )
 ( + , C) 

 (Original Classifier)
 Novel Class: ( , 2C)
 motor cow ∗ 

 Base Class: 
 person car
 (1~ ! )
 
 sheep bus
 dog ( , 1) Adaptive 
 ( ! + 1~ ! )
 ( , C) ( + , C)
 Support Set

 (a) New Classifier Formation (b) Generation
Figure 2. Visual illustration of the new classifier formation process before the inference phase of CAPL. As shown in (a), the weights
(prototypes) of Nn novel classes (e.g. motor and cow) are directly set by the averaged novel features. Also, the weights of nb base classes
(e.g. person, car, sheep and bus) that appear with novel classes in support samples are enriched by the weighted sum of averaged features
and the original weights. Finally, the weights of rest Nb − nb base classes (e.g. dog) are kept. The generation process of the adaptive
weighing factor γ is shown in (b).

Training Let B denote the training batch size and N b de- Because the updated classifier is based on both original
note the number of all base classes. We randomly select b B2 c learned classifier and features extracted from support sam-
training samples as the ‘Fake Support’ samples and the rest ples, the weights of base classes in the learned classifier are
samples as ‘Fake Query’ ones. Let Nfb denote the number of important. To help the model generate decent representation
base classes contained in ‘Fake Support’ samples. We ran- pb,i
 f eat to form new classifiers without compromising the qual-
 Nfb
domly select b 2 c as the ‘Fake Novel’ classes C
 FN
 and set ity of original classifier pb,i
 cls , we use both Pcls and Pupdate to
 b
the rest
 N
 Nfb − b 2f c as the ‘Fake Context’ classes C F C . The generate the training loss. Let Lcls denote the cross-entropy
 FN FC (CE) loss of the output generated by the original classifier
selected C and C classes are both base classes, but Pcls , and let Lupdate represent the CE loss produced by the
they mimic the behavior of real novel and base classes during
 updated weights Pupdate . The final training loss L is the
inference respectively. The formation of the updated pro-
 average of Lcls and Lupdate .
totypes pb,i b b
 update (i ∈ {1, 2, ..., Nf , ..., N }) during training
is
 5. Experiments
 γ ∗ pb,i b,i
  i i
 
  cls + (1 − γ ) ∗ pf eat ci ∈ C F C
 
 5.1. Settings
 pb,i
 update = pb,i
 f eat ci ∈ C F N
 
 
  b,i
 pcls Otherwise
 The baseline model shown in this section is based on
 (5) PSPNet [54] with ResNet-50 [16]. More experiments, e.g.,
 Pb B2 c P i i DeepLab-V3[5], are shown in the supplementary file. Our
 j=1 h,w [mj ◦ F (sj )]h,w experiments are conducted on Pascal-VOC [8] with aug-
 pb,i
 f eat = Pb B2 c P , ci ∈ C F C ∪ C F N
 j=1
 i
 h,w [mj ]h,w
 mented data from [14] and COCO [21]. Following Pascal-
 (6) 5i [31], 20 classes in Pascal-VOC are evenly divided into
Specifically, ‘Fake Context’ classes C F C act as the base four splits and each split contains 5 classes to cross-validate
classes contained in support samples of novel classes during the performance of models. Specifically, for four splits:
 b spliti = {5i − 4, ..., 5i}(i ∈ {1, 2, 3, 4}), when using one
testing. The features of ‘Fake Context’ classes update Pcls
with γ as shown in Eq. (5). Also, features of ‘Fake Novel’ split for providing novel classes, classes of the other three
classes C F N yield ‘Fake Novel’ prototypes via Eq. (6) splits and the background class {0} are treated as 16 base
that directly replace the learned weights of base classes in classes. Different from FS-Seg where evaluation is per-
 b formed only on the novel classes of test images, GFS-Seg
Pcls . Finally, weights of the rest classes are kept as the
original ones. Following Eqs. (5) and (6), we get the updated models predict labels from both base and novel classes si-
classifier during training that helps the feature extractor learn multaneously on each image from the validation set.
to provide the classifier-aligned features during inference. As the number of either base or novel classes increases,
Generalized Few-Shot Semantic Segmentation
Novel Class: bus car chair cat cow
Figure 3. Visual comparison. Images from the top to bottom are input, ground truth, baseline and CAPL respectively. Novel classes are bus,
car, cat, chair and cow. The white color stands for the label ‘Do Not Care’.

the generalized few-shot task becomes more challenging. To – CANet is a relation-based model and PANet is a cosine-
further analyze the performance in the setting with more based model. Specifically, CANet uses convolutions as a
novel classes, we propose a variant of Pascal-5i : Pascal- relation module [36] to process the concatenated query and
10i that divides 20 classes into two splits and each has 10 support features to yield prediction on the query image, but
classes: spliti = {10i − 9, ..., 10i}(i ∈ {1, 2}). When PANet generates results by measuring the cosine similarity
using one split for providing 10 novel classes, classes of between every query feature and prototypes got from sup-
the other one split and the background class {0} are treated port samples. Though FS-Seg only requires evaluation of
as 11 base classes. Experiments on COCO-20i are also novel classes, if the prototypes (averaged features) of base
included because COCO is a rather challenging dataset with classes are available, the FS-Seg models should also be able
an overwhelming number of classes. Each split of COCO- to identify the regions of base classes in query images.
20i contains 60 base and 20 novel classes. We follow PANet We use the publicly available codes and follow the default
[41] to split COCO into four splits and report the averaged training configuration of these two frameworks. We modify
mIoU as the final performance. For stable results, we follow the inference code to feed all prototypes (base and novel)
[38] to evaluate 5,000 and 20,000 episodes on Pascal and for each query image. The base prototypes are formed by
COCO respectively. averaging the features belonging to base classes in all train-
 In the following sections, we first apply FS-Seg models to ing samples. As shown in Table 1, CANet and PANet do
the proposed GFS-Seg setting and make comparisons with not perform well compared to our CAPL, especially on base
our model. Then, we manifest the effectiveness of CAPL classes. The performance gap between FS-Seg models and
by comparing the baseline with different training/inference our baseline in the setting of GFS-Seg is still obvious when
strategies. For each dataset, we take the average of all splits testing on Pascal-10i as shown in Table 1. The results of the
as the final performance. We note that the following ex- two representative FS-Seg models in Table 1 also indicate
perimental results are the averaged results of five different that the cosine-based models can be more effective than the
random seeds, i.e., {123, 321, 456, 654, 999}. Our code will relation-based ones in GFS-Seg.
be made publicly available. We note that the results of novel mIoU of PANet and
 CANet in Table 1 are under the GFS-Seg setting and thus
5.2. Comparison with FS-Seg Models they are different from that in the setting of FS-Seg reported
 in their papers. This discrepancy is caused by different
 To show that models in the setting of FS-Seg are unable evaluation settings. In GFS-Seg, models are required to
to perform well when facing both base and novel classes, we identify all classes in a given testing image, including both
evaluate two representative FS-Seg frameworks (PANet [41] base and novel classes, while in FS-Seg, models only need
and CANet [49]) in the setting of GFS-Seg. to find the pixels belonging to one specific novel class with
 The major difference between CANet and PANet is about a prior knowledge of what the target class is. Therefore,
‘Dist’ shown in Figure 1(a) relating to the method for pro- it is much harder to identify the novel classes under the
cessing query and support feature and making predictions interference of base classes in GFS-Seg.
Generalized Few-Shot Semantic Segmentation
Table 1. Comparisons with FS-Seg models in GFS-Seg. Base: mIoU results of all base classes. Novel: mIoU results of all novel classes.
Total: mIoU results of all (base + novel) classes.
 Pascal-5i Pascal-10i
 Methods Shot Base Novel Total Base Novel Total
 CANet 1 8.73 2.42 7.23 9.59 1.79 5.88
 PANet 1 31.88 11.25 26.97 26.27 10.79 18.90
 CAPL 1 63.98 16.44 52.66 52.10 15.04 34.45
 CANet 5 9.05 1.52 7.26 9.55 1.66 5.79
 PANet 5 32.95 15.25 28.74 28.75 15.40 22.39
 CAPL 5 65.66 20.23 54.84 55.45 19.01 38.10
 CANet 10 8.96 1.57 7.20 9.42 2.09 5.93
 PANet 10 32.96 16.26 28.98 28.87 16.08 22.78
 CAPL 10 65.62 23.32 55.55 55.90 19.07 38.37

 More specifically, the main reason that FS-Seg models ing because models need to be more discriminative for both
fall short is that the episodic training scheme of FS-Seg only base and novel classes to make accurate predictions – more
focuses on adapting models to be discriminative between novel classes is detrimental to the predictions on base classes
background and foreground, where the decision boundary due to the additional noise brought by the limited support
for each episode only lies between one target class and the samples, and more base classes overwhelms the limited infor-
background in each query sample. Also, FS-Seg requires mation of novel classes. Results in Table 2 and 3 still demon-
query images to contain the classes provided by support sam- strate the robustness of the proposed method in these cases.
ples, while GFS-Seg distinguishes between not only multiple Also, as the shot number increases, more co-occurrence in-
novel classes but also all possible base classes simultane- formation can be captured and therefore CAPL can bring
ously without the prior knowledge of what classes are con- more performance gain to the baseline.
tained in query samples. The reproduced results of CANet
and PANet in FS-Seg are placed in the supplementary.
 5.4. Apply CAPL to FS-Seg
5.3. Effectiveness of CAPL
 FS-Seg is an extreme case of GFS-Seg. As PANet[41]
 To verify our proposed CAPL, we make comparisons is a cosine-based FS-Seg model, to validate the proposed
with the baseline in the settings of different shots (i.e., 1, 5 CAPL in the setting of FS-Seg, in Table 4, we implement
and 10) and different datasets. CAPL as an auxiliary loss to train the PANet with ResNet-
 Since the training of CAPL is modified to pick ‘Fake 50 and CAPL brings significant improvement (PANet-C) to
Novel’ and ‘Fake Context’ samples to align the averaged fea- the PANet by serving as an inter-class regularizer that helps
ture and learned weights for building prototypes, we modify model be more robust to different contexts.
the training scheme of the baseline (denoted as ‘CAPL-Tr’)
accordingly for a fair comparison. Because the baseline only
replaces the novel prototypes during the evaluation, there- 5.5. Comparison with AMP
fore for CAPL-Tr, only ‘Fake Novel’ is sampled and used to
replace the base prototypes during training. ‘CAPL-Te’ in Adaptive Masked Proxies (AMP) [34] has achieved
the table represents that only the inference strategy of CAPL promising performance on FS-Seg by fusing new proxies
is performed on the baseline and there is no modification to with the previously learned class signatures. Both CAPL
the training phase. We set γ to the converged mean values and AMP use weight imprinting and weighted sum for base
of CAPL models for CAPL-Te models. class representation refinement. However, there are two in-
 The results in Table 2 show that the training and inference herent differences: 1) AMP involves an additional search
strategies of CAPL complement each other – both are indis- for a fixed α shared by weighted summations of all classes,
pensable. CAPL-Tr brings minor improvement to the base- while our weighting factor is adaptive to different classes
line with only the training alignment is implemented, and because it is conditioned on the input prototypes without
CAPL-Te proves that, without the proposed training strategy, additional searching epochs. 2) The weighted summation is
the inference method of CAPL produces sub-optimal results only performed at the inference stage in AMP and the model
due to misalignment between the original classifier and new is trained in a normal fashion, but our model meta-learns
weights obtained by averaging support features. the behavior during training to better accomplish the context
 Either more novel classes (11 base and 10 novel classes) enrichment during inference. Therefore AMP is between
in Pascal-10i or more base classes (60 base and 20 novel our baseline and CAPL-Te and is less competitive to CAPL.
classes) in COCO-20i can make this setting more challeng- More experiments are shown in the supplementary.
Generalized Few-Shot Semantic Segmentation
Table 2. Ablation study of CAPL in GFS-Seg. Base: mIoU results of all base classes. Novel: mIoU results of all novel classes. Total: mIoU
results of all (base + novel) classes.
 Pascal-5i Pascal-10i
 Methods Shot Base Novel Total Base Novel Total
 Baseline 1 62.57 14.02 50.99 51.05 13.68 33.25
 CAPL-Tr 1 62.79 15.53 51.54 49.91 14.10 32.86
 CAPL-Te 1 63.43 14.09 51.68 51.38 13.30 33.25
 CAPL 1 63.98 16.44 52.66 52.10 15.04 34.45
 Baseline 5 64.04 17.01 52.96 52.40 16.44 35.32
 CAPL-Tr 5 64.29 19.60 53.65 52.17 17.17 35.5
 CAPL-Te 5 64.80 17.01 53.42 55.18 16.80 36.90
 CAPL 5 65.66 20.23 54.84 55.45 19.01 38.10
 Baseline 10 63.76 18.63 53.01 52.28 17.60 35.77
 CAPL-Tr 10 63.52 21.08 53.41 51.93 19.41 36.45
 CAPL-Te 10 65.02 18.76 54.00 55.04 17.89 37.35
 CAPL 10 65.62 23.32 55.55 55.90 19.07 38.37

Table 3. The effectiveness of CAPL on COCO-20i in GFS-Seg. Table 5. Comparison with baseline model on base classes. Splits
PANet-C: PANet implemented with CAPL. Base: mIoU results of from Pascal-10i are marked with ? and the other splits without ?
all base classes. Novel: mIoU results of all novel classes. Total: are from Pascal-5i . The mIoU results of base classes are shown in
mIoU results of all (base + novel) classes. this table.
 Methods Shot Base Novel Total Methods Split-1 Split-2 Split-3 Split-4 Split-1? Split-2?
 PANet 1 11.39 9.69 10.97 Baseline 70.22 62.37 60.78 70.99 53.15 58.09
 PANet-C 1 11.56 17.13 15.94 CAPL 70.57 62.04 60.86 71.39 53.37 57.03
 Baseline 1 39.21 8.20 31.56
 CAPL 1 43.51 10.56 35.38
 PANet 5 11.68 12.62 11.91
 PANet-C 5 16.00 19.04 16.75
 5.7. Comparison on Base Classes
 Baseline 5 38.73 8.08 31.16 To demonstrate that our proposed CAPL does not affect
 CAPL 5 42.30 9.67 34.24 the base class training, we show comparison on base classes
 PANet 10 11.74 12.78 11.99 with the baseline model in Table 5. Both two models are
 PANet-C 10 16.07 19.37 16.88 evaluated without support images, therefore the mIoU results
 Baseline 10 37.46 5.62 29.60 for novel classes are zero and we only show the mIoU results
 CAPL 10 40.20 6.11 31.79 of base classes. Based on results shown in Table 5, we can
 conclude that CAPL can not only quickly adapt to previously
 unseen novel classes as shown in other experiments, but it
 Table 4. Class mIoU results in FS-Seg.
 also yields robust results on base classes.
 Methods Shot Pascal-5i Pascal-10i COCO-20i
 PANet 1 49.49 46.69 26.17
 PANet-C 1 53.91 53.70 31.39 6. Conclusion
 PANet 5 56.56 53.24 33.51
 PANet-C 5 60.30 61.14 36.34 We have presented the new benchmark of Generalized
 Few-Shot Semantic Segmentation (GFS-Seg) with a novel
 solution – Context-Aware Prototype Learning (CAPL). Dif-
5.6. Visual Illustration ferent from the classic Few-Shot Segmentation (FS-Seg),
 GFS-Seg aims at identifying both base and novel classes
 Visual examples in GFS-Seg are shown in Figure 3. The that recent state-of-the-art FS-Seg models fall short. Our
major issue of the baseline is that the results are more likely proposed CAPL helps baseline gain significant performance
to be negatively affected by other prototypes of classes that improvement by enriching the context information brought
are not in the test image, especially for predictions near the by novel classes to the base classes with aligned features.
boundary area. With enriched context information, CAPL CAPL has no structural constraints on the base model and
yields less false predictions than the baseline. Both our base- thus it can be easily applied to normal semantic segmenta-
line method and CAPL can be easily applied to any normal tion frameworks. CAPL also generalizes well to the classic
semantic segmentation model without structural constraints. setting of FS-Seg.
Generalized Few-Shot Semantic Segmentation
References [19] Haijie Tian Yong Li Yongjun Bao Zhiwei Fang and Han-
 qing Lu Jun Fu, Jing Liu. Dual attention network for scene
 [1] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. segmentation. In CVPR, 2019. 2
 Segnet: A deep convolutional encoder-decoder architecture [20] Xiang Li, Tianhan Wei, Yau Pun Chen, Yu-Wing Tai, and Chi-
 for image segmentation. TPAMI, 2017. 2 Keung Tang. FSS-1000: A 1000-class dataset for few-shot
 [2] Ayan Kumar Bhunia, Ankan Kumar Bhunia, Shuvozit Ghose, segmentation. In CVPR, 2020. 2
 Abhirup Das, Partha Pratim Roy, and Umapada Pal. A deep [21] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays,
 one-shot network for query-based logo retrieval. PR, 2019. 2 Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence
 [3] Qi Cai, Yingwei Pan, Ting Yao, Chenggang Yan, and Tao Mei. Zitnick. Microsoft COCO: common objects in context. In
 Memory matching networks for one-shot image recognition. ECCV, 2014. 5
 In CVPR, 2018. 2 [22] Wei Liu, Andrew Rabinovich, and Alexander C. Berg.
 [4] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Parsenet: Looking wider to see better. arXiv, 2015. 2
 Kevin Murphy, and Alan L. Yuille. Deeplab: Semantic image [23] Weide Liu, Chi Zhang, Guosheng Lin, and Fayao Liu. Cr-
 segmentation with deep convolutional nets, atrous convolu- net: Cross-reference networks for few-shot segmentation. In
 tion, and fully connected crfs. TPAMI, 2018. 1, 2 CVPR, 2020. 2
 [5] Liang-Chieh Chen, George Papandreou, Florian Schroff, and [24] Yongfei Liu, Xiangyi Zhang, Songyang Zhang, and Xuming
 Hartwig Adam. Rethinking atrous convolution for semantic He. Part-aware prototype network for few-shot semantic
 image segmentation. arXiv, 2017. 5 segmentation. In ECCV, 2020. 2
 [6] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian [25] Khoi Nguyen and Sinisa Todorovic. Feature weighting and
 Schroff, and Hartwig Adam. Encoder-decoder with atrous boosting for few-shot segmentation. In ICCV, 2019. 2
 separable convolution for semantic image segmentation. In [26] Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. Learn-
 ECCV, 2018. 4 ing deconvolution network for semantic segmentation. In
 ICCV, 2015. 2
 [7] Nanqing Dong and Eric P. Xing. Few-shot semantic segmen-
 [27] Hang Qi, Matthew Brown, and David G. Lowe. Low-shot
 tation with prototype learning. In BMVC, 2018. 2, 3
 learning with imprinted weights. In CVPR, 2018. 1, 4
 [8] Mark Everingham, Luc Van Gool, Christopher K. I. Williams, [28] Kate Rakelly, Evan Shelhamer, Trevor Darrell, Alexei A.
 John M. Winn, and Andrew Zisserman. The pascal visual Efros, and Sergey Levine. Few-shot segmentation propagation
 object classes (VOC) challenge. IJCV, 2010. 5 with guided networks. arXiv, 2018. 2
 [9] Zhibo Fan, Jin-Gang Yu, Zhihao Liang, Jiarong Ou, Changxin [29] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net:
 Gao, Gui-Song Xia, and Yuanqing Li. FGN: fully guided Convolutional networks for biomedical image segmentation.
 network for few-shot instance segmentation. In CVPR, 2020. In MICCAI, 2015. 2
 2 [30] Andrei A. Rusu, Dushyant Rao, Jakub Sygnowski, Oriol
[10] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model- Vinyals, Razvan Pascanu, Simon Osindero, and Raia Hadsell.
 agnostic meta-learning for fast adaptation of deep networks. Meta-learning with latent embedding optimization. In ICLR,
 In ICML, 2017. 2 2019. 2
[11] Siddhartha Gairola, Mayur Hemani, Ayush Chopra, and Balaji [31] Amirreza Shaban, Shray Bansal, Zhen Liu, Irfan Essa, and
 Krishnamurthy. Simpropnet: Improved similarity propagation Byron Boots. One-shot learning for semantic segmentation.
 for few-shot image segmentation. In IJCAI, 2020. 2 In BMVC, 2017. 1, 2, 3, 5
[12] Naiyu Gao, Yanhu Shan, Yupei Wang, Xin Zhao, Yinan Yu, [32] Evan Shelhamer, Jonathan Long, and Trevor Darrell. Fully
 Ming Yang, and Kaiqi Huang. SSAP: single-shot instance convolutional networks for semantic segmentation. TPAMI,
 segmentation with affinity pyramid. In ICCV, 2019. 2 2017. 2
[13] Spyros Gidaris and Nikos Komodakis. Dynamic few-shot [33] Mennatullah Siam and Boris N. Oreshkin. Adaptive masked
 visual learning without forgetting. In CVPR, 2018. 1, 2, 4 weight imprinting for few-shot segmentation. In ICCV, 2019.
 2
[14] Bharath Hariharan, Pablo Arbelaez, Lubomir D. Bourdev,
 [34] Mennatullah Siam, Boris N. Oreshkin, and Martin Jägersand.
 Subhransu Maji, and Jitendra Malik. Semantic contours from
 AMP: adaptive masked proxies for few-shot segmentation. In
 inverse detectors. In ICCV, 2011. 5
 ICCV, 2019. 2, 7
[15] Bharath Hariharan and Ross B. Girshick. Low-shot visual [35] Jake Snell, Kevin Swersky, and Richard S. Zemel. Prototypi-
 recognition by shrinking and hallucinating features. In Hallu- cal networks for few-shot learning. In NeurIPS, 2017. 2, 3,
 cinating, 2017. 2 4
[16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. [36] Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip H. S.
 Deep residual learning for image recognition. In CVPR, 2016. Torr, and Timothy M. Hospedales. Learning to compare:
 5 Relation network for few-shot learning. In CVPR, 2018. 2, 3,
[17] Tao Hu, Pengwan Yang, Chiliang Zhang, Gang Yu, Yadong 4, 6
 Mu, and Cees G. M. Snoek. Attention-based multi-context [37] Pinzhuo Tian, Zhangkai Wu, Lei Qi, Lei Wang, Yinghuan
 guiding for few-shot semantic segmentation. In AAAI, 2019. Shi, and Yang Gao. Differentiable meta-learning model for
 2, 3 few-shot semantic segmentation. In AAAI, 2020. 2
[18] Zilong Huang, Xinggang Wang, Lichao Huang, Chang Huang, [38] Zhuotao Tian, Hengshuang Zhao, Michelle Shu, Zhicheng
 Yunchao Wei, and Wenyu Liu. Ccnet: Criss-cross attention Yang, Ruiyu Li, and Jiaya Jia. Prior guided feature enrichment
 for semantic segmentation. In ICCV, 2019. 2 network for few-shot segmentation. TPAMI, 2020. 2, 6
Generalized Few-Shot Semantic Segmentation
[39] Oriol Vinyals, Charles Blundell, Tim Lillicrap, Koray
 Kavukcuoglu, and Daan Wierstra. Matching networks for one
 shot learning. In NeurIPS, 2016. 2, 4
[40] Haochen Wang, Xudong Zhang, Yutao Hu, Yandan Yang,
 Xianbin Cao, and Xiantong Zhen. Few-shot semantic seg-
 mentation with democratic attention networks. In ECCV,
 2020. 2
[41] Kaixin Wang, JunHao Liew, Yingtian Zou, Daquan Zhou, and
 Jiashi Feng. Panet: Few-shot image semantic segmentation
 with prototype alignment. In ICCV, 2019. 1, 2, 3, 4, 6, 7
[42] Yu-Xiong Wang, Ross B. Girshick, Martial Hebert, and
 Bharath Hariharan. Low-shot learning from imaginary data.
 In CVPR, 2018. 2
[43] Boyu Yang, Chang Liu, Bohao Li, Jianbin Jiao, and Qixi-
 ang Ye. Prototype mixture models for few-shot semantic
 segmentation. In ECCV, 2020. 2
[44] Maoke Yang, Kun Yu, Chi Zhang, Zhiwei Li, and Kuiyuan
 Yang. Denseaspp for semantic segmentation in street scenes.
 In CVPR, 2018. 2
[45] Yuwei Yang, Fanman Meng, Hongliang Li, King N. Ngan,
 and Qingbo Wu. A new few-shot segmentation network based
 on class representation. In VCIP, 2019. 2
[46] Yuwei Yang, Fanman Meng, Hongliang Li, Qingbo Wu, Xiao-
 long Xu, and Shuai Chen. A new local transformation module
 for few-shot segmentation. In MMM, 2020. 2
[47] Fisher Yu and Vladlen Koltun. Multi-scale context aggrega-
 tion by dilated convolutions. In Yoshua Bengio and Yann
 LeCun, editors, ICLR, 2016. 2
[48] Chi Zhang, Guosheng Lin, Fayao Liu, Jiushuang Guo,
 Qingyao Wu, and Rui Yao. Pyramid graph networks with
 connection attentions for region-based one-shot semantic seg-
 mentation. In ICCV, 2019. 2
[49] Chi Zhang, Guosheng Lin, Fayao Liu, Rui Yao, and Chunhua
 Shen. Canet: Class-agnostic segmentation networks with
 iterative refinement and attentive few-shot learning. In CVPR,
 2019. 2, 3, 6
[50] Hang Zhang, Kristin J. Dana, Jianping Shi, Zhongyue Zhang,
 Xiaogang Wang, Ambrish Tyagi, and Amit Agrawal. Context
 encoding for semantic segmentation. In CVPR, 2018. 2
[51] Hongguang Zhang, Jing Zhang, and Piotr Koniusz. Few-shot
 learning via saliency-guided hallucination of samples. In
 CVPR, 2019. 2
[52] Xiaolin Zhang, Yunchao Wei, Yi Yang, and Thomas Huang.
 Sg-one: Similarity guidance network for one-shot semantic
 segmentation. arXiv, 2018. 2
[53] Hengshuang Zhao, Xiaojuan Qi, Xiaoyong Shen, Jianping
 Shi, and Jiaya Jia. Icnet for real-time semantic segmentation
 on high-resolution images. In ECCV, 2018. 2
[54] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang
 Wang, and Jiaya Jia. Pyramid scene parsing network. In
 CVPR, 2017. 1, 2, 4, 5
[55] Hengshuang Zhao, Yi Zhang, Shu Liu, Jianping Shi,
 Chen Change Loy, Dahua Lin, and Jiaya Jia. Psanet: Point-
 wise spatial attention network for scene parsing. In ECCV,
 2018. 2
[56] Kai Zhu, Wei Zhai, and Yang Cao. Self-supervised tuning for
 few-shot segmentation. In IJCAI, 2020. 2
You can also read