DETReg: Unsupervised Pretraining with Region Priors for Object Detection - arXiv

Page created by Antonio Morales

Society

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

DETReg: Unsupervised Pretraining with Region Priors for Object Detection - arXiv

DETReg: Unsupervised Pretraining with
                                                      Region Priors for Object Detection

                                                  Amir Bar1 , Xin Wang2 , Vadim Kantorov1 , Colorado J Reed2 , Roei Herzig1 ,
                                                     Gal Chechik3,4 , Anna Rohrbach2 , Trevor Darrell2 , Amir Globerson1
                                               1
                                                 Tel-Aviv University 2 Berkeley AI Research 3 NVIDIA 4 Bar-Ilan University
                                                                        amir.bar@cs.tau.ac.il
arXiv:2106.04550v1 [cs.CV] 8 Jun 2021

                                                                                       Abstract

                                                 Self-supervised pretraining has recently proven beneficial for computer vision
                                                 tasks, including object detection. However, previous self-supervised approaches
                                                 are not designed to handle a key aspect of detection: localizing objects. Here, we
                                                 present DETReg, an unsupervised pretraining approach for object DEtection with
                                                 TRansformers using Region priors. Motivated by the two tasks underlying object
                                                 detection: localization and categorization, we combine two complementary signals
                                                 for self-supervision. For an object localization signal, we use pseudo ground
                                                 truth object bounding boxes from an off-the-shelf unsupervised region proposal
                                                 method, Selective Search, which does not require training and can detect objects at
                                                 a high recall rate and very low precision. The categorization signal comes from
                                                 an object embedding loss that encourages invariant object representations, from
                                                 which the object category can be inferred. We show how to combine these two
                                                 signals to train the Deformable DETR detection architecture from large amounts
                                                 of unlabeled data. DETReg improves the performance over competitive baselines
                                                 and previous self-supervised methods on standard benchmarks like MS COCO and
                                                 PASCAL VOC. DETReg also outperforms previous supervised and unsupervised
                                                 baseline approaches for a low-data regime when trained with only 1%, 2%, 5%,
                                                 and 10% of the labeled data on MS COCO. For code and pretrained models, visit
                                                 the project page https://amirbar.net/detreg.

                                        1   Introduction
                                        Object detection is a key task in machine vision, and involves both localizing objects in an image
                                        and classifying them into categories. Achieving high detection accuracy typically requires training
                                        the models with large datasets. However, such datasets are expensive to collect since they require
                                        manual annotation of multiple bounding boxes per image, while unlabeled images are easy to
                                        collect and require no manual annotation. Recently, there has been a growing interest in learning
                                        self-supervised representations, which substantially reduce the need for labeled data [24, 6, 23, 9].
                                        These self-supervised representations are learned in a pretraining stage on large-scale datasets like
                                        ImageNet [13], and they have led to increased performance for a range of perception tasks [8]
                                        including object detection — even outperformed a supervised pretraining counterparts.
                                        Despite this recent progress, we argue that current approaches are limited in their ability to learn good
                                        representations for object detection, as they do not focus on learning to detect objects. Most past
                                        works (e.g., MoCo [24] and SwAV [6]) focus on learning only part of the detection architecture, which
                                        is usually a subnetwork of the detector (e.g., a convolutional network like ResNet [26]). Learning a
                                        backbone on its own is not enough for a detection model to succeed. While the recent UP-DETR [12]
                                        work trains a full detection architecture, it learns to detect random patches in an image and is therefore
                                        not geared towards detection of actual objects.

                                        Preprint. Under review.

(a) Def. DETR [59] w/ SwAV [6] (b) UP-DETR [12] (c) DETReg (Ours)
Figure 1: Prediction examples of unsupervised pretraining approaches. Recent methods, shown
in (a) and (b), do not learn “objectness” during the pretraining stage. In contrast, our method DETReg
(c) learns to localize objects more accurately in its pretraining. The included prediction examples
were obtained after pretraining and before finetuning with annotated data.

Our approach to the problem is different and is based on the observation that learning good detectors
requires learning to detect objects in the pretraining stage. To accomplish this, we present a new
framework called “DEtection with TRansformers based on Region priors”, or DETReg. DETReg
can be used to train a detector on unlabeled data by introducing two key pretraining tasks: “Object
Localization Task” and the “Object Embedding Task”. The goal of the first is to train the model to
localize objects, regardless of their categories. However, learning to localize objects is not enough,
and detectors must also classify objects. Towards this end, we introduce the “Object Embedding
Task”, which is geared towards understanding the categories of objects in the image. Inspired by the
simplicity of recent transformers for object detection [4, 59], we choose to base our approach on the
Deformable DETR [59] architecture, which simplifies the implementation and is fast to train.
But how can we learn to localize objects from unlabeled data? Luckily, the machine vision community
has worked extensively on the problem of region proposals, and there are effective methods like
Selective Search [47] to produce category-agnostic region proposals at a high recall, off-the-shelf,
and without the need for training. The key idea in Selective Search is that objects exhibit certain
structural properties (continuity, hierarchy, edges), and fairly simple programmatic (i.e., not trained)
procedures can leverage these cues to extract object proposals. As we show here, these classic
algorithms can be effectively used for unsupervised learning of detectors. Similarly, our “Object
Embedding Task” is based on the recent success of self-supervised methods in learning visual
representations from unlabeled data [6, 8, 10]. In these works, the key idea is to encourage learning
of visual representations that are not sensitive to transformations that preserve object categories, such
as translation or mild cropping. We use one such method, SwAV [6], to obtain the embeddings of
potential objects, and use them to supervise our DETReg object embeddings during pretraining.
We train DETReg on the above two tasks without using any manually annotated bounding boxes or
categories. A key advantage of this approach is that it trains all DETR model parameters, and thus
learns to produce meaningful detections even with no supervision — see Figure 1. We conduct an
extensive evaluation of DETReg on standard benchmarks: MS COCO [35] and PASCAL VOC [16]
under various settings, including “low-data” training regimes. We find that DETReg improves on
challenging baselines across the board, and especially when small amounts of annotated data are
available. For example, DETReg improves over the supervised pretrained Deformable DETR by 4
points in AP on PASCAL VOC and by 1.6 points on MS COCO. When using only 1% of the data,
it improves over the supervised counterpart by over 11 points in AP. Additionally, it improves on
the Deformable DETR initialized with SwAV by 2.5 points in AP on PASCAL VOC, and by 0.3 on
MS COCO. We also find it improves by 5.7 and 5.8 points on AP when using only 1% and 2% of
annotated data on MS COCO. Taken together, these results suggest that DETReg is a highly effective
approach to pretraining object detector models.

2 Related Work

Self-supervised pretraining. Recent work [8, 27, 24, 6, 10] has shown that self-supervised pretrain-
ing can generate powerful representations for transfer learning, even outperforming its supervised
counterparts on challenging vision benchmarks [52, 8]. Self-supervised learning often involves

various image restoration (e.g., inpainting [40], colorization [58], denoising [48]) and higher level
prediction tasks like image orientation [19], context [14], temporal ordering [38], and cluster assign-
ments [5]. The learned representations transfer well to image classification but the improvement is
less significant for instance-level tasks, such as object detection and instance segmentation [24, 41].
More recently, a few works [43, 27, 53, 55] studied instance-level self-supervised representation
learning. Roh [43] et al. propose a spatially consistent representation learning (SCRL) algorithm to
produce coherent spatial representations of a randomly cropped local region according to geometric
translations and zooming operations. Concurrent works, DetCon [27], ReSim [53] and DetCo [55]
adopt contrastive learning on image patches for region similarity learning. DetCon uses mask priors
to align patches from different views while ReSim applies two different transformations (e.g., random
cropping) to the image and constructs the positive alignments from the overlapping regions. Our
work is in line with these works on learning useful representations for object detection. In contrast
to DetCon, which requires object mask priors, our approach seeks to use the region proposals from
the off-the-shelf tools and use them as weak supervision, rather than implicitly embedding them in
the contrastive learning formulation for constructing positive/negative pairs. Our intuition is that
contrastive learning on image patches does not necessarily empower the model to learn what and
where an object is, and adding the weak supervision signals from the region priors could be beneficial.
End-to-end object detection. Detection with transformers (DETR) [4] builds the first fully end-
to-end object detector and eliminates the need for components such as anchor generation and
non-maximum suppression (NMS) post-processing. This model has quickly gained traction in the
machine vision community. However, the original DETR suffers from slow convergence and limited
sample efficiency. Deformable DETR [59] introduces a deformable attention module to attend to a
sparsely sampled small set of prominent key elements, and achieves better performance compared to
DETR with reduced training epochs. We use Deformable DETR as our base detection architecture
given its improved training efficiency. Both DETR and Deformable DETR adopt the supervised
pretrained backbone (i.e., ResNet [26]) on ImageNet. UP-DETR [12] pretrains DETR in a self-
supervised way by detecting and reconstructing the random patches from the input image. Our work
shares the goal of UP-DETR of unsupervised pretraining for object detection, but our approach is
very different. In contrast to UP-DETR, we adopt region priors from off-the-shelf unsupervised
region proposal algorithms to provide weak supervision for pretraining, which has an explicit notion
of object compared to random image patches which do not.
Region proposals. A rich study of region proposals methods [1, 44, 7, 15, 2, 60, 11, 32] exists in the
object detection literature. Grouping based method, Selective Search [44], and window scoring based
approach, Objectness [1] are two early and well known proposal methods, which has been widely
adopted and supported in the major libraries (e.g., OpenCV [3]). Selective search greedily merges
superpixels to generate proposals. Objectness relies on visual cues such as multi-scale saliency,
color contrast, edge density and superpixel straddling to identify likely regions. While the field has
largely drifted to learning based approaches, one benefit of these classic region proposal approaches
is that they do not have learned parameters, and thus can be a good source of “free” supervision.
Hosang et al. [29, 28] provide a comprehensive analysis over the various region proposals methods
and Selective Search is the among the top performing approaches with a high recall rate. In this work,
we seek weak supervision from the region proposals generated by Selective Search, which has been
widely adopted and proven successful in the well-known detectors such as R-CNN [21] and Fast
R-CNN [20]. Note however that our proposed approach is not limited to the Selective Search region
priors, and can employ other region proposal methods.

3 Region Proposals via Selective Search

Training a model for object detection requires learning to localize objects. To accomplish this, we
rely on classical region proposal approaches. Specifically, we use the Selective Search algorithm [47].
The goal of Selective Search is to propose candidate regions in the image that contain objects. These
regions are obtained via an iterative process that hierarchically groups smaller regions based on their
similarity and adjacency. This algorithm is fully programmatic, it does not require training and is
available “off-the-shelf” using the OpenCV python library [3]. Furthermore, it captures multiple
attributes of objects and functions as an excellent prior for “objectness”.

Input Image: x Bipartite Matching Region Proposals
^
b z^ p^ bz c b2
b4 b1 b3
fbox
femb
Detector v
fcat
SwAV
1 2 3 4

Figure 2: The DETReg pretext task and model. We pretrain a Deformable DETR [59] based detector
to predict region proposals and their corresponding object embeddings in the pretraining stage.

Next, we briefly describe the Selective Search procedure, in order to highlight the type of information
it captures. Given an image, a graph-based segmentation algorithm [17] is used to propose initial
image regions R = {r1 , ..., rn }. These regions are the result of an iterative grouping process of
super-pixels, where adjacent elements are grouped based on their similarity across the boundary
compared to their similarity with other neighboring components. Let S be the set of pairwise region
similarities of the neighboring regions, according to some similarity function s. In every iteration let
ri , rj ∈ R be the two regions such that s(ri , rj ) = max(S); these two regions are combined into a
new region rt = ri ∪ rj , which is added to the set of regions R: R = R ∪ {rt }. The old similarities
involving ri , rj are removed, and the new similarities w.r.t. rt and its neighbours are added. When
the set of similar regions is empty, the algorithm stops and returns the locations of the regions in R.
The ranking of the output boxes is determined according to the order for which they were generated
with a bit of randomness to make the result more diverse.
To group regions correctly, the region similarity function s has to assign high scores to pairs of
regions which are more likely to comprise objects and low scores to ones that do not. This requires
some built-in “objectness” assumptions. It is defined by:

s(ri , rj ) = scolor (ri , rj ) + stexture (ri , rj ) + ssize (ri , rj ) + sf ill (ri , rj ), (1)

where scolor , stexture measure the similarity between the color histograms and the texture histograms
using SIFT-like features, ssize is the fraction of the image that ri and rj jointly occupy, and sf ill
scores how well the shapes of these two regions fit together (e.g, they fit if merging them is likely to
fill holes and they do not fit if they are hardly touching each other).

4 Selecting Bounding Box Proposals

As mentioned in Section 3, the Selective Search algorithm attempts to sort the region proposals such
that ones that are more likely to be objects appear first. However, the number of region proposals is
large and the ranking is not precise. Therefore, we need a mechanism to choose the best ones to be
used as proposals during training (see Section 5 below). We consider the three following policies for
selecting boxes: Top-k, Random-k, and Importance Sampling.
Top-K. We follow the object ranking determined by the Selective Search algorithm. Specifically,
regions that are grouped earlier are ranked as ones that are more likely to be objects. We select the
top K ones as input to DETReg (see Section 5).
Random-K. We randomly select K candidates from the full list of proposals generated by Selective
Search. This yields lower quality candidates but encourages exploration.
Importance Sampling. In this approach, we aim to rely on the ranking of Selective Search, but also
utilize lower ranked and more diverse proposals. More formally, Let b1 , . . . , bn be a set of n sorted
region proposals, as calculated by the Selective Search algorithm, where the bi has rank i. Let Xi
be a random variable indicating whether we include bi in the output proposals. Then we assign the
sampling probability for Xi to be:

P r(Xi = 1) ∝ −log(i/n).

To determine if a box should be included, we randomly sample from its respective distribution.

5 The DETReg Model for Unsupervised Pretraining with Region Priors
Next, we turn to the key challenge this paper addresses: how to use unlabeled data for pretraining
an end-to-end detection model. Our approach uses unlabeled data to generate a pretraining task, or
pretext task, for DETR. The principal idea is to design this task to be as close as possible to object
detection, such that if our model succeeds on the pretext task, it is likely to transfer well to the object
detection task. Specifically, our goal is for the pretrained detector to understand both how to localize
objects and how to learn a good embedding of the object. The overall approach is shown in Fig. 2.
We use Deformable-DETR [59] as the detection architecture, although other architectures can also be
used. Recall that DETR detects up to N objects in an image, which is done by iteratively applying
attention and feedforward layers over the N object query vectors of the decoder and over the input
image features.
The last layer of the decoder results in N image-dependent query embeddings that are used to predict
bounding box coordinates and object categories. Formally, consider an input image x ∈ RH×W ×3 .
Then DETR uses x to calculate N image-dependent query embeddings v 1 , . . . , v N with v i ∈ Rd
(this is done by passing the image through a backbone, followed by a transformer, and processing of
the query vectors. See [4] for details). Then two prediction heads are applied to the v i . The first is
fbox : Rd → R4 , which predicts the bounding boxes. The second is fcat : Rd → RL , which outputs
a distribution over L object categories, including the background “no object” category. During
our unsupervised pretraining process, the fcat prediction head has only two outputs: object and
background, since we do not use any category labels. The two prediction heads are implemented via
MLPs, and in the finetuning phase (i.e., when training on a labeled target dataset) we drop the last
layer of fcat and replace it with a new fully-connected layer, setting the number of outputs according
to the number of categories in the target dataset.
Recall that our goal is for the pretrained detector to both localize objects and to learn a good
embedding of the object, which should ideally capture the visual features and category of the object.
Accordingly, we devise two pretraining tasks, as follows.

Object Localization Task To teach the model to detect objects, we ideally need to provide it with
boxes that contain objects. Our key insight here is that this is precisely what Selective Search can do.
Namely, Selective Search can take an image and produce a large set of region proposals at a high
recall rate, e.g, some of the regions are likely to contain objects. However, is has very low precision
and it does not output category information (see [29, 28] for extensive evaluation of Selective Search
and other region proposal methods). Thus, the “Object Localization” pretraining task takes a set of M
boxes b1 , . . . , bM (where bi ∈ R4 ) output by Selective Search (see Section 4 on how to choose these
boxes) and optimizes a loss that minimizes the difference between the DETR predictions (i.e., the
outputs of the network fbox above) and these M boxes. As with DETR, the loss involves matching the
predicted boxes and bi , as we explain later. We note that it is clear that most of the Selective Search
boxes will not contain actual objects. However, since the content of non-object boxes tends to be
more variable than for object boxes we expect that deep models can be trained to recognize objectness
even when given objectness labels that are very noisy, as in Selective Search. Our empirical results
support this intuition. In fact, we even show that after pretraining, DETReg outperforms Selective
Search in object-agnostic detection, suggesting that DETReg managed to ignore wrong examples.

Object Embedding Task Recall that in the standard supervised training scheme of DETR, the
query embeddings v i are also used to classify the category of the object in the box via the prediction
head fcat . Thus we would like the v i embedding to capture information useful for category prediction.
Towards this end, we leverage existing region descriptors that provide good representations for
categorizing individual objects. Here we use SwAV [6], which obtains state-of-the-art unsupervised
image representations. For each box bi (of the M boxes output by Selective Search), we apply SwAV
to the region in the image specified by bi . Denote the corresponding SwAV descriptor by z i . Then
we introduce a network femb : Rd → Rd that tries to predict z i from the DETR embedding for this
box (namely v i ), and we minimize the loss of this prediction. Again, the loss here involves matching
the predicted boxes and bi as we explain below.
Next, we describe how we train our model to optimize the above two tasks. Assume that Selective
Search always returns M object proposals. As explained above these are used to generate M
bounding boxes b1 , . . . , bM and M SwAV descriptors v 1 , . . . , v M . Denote by yi = (bi , z i ) the

tuple containing bi and z i , and denote these M tuples by y. Recall that our goal is to train the DETR
model such that its N outputs are well aligned with y. Denote by v 1 , . . . , v K the image-dependent
query embeddings calculated by DETR (i.e., the output of the last layer of the DETR decoder). Recall
that we consider three prediction heads for DETR: fbox which outputs predicted bounding boxes,
fcat which predicts if the box is object or background, and femb which tries to reconstruct the SwAV
descriptor. We denote these three outputs as follows:
              b̂i = fbox (v i )                 ẑ i = femb (v i )                  p̂i = fcat (v i )

We use each such triplet to define a tuple yˆi = (bˆi , zˆi , pˆi ) and denote the set of N tuples by
ŷ = {yˆi }N
           i=1 . We assume that the number of DETR queries N is larger than M , and we therefore
pad y to obtain N tuples, and assign a label ci ∈ {0, 1} to each box in y to indicate whether it was a
Selective Search proposal (ci = 1) or padded proposal (ci = 0). With the DETR family of object
detectors [59, 4], there are no assumptions on the order of the labels or the predictions and therefore
we first match the objects of y to the ones in ŷ via the Hungarian bipartite matching algorithm [33].
Specifically, we find the permutation σ that minimizes the optimal matching cost between y and ŷ:
                                                    N
                                                    X
                                    σ = arg min           Lmatch (yi , ŷσ(i) )                                     (2)
                                           σ∈ΣN       i

Where Lmatch is the pairwise matching cost matrix as defined in [4] and ΣN is the set of all
permutations over {1 . . . N }. Using the optimal σ, we define the loss as follows:
                      N h
                      X                                                                                              i
 LDET Reg (y, ŷ) =         λf Lf ocal (ci , p̂σ(i) ) + 1{ci 6=φ} (λb Lbox (bi , b̂σ(i) ) + λe Lemb (z i , ẑ σ(i) ))
                      i=1
                                                                                                  (3)
Where Lf ocal is the Focal Loss [34], and Lbox is based on the the L1 loss and the Generalized
Intersection Over Union (GIoU) loss [42]. Finally, we define Lemb to be the L1 loss over the pairs z i
and zˆj , which corresponds to the “Object Embedding” pretext task discussed in Section 5:
                                        Lemb (z i , z j ) = kz i − ẑ j k1                                          (4)

6   Experiments
In this section, we present extensive evaluation of DETReg on standard benchmarks, MS COCO [35]
and PASCAL VOC [16], under both full and low data settings in Section 6.1. We visualize and analyze
DETReg in Section 6.2 to better illustrate the “objectness” encoded in the learned representations.
Datasets. We conduct the pretraining stage on ImageNet100 (IN100) [13] following prior work
[53, 46, 54], and evaluate the learned representation by fine-tuning the model on MS COCO 2017 [35],
or PASCAL VOC [16] with full data or small subsets of the data (1%, 2%, 5% and 10%) that were
randomly chosen and remained consistent for all models. IN100 is a subset of the the ImageNet
(IN-1K) ILSRVC 2012 challenge data that contains around 125K images (∼10% of the full ImageNet
data) from 100 different classes. When using IN100, We only make use of the images and do not use
class information. We follow the standard protocols as earlier works [53, 25, 24] and train on the
train2017 partition and evaluate on the val2017 partition. Similarly, for PASCAL VOC, we
utilize the trainval07+12 partitions for fine tuning and the test07 for evaluation.
Baselines. We adopt the recent Deformable DETR [59] model with a ResNet-50 [26] backbone
as our base object detector architecture. We compare DETReg against architectures which utilize
supervised and unsupervised ImageNet pretrained backbones. For example, the standard Deformable
DETR uses a supervised backbone and “Deformable DETR w/ SwAV” uses a SwAV [6] backbone.
DETReg is pretrained on IN100 in an unsupervised way and uses a SwAV backbone that was trained
on IN-1K. We also compare to various past works that reported results on MS COCO and PASCAL
VOC [6, 53, 55] after pretraining on the full IN-1K.

Pretraining stage. We initialize the backbone of DETReg with a pretrained SwAV [6] model
which is fixed throughout the pretraining. In the object-embedding branch, femb and fbox are MLPs
with 2 hidden layers of size 256 followed by a ReLU [39] nonlinearity. The output sizes of femb

                                                          6

and fbox are 4 and 512. fcat is implemented as a single fully-connected layer with 2 outputs. We
select M = 30 proposals per-image and run experiments on an NVIDIA DGX, V100 x8 GPUs
machine. To compute the Selective Search boxes, we use the compact “fast” version implementation
of OpenCV [3] python package.
All our models are pretrained on unlabeled data for 50 epochs with a batch size of 28 per GPU. We
use a learning rate of 2 · 10−4 and decay the learning rate by a factor of 10 after 40 epochs. During
pretraining, we randomly flip, crop, and resize the images similar to previous works [59, 12]. We
randomly choose a scale from 10 predefined sizes between 320 to 480, and resize the image such that
the short side is within this scale. We also perform similar image augmentations over the patches that
correspond to region proposals, before obtaining the embeddings z i .
Fine-tuning stage. When fine tuning, we drop the femb branch, and set the last fully-connected
layer of fcat to have size which is equal to the target dataset number of classes plus background.
This model is then identical in its structure to Deformable DETR. For MS COCO fine tuning, we
use the exact hyperparams as suggested in [59]. For PASCAL VOC which is smaller, we adopt a
different LR schedule and train for 100 epochs and drop LR after 70 epochs. Similarly, for low-data
regime experiments, we train all models for 150 epochs to ensure the training converges. We run
every experiment once and therefore do not report confidence intervals, mainly because we operate
under limited budged and our core experiments require few days of training.

6.1 Evaluation Results

Results on MS COCO. We present the evaluation results on MS COCO with various amount of data
for fine-tuning in Figure 3. DETReg consistently outperforms previous pretraining strategies with
larger margin when fine-tuning on less data. For example, DETReg improves the average precision
(AP) score from 8.7 to 15.6 points with 1% of the full training data of MS COCO.

Deformable DETR MoCoV2 Def. DETR + SwAV DETReg 49.7
45.5
45.2 64.0
64.050 49.5
AP AP50 62.6 AP75 47.7
40 36.2 43.8
40.9 60 52.6
52.1
61.5 39.2 44.7
29.2 34.8 43.8 40
31.6 38.1
30 26.9 30.2 42.9 47.1 28.6 32.3
21.8 40 33.8 37.7 42.0 30 23.4
AP50

AP75

20.6 24.1 30.0 36.6
AP

20 15.6 16.0 22.0 25.6 20 16.7 21.0 24.8
15.6 22.7
13.8 18.9
20 18.7 22.725.3 13.4
10 9.9 8.7 11.0 10 9.57.1 9.7
0 3.8 9.6 0 2.3
1 2 5 10 100 1 2 5 10 100 1 2 5 10 100
Data percent (%) Data percent (%) Data percent (%)
Figure 3: Object detection results finetuned on MS COCO train2017 and evaluated on val2017.
DETReg consistently outperforms previous pretraining approaches by a large margin. When fine-
tuning with 1% data, DETReg improves 5 points in AP over prior methods.

Table 1: Object Detection finetuned on PASCAL VOC trainval07+2012 and evaluated on
test07. DETReg improves 6 points in AP using 10% data and 2.5 points in AP using the full data.
PASCAL VOC 10% PASCAL VOC 100%
Method
AP AP50 AP75 AP AP50 AP75
Deformable DETR 42.9 67.6 45.4 59.5 82.6 65.6
Deformable DETR w/ SwAV [6] 46.5 71.2 50.5 61.0 83.0 68.1
DETReg (ours) 51.4 72.2 56.6 63.5 83.3 70.3

Results on PASCAL VOC. We pretrain DETReg over IN100 and finetune it on PASCAL VOC. As
mentioned earlier, we follow train/test splits reported in prior works. We finetune using 10% and
100% of the training data. Results in Table 1 demonstrate that DETReg consistently improves over
baselines on PASCAL VOC both in the low-data regime and with the full data. For example, it
improves by 2.5 (AP) points over Deformable DETR with SwAV and by 4 points over the standard
Deformable DETR. Furthermore, DETReg achieves improved overall performance compared to
previous approaches (see Table 2).

Few shot object detection. We pretrain DETReg over IN100 then follow the standard COCO
protocol for few-shot object detection. Specifically, we finetune over the full data of 60 base classes
and then finetune over 80 classes which include the 60 base classes plus 20 novel classes with only
k ∈ {10, 30} samples per class. We use the same splits as [49] and report the performance over the
novel 20 classes. For finetuning over the base classes, we use the standard hyperparams as reported
in the main paper. For finetuning over the novel classes, for k = 10 we use a learning of 2 · 10−5 for
30 epochs. For k = 30 we train for 50 epochs with a learning rate of 2 · 10−4 . In both settings, we do
not drop the learning rate. The results in Table 3 confirm the advantage of DETReg, with a very large
improvement of 7 points and almost 10 points in AP and AP75 in the 30 shot setting.
  Table 2: VOC leader board.
 Method          AP     AP50   AP75
                                           Table 3: Few-shot detection performance for the 20
                                           novel categories on the COCO dataset. Our approach
 DETR            54.1   78.0   58.3
 Faster RCNN     56.1   82.6   62.7        consistently outperforms baseline methods across all
 Def. DETR       59.5   82.6   65.6        shots and metrics. We follow the data split used
 InsDis [52]     55.2   80.9   61.2        in [49].
 Jigsaw [22]     48.9   75.1   52.9
 Rotation [22]   46.3   72.5   49.3                                        novel AP       novel AP75
 NPID++ [37]     52.3   79.1   56.9          Model
                                                                           10     30       10    30
 SimCLR [8]      51.5   79.4   55.6
 PIRL [37]       54.0   80.7   59.7          FSRW [31]                      5.6    9.1     4.6    7.6
 BoWNet [18]     55.8   81.3   61.1          MetaDet [51]                   7.1   11.3     6.1    8.1
 MoCo [25]       55.9   81.5   62.6          FRCN+ft+full [56]              6.5   11.1     5.9   10.3
 MoCo-v2 [10]    57.0   82.4   63.6          Meta R-CNN [56]                8.7   12.4     6.6   10.8
 SwAV [6]        56.1   82.6   62.7          FRCN+ft-full (Impl by [49])    9.2   12.5     9.2   12.0
 UP-DETR [12]    57.2   80.1   62.0          TFA [49]                      10.0   13.7     9.3   13.4
 DenseCL [50]    58.7   82.8   65.2          Meta-DETR [57]                17.8   22.9    18.5   23.8
 DetCo [55]      58.2   82.7   65.0
 ReSim [53]      59.2   82.9   65.9          DETReg (ours)                 18.0   30.0    20.0   33.7
 DETReg (ours)   63.5   83.3   70.3

Comparison with semi-supervised approaches. Here we compare DETReg to previous works that
employ semi-supervised learning of both labeled and unlabeled data. Semi-supervised approaches
like [36], utilize k% of labeled data but also use the rest of the data without labels. We pretrain
DETReg on the entire coco train2017 data without using labels, and finetune on random k% of
train2017 data for k ∈ {1, 2, 5, 10}. After the pretraining stage, we finetune with a learning rate
of 2 · 10−4 . For k ∈ {5, 10} we finetune for 100 epochs and decay the learning rate in a factor of 10
after 40 epochs. For k ∈ {1, 2} we finetune for 200 epochs and drop learning rate after 150 epochs.
In each of the settings, we train 4 different models each one with a different random seed, and report
the standard deviation of the result. The results in Table 4 confirm the advantage of the DETReg
pretraining stage, which results in improved performance for all settings.

Table 4: Comparison to semi-supervised detection methods, trained over train2017 with limited
amount of labeled data. All approaches utilize only k% of labeled COCO while using the rest as
unlabeled data. Evaluation on val2017.
                                                                COCO
            Method
                                          1%             2%           5%                   10%
            CSD [30]                  10.51 ± 0.06   13.93 ± 0.12 18.63 ± 0.07         22.46 ± 0.08
            STAC [45]                 13.97 ± 0.35   18.25 ± 0.25 24.38 ± 0.12         28.64 ± 0.21
            Unbiased-Teacher [36]     20.75 ± 0.12   24.30 ± 0.07 28.27 ± 0.11         31.50 ± 0.10
            DETReg (ours)             22.90 ± 0.14   28.68 ± 0.26    30.19 ± 0.11      35.34 ± 0.43

6.2   Ablation and Visualization

Ablation studies. To assess the contribution of the various components in our system we perform
an extensive ablation study. Specifically, we validate the contribution of using the Selective Search
by replacing the proposals of an image with other proposals that were produced for a different
(randomly chosen) image. The goal of this is to check that the Selective Search proposals indeed
capture some objectness. Next, we assess the embedding loss Lemb contribution by training with
multiple coefficients λe ∈ {0, 1, 2}. Finally, we validate that there’s no performance drop in the

                                                      8

model when freezing the backbone during training. All the models are trained on IN100 for 50 epochs
and finetuned on MS COCO. The results in Table 5 confirm our design choices.

Table 5: We test how pretraining with different settings on IN100 transfer when fine tuned on
MS COCO train2017 and evaluated on val2017.
Model Reg. Proposals Embedding Loss Frozen Backbone Class Error ↓ Box Error ↓
Random λe =0 11.3 .044
X λe =0 9.50 .037
DETReg X λe =1 8.81 .037
X λe =2 9.14 .039
X λe =1 X 8.61 .037

Class Agnostic Object Detection. The goal of this experiment is to assess the performance of
DETReg in detecting objects right after the pretraining stage. We compare the pretrained DETReg de-
tectors using different box selection methods described in Section 4 to other unsupervised pretraining
methods and to the classical Selective Search algorithm, which does not utilize any annotated data.
We report the Class Agnostic Object Detection performance in Table 6. Surprisingly, although trained
on Selective Search annotations, the DETReg variants achieve improved performance over this task,
which is likely due to helpful inductive biases in the training process and detection architecture. We
find that the Top-K, which is the most simple setting of our model performs the best, and therefore
we used this approach thorough out the rest of the work.

Table 6: Class agnostic object proposal evaluation on MS COCO val2017. For each method, we
consider the top 100 proposals.
Method AP AP50 AP75 R@1 R@10 R@100
Def. DETR [59] 0.0 0.0 0.0 0.0 0.1 0.6
w/ SwAV [6] 0.0 0.0 0.0 0.0 0.1 0.6
UP-DETR [12] 0.0 0.0 0.0 0.0 0.0 0.4
Rand. Prop. 0.0 0.0 0.0 0.0 0.0 0.8
Selective Search [47] 0.2 0.5 0.1 0.2 1.5 10.9
DETReg-IS (ours) 0.7 2.0 0.1 0.3 1.8 9.0
DETReg-RK (ours) 0.7 2.4 0.2 0.5 2.9 11.7
DETReg-TK (ours) 1.0 3.1 0.6 0.6 3.6 12.7

Visualizing DETReg. Figure 5 shows qualitative examples of DETReg unsupervised box predic-
tions, and similar to [59], it also shows the pixel-level gradient norm for the x/y bounding box center
and the object embedding. These gradient norms indicate how sensitive the predicted values are to
perturbations of the input pixels. For the first three columns, DETReg attends to the object edges for
the x/y predictions and z for the predicted object embedding. The final column shows a limitation
where the background plays a more important role than the object in the embedding. We observed
that phenomena occasionally, and this could be explained because we do not use any object class
labels during the pretraining. Furthermore, we examine the learned object queries (see Figure 4), and
observe that they specialize in similar way to ones in Deformable DETR, despite not using any human
annotated data. The main difference is that the Deformable DETR slots have greater variance of the
predicted box locations and they are typically more dominated by a particular shape of bounding box
(more points are of a single color).

7 Discussion
In this work, we presented DETReg, an unsupervised pretraining approach for object DEtection
with TRansformers using Region priors. Our model and proposed pretext tasks are based on the
observation that in order to learn meaningful representations in the unsupervised pretraining stage, the
detector model not only has to learn to localize, but also to obtain good embeddings of the detected
objects. Our results demonstrate that this approach is beneficial across the board on MS COCO and

Figure 4: Despite not using any labeled training data, each DETReg slot specializes to a specific area
of each image and uses a variety of box sizes much like Deformable DETR. Each square corresponds
to a DETR slot, and shows the location of its bounding box predictions. We compare 10 slots of the
supervised Deformable DETR (top) and unsupervised DETReg (bottom) decoder for the MS COCO
2017 val dataset. Each point shows the center coordinate of the predicted bounding box, where
following a similar plot in [4], a green point represents a square bounding box, a orange point is
a large horizontal bounding box, and a blue point is a large vertical bounding box. Deformable
DETR has been trained on MS COCO 2017 data, while DETReg has only been trained on unlabeled
ImageNet data.

k ∂x
∂I k

k ∂y
∂I k

k ∂z
∂I k

Figure 5: Shown are the gradient norms from the unsupervised DETReg detection with respect to the
input image I for (top) the x coordinate of the object center, (middle) the y coordinate of the object
center, (bottom) the feature-space embedding, z.

PASCAL VOC, and especially on low-data regime settings, compared to challenging supervised
and unsupervised baselines. In general, unsupervised approaches can allow models to pretrain on
large amounts of unlabeled data, which is very advantageous in domains like medical imaging where
obtaining human-annotated data is expensive.

Acknowledgements
We would like to thank Sayna Ebrahimi for helpful feedback and discussions. This project has
received funding from the European Research Council (ERC) under the European Unions Horizon
2020 research and innovation programme (grant ERC HOLI 819080). Prof. Darrell’s group was
supported in part by DoD including DARPA’s XAI, LwLL, and/or SemaFor programs, as well as
BAIR’s industrial alliance programs. GC group was supported by the Israel Science Foundation
(ISF 737/2018), and by an equipment grant to GC and Bar-Ilan University from the Israel Science
Foundation (ISF 2332/18). This work was completed in partial fulfillment for the Ph.D degree of the
first author.

References
 [1] Alexe, B., Deselaers, T., Ferrari, V.: What is an object? In: CVPR. IEEE (2010) 3
 [2] Arbeláez, P., Pont-Tuset, J., Barron, J.T., Marques, F., Malik, J.: Multiscale combinatorial
     grouping. In: CVPR (2014) 3
 [3] Bradski, G.: The OpenCV Library. Dr. Dobb’s Journal of Software Tools (2000) 3, 7
 [4] Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object
     detection with transformers. ECCV (2020) 2, 3, 5, 6, 10
 [5] Caron, M., Bojanowski, P., Joulin, A., Douze, M.: Deep clustering for unsupervised learning of
     visual features. In: ECCV (2018) 3
 [6] Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., Joulin, A.: Unsupervised learning of
     visual features by contrasting cluster assignments. NeurIPS (2020) 1, 2, 5, 6, 7, 8, 9
 [7] Carreira, J., Sminchisescu, C.: Cpmc: Automatic object segmentation using constrained para-
     metric min-cuts. TPAMI 34(7), 1312–1328 (2011) 3
 [8] Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning
     of visual representations. arXiv preprint arXiv:2002.05709 (2020) 1, 2, 8
 [9] Chen, T., Kornblith, S., Swersky, K., Norouzi, M., Hinton, G.: Big self-supervised models are
     strong semi-supervised learners. NeurIPS (2020) 1
[10] Chen, X., Fan, H., Girshick, R., He, K.: Improved baselines with momentum contrastive
     learning. arXiv preprint arXiv:2003.04297 (2020) 2, 8
[11] Cheng, M.M., Zhang, Z., Lin, W.Y., Torr, P.: Bing: Binarized normed gradients for objectness
     estimation at 300fps. In: CVPR (2014) 3
[12] Dai, Z., Cai, B., Lin, Y., Chen, J.: UP-DETR: Unsupervised pre-training for object detection
     with transformers. CVPR (2021) 1, 2, 3, 7, 8, 9
[13] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical
     image database. In: CVPR. Ieee (2009) 1, 6
[14] Doersch, C., Gupta, A., Efros, A.A.: Unsupervised visual representation learning by context
     prediction. In: ICCV (2015) 3
[15] Endres, I., Hoiem, D.: Category-independent object proposals with diverse ranking. TPAMI
     36(2), 222–234 (2013) 3
[16] Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual
     object classes (voc) challenge. IJCV 88(2), 303–338 (2010) 2, 6
[17] Felzenszwalb, P.F., Huttenlocher, D.P.: Efficient graph-based image segmentation. IJCV 59(2),
     167–181 (2004) 4
[18] Gidaris, S., Bursuc, A., Komodakis, N., Pérez, P., Cord, M.: Learning representations by
     predicting bags of visual words. In: CVPR (2020) 8
[19] Gidaris, S., Singh, P., Komodakis, N.: Unsupervised representation learning by predicting image
     rotations. arXiv preprint arXiv:1803.07728 (2018) 3
[20] Girshick, R.: Fast r-cnn. In: ICCV (2015) 3
[21] Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object
     detection and semantic segmentation. In: CVPR (2014) 3
[22] Goyal, P., Mahajan, D., Gupta, A., Misra, I.: Scaling and benchmarking self-supervised visual
     representation learning. In: ICCV (2019) 8
[23] Grill, J.B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C.,
     Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent-a new approach
     to self-supervised learning. NeurIPS (2020) 1
[24] He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual
     representation learning. In: CVPR (2020) 1, 2, 3, 6
[25] He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual
     representation learning. In: CVPR (2020) 6, 8

                                                  11

[26] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR
     (2016) 1, 3, 6
[27] Hénaff, O.J., Koppula, S., Alayrac, J.B., Oord, A.v.d., Vinyals, O., Carreira, J.: Efficient visual
     pretraining with contrastive detection. arXiv preprint arXiv:2103.10957 (2021) 2, 3
[28] Hosang, J., Benenson, R., Dollár, P., Schiele, B.: What makes for effective detection proposals?
     TPAMI 38(4), 814–830 (2015) 3, 5
[29] Hosang, J., Benenson, R., Schiele, B.: How good are detection proposals, really? arXiv preprint
     arXiv:1406.6962 (2014) 3, 5
[30] Jeong, J., Lee, S., Kim, J., Kwak, N.: Consistency-based semi-supervised learning for object
     detection. In: nips (2019) 8
[31] Kang, B., Liu, Z., Wang, X., Yu, F., Feng, J., Darrell, T.: Few-shot object detection via feature
     reweighting. In: ICCV (2019) 8
[32] Krähenbühl, P., Koltun, V.: Geodesic object proposals. In: ECCV. pp. 725–739. Springer (2014)
     3
[33] Kuhn, H.W.: The hungarian method for the assignment problem. Naval research logistics
     quarterly 2(1-2), 83–97 (1955) 6
[34] Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In:
     ICCV (2017) 6
[35] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.:
     Microsoft coco: Common objects in context. In: ECCV. Springer (2014) 2, 6
[36] Liu, Y.C., Ma, C.Y., He, Z., Kuo, C.W., Chen, K., Zhang, P., Wu, B., Kira, Z., Vajda, P.:
     Unbiased teacher for semi-supervised object detection. arXiv preprint arXiv:2102.09480 (2021)
     8
[37] Misra, I., Maaten, L.v.d.: Self-supervised learning of pretext-invariant representations. In:
     CVPR (2020) 8
[38] Misra, I., Zitnick, C.L., Hebert, M.: Shuffle and learn: unsupervised learning using temporal
     order verification. In: ECCV. Springer (2016) 3
[39] Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In: ICML
     (2010) 6
[40] Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., Efros, A.A.: Context encoders: Feature
     learning by inpainting. In: CVPR (2016) 3
[41] Reed, C.J., Yue, X., Nrusimha, A., Ebrahimi, S., Vijaykumar, V., Mao, R., Li, B., Zhang, S.,
     Guillory, D., Metzger, S., et al.: Self-supervised pretraining improves self-supervised pretraining.
     arXiv preprint arXiv:2103.12718 (2021) 3
[42] Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., Savarese, S.: Generalized in-
     tersection over union: A metric and a loss for bounding box regression. In: CVPR (2019)
     6
[43] Roh, B., Shin, W., Kim, I., Kim, S.: Spatially consistent representation learning. arXiv preprint
     arXiv:2103.06122 (2021) 3
[44] Van de Sande, K.E., Uijlings, J.R., Gevers, T., Smeulders, A.W.: Segmentation as selective
     search for object recognition. In: ICCV. IEEE (2011) 3
[45] Sohn, K., Zhang, Z., Li, C.L., Zhang, H., Lee, C.Y., Pfister, T.: A simple semi-supervised
     learning framework for object detection. arXiv preprint arXiv:2005.04757 (2020) 8
[46] Tian, Y., Krishnan, D., Isola, P.: Contrastive multiview coding. arXiv preprint arXiv:1906.05849
     (2019) 6
[47] Uijlings, J.R., Van De Sande, K.E., Gevers, T., Smeulders, A.W.: Selective search for object
     recognition. IJCV 104(2), 154–171 (2013) 2, 3, 9
[48] Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.A.: Extracting and composing robust
     features with denoising autoencoders. In: ICML (2008) 3
[49] Wang, X., Huang, T.E., Darrell, T., Gonzalez, J.E., Yu, F.: Frustratingly simple few-shot object
     detection. arXiv preprint arXiv:2003.06957 (2020) 8

                                                  12

[50] Wang, X., Zhang, R., Shen, C., Kong, T., Li, L.: Dense contrastive learning for self-supervised
     visual pre-training. arXiv preprint arXiv:2011.09157 (2020) 8
[51] Wang, Y.X., Ramanan, D., Hebert, M.: Meta-learning to detect rare objects. In: Proceedings of
     the IEEE International Conference on Computer Vision. pp. 9925–9934 (2019) 8
[52] Wu, Z., Xiong, Y., Yu, S.X., Lin, D.: Unsupervised feature learning via non-parametric instance
     discrimination. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
     Recognition. pp. 3733–3742 (2018) 2, 8
[53] Xiao, T., Reed, C.J., Wang, X., Keutzer, K., Darrell, T.: Region similarity representation
     learning. arXiv preprint arXiv:2103.12902 (2021) 3, 6, 8
[54] Xiao, T., Wang, X., Efros, A.A., Darrell, T.: What should not be contrastive in contrastive
     learning. arXiv preprint arXiv:2008.05659 (2020) 6
[55] Xie, E., Ding, J., Wang, W., Zhan, X., Xu, H., Li, Z., Luo, P.: Detco: Unsupervised contrastive
     learning for object detection. arXiv preprint arXiv:2102.04803 (2021) 3, 6, 8
[56] Yan, X., Chen, Z., Xu, A., Wang, X., Liang, X., Lin, L.: Meta r-cnn: Towards general solver
     for instance-level low-shot learning. In: Proceedings of the IEEE International Conference on
     Computer Vision. pp. 9577–9586 (2019) 8
[57] Zhang, G., Luo, Z., Cui, K., Lu, S.: Meta-detr: Few-shot object detection via unified image-level
     meta-learning. arXiv preprint arXiv:2103.11731 (2021) 8
[58] Zhang, R., Isola, P., Efros, A.A.: Colorful image colorization. In: ECCV. Springer (2016) 3
[59] Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: Deformable transformers
     for end-to-end object detection. arXiv preprint arXiv:2010.04159 (2020) 2, 3, 4, 5, 6, 7, 9
[60] Zitnick, C.L., Dollár, P.: Edge boxes: Locating object proposals from edges. In: ECCV. Springer
     (2014) 3

                                                 13

You can also read