KeypointNet: A Large-scale 3D Keypoint Dataset Aggregated from Numerous Human Annotations

Page created by Philip Waters

Current Events

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

KeypointNet: A Large-scale 3D Keypoint Dataset Aggregated from Numerous Human Annotations

KeypointNet: A Large-scale 3D Keypoint Dataset
                                                                      Aggregated from Numerous Human Annotations

                                                                  Yang You, Yujing Lou∗, Chengkun Li∗, Zhoujun Cheng, Liangwei Li,
                                                                              Lizhuang Ma, Weiming Wang†, Cewu Lu
                                                                                Shanghai Jiao Tong University, China
arXiv:2002.12687v6 [cs.CV] 7 Aug 2020

                                                                      Abstract

                                           Detecting 3D objects keypoints is of great interest to the
                                        areas of both graphics and computer vision. There have
                                        been several 2D and 3D keypoint datasets aiming to address
                                        this problem in a data-driven way. These datasets, however,
                                        either lack scalability or bring ambiguity to the definition
                                        of keypoints. Therefore, we present KeypointNet: the first
                                        large-scale and diverse 3D keypoint dataset that contains
                                        103,450 keypoints and 8,234 3D models from 16 object cat-
                                        egories, by leveraging numerous human annotations. To
                                        handle the inconsistency between annotations from differ-
                                        ent people, we propose a novel method to aggregate these
                                        keypoints automatically, through minimization of a fidelity
                                        loss. Finally, ten state-of-the-art methods are benchmarked          Figure 1. We propose a large-scale KeypointNet dataset. It con-
                                        on our proposed dataset. Our code and data are available             tains 8K+ models and 103K+ keypoint annotations.
                                        on https://github.com/qq456cvb/KeypointNet.

                                                                                                                 In order to alleviate the bias of experts’ definitions on
                                        1. Introduction                                                      keypoints, we ask a large group of people to annotate vari-
                                                                                                             ous keypoints according to their own understanding. Chal-
                                            Detection of 3D keypoints is essential in many applica-
                                                                                                             lenges rise in that different people may annotate different
                                        tions such as object matching, object tracking, shape re-
                                                                                                             keypoints and we need to identify the consensus and pat-
                                        trieval and registration [22, 6, 37]. Utilization of keypoints
                                                                                                             terns in these annotations. Finding such patterns is not triv-
                                        to match 3D objects has its advantage of providing features
                                                                                                             ial when a large set of keypoints spread across the entire
                                        that are semantically significant and such keypoints are usu-
                                                                                                             model. A simple clustering would require a predefined dis-
                                        ally made invariant to rotations, scales and other transfor-
                                                                                                             tance threshold and fail to identify closely spaced keypoints.
                                        mations.
                                                                                                             As shown in Figure 1, there are four closely spaced key-
                                            In the trend of deep learning, 2D semantic point detec-
                                                                                                             points on each airplane empennage and it is extremely hard
                                        tion has been boosted with the help of a large quantity of
                                                                                                             for simple clustering methods to distinguish them. Besides,
                                        high-quality datasets [3, 23]. However, there are few 3D
                                                                                                             clustering algorithms do not give semantic labels of key-
                                        datasets focusing on the keypoint representation of an ob-
                                                                                                             points since it is ambiguous to link clustered groups with
                                        ject. Dutagaci et al. [11] collect 43 models and label them
                                                                                                             each other. In addition, people’s annotations are not al-
                                        according to annotations from various persons. Annotations
                                                                                                             ways exact and errors of annotated keypoint locations are
                                        from different persons are finally aggregated by geodesic
                                                                                                             inevitable. In order to solve these problems, we propose a
                                        clustering. ShapeNetCore keypoint dataset [42], and a sim-
                                                                                                             novel method to aggregate a large number of keypoint an-
                                        ilar dataset [15], in another way, resort to an expert’s anno-
                                                                                                             notations from distinct people, by optimizing a fidelity loss.
                                        tation on keypoints, making them vulnerable and biased.
                                                                                                             After this auto aggregation process, we verify these gener-
                                          ∗ These   authors contributed equally.                             ated keypoints based on some simple priors such as symme-
                                          † Weiming    Wang is the corresponding author.                     try.

                                                                                                         1

In this paper, we build the first large-scale and diverse PoseTrack [2] annotate millions of keypoints on humans.
dataset named KeypointNet which contains 8,234 models For more general objects, SPair-71k [23] contains 70,958
with 103,450 keypoints. These keypoints are of high fidelity image pairs with diverse variations in viewpoint and scale,
and rich in structural or semantic meanings. Some examples with a number of corresponding keypoints on each image
are given in Figure 1. We hope this dataset could boost pair. PUB [36] provides 15 part locations on 11,788 images
semantic understandings of common objects. from 200 bird categories and PASCAL [5] provides key-
In addition, we propose two large-scale keypoint predic- point annotations for 20 object categories. HAKE [19] pro-
tion tasks: keypoint saliency estimation and keypoint cor- vides numerous annotations on human interactiveness key-
respondence estimation. We benchmark ten state-of-the- points. ADHA [25] annotates key adverbs in videos, which
art algorithms with mIoU, mAP and PCK metrics. Results is a sequence of 2D images.
show that the detection and identification of keypoints re- Keypoint datasets on 3D objects, include Dutagaci et
main a challenging task. al. [11], SyncSpecCNN [42] and Kim et al. [15]. Dutagaci
In summary, we make the following contributions: et al. [11] aggregates multiple annotations from different
people with an ad-hoc method while the dataset is extremely
• To the best of our knowledge, we provide the first
small. Though SyncSpecCNN [42], Pavlakos et al. [26] and
large-scale dataset on 3D keypoints, both in number
Kim et al. [15] give a relatively large keypoint dataset, they
of categories and keypoints.
rely on a manually designed template of keypoints, which is
• We come up with a novel approach on aggregating inevitably biased and flawed. GraspNet [12] gives dense an-
people’s annotations on keypoints, even if their anno- notations on 3D object grasping. The differences between
tations are independent from each other. theirs and ours are illustrated in Table 1.
• We experiment with ten state-of-the-art benchmarks on
3. KeypointNet: A Large-scale 3D Keypoint
our dataset, including point cloud, graph, voxel and
local geometry based keypoint detection methods. Dataset
3.1. Data Collection
2. Related Work
KeypointNet is built on ShapeNetCore [8]. ShapeNet-
2.1. Detection of Keypoints Core covers 55 common object categories with about
Detection of 3D keypoints has been a very important task 51,300 unique 3D models.
for 3D object understanding which can be used in many We filter out those models that deviate from the majority
applications, such as object pose estimation, reconstruc- and keep at most 1000 instances for each category in or-
tion, matching, segmentation, etc. Researchers have pro- der to provide a balanced dataset. In addition, a consistent
posed various methods to produce interest points on ob- canonical orientation is established (e.g., upright and front)
jects to help further objects processing. Traditional meth- for every category because of the incomplete alignment in
ods like 3D Harris [31], HKS [32], Salient Points [7], Mesh ShapeNetCore.
Saliency [18], Scale Dependent Corners [24], CGF [14], We let annotators determine which points are important,
SHOT [35], etc, exploit local reference frames (LRF) to and same keypoint indices should indicate same meanings
extract geometric features as local descriptors. However, for each annotator. Though annotators are free to give their
these methods only consider the local geometric informa- own keypoints, three general principles should be obeyed:
tion without semantic knowledge, which forms a gap be- (1) each keypoint should describe an objects semantic infor-
tween detection algorithms and human understanding. mation shared across instances of the same object category,
Recent deep learning methods like SyncSpecCNN [42], (2) keypoints of an object category should spread over the
deep functional dictionaries [33] are proposed to detect whole object and (3) different keypoints have distinct se-
keypoints. Unlike traditional ones, these methods do not mantic meanings. After that, we utilize a heuristic method
handle rotations well. Though some recent methods like to aggregate these points, which will be discussed in Sec-
S2CNN [9] and PRIN [43] try to fix this, deep learning tion 4.
methods still rely on ground-truth keypoint labels annotated Keypoints are annotated on meshes and these annotated
by human with expert verification. meshes are then downsampled to 2,048 points. Our final
dataset is a collection of point clouds, with keypoint indices.
2.2. Keypoint Datasets
3.2. Annotation Tools
Keypoint datasets have its origin in 2D images, where
plenty of datasets on human skeletons and object inter- We develop an easy-to-use web annotation tool based
est points are proposed. For human skeletons, MPII hu- on NodeJS. Every user is allowed to click up to 24 inter-
man pose dataset [3], MSCOCO keypoint challenge [1] and est points according to his/her own understanding. The UI

Dataset                Domain Correspondence Template-free Instances Categories Keypoints    Format
                                             √
         FAUST [4]              human                     ×             100       1        689K         mesh
                                             √
         SyncSpecCNN [42]        chair                    ×            6243       1       ∼60K      point cloud
                                                          √
         Dutagaci et al. [11]   general      ×                          43        16

Airplane

            Bottle

             Cap

             Chair

          Skateboard

             Table

Figure 3. Dataset Visualization. Here we plot ground-truth keypoints for several categories. We can see that by utilizing our automatic
aggregation method, keypoints of high fidelity are extracted.

where dθ is the L2 distance between two vectors in embed-             robust to noisy points as the embedding space encodes the
ding space:                                                           semantic information of an object.
                                                                         Equation 1 involves both θ and Y and it is impractical
                dθ (a, b) = kgθ (a) − gθ (b)k22 ,
                                                                      to solve this problem in closed form. In practice, we use
and                                                                   alternating minimization with a deep neural network to ap-
                                                                      proximate the embedding function gθ , so that we solve the
                n∗ = arg min dθ (lm,n
                                  (c)
                                      , ym ).                         following dual problem instead (by slightly loosening the
                           n                                          constraints):
   Unlike previous methods such as Dutagaci et al.[11]
where a simple geodesic average of human labeled points
is given as ground-truth points, we seek a point whose ex-                θ∗ = arg min[H(gθ )
pected embedding distance to all human labeled points is                            θ
smallest. The reason is that geodesic distance is sensi-                                   Km
                                                                                         M X
                                                                                         X                                                (2)
tive to the misannotated keypoints and could not distinguish                    +λ                   kgθ (ym1 ,k ) − gθ (ym2 ,k )k22 ],
closely spaced keypoints, while embedding distance is more                              m1 ,m2   k

Fidelity Error
                                                                                            Calculatation

                    ……                                                  …...                                              ……
                Raw Annotations                                   Embedding Space                                   Fidelity Error Map

                                                                                                                              NMS

                                               Human                                         Clustering
                                             Verification                                      t-SNE

                    ……                                                                                                    ……
              Aggregated Keypoints                                Clustered Keypoints                              Potential Keypoints
Figure 4. Keypoint aggregation pipeline. We first infer dense embeddings from human labeled raw annotations. Then fidelity error maps
are calculated by summing embedding distances to human labeled keypoints. Non Minimum Suppression is conducted to form a potential
set of keypoints. These keypoints are then projected onto 2D subspace with t-SNE and verified by humans.

        Category       Train   Val   Test    All #Annotators                            Category          Train     Val    Test      All
        Airplane        715    102   205    1022     26                                 Airplane           9695    1379    2756    13830
        Bathtub         344     49    99    492      15                                 Bathtub            5519    772     1589     7880
        Bed             102     14    30    146       8                                 Bed                1276    188     377      1841
        Bottle          266     38    76    380       9                                 Bottle             4366    625     1269     6260
        Cap              26      4     8     38       6                                 Cap                154      24      48      226
        Car             701    100   201    1002     17                                 Car               14976    2133    4294    21403
        Chair           699    100   200    999      15                                 Chair              8488    1180    2395    12063
        Guitar          487     70   140    697      13                                 Guitar             4112    591     1197     5900
        Helmet           62     10    18     90       8                                 Helmet             529      85     162      776
        Knife           189     27    54    270       5                                 Knife              946     136     276      1358
        Laptop          307     44    88    439      12                                 Laptop             1842    264     528      2634
        Motorcycle      208     30    60    298       7                                 Motorcycle         2690    394     794      3878
        Mug             130     18    38    186       9                                 Mug                1427    198     418      2043
        Skateboard       98     14    29    141       9                                 Skateboard         903     137     283      1323
        Table           786    113   225    1124     17                                 Table              6325    913     1809     9047
        Vessel          637     91   182    910      18                                 Vessel             9168    1241    2579    12988
        Total          5757    824   1653   8234    194                                 Total             72416   10260   20774   103450

      Table 2. Keypoint Dataset statistics. Left: number of models in each category. Right: number of keypoints in each category.

                                                                               Non Minimum Suppression Equation 3 may be hard to
                                                                               solve since Km is also unknown beforehand. For each
                  M X
                  X   Km Z
                    C X                                     (c)
                                                        φ(lm,n? , x)           model Mm , the fidelity error associated with each poten-
Y ∗ = arg min                          dθ (x, ym,k )                 dx,       tial keypoint ym ∈ Mm is:
          Y      m=1 c=1 k=1     Mm                        Z(φ)
                                                                                               M X C Z                      (c)
                                                                                                                         φ(lm0 ,n∗ , x)
                 s.t. gθ (ym1 ,k ) ≡ gθ (ym2 ,k ), ∀m1 , m2 , k,
                                                                                               X
                                                                                f (ym , gθ ) =               dθ (x, ym )
                                                                                                                      0                 dx,
                                                                     (3)                       0  c=1 Mm0
                                                                                                                            Z(φ)
                                                                                              m 6=m
                                                                                                                                           (4)
and alternate between the two equations until convergence.
                                                                               where ym0 = arg minym0 ∈Mm0 kgθ (ym0 ) − gθ (ym )k22 .
   By solving this problem, we find both an optimal em-                           Then ym∗
                                                                                           is found by conducting Non Minimum Sup-
bedding function gθ , together with intra-class consistent                     pression (NMS), such that:
ground-truth keypoints Y, while keeping its embedding dis-
tance from human-labeled keypoints as close as possible.                                                      ∗
                                                                                                          f (ym , gθ ) ≤ f (ym , gθ ),
The ground-truth keypoints can be viewed as the projection                                                         ∗                       (5)
of human labeled data onto embedding space.                                                   ∀ym ∈ Mm , dθ (ym , ym ) < δ,

GroudTruth                     RSNet                        RSCNN                   DGCNN                    GraphCNN                      ISS3D                 HARRIS3D

             Table

             Chair

             Mug

                                                     Figure 5. Visualizations of detected keypoints for six algorithms.

             Airplane    Bathtub       Bed        Bottle       Cap         Car        Chair       Guitar      Helmet      Knife       Laptop      Motor        Mug         Skate      Table       Vessel     Average
PointNet     10.4/9.3     0.8/6.3     3.0/9.4     0.5/8.2     0.0/0.3     7.3/9.2     7.7/6.5     0.1/1.4     0.0/1.3     0.0/0.9    23.6/32.0    0.1/4.0     0.2/3.9      0.0/2.9   20.9/23.1    7.1/9.1     5.1/8.0
PointNet++   33.5/44.6   18.5/32.3   16.4/23.4   22.1/38.2   13.2/12.1   24.0/41.0   18.2/29.4   24.3/27.7    7.0/7.4    19.0/22.1   30.9/60.0   21.9/36.0   16.3/21.4   13.7/16.1   29.3/47.4   16.7/26.9   20.3/30.4
RSNet        33.7/38.8   23.1/36.0   28.4/42.5   29.4/44.3   26.7/26.9   30.4/42.3   22.7/26.2   30.5/34.2   15.8/21.3   21.7/31.5   39.8/62.0   31.5/43.2   24.1/35.2   21.6/25.9   39.7/54.6   18.3/22.3   27.3/36.7
SpiderCNN    24.6/24.7   13.0/16.4   16.4/17.1    3.6/8.2     0.0/0.5    11.8/17.2   18.2/20.4    7.4/9.9     0.0/0.5    14.9/18.7   25.6/40.0   12.0/13.3    2.2/4.0      1.8/3.3   27.9/40.5   12.9/14.6   12.0/15.6
PointConv    40.0/51.2   29.0/44.7   25.1/43.7   33.6/46.5    0.0/7.0    35.9/55.6   25.5/39.8   28.0/39.3    7.7/8.3    26.6/38.1   43.1/63.5   35.9/50.6   19.8/32.0    9.3/10.1   44.1/64.0   21.7/31.8   26.6/39.1
RSCNN        31.1/45.0   20.4/34.9   26.2/40.1   25.6/40.1    7.0/8.9    25.4/44.7   22.0/33.0   27.3/34.1    8.9/10.8   21.1/27.2   35.2/58.8   23.6/37.0   15.9/18.6   21.9/30.1   27.4/49.8   18.1/28.9   22.3/33.9
DGCNN        39.2/52.0   25.5/40.6   32.4/42.8   40.3/61.6    0.0/2.0    27.0/48.6   24.9/34.6   31.9/44.1   17.1/19.9   31.5/39.9   44.8/69.6   30.3/41.4   23.9/33.2   22.6/29.8   41.0/59.0   24.4/35.6   28.5/40.9
GraphCNN      0.7/0.7    17.0/23.7   22.2/31.5   23.3/42.6    0.0/4.1    23.0/36.5    0.6/0.6    17.0/21.6    9.7/12.5   18.7/23.3   33.3/49.7    0.6/0.7     0.5/0.6    11.7/12.8   24.5/34.7   15.1/19.3   13.6/19.7
Harris3D       1.6/-       0.3/-       0.9/-       0.5/-       0.0/-       0.9/-       0.7/-       1.2/-        0.0/-      1.9/-       0.3/-       0.7/-       0.0/-        1.0/-      0.4/-       1.2/-       0.7/-
SIFT3D         1.2/-       1.0/-       0.5/-       1.1/-       0.4/-       1.0/-       0.7/-       0.7/-        0.4/-      0.6/-       0.2/-       0.8/-       0.5/-        0.7/-      0.3/-       1.0/-       0.7/-
ISS3D          1.1/-       1.4/-       0.9/-       1.8/-       0.6/-       1.7/-       0.7/-       0.7/-        0.8/-      0.4/-       0.1/-       0.8/-       0.9/-        0.8/-      0.3/-       1.1/-       0.9/-

                          Table 3. mIoU and mAP results (in percentage) for compared methods with distance threshold 0.01.

where δ is some neighborhood threshold.                                                                       4.5. Pipeline
   After NMS, we would get several ground-truth points
ym,1 , ym,2 , . . . , ym,k for each manifold Mm . However, the                                                   The whole pipeline is shown in Figure 4. We first in-
arbitrarily assigned index k within each model does not pro-                                                  fer dense embeddings from human labeled raw annotations.
vide a consistent semantic correspondence across different                                                    Then fidelity error maps are calculated by summing em-
models. Therefore we cluster these points according to their                                                  bedding distances to human labeled keypoints. Non Min-
embeddings by first projecting them onto 2D subspace with                                                     imum Suppression is conducted to form a potential set of
t-SNE [21].                                                                                                   keypoints. These keypoints are then projected onto 2D sub-
                                                                                                              space with t-SNE and verified by humans.
Ground-truth Verification Though the above method
automatically aggregate a set of potential set of keypoints                                                   5. Tasks and Benchmarks
with high precision, it omits some keypoints in some cases.
As the last step, experts manually verify these keypoints                                                        In this section, we propose two keypoint prediction
based on some simple priors such as the rotational symme-                                                     tasks: keypoint saliency estimation and keypoint correspon-
try and centrosymmetry of an object.                                                                          dence estimation. Keypoint saliency estimation requires
                                                                                                              evaluated methods to give a set of potential indistinguish-
4.4. Implementation Details                                                                                   able keypoints while keypoint correspondence estimation
   At the start of the alternating minimization, we initialize                                                asks to localize a fixed number of distinguishable keypoints.
Y to be sampled from raw annotations and then run one it-
eration, which is enough for the convergence. We choose                                                       5.1. Keypoint Saliency Estimation
PointConv with hidden dimension 128 as the embedding
function g. During the optimization of Equation 3, we clas-                                                   Dataset Preparation For keypoint saliency estimation,
sify each point into K classes with a SoftMax layer and ex-                                                   we only consider whether a point is the keypoint or not,
tract the feature of the last but one layer as the embedding.                                                 without giving its semantic label. Our dataset is split into
The learning rate is 1e-3 and the optimizer is Adam [17].                                                     train, validation and test sets with the ratio 70%, 10%, 20%.

Evaluation Metrics Two metrics are adopted to evaluate
the performance of keypoint saliency estimation. Firstly, we
evaluate their mean Intersection over Unions [34] (mIoU),
which can be calculated as
TP
IoU = . (6)
TP + FP + FN
mIoU is calculated under different error tolerances from 0 to
0.1. Secondly, for those methods that output keypoint prob-
abilities, we evaluate their mean Average Precisions (mAP)
over all categories.

Benchmark Algorithms We benchmark eight state-
of-the-art algorithms on point cloud semantic analysis:
PointNet [27], PointNet++ [28], RSNet [13], Spider-
CNN [41], PointConv [39], RSCNN [20], DGCNN [38] and
Figure 6. mIoU results under various distance thresholds (0-0.1)
GraphCNN [10]. Three traditional local geometric keypoint
for compared algorithms.
detectors are also considered: Harris3D [31], SIFT3D [29]
and ISS3D [30].

Evaluation Results For deep learning methods, we use
the default network architectures and hyperparameters to
predict the keypoint probability of each point and mIoU and
mAP are adopted to evaluate their performance. For local
geometry based methods, mIoU is used. Each method is
tested with various geodesic error thresholds. In Table 3,
we report mIoU and mAP results under a restrictive thresh-
old 0.01. Figure 6 shows the mIoU curves under different
distance thresholds from 0 to 0.1 and Figure 7 shows the
mAP results. We can see that under a restrictive distance
threshold 0.01, geometric and deep learning methods both
fail to predict qualified keypoints.
Figure 5 shows some visualizations of the results from
RSNet, RSCNN, DGCNN, GraphCNN, ISS3D and Har-
Figure 7. mAP results under various distance thresholds (0-0.1) for
ris3D. Deep learning methods can predict some of ground-
compared algorithms.
truth keypoints while the predicted keypoints are sometimes
missing. For local geometry based methods like ISS3D and
Harris3D, they give much more interest points spread over number of keypoints is fixed and the data split is the same
the entire model while these points are agnostic of seman- as keypoint saliency estimation.
tic information. Learning discriminative features for better
localizing accurate and distinct keypoints across various ob-
Evaluation Metric The prediction of network is evalu-
jects is still a challenging task.
ated by the percentage of correct keypoints (PCK), which
5.2. Keypoint Correspondence Estimation is used to evaluate the accuracy of keypoint prediction in
many previous works [42, 33].
Keypoint correspondence estimation is a more challeng-
ing task, where one needs to predict not only the keypoints,
but also their semantic labels. The semantic labels should Benchmark Algorithms We benchmark eight state-
be consistent across different objects in the same category. of-the-art algorithms: PointNet [27], PointNet++[28],
RSNet [13], SpiderCNN [41], PointConv [39],
Dataset Preparation For keypoint correspondence esti- RSCNN [20], DGCNN [38] and GraphCNN [10].
mation, each keypoint is labeled with a semantic index. For
those keypoints that do not exist on some objects, index - Evaluation Results Similarly, we use the default network
1 is given. Similar to SyncSpecCNN [42], the maximum architectures. Table 4 shows the PCK results with error dis-

GroundTruth         PointNet               RSCNN       PointConv     SpiderCNN        DGCNN           GraphCNN

         Bathtub

         Helmet

       Motorbike

           Figure 8. Visualizations of detected keypoints and their semantic labels. Same colors indicate same semantic labels.

              Airplane   Bath   Bed    Bottle   Cap    Car    Chair Guitar Helmet Knife Laptop Motor   Mug    Skate   Table   Vessel Average
 PointNet       65.4     57.2   55.4    78.9    16.7   66.6    43.2 65.4    42.6 32.6 76.1      62.1   60.7    50.4    63.8    42.5   55.0
 PointNet++     64.1     45.4   44.2    10.2    18.8   55.1    38.0 55.4    23.6 43.8 64.6      49.9   36.4    42.3    57.1    37.9   42.9
 RSNet          68.9     56.0   69.2    78.2    47.9   65.7    49.4 63.7    33.8 56.9 80.3      67.4   61.7    69.2    72.3    46.7   61.7
 SpiderCNN      54.9     36.8   43.9    50.4     0.0   47.1    36.9 37.7    5.6   32.2 63.4     36.6   16.4    15.8    61.0    36.8   36.0
 PointConv      70.3     54.4   58.6    64.6    25.0   65.7    51.1 62.6    16.2 47.4 74.2      61.7   51.2    51.8    69.5    46.7   54.4
 RSCNN          66.8     47.9   52.4    59.4    18.8   56.5    45.0 60.0    21.3 51.4 66.9      52.8   29.5    51.7    64.8    41.2   49.2
 DGCNN          66.8     51.5   56.3     8.9    39.6   58.6    46.0 62.4    36.1 52.1 73.9      57.7   51.4    52.3    70.5    46.8   51.9
 GraphCNN       41.8     19.5   35.4    32.1     8.3   36.9    27.0 28.9    8.8   20.0 65.9     34.4   15.6    26.2    17.3    22.1   27.5

                          Table 4. PCK results under distance threshold 0.01 for various deep learning networks.

                                                                            ferent methods. Same colors denote same semantic labels.
                                                                            We can see that most methods can accurately predict some
                                                                            of keypoints. However, there are still some missing key-
                                                                            points and inaccurate localizations.
                                                                                Keypoint saliency estimation and keypoint correspon-
                                                                            dence estimation are both important for object understand-
                                                                            ing. Keypoint saliency estimation gives a spare represen-
                                                                            tation of object by extracting meaningful points. Key-
                                                                            point correspondence estimation establishes relations be-
                                                                            tween points on different objects. From the results above,
                                                                            we can see that these two tasks still remain challenging. The
                                                                            reason is that object keypoints from human perspective are
                                                                            not simply geometrically salient points but abstracts seman-
                                                                            tic meanings of the object.

Figure 9. PCK results under various distance thresholds (0-0.1) for         6. Conclusion
compared algorithms.
                                                                               In this paper, we propose a large-scale and high-quality
                                                                            KeypointNet dataset. In order to generate ground-truth key-
                                                                            points from raw human annotations where identification of
tance threshold 0.01. Figure 9 illustrates the percentage of                their modes are non-trivial, we transform the problem into
correct points curves with distance thresholds varied from                  an optimization problem and solve it in an alternating fash-
0 to 0.1. RS-Net performs relatively better than the other                  ion. By optimizing a fidelity loss, ground-truth keypoints,
methods under all thresholds. However, all eight methods                    together with their correspondences are generated. In addi-
face difficulty in giving exact consistent semantic keypoints               tion, we evaluate and compare several state-of-the-art meth-
under a threshold of 0.01.                                                  ods on our proposed dataset and we hope this dataset could
   Figure 8 shows some visualizations of the results for dif-               boost the semantic understanding of 3D objects.

Acknowledgements                                                       [12] Hao-Shu Fang, Chenxi Wang, Minghao Gou, and Cewu
                                                                            Lu. Graspnet: A large-scale clustered and densely annotated
   This work is supported in part by the National Key                       datase for object grasping. arXiv preprint arXiv:1912.13470,
R&D Program of China (No. 2017YFA0700800), Na-                              2019. 2
tional Natural Science Foundation of China under Grants                [13] Qiangui Huang, Weiyue Wang, and Ulrich Neumann. Re-
61772332, 51675342, 51975350, SHEITC (2018-RGZN-                            current slice networks for 3d segmentation of point clouds.
02046) and Shanghai Science and Technology Commit-                          In Proceedings of the IEEE Conference on Computer Vision
tee.                                                                        and Pattern Recognition, pages 2626–2635, 2018. 7
                                                                       [14] Marc Khoury, Qian-Yi Zhou, and Vladlen Koltun. Learning
References                                                                  compact geometric features. In Proceedings of the IEEE In-
                                                                            ternational Conference on Computer Vision, pages 153–161,
 [1] Mscoco keypoint challenge 2016, 2016. 2
                                                                            2017. 2
 [2] Mykhaylo Andriluka, Umar Iqbal, Eldar Insafutdinov,
                                                                       [15] Vladimir G Kim, Wilmot Li, Niloy J Mitra, Siddhartha
     Leonid Pishchulin, Anton Milan, Juergen Gall, and Bernt
                                                                            Chaudhuri, Stephen DiVerdi, and Thomas Funkhouser.
     Schiele. Posetrack: A benchmark for human pose estima-
                                                                            Learning part-based templates from large collections of 3d
     tion and tracking. In Proceedings of the IEEE Conference
                                                                            shapes. ACM Transactions on Graphics (TOG), 32(4):70,
     on Computer Vision and Pattern Recognition, pages 5167–
                                                                            2013. 1, 2
     5176, 2018. 2
                                                                       [16] Vladimir G Kim, Wilmot Li, Niloy J Mitra, Stephen DiVerdi,
 [3] Mykhaylo Andriluka, Leonid Pishchulin, Peter Gehler, and
                                                                            and Thomas Funkhouser. Exploring collections of 3d models
     Bernt Schiele. 2d human pose estimation: New benchmark
                                                                            using fuzzy correspondences. ACM Transactions on Graph-
     and state of the art analysis. In IEEE Conference on Com-
                                                                            ics (TOG), 31(4):1–11, 2012. 3
     puter Vision and Pattern Recognition (CVPR), June 2014. 1,
                                                                       [17] Diederik P Kingma and Jimmy Ba. Adam: A method for
     2
                                                                            stochastic optimization. arXiv preprint arXiv:1412.6980,
 [4] Federica Bogo, Javier Romero, Matthew Loper, and                       2014. 6
     Michael J. Black. FAUST: Dataset and evaluation for 3D
                                                                       [18] Chang Ha Lee, Amitabh Varshney, and David W Jacobs.
     mesh registration. In Proceedings IEEE Conf. on Com-
                                                                            Mesh saliency. ACM transactions on graphics (TOG),
     puter Vision and Pattern Recognition (CVPR), Piscataway,
                                                                            24(3):659–666, 2005. 2
     NJ, USA, June 2014. IEEE. 3
                                                                       [19] Yong-Lu Li, Liang Xu, Xijie Huang, Xinpeng Liu, Ze Ma,
 [5] Lubomir Bourdev and Jitendra Malik. Poselets: Body part                Mingyang Chen, Shiyi Wang, Hao-Shu Fang, and Cewu Lu.
     detectors trained using 3d human pose annotations. In 2009             Hake: Human activity knowledge engine. arXiv preprint
     IEEE 12th International Conference on Computer Vision,                 arXiv:1904.06539, 2019. 2
     pages 1365–1372. IEEE, 2009. 2                                    [20] Yongcheng Liu, Bin Fan, Shiming Xiang, and Chunhong
 [6] M Bueno, J Martı́nez-Sánchez, H González-Jorge, and H                Pan. Relation-shape convolutional neural network for point
     Lorenzo. Detection of geometric keypoints and its applica-             cloud analysis. In Proceedings of the IEEE Conference
     tion to point cloud coarse registration. International Archives        on Computer Vision and Pattern Recognition, pages 8895–
     of the Photogrammetry, Remote Sensing & Spatial Informa-               8904, 2019. 7
     tion Sciences, 41, 2016. 1                                        [21] Laurens van der Maaten and Geoffrey Hinton. Visualiz-
 [7] Umberto Castellani, Marco Cristani, Simone Fantoni, and                ing data using t-sne. Journal of machine learning research,
     Vittorio Murino. Sparse points matching by combining                   9(Nov):2579–2605, 2008. 6
     3d mesh saliency with statistical descriptors. In Computer        [22] Ajmal S Mian, Mohammed Bennamoun, and Robyn Owens.
     Graphics Forum, volume 27, pages 643–652. Wiley Online                 Three-dimensional model-based object recognition and seg-
     Library, 2008. 2                                                       mentation in cluttered scenes. IEEE transactions on pattern
 [8] Angel X Chang, Thomas Funkhouser, Leonidas Guibas,                     analysis and machine intelligence, 28(10):1584–1601, 2006.
     Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese,                  1
     Manolis Savva, Shuran Song, Hao Su, et al. Shapenet:              [23] Juhong Min, Jongmin Lee, Jean Ponce, and Minsu Cho.
     An information-rich 3d model repository. arXiv preprint                Spair-71k: A large-scale benchmark for semantic correspon-
     arXiv:1512.03012, 2015. 2                                              dence. arXiv preprint arXiv:1908.10543, 2019. 1, 2
 [9] Taco S Cohen, Mario Geiger, Jonas Köhler, and Max                [24] John Novatnack and Ko Nishino. Scale-dependent 3d geo-
     Welling. Spherical cnns. arXiv preprint arXiv:1801.10130,              metric features. In 2007 IEEE 11th International Conference
     2018. 2                                                                on Computer Vision, pages 1–8. IEEE, 2007. 2
[10] Michaël Defferrard, Xavier Bresson, and Pierre Van-              [25] Bo Pang, Kaiwen Zha, Yifan Zhang, and Cewu Lu. Further
     dergheynst. Convolutional neural networks on graphs with               understanding videos through adverbs: A new video task. In
     fast localized spectral filtering. In Advances in neural infor-        AAAI, 2020. 2
     mation processing systems, pages 3844–3852, 2016. 7               [26] Georgios Pavlakos, Xiaowei Zhou, Aaron Chan, Konstanti-
[11] Helin Dutagaci, Chun Pan Cheung, and Afzal Godil. Eval-                nos G Derpanis, and Kostas Daniilidis. 6-dof object pose
     uation of 3d interest point detection techniques via human-            from semantic keypoints. In 2017 IEEE International Con-
     generated ground truth. The Visual Computer, 28(9):901–                ference on Robotics and Automation (ICRA), pages 2011–
     917, 2012. 1, 2, 3, 4                                                  2018. IEEE, 2017. 2

[27] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas.         [41] Yifan Xu, Tianqi Fan, Mingye Xu, Long Zeng, and Yu Qiao.
     Pointnet: Deep learning on point sets for 3d classification           Spidercnn: Deep learning on point sets with parameterized
     and segmentation. In Proceedings of the IEEE Conference on            convolutional filters. In Proceedings of the European Con-
     Computer Vision and Pattern Recognition, pages 652–660,               ference on Computer Vision (ECCV), pages 87–102, 2018.
     2017. 7                                                               7
[28] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J            [42] Li Yi, Hao Su, Xingwen Guo, and Leonidas J Guibas. Sync-
     Guibas. Pointnet++: Deep hierarchical feature learning on             speccnn: Synchronized spectral cnn for 3d shape segmenta-
     point sets in a metric space. In Advances in neural informa-          tion. In Proceedings of the IEEE Conference on Computer
     tion processing systems, pages 5099–5108, 2017. 7                     Vision and Pattern Recognition, pages 2282–2290, 2017. 1,
[29] Blaine Rister, Mark A Horowitz, and Daniel L Rubin.                   2, 3, 7
     Volumetric image registration from invariant keypoints.          [43] Yang You, Yujing Lou, Qi Liu, Yu-Wing Tai, Weiming
     IEEE Transactions on Image Processing, 26(10):4900–4910,              Wang, Lizhuang Ma, and Cewu Lu. Prin: Pointwise rotation-
     2017. 7                                                               invariant network. arXiv preprint arXiv:1811.09361, 2018.
[30] Samuele Salti, Federico Tombari, Riccardo Spezialetti, and            2
     Luigi Di Stefano. Learning a descriptor-specific 3d keypoint
     detector. In Proceedings of the IEEE International Confer-
     ence on Computer Vision, pages 2318–2326, 2015. 7
[31] Ivan Sipiran and Benjamin Bustos. Harris 3d: a robust ex-
     tension of the harris operator for interest point detection on
     3d meshes. The Visual Computer, 27(11):963, 2011. 2, 7
[32] Jian Sun, Maks Ovsjanikov, and Leonidas Guibas. A concise
     and provably informative multi-scale signature based on heat
     diffusion. In Computer graphics forum, volume 28, pages
     1383–1392. Wiley Online Library, 2009. 2
[33] Minhyuk Sung, Hao Su, Ronald Yu, and Leonidas J Guibas.
     Deep functional dictionaries: Learning consistent semantic
     structures on 3d models from functions. In Advances in Neu-
     ral Information Processing Systems, pages 485–495, 2018.
     2, 7
[34] Leizer Teran and Philippos Mordohai. 3d interest point de-
     tection via discriminative learning. In European Conference
     on Computer Vision, pages 159–173. Springer, 2014. 7
[35] Federico Tombari, Samuele Salti, and Luigi Di Stefano.
     Unique signatures of histograms for local surface descrip-
     tion. In European conference on computer vision, pages
     356–369. Springer, 2010. 2
[36] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie.
     The Caltech-UCSD Birds-200-2011 Dataset. Technical Re-
     port CNS-TR-2011-001, California Institute of Technology,
     2011. 2
[37] Hanyu Wang, Jianwei Guo, Dong-Ming Yan, Weize Quan,
     and Xiaopeng Zhang. Learning 3d keypoint descriptors for
     non-rigid shape matching. In Proceedings of the European
     Conference on Computer Vision (ECCV), pages 3–19, 2018.
     1
[38] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma,
     Michael M Bronstein, and Justin M Solomon. Dynamic
     graph cnn for learning on point clouds. ACM Transactions
     on Graphics (TOG), 38(5):146, 2019. 7
[39] Wenxuan Wu, Zhongang Qi, and Li Fuxin. Pointconv: Deep
     convolutional networks on 3d point clouds. In Proceedings
     of the IEEE Conference on Computer Vision and Pattern
     Recognition, pages 9621–9630, 2019. 7
[40] Yu Xiang, Roozbeh Mottaghi, and Silvio Savarese. Beyond
     pascal: A benchmark for 3d object detection in the wild. In
     IEEE Winter Conference on Applications of Computer Vision
     (WACV), 2014. 3

You can also read