Learning to Disambiguate Strongly Interacting Hands via Probabilistic Per-pixel Part Segmentation

Page created by Jack Cruz
 
CONTINUE READING
Learning to Disambiguate Strongly Interacting Hands via Probabilistic Per-pixel Part Segmentation
Learning to Disambiguate Strongly Interacting Hands
                                                                   via Probabilistic Per-pixel Part Segmentation

                                                             Zicong Fan1,2  Adrian Spurr1 Muhammed Kocabas1,2 Siyu Tang1
                                                                            Michael J. Black2 Otmar Hilliges1
                                                    1
                                                      ETH Zürich, Switzerland 2 Max Planck Institute for Intelligent Systems, Tübingen
arXiv:2107.00434v1 [cs.CV] 1 Jul 2021

                                        Figure 1. When estimating the 3D pose of interacting hands, state-of-the-art methods struggle to disambiguate the appearance of the
                                        two hands and their parts. In this example, significant uncertainty between the left and right wrist arise (1.1), resulting in erroneous pose
                                        estimation (1.2). Our model, DIGIT, reduces the ambiguity by predicting and leveraging a probabilistic part segmentation volume (2.1) to
                                        produce reliable pose estimates even when the two hands are in direct contact and under significant occlusion (2.2, 2.3).

                                                                   Abstract

                                           In natural conversation and interaction, our hands often
                                        overlap or are in contact with each other. Due to the ho-                1. Introduction
                                        mogeneous appearance of hands, this makes estimating the
                                        3D pose of interacting hands from images difficult. In this                 Hands are our primary means of interaction with the
                                        paper we demonstrate that self-similarity, and the resulting             physical world, be it to manipulate objects. Consequently,
                                        ambiguities in assigning pixel observations to the respec-               a method for estimating 3D hand pose from monocular im-
                                        tive hands and their parts, is a major cause of the final 3D             ages would have many applications in human-computer in-
                                        pose error. Motivated by this insight, we propose DIGIT, a               teraction, AR/VR, and robotics. We often use both hands
                                        novel method for estimating the 3D poses of two interacting              in a concerted manner and, as a consequence, our hands
                                        hands from a single monocular image. The method consists                 are often close to or in contact with each other. The vast
                                        of two interwoven branches that process the input imagery                majority of the 3D hand pose estimation methods assumes
                                        into a per-pixel semantic part segmentation mask and a vi-               that inputs contain only a single hand [5, 9, 10, 18, 30, 32,
                                        sual feature volume. In contrast to prior work, we do not                47, 48, 50, 54, 58]. This is for good reasons: hands dis-
                                        decouple the segmentation from the pose estimation stage,                play a large amount of self-similarity and are very dexter-
                                        but rather leverage the per-pixel probabilities directly in the          ous. This leads to self-occlusion, which, together with the
                                        downstream pose estimation task. To do so, the part prob-                inherent depth-ambiguities, results in a challenging pose re-
                                        abilities are merged with the visual features and processed              construction problem. Estimating two interacting hands is
                                        via fully-convolutional layers. We experimentally show that              even more difficult due to the self-similar appearance and
                                        the proposed approach achieves new state-of-the-art per-                 complex occlusion patterns, where often large areas of the
                                        formance on the InterHand2.6M [30] dataset for both sin-                 hands are unobservable.
                                        gle and interacting hands across all metrics. We provide                    Recently, Moon et al. [30] proposed a large-scale an-
                                        detailed ablation studies to demonstrate the efficacy of our             notated dataset, captured via a massive multi-view setup,
                                        method and to provide insights into how the modelling of                 allowing for the study of the 3D interacting hand pose es-
                                        pixel ownership affects single and interacting hand pose es-             timation task. The method shows feasibility but struggles
                                        timation. Our code will be released for research purposes.               with interacting hands. One of the main sources of difficulty
Learning to Disambiguate Strongly Interacting Hands via Probabilistic Per-pixel Part Segmentation
in the task is ambiguities caused by the relatively homoge-          pose estimation from monocular images that depict
neous appearance of hands and fingers. Even the fingers of           two hands, often under self-contact.
a single hand can be difficult to tell apart if only parts of     3. An approach to incorporate a semantic part-
the hand are visible. Considering hands in close interaction         segmentation network and means to combine the
only makes this problem more pronounced. Consider the                per-pixel probabilities with visual features for the final
example in Fig. 1. Here an interacting hand pose estimator           task of 3D pose estimation.
struggles to disambiguate the wrists of the two hands, re-        4. Detailed ablation studies revealing a reduction of un-
flected in a bi-modal and dispersed heatmap, resulting in a          certainty due to self-similarity in interacting hands and
poor 3D pose estimate.                                               improvements in the accuracy of single hands, and the
    To address this problem, we introduce DIGIT (DIs-                estimation of relative depth between hands.
ambiGuating hands in InTeraction), a novel method for             5. Our method reaches state-of-the-art performance
learning-based reconstruction of 3D hand poses for interact-         across all metrics in single and interacting hand pose
ing hands. The key insight is to explicitly reason about the         estimation on the InterHand2.6M [30] dataset.
per-pixel segmentation of the images into the separate hands
and their parts, thus assigning ownership of each pixel to a     2. Related work
specific part of one of the hands. We show that this reduces
the ambiguities brought on by the self-similarity of hands           Here we briefly review related work in monocular hand
and, in turn, significantly improves the accuracy of 3D pose     pose estimation, reconstruction of 3D pose of bi-manual in-
estimates. While prior work on hand- [4, 58] and body-           teraction, and the use of segmentation in related tasks.
pose estimation [38, 41, 55] and hand-tracking [12, 33, 53]      Monocular 3D hand pose estimation. Monocular RGB
have leveraged some form of segmentation, most often sil-        3D hand pose estimation has a long history beginning with
houettes or per-pixel masks, this is typically done as a pre-    Rehg and Kanade [42]. Surface-based approaches estimate
processing step. In contrast, our ablations show that inte-      dense hand surfaces by either fitting a hand model to ob-
grating a semantic segmentation branch into an end-to-end        servations or by regressing model parameters directly from
trained architecture already increases pose estimation ac-       pixels [7, 11, 13, 14, 16, 17, 23, 25, 28, 29, 36, 39, 42,
curacy. We also demonstrate that leveraging the per-pixel        43, 57]. More closely related to ours are keypoint-based
probabilities, rather than class-labels, alongside the image     approaches, that regress the 3D joint positions [5, 9, 10,
features further improves the accuracy of the 3D pose es-        18, 30, 32, 47, 48, 50, 54, 58]. For example, Zimmermann
timation task. Finally, our experiments reveal that the pro-     et al. [58] propose the first convolutional network for RGB
posed approach helps to disambiguate interacting hands but       hand pose estimation. Iqbal et al. [18] introduce a 2.5D rep-
also improves the accuracy of single hands, and in the esti-     resentation, allowing training on in-the-wild 2D annotation.
mation of relative positions between hands, via a reduction      However, all of the above approaches assume single hand
of uncertainty due to self-similarity.                           images. Recently, Moon et al. [30] introduce a large scale
    More precisely, DIGIT is an end-to-end trainable net-        dataset and a 3D hand pose estimator for both single and
work architecture (see Fig. 2) that uses two separate, but       interacting hands. In our work, we show that existing ap-
interwoven, branches for the tasks of semantic segmen-           proaches struggle with occlusions and appearance ambigu-
tation and pose estimation respectively. Importantly, the        ity. To this end, we propose a novel method that can better
output of the segmentation branch is per-pixel logits (i.e.,     disambiguate strongly interacting hands and thus improves
the full probability distribution) rather than the more com-     interacting hand pose estimation.
monly used discrete class labels. These probabilities are        Interacting hand tracking and pose estimation. Model-
then merged with the visual features and processed via fully     based approaches to tracking of interacting hands have been
convolutional layers to attain a fused feature representation    proposed [3, 33, 37, 46, 51, 53] as well as multi-view meth-
that is ultimately used for the final pose estimates. The net-   ods to reconstruct the pose of interacting hands [15, 30, 45].
work is supervised via a 3D pose estimation loss and a se-       Oikonomidis et al. [37] provide a formulation to track in-
mantic segmentation loss. We show that all design choices        teracting hands using Particle Swarm Optimization from
are necessary in order to attain the best performing architec-   RGBD videos. Ballan et al. [3] introduce an offline method
ture in ablation studies and that the final proposed method      to capture hand motion during hand-hand and hand-object
reaches state-of-the-art performance on the InterHand2.6M        interaction in a multi-camera setup. Tzionas et al. [51]
[30] dataset across all metrics. In summary, we contribute:      extend the idea in [3] with a physical model. Mueller
                                                                 et al. [33] and Wang et al. [53] propose interacting hand
 1. An analysis showing that SOTA hand-pose meth-                tracking methods by predicting left/right-hand silhouettes
    ods are sensitive to self-occlusions and ambiguities         and correspondence masks used in a post-processing en-
    brought on by interacting hands                              ergy minimization step. Smith et al. [46] propose a multi-
 2. A novel end-to-end trainable architecture for the 3D         view system that constrains a vision-based tracking algo-
Learning to Disambiguate Strongly Interacting Hands via Probabilistic Per-pixel Part Segmentation
Segmentation Loss

                                                                                                                              2.5D Pose

               Image
                                                  Segm.
                                                                                                       Fused                                 GT
                                  Visual           (S)                  Semantic    Visual                          Pose

                                                                 CNN
                           CNN

                                            CNN
                                                                                                      Features    Estimator               Supervision
                                 Features                               Features   Semantic
                                                                                                        (F’)
                                    (F)                                   (S’)     Features

                                                          Concatenate                                                              Pose Loss

Figure 2. An illustration of our hand pose estimation model (DIGIT). Given an image, DIGIT extracts visual features (F) and predict
part segmentation probability volume (S). The segmentation volume is projected into latent semantic features (S0 ). The visual features (F)
and the semantic features (S0 ) are fused across multiple scales and are used for interacting hand pose estimation (illustrated in Fig. 3).

rithm with a physical model. In 3D interacting hand pose                            is that of Omran et al. [38]. In addition to not training the
estimation, Simon et al. [45] propose a multi-view boot-                            networks end-to-end, they use discrete part segments while
strapping technique to triangulate full-body 2D keypoints                           we preserve uncertainty with a probabilistic segmentation
into 3D. He et al. [15] incorporate epipolar geometry into a                        map. Our experiments show that end-to-end training and
transformer network [52]. Moon et al. [30] propose a large-                         the use of probabilistic segmentation maps significantly re-
scale dataset and a model for interacting hand pose estima-                         duce hand pose estimation errors.
tion. The model from Moon et al. [30] is the most closely
related to this paper since it is the only prior work that esti-                    3. Method
mates 3D hand pose of interacting hands from a single RGB
image. Compared to hand tracking [3, 33, 37, 46, 51, 53],                           3.1. Overview
our method does not require RGB video or depth image se-                               At the core of DIGIT lies the observation that the self-
quences. In contrast to the existing interacting pose esti-                         similarity between joints and the similarity between hands,
mation frameworks [15, 30, 45], we explicitly model uncer-                          which are especially pronounced during hand-to-hand inter-
tainty caused by appearance ambiguity in interacting hands                          action, are major sources of errors for monocular 3D hand
and we do not require a multi-view supervision [15, 45].                            pose estimation. Our experiments (see Fig. 8) show that
                                                                                    standard approaches do not have a mechanism to cope with
Segmentation in pose estimation. Segmentation has been
                                                                                    such ambiguity. Embracing this challenge, we propose a
used in 3D hand pose estimation, 3D human pose estima-
                                                                                    simple yet effective framework for interacting hand pose es-
tion and hand tracking and can be grouped into four cate-
                                                                                    timation by modelling per-pixel ownership via probabilistic
gories: as a localization step [1, 19, 34, 35, 56, 58], as a
                                                                                    part segmentation maps. Our method leverages both visual
training loss [2, 4], as an optimization term [6, 33, 53], or
                                                                                    features and distinctive semantic features to address the am-
as an intermediate representation [38, 41, 55]. Most sin-
                                                                                    biguity caused by self-similarity.
gle hand pose estimation approaches follow Zimmermann
et al. [58] in localizing a hand in an image by predicting                             In contrast to prior work, which separates the image-to-
the hand silhouette, which is used to crop the input image                          segmentation and segmentation-to-pose steps [38, 41, 55],
before performing pose estimation. Boukhayma et al. [4]                             we propose a holistic approach that is trained to jointly rea-
predict a dense hand surface and use a neural rendering                             son about pixel-to-part assignment and 3D joint locations.
technique to obtain a silhouette loss. In contrast, we lever-                       In particular, given an input image, our model identifies in-
age part segmentation to explicitly address self-similarity                         dividual parts of each hand in the form of probabilistic seg-
in hand pose estimation. In tracking interacting hands, left                        mentation maps, which are used to encourage the local in-
and right hand masks can be predicted from either depth im-                         fluence of visual features for estimating the corresponding
ages [33] or monocular RGB images [53], which are used in                           3D joint. We fuse the probabilistic segmentation maps and
a optimization-based post-processing step. Our method nei-                          the visual feature maps across multiple scales using a con-
ther assume RGB nor depth image sequences. In 3D human                              volutional fusion layer. Our experiments show that end-to-
pose and shape estimation, existing methods predict part                            end training and the use of probabilistic segmentation maps
segmentation maps [38, 55] or silhouettes [41] from RGB                             significantly improve hand pose estimation.
images and use the predicted masks as an intermediate rep-
                                                                                    3.2. Segmentation-aware pose estimation
resentation. Specifically, they decouple the image-to-pose
problem into image-to-segmentation and segmentation-to-                                 Fig. 2 illustrates the main components of our frame-
pose. The two models of the subtasks are trained sepa-                              work. Given an image region I ∈ IRWI ×HI ×3 , cropped by
rately. Our method, on the other hand, trains image-to-pose                         a bounding box including all hands, the goal of our model
in an end-to-end fashion. The most closely related method                           is to estimate the 3D hand pose P3D ∈ IR2J×3 in the cam-
Learning to Disambiguate Strongly Interacting Hands via Probabilistic Per-pixel Part Segmentation
Pose Estimator
                                                                                  2D keypoints
                                                                                                      to obtain a fused feature map F0 ∈ IRWF ×HF ×(DF +DS ) for
                      2D Latent
                                     Softmax
                                                        2D
                                                                                                      pose estimation (see Sup. Mat. for the UNet details).
                                   Normalization                soft-argmax
                                                    Heatmaps
                      Heatmaps
                       ( H* )                         ( H2D )
                                                                                                      Interacting hand pose estimation. To estimate the final
                                                                                                      3D hand pose of both hands, we learn a function F : F0 →
                           2D
                CNN

                                                                              Root-relative depth     P2.5D that maps the fused feature map F0 to 2.5D pose.
                        Latent
                                   Element-wise                                                       The 2.5D representation P2.5D ∈ IR2J×3 consists of in-
                                   Multiplication           Depth Maps

   Fused
                      Depth Maps                               ( HZ )         Σ                       dividual 2.5D joints (xi , yi , zi ) ∈ IR3 where (xi , yi ) is
                        ( H* )
                           z
  Features
    (F’)
                                                                                                      the 2D projection of the 3D joint (Xi , Yi , Zi ) ∈ IR3 , and
                                                                                                      zi = Zi − Zroot(i) . The notation root(i) denotes the hand
                                                MLP
                                                                                  L/R relative root   root of joint i. During inference, 3D pose can be recov-
                                                                                       depth
                CNN

                       x                                                                              ered by applying inverse perspective projection on (xi , yi )
                                                MLP                            L/R handedness         using the depth estimation Zi . To model the function F,
                                                                                                      we use a custom estimator inspired by [18]. We found that
             Figure 3. Our interacting hand pose estimator                                            the 2.5D representation of [18] performs equally well to the
                                                                                                      interacting pose estimator by Moon et al. [30], while being
                                                                                                      more memory efficient due to the lack of a requirement for
                                                                                                      a volumetric heatmap representation (see Sup. Mat).
                                                                                                          Figure 3 shows a schematic of our proposed pose esti-
Figure 4. Part segmentation classes. Each of the left hand and the                                    mator. Similar to [30], our model estimates the handedness
right hand is partitioned into 16 classes shown in different colors.                                  (hL , hR ) ∈ [0, 1]2 , the 2.5D left and right-hand pose P2.5D ,
Including the background, there are 33 classes in total.                                              and the right hand-relative left-hand depth z R→L ∈ IR,
                                                                                                      where L and R denote left and right hands. Since our model
                                                                                                      predicts 2.5D joints (xi , yi , zi ), and converting from 2.5D
era coordinate for 2J joints where J is the number of joints
                                                                                                      to 3D requires an inverse perspective projection, we need to
in one hand. In particular, we first extract a feature map
                                                                                                      estimate the depth Zi for a joint i by Zi = zi + Zroot(i) .
F ∈ IRWF ×HF ×DF from the image I using a CNN back-
                                                                                                      The root relative depth z R→L is used to obtain the left-hand
bone network to provide visual features for pose estimation
                                                                                                      root depth when both hands are present.
and part segmentation. Here WF , HF , and DF denote the
                                                                                                      Handedness and relative root depth. The handedness
width, height and channel dimension of the feature map.
                                                                                                      (hL , hR ) detects the presence of the two hands and z R→L
Probabilistic segmentation. Since there is an inherent self-
                                                                                                      measures the depth of the left root relative to the right root.
similarity between different parts of the hands, we learn
                                                                                                      We repeatedly convolve and downsample F0 to a latent vec-
a part segmentation network to predict a probabilistic seg-
                                                                                                      tor x, which is used to estimate (hL , hR ) and z R→L by
mentation volume S ∈ IRWS ×HS ×C , which is directly su-
                                                                                                      two separate multi-layer perception (MLP) networks. For
pervised by groundtruth part segmentation maps. Each
                                                                                                      z R→L , we use the MLP to estimate a 1D heatmap p ∈ IRDz
pixel on the segmentation volume S is a channel of prob-
                                                                                                      that is softmax-normalized, representing the probability dis-
ability logits over C classes where C is the number of cat-
                                                                                                      tribution over Dz possible values for z R→L . The final rela-
egories including the parts of the two hands and the back-
                                                                                                      tive depth z R→L is obtained by
ground (see Fig. 4). Note that to preserve the uncertainty in
segmentation prediction, we do not pick the class with the                                                                           z −1
                                                                                                                                    DX
highest response among the C classes for each pixel in the                                                               z R→L =            k p[k].               (1)
segmentation volume. For display purposes only, we show                                                                              k=0
the class with the highest probability in Fig. 2. Finally, since
the segmentation volume S has a higher resolution than F,                                             2.5D hand pose estimator (F). Inspired by [18], our
we perform a series of convolution and downsampling op-                                               pose estimator predicts the latent 2D heatmap H∗2D ∈
erations to obtain semantic features S0 ∈ IRWF ×HF ×DS                                                IRWF ×HF ×2J for the 2D joint locations and the latent root-
where DS is the channel dimension.                                                                    relative depth map H∗z ∈ IRWF ×HF ×2J for the root-relative
Visual semantic fusion. The visual features F and the se-                                             depth of each joint. The heatmap H∗2D is spatially softmax-
mantic features S0 are concatenated along the channel di-                                             normalized to a probability map H2D ∈ IRWF ×HF ×2J .
mension to provide rich visual cues for estimating accurate                                           Since H2D indicates potential 2D joint locations, to focus
3D hand poses and distinctive semantic features for avoid-                                            the depth values on the joint locations, H2D is element-wise
ing appearance ambiguity. However, a naive concatenation                                              multiplied with the latent depth map H∗z to obtain the depth
does not provide global context from the semantic features.                                           map Hz = H∗z H2D . To allow our network to be fully-
Therefore, we fuse the visual and semantic features across                                            differentiable, we use soft-argmax [26] to convert the 2D
different scales using a custom and lightweight UNet [44]                                             heatmap H2D to 2D keypoints {(xi , yi )}2J  i=0 . Finally, we
Learning to Disambiguate Strongly Interacting Hands via Probabilistic Per-pixel Part Segmentation
sum across the values on each slice of the depth map Hz to       where (i, j) ∈ E denotes a bone from the edge set E of the
obtain the root-relative depth zi for a joint i.                 directed kinematic tree connecting the groundtruth joints
From 2.5D pose to 3D. To convert P2.5D to 3D pose P3D ,          P̄i2.5D and P̄j2.5D . This loss encourages the predicted bones
following [30], we apply an inverse perspective projection       to have similar length to those in the groundtruth.
to map 2D keypoints to 3D camera coordinate by
                                                               4. Experiments
               PL3D = Π T−1 PL2.5D + ZL                (2)
                                                                   In this paper we want to demonstrate that self-similarity,
               PR            −1 R
                                P2.5D + ZR                       and the resulting ambiguities between joints is a major
                 3D = Π T                              (3)
                                                                 cause of 3D hand pose error, and the ambiguity becomes
where Π and T−1 are the camera back-projection operation         more severe during interaction. We hypothesize that the am-
and the inverse affine transformation (undoing cropping and      biguity problem can be alleviated by modelling pixel own-
resizing). The projection requires the absolute depth of the     ership via part segmentation. To demonstrate the efficacy of
left and right roots ZL and ZR (written in vector form):         our approach, we first compare our model with the state-of-
               (                                                 the-art. We investigate the benefits of modelling part seg-
          L       [0, 0, z L ]| ,        if hR < 0.5             mentation in an ablation study that sheds further light on
        Z =                R      R→L |                  (4)     when and why our approach works.
                  [0, 0, z + z       ] , otherwise,
                                                                 Dataset. We evaluate our interacting hand pose estimation
  and ZR = [0, 0, z R ]|                                  (5)    model using the InterHand2.6M [30] dataset. It is the only
                                                                 large-scale dataset for modeling hand interaction, which in-
where z L and z R are the absolute depth of the roots for
                                                                 cludes images of both single-hand (SH) and interacting-
the left and the right hands. Following [30], we use the
                                                                 hand (IH) sequences of 21 subjects in the training set. For
estimates from RootNet [27] for z L and z R .
                                                                 the validation and the test set, there are 1 and 8 subjects
Training loss. The loss used to train our model is:
                                                                 respectively. We use the initial release of the 5 frames-
         L = Lh + L2.5D + Lz + λs Ls + λb Lb ,            (6)    per-second (FPS) subset for our experiments because the
                                                                 full dataset has not been released at the time of submission.
where the terms are the handedness loss Lh , the 2.5D hand       The subset uses an official split [31] containing 371K single
pose loss L2.5D , the right-hand relative left hand depth loss   hand images and 367K interacting hand images for training,
Lz , the segmentation loss Ls , and a bone regularization        113K single hand images, and 71K interacting hand images
loss Lb . In particular, we use the multi-label binary cross-    for validation, and 198K single hand images and 155K in-
entropy loss to supervise the handedness prediction. For         teracting hand images for testing.
segmentation, we use the multi-class cross-entropy loss:         Evaluation metrics. We use the three metrics from [30]
                                                                 for hand pose evaluation. The average precision of hand-
             WF X
                HF X
                   C
             X                                                   edness estimation (AP) measures the accuracy of hand-
    Ls = −                 Tj [m, n] log(σ(Sj [m, n]))    (7)
                                                                 edness prediction. The root-relative mean per joint posi-
             m=1 n=1 j=1
                                                                 tion error (MPJPE) measures the error in root-relative 3D
where Tj [m, n] ∈ IRC is a one-hot vector with 1                 hand pose estimation. It is the Euclidean distance between
positive class and C − 1 negative classes according to           the predicted 3D joint locations and the groundtruth after
the groundtruth for the segmentation pixel at (m, n)             root alignment. For interacting sequences, the alignment
and σ(·) is a softmax normalization operation so that            is done on the two hands separately. To measure the per-
PC                                                               formance in estimating the relative position between the
   j=1 Sj [m, n] = 1. We use the L1 loss to supervise the
2.5D hand pose and the relative root depth.                      left and the right root in interacting sequences, we use the
Kinematic consistency. In our experiments we observe that        mean relative-root position error (MRRPE). It is defined
both the method by Moon et al. [30] and our own baseline         as the Euclidean distance between the predicted and the
yield asymmetric predictions in terms of bone length be-         groundtruth left-hand root position after aligning the left-
tween left and right hand, due to the appearance ambigui-        hand root by the root of the right hand.
ties (with an average difference of 8mm and 10mm on the          4.1. Implementation details
baseline and the InterHand2.6M [30] model in the valida-
tion set). To encourage more physically plausible predic-           We implement our models in PyTorch [40] using an
tions we propose a bone vector loss:                             HRNet-W32 [49] backbone pre-trained on the ImageNet
           X                                                     dataset [8]. Following [30], we crop the hand region us-
   Lb =          (Pi2.5D − Pj2.5D ) − (P̄i2.5D − P̄j2.5D ) ,     ing the groundtruth bounding box for both training and test-
        (i,j)∈E                                                  ing images and resize the cropped image to 256x256 before
                                                          (8)    feeding it to the network. The spatial dimension of the 2D
Learning to Disambiguate Strongly Interacting Hands via Probabilistic Per-pixel Part Segmentation
Predicted        GT
         Input image   Predicted 2D pose   GT 2D pose                                                Front View                                 Rotated View
                                                        segmentation   segmentation
                                                                                          Predicted 3D pose       GT 3D pose          Predicted 3D pose   GT 3D pose

Figure 5. Qualitative results from our model. The lighting is adjusted for display purposes (not model input). Best viewed zoomed in.

heatmap H2D and the 2D depth map Hz are 64x64. We ob-                                  Methods                          MPJPE Val     MRRPE Val      MPJPE Test        MRRPE Test
                                                                                       InterHand2.6M [30]               14.82/20.59     35.99        12.63/17.36         34.49
tain groundtruth part segmentation for training our segmen-                            Baseline                         14.64/20.24     35.05        12.32/17.23         32.70
                                                                                       Ours                             13.54/18.28     32.21        11.32/15.57         30.51
tation network by rendering groundtruth hand meshes from                               % in improvement over [30]        8.64/11.22     10.50        10.37/10.31         11.54
InterHand2.6M [30] with a neural renderer [20, 22] using a                            Table 1. Comparison with the state-of-the-art. The left and right
custom texture map (see Fig. 4). The detail of our network                            of the slash are for the single and interacting images.
is in Sup. Mat. To balance the loss in Eq. 6, we choose
λb = 1.0 and λs = 10.0 based on the average MPJPE for
                                                                                      4.3. Ablation study
single and interacting images on the validation set. We do
not apply Ls for models without a segmentation network.                                   Here we aim to provide further insights how, when and
Training procedure. We train all models with both single-                             why our proposed method (Fig. 2) improves over the base-
hand and interacting-hand sequences using the Adam opti-                              line (Fig. 6a). In particular, we examine the impact of the
mizer [21] with an initial learning rate 10.0−4 and a batch                           bone loss Lb (BL), the part segmentation loss Ls (SL), and
size of 64. For experiments comparing to the state-of-the-                            the segmentation features S (SF) for hand pose estimation.
art, we train our models for 50 epochs and decay the learn-                               From Table 2, comparing the performance between the
ing rate at epoch 40. For the ablation experiments, since the                         baseline with or without the bone loss, using the same ar-
InterHand2.6M [30] subset has a large number of frames                                chitecture (Fig. 6a), we see a 0.32mm and 1.17mm improve-
(738K frames), for time efficiency, we train the models for                           ment of 3D pose estimation for single and interacting hand
30 epochs and decay the learning rate at epoch 10 and 20.                             on the test set. The bone loss improves both single and in-
We use a factor of 10 for all learning rate decays.                                   teracting hand pose because the loss is applied whenever
                                                                                      both joints of a bone are labeled.
4.2. Comparison with the state-of-the-art
                                                                                          Inspired by the multi-task learning paradigm [24], we in-
   Table 1 shows the root-relative mean pose per joint er-                            vestigate whether predicting part segmentation regularizes
ror (MPJPE) in millimeters for both single and interacting                            hand pose estimation. In particular, we train a pose estima-
hand sequences. The results show that despite having fewer                            tor with an additional head to predict the part segmentation
parameters, our baseline network slightly improves Inter-                             of hands (see Fig. 6b) but do not use segmentation for pose
Hand2.6M [30]. For estimating root-relative hand pose,                                estimation. The result shows that the segmentation loss im-
shown in MPJPE, our proposed model outperforms Inter-                                 proves pose estimation over the baseline with bone loss by
Hand2.6M [30] by 1.31mm/1.79mm for single and inter-                                  0.84mm / 0.49mm for single and interacting hand on the test
acting hand images on the test set. For estimating the rela-                          set. For the relative root position, the segmentation loss dra-
tive position between the left and the right-hand roots, our                          matically improves the MRRPE metric by 5.26mm on the
model outperforms InterHand2.6M [30] by 3.98mm on the                                 test set over the baseline with bone loss. The reason is that,
test set. The average precision of InterHand2.6M [30], our                            to satisfy the loss Ls , the backbone has to distinguish the
baseline, and our proposed model on the validation set are                            left and the right roots and the distinction reduces appear-
98.35, 98.13, and 98.12 percent. Fig. 5 shows the qualitative                         ance ambiguity, resulting in better hand root localization.
results from our model. The figure shows the input image,                                 Finally, in addition to the bone loss and the segmentation
2D keypoint, and the segmentation masks overlaid on the                               loss, our model (see Fig. 2) makes use of the segmentation
image. The predicted 3D pose is shown in two views. More                              and visual features for pose estimation. Compared to the
examples are in Sup. Mat.                                                             baseline with bone loss, our model reduces MPJPE for sin-
Learning to Disambiguate Strongly Interacting Hands via Probabilistic Per-pixel Part Segmentation
(a) Baseline / Baseline + BL
                                                                                                                             Ablation Study      MPJPE Val     MRRPE Val   MPJPE Test    MRRPE Test
Image
                                                                                                                             Part segm.§         16.68/23.52     41.99     14.35/20.57     38.87
                      Visual
               CNN

                     Features
                                  Pose
                                Estimator
                                                                                                                 GT
                                                                                                              Supervision
                                                                                                                             Part segm. (ours)   14.06/20.01     35.13     12.30/17.22     32.88
                        (F)                                                                            Pose
                                                                                                       Loss                 Table 3. Effects of end-to-end training. The entry with § trains
                                                                                                                            the segm. and the pose networks separately [38, 55]. The left and
(b) Baseline + SL                                                          Segmentation Loss

Image
                                                                                                                            right of the slash are for the single and interacting images.
                                      Segm.
                      Visual           (S)
               CNN

                                CNN

                                                              Visual Features (F)            Pose                GT
                     Features
                                                                                           Estimator          Supervision
                        (F)                                                                            Pose
                                                                                                       Loss

                                                                                                                            argmax [26]. In contrast, with segmentation, our network
(c) Only Segmentation                                                      Segmentation Loss

Image
                                                                                                                            (Baseline + BL + SF in Table 2) disambiguates different
                                      Segm.
                                       (S)                 Semantic              Fused                           GT         hands and provides a single-mode estimation. A similar ob-
               CNN

                                                                                             Pose
                                                    CNN

                      Visual
                                CNN

                                                           Features             Features   Estimator          Supervision
                     Features                                                                          Pose
                        (F)                                  (S’)                 (F’)
                                                                                                       Loss                 servation can be made in the single-hand case for ambiguity
                                                                                                                            between fingers.
Figure 6. Pose estimators for the ablation studies: (a) without
                                                                                                                            Impact of interaction and occlusion. We study how pose
segmentation, (b) with segmentation but only for applying a loss,
(c) segmentation for pose estimation (no visual features).
                                                                                                                            estimation performance is affected by the degree of interac-
                                                                                                                            tion. In particular, we use the IoU between the groundtruth
 Ablation Study                       MPJPE Val           MRRPE Val          MPJPE Test           MRRPE Test                left/right masks (not part segm.) to measure the degree of
 Baseline                             15.27/21.91           36.73            13.15/18.71            34.05                   interaction and occlusion. The higher IoU implies more oc-
 Baseline + BL                        14.97/20.43           38.84            12.83/17.54            37.36
 Baseline + BL + SL                   14.14/19.72           34.51            11.99/17.05            32.10                   clusion. Fig. 8 compares the results on the validation set.
 Baseline + BL + SF*                  14.50/20.31           40.09            12.46/17.60            38.20                   The bars show the MPJPE over annotated joints for each
 Baseline + BL + SF (ours)            13.82/19.05           34.14            11.64/16.55            31.39
                                                                                                                            IoU range while the half-length of the error bars correspond
Table 2. Effects of bone loss (BL), segmentation loss (SL), and
segmentation features (SF). The symbol * denotes not using segm.                                                            to 0.5 times (for better display) of MPJPE standard devi-
supervision. Left/right denotes single and interacting images.                                                              ation in that range. Typical hand masks are shown above
                                                                                                                            the bars of each IoU range. We observe more errors in in-
                                                                                                                            teracting hand cases when the two hands do not intersect,
gle and interacting hand further by 1.15 mm/1.38mm on the                                                                   which indicates that appearance ambiguity applies as long
validation set, and 1.19mm/0.99mm on the test set. Further,                                                                 as two hands are present. For non-degenerative occlusion
we improve MRRPE by 5.97 mm on the test set. We also                                                                        (IoU ≤ 0.67), our method has consistent improvement over
trained a network with the same architecture but not super-                                                                 InterHand2.6M [30] . The improvement in single-hand is
vised by the segmentation loss (Baseline + BL + SF* ). Since                                                                smaller because appearance ambiguity is amplified in inter-
the performance of Baseline + BL is similar to Baseline +                                                                   acting hands. In the high IoU regime (> 0.67), the improve-
BL + SF* , we can assume that our improvement is not due                                                                    ment levels off, which is expected since the second hand
to the additional network parameters.                                                                                       is almost entirely invisible and the problem is no longer
                                                                                                                            caused by ambiguities. It would be extremely challenging
4.4. Analysis of our proposed model                                                                                         to reliably estimate the correct pose from a single image.
    Here we first provide qualitative analysis to show how                                                                  End-to-end training. Table 3 shows the effect of end-to-
segmentation helps appearance ambiguity. We then investi-                                                                   end training compared to training the image-to-segm. and
gate hand pose performance between InterHand2.6M [30],                                                                      segm.-to-pose models separately [38]. We use the architec-
our baseline, and our final model under different degree of                                                                 ture in Fig. 6c. The first row summarizes the accuracy of a
interaction to show that modeling pixel ownership via seg-                                                                  variant where we train the segmentation network alongside
mentation helps to reduce errors in hand pose estimation.                                                                   the backbone for 25 epochs and then train the pose network
Qualitative Analysis. We investigate how segmentation                                                                       for another 25 epochs, while freezing the segmentation and
helps to reduce appearance ambiguity in hand pose esti-                                                                     the backbone networks. For the end-to-end condition, we
mation. To build intuition, we provide the 2D estimations                                                                   train the two models jointly for a total of 25 epochs. This
of individual joints in Fig. 7. The cross sign indicates the                                                                ensures the same amount of training epochs for the segmen-
groundtruth 2D location of the joint of interest and the                                                                    tation network in both conditions. The results show that
plus sign is the predicted location. The example in the                                                                     end-to-end training outperforms two-stage training.
first column shows that without segmentation the baseline                                                                   Full probabilities vs part-labels. Existing body-pose
model’s predictions (Baseline + BL in Table 2) contain sig-                                                                 methods [38, 55] decouple the image-to-pose problem into
nificant uncertainty due to the presence of the other hand in                                                               image-to-segm. and segm.-to-pose. They use class-label
the image, as indicated by the dispersed 2D heatmap with                                                                    segmentation maps while our method leverages the full seg-
modes on both hands (see the same behaviour in the In-                                                                      mentation probability distribution. Table 4 compares the
terHand2.6M [30] model in Sup. Mat). As a result, the                                                                       two for pose estimation. All models use the network in
2D prediction is centered between the hands after the soft-                                                                 Fig. 6c. The models using class-label maps are marked with
Learning to Disambiguate Strongly Interacting Hands via Probabilistic Per-pixel Part Segmentation
Figure 7. Qualitative results of how segmentation reduces appearance ambiguity. The image lighting is adjusted for display purposes
(not model input). The dispersed 2D heatmap issue also arises in the model proposed in InterHand2.6M [30] . See Sup. Mat for a more
in-depth analysis. The skeleton notation is also provided in Sup. Mat. Plus: prediction; Cross: groundtruth. Best viewed zoomed in.

                                                                          ment in the main task. In particular, we pass the unnormal-
                                                                          ized segmentation probability distribution (i.e., logits) to
                                                                          the pose estimator, preserving the uncertainty for the down-
                                                                          stream task. In contrast, the methods from [38, 55] take the
                                                                          quantized information (i.e., class labels). Our simple yet
                                                                          effective formulation also enables fully-differentiable end-
                                                                          to-end learning of hand pose estimation in conjunction with
                                                                          hand segmentation. We empirically show that our end-to-
                                                                          end multi-task setup achieves better performance compared
                                                                          to separate training of tasks (see Table 3) and class-label
Figure 8. Comparing pose estimation performance by the de-                inputs (see Table 4) as in [38, 55].
gree of interaction/occlusion. The IoU between groundtruth
left/right masks measures the degree of interaction. SH and IH
denote single and interacting images. The left (yellow) and right         6. Conclusion
(blue) hand masks provide interaction examples in each IoU range.             In this paper, we introduce a framework for interact-
                                                                          ing 3D hand pose estimation that explicitly addresses self-
 Ablation Study      MPJPE Val     MRRPE Val   MPJPE Test    MRRPE Test   similarity between joints. Our method consists of two in-
 LR segm.†           28.72/36.05     50.85     25.75/31.46     46.98
 Part segm.†         17.69/25.49     46.00     15.16/22.08     41.46      terwoven branches that process an input image into a per-
 LR segm. (ours)     14.87/21.19     34.70     12.92/18.40     32.13      pixel semantic part segmentation mask and a visual fea-
 Part segm. (ours)   14.03/20.01     35.26     12.29/17.23     32.88
                                                                          ture volume. The part segmentation mask provides seman-
Table 4. Different segmentation types as intermediate repre-
                                                                          tic features for visually-similar hand regions while the vi-
sentations. With the model in Fig. 6c, we show the effects
                                                                          sual feature volume provides rich visual cues for accurate
of left/right masks and part segm. Entries with † use class-
label [38, 55] instead of probabilistic maps. The left and right          pose estimation. Our experiments show that our proposed
of the slash are for the single and interacting images.                   method achieves state-of-the-art performance on the Inter-
                                                                          Hand2.6M [30] dataset across all metrics. Detailed ablation
†. Compared to probabilistic maps, the class-label maps                   studies show the efficacy of our method and provide insights
lose lots of information for pose estimation.                             into how the modeling of pixel ownership addresses self-
                                                                          ambiguity in single and interacting hand pose estimation.
5. Discussion                                                             Acknowledgement. The authors want to thank Emre Ak-
                                                                          san, and Dimitrios Tzionas for their valuable feedback.
   Our insight is that self-similarity in hands can be ad-                Disclosure. MJB has received research gift funds from
dressed by modeling pixel ownership. While our approach                   Adobe, Intel, Nvidia, Facebook, and Amazon. While MJB
resembles 3D body-pose [38, 55] in terms of leveraging                    is a part-time employee of Amazon, his research was per-
part segmentation, the key difference is that we incorporate              formed solely at, and funded solely by, Max Planck. MJB
the segmentation task in an end-to-end fashion, leading to                has financial interests in Amazon, Datagen Technologies,
more informative representations and a significant improve-               and Meshcapade GmbH.
Learning to Disambiguate Strongly Interacting Hands via Probabilistic Per-pixel Part Segmentation
References                                                                  a deformable model. In Proceedings of the Second Interna-
                                                                            tional Conference on Automatic Face and Gesture Recogni-
 [1] Vassilis Athitsos and Stan Sclaroff. Estimating 3d hand pose           tion, pages 140–145. Ieee, 1996. 2
     from a cluttered image. In CVPR, volume 2, pages II–432.        [17]   Umar Iqbal, Andreas Doering, Hashim Yasin, Björn Krüger,
     IEEE, 2003. 3                                                          Andreas Weber, and Juergen Gall. A dual-source approach
 [2] Seungryul Baek, Kwang In Kim, and Tae-Kyun Kim. Push-                  for 3d human pose estimation from single images. Comput.
     ing the envelope for rgb-based dense 3d hand pose estimation           Vis. Image Underst., 172:37–49, 2018. 2
     via neural rendering. In CVPR, pages 1067–1076. Computer        [18]   Umar Iqbal, Pavlo Molchanov, Thomas Breuel Juergen Gall,
     Vision Foundation / IEEE, 2019. 3                                      and Jan Kautz. Hand pose estimation via latent 2.5D
 [3] Luca Ballan, Aparna Taneja, Jürgen Gall, Luc Van Gool,                heatmap regression. In ECCV, pages 118–134, 2018. 1, 2, 4
     and Marc Pollefeys. Motion capture of hands in action us-       [19]   Byeongkeun Kang, Kar-Han Tan, Nan Jiang, Hung-Shuo
     ing discriminative salient points. In ECCV, pages 640–653.             Tai, Daniel Tretter, and Truong Nguyen. Hand segmentation
     Springer, 2012. 2, 3                                                   for hand-object interaction from depth map. In GlobalSIP,
 [4] Adnane Boukhayma, Rodrigo de Bem, and Philip H. S. Torr.               pages 259–263. IEEE, 2017. 3
     3d hand shape and pose from images in the wild. In CVPR,        [20]   Hiroharu Kato, Yoshitaka Ushiku, and Tatsuya Harada. Neu-
     pages 10843–10852. Computer Vision Foundation / IEEE,                  ral 3d mesh renderer. In CVPR, 2018. 6
     2019. 2, 3                                                      [21]   Diederik P. Kingma and Jimmy Ba. Adam: A method for
 [5] Yujun Cai, Liuhao Ge, Jun Liu, Jianfei Cai, Tat-Jen Cham,              stochastic optimization. In ICLR (Poster), 2015. 6
     Junsong Yuan, and Nadia Magnenat-Thalmann. Exploit-             [22]   Nikos Kolotouros. Pytorch implementation of the neural
     ing spatial-temporal relationships for 3d pose estimation via          mesh renderer, 2018. 6
     graph convolutional networks. In ICCV, pages 2272–2281.         [23]   Dominik Kulon, Riza Alp Güler, Iasonas Kokkinos,
     IEEE, 2019. 1, 2                                                       Michael M. Bronstein, and Stefanos Zafeiriou. Weakly-
 [6] Yunlong Che and Yue Qi. Dynamic projected segmentation                 supervised mesh-convolutional hand reconstruction in the
     networks for hand pose estimation. In ICRA, pages 477–482.             wild. In CVPR, pages 4989–4999. IEEE, 2020. 2
     IEEE, 2018. 3                                                   [24]   Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi
 [7] Martin de La Gorce, David J Fleet, and Nikos Para-                     Parikh, and Stefan Lee. 12-in-1: Multi-task vision and
     gios. Model-based 3d hand pose estimation from monocular               language representation learning. In CVPR, pages 10437–
     video. TPAMI, 33(9):1793–1805, 2011. 2                                 10446, 2020. 6
 [8] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,          [25]   Shan Lu, Dimitris Metaxas, Dimitris Samaras, and John
     and Li Fei-Fei. Imagenet: A large-scale hierarchical image             Oliensis. Using multiple cues for hand tracking and model
     database. In CVPR, pages 248–255. Ieee, 2009. 5                        refinement. In CVPR, volume 2, pages II–443. IEEE, 2003.
 [9] Bardia Doosti, Shujon Naha, Majid Mirbagheri, and David J.             2
     Crandall. Hope-net: A graph-based model for hand-object         [26]   Diogo C Luvizon, Hedi Tabia, and David Picard. Human
     pose estimation. In CVPR, pages 6607–6616. IEEE, 2020.                 pose regression by combining indirect part detection and
     1, 2                                                                   contextual information. Computers & Graphics, 85:15–22,
[10] Zhipeng Fan, Jun Liu, and Yao Wang. Adaptive computa-                  2019. 4, 7
     tionally efficient network for monocular 3d hand pose esti-     [27]   Gyeongsik Moon, Ju Yong Chang, and Kyoung Mu Lee.
     mation. In ECCV, pages 127–144. Springer, 2020. 1, 2                   Camera distance-aware top-down approach for 3d multi-
[11] Liuhao Ge, Zhou Ren, Yuncheng Li, Zehao Xue, Yingying                  person pose estimation from a single rgb image. In ICCV,
     Wang, Jianfei Cai, and Junsong Yuan. 3D hand shape and                 pages 10133–10142, 2019. 5
     pose estimation from a single rgb image. In CVPR, pages         [28]   Gyeongsik Moon and Kyoung Mu Lee. I2l-meshnet: Image-
     10833–10842, 2019. 2                                                   to-lixel prediction network for accurate 3d human pose and
[12] Shangchen Han, Beibei Liu, Randi Cabezas, Christopher D                mesh estimation from a single rgb image. In ECCV, 2020. 2
     Twigg, Peizhao Zhang, Jeff Petkau, Tsz-Ho Yu, Chun-Jung         [29]   Gyeongsik Moon, Takaaki Shiratori, and Kyoung Mu Lee.
     Tai, Muzaffer Akbay, Zheng Wang, et al. Megatrack:                     Deephandmesh: A weakly-supervised deep encoder-decoder
     monochrome egocentric articulated hand-tracking for virtual            framework for high-fidelity hand mesh modeling. In CVPR,
     reality. TOG, 39(4):87–1, 2020. 2                                      2020. 2
[13] Yana Hasson, Bugra Tekin, Federica Bogo, Ivan Laptev,           [30]   Gyeongsik Moon, Shoou-I Yu, He Wen, Takaaki Shiratori,
     Marc Pollefeys, and Cordelia Schmid. Leveraging photomet-              and Kyoung Mu Lee. Interhand2.6m: A dataset and baseline
     ric consistency over time for sparsely supervised hand-object          for 3d interacting hand pose estimation from a single rgb im-
     reconstruction. In CVPR, pages 568–577. IEEE, 2020. 2                  age. In ECCV, 2020. 1, 2, 3, 4, 5, 6, 7, 8
[14] Yana Hasson, Gül Varol, Dimitrios Tzionas, Igor Kale-          [31]   Gyeongsik Moon, Shoou-I Yu, He Wen, Takaaki Shiratori,
     vatykh, Michael J. Black, Ivan Laptev, and Cordelia Schmid.            and Kyoung Mu Lee. InterHand2.6M: A dataset and baseline
     Learning joint reconstruction of hands and manipulated ob-             for 3d interacting hand pose estimation from a single rgb im-
     jects. In CVPR, pages 11807–11816. Computer Vision Foun-               age. https://github.com/facebookresearch/
     dation / IEEE, 2019. 2                                                 InterHand2.6M, 2020. 5
[15] Yihui He, Rui Yan, Katerina Fragkiadaki, and Shoou-I Yu.        [32]   Franziska Mueller, Florian Bernard, Oleksandr Sotny-
     Epipolar transformers. In CVPR, pages 7779–7788, 2020. 2,              chenko, Dushyant Mehta, Srinath Sridhar, Dan Casas, and
     3                                                                      Christian Theobalt. Ganerated hands for real-time 3d hand
[16] Tony Heap and David Hogg. Towards 3d hand tracking using               tracking from monocular RGB. In CVPR, pages 49–59.
Learning to Disambiguate Strongly Interacting Hands via Probabilistic Per-pixel Part Segmentation
IEEE Computer Society, 2018. 1, 2                                    Springer, 2015. 4
[33] Franziska Mueller, Micah Davis, Florian Bernard, Oleksandr      [45] Tomas Simon, Hanbyul Joo, Iain A. Matthews, and Yaser
     Sotnychenko, Mickeal Verschoor, Miguel A Otaduy, Dan                 Sheikh. Hand keypoint detection in single images using
     Casas, and Christian Theobalt. Real-time pose and shape              multiview bootstrapping. In CVPR, pages 4645–4653. IEEE
     reconstruction of two interacting hands with a single depth          Computer Society, 2017. 2, 3
     camera. TOG, 38(4):1–13, 2019. 2, 3                             [46] Breannan Smith, Chenglei Wu, He Wen, Patrick Peluse,
[34] Markus Oberweger and Vincent Lepetit. Deepprior++: Im-               Yaser Sheikh, Jessica K Hodgins, and Takaaki Shiratori.
     proving fast and accurate 3d hand pose estimation. In ICCV           Constraining dense hand surface tracking with elasticity.
     Workshops, pages 585–594, 2017. 3                                    TOG, 39(6):1–14, 2020. 2, 3
[35] Markus Oberweger, Paul Wohlhart, and Vincent Lepetit.           [47] Adrian Spurr, Umar Iqbal, Pavlo Molchanov, Otmar Hilliges,
     Hands deep in deep learning for hand pose estimation. arXiv          and Jan Kautz. Weakly supervised 3d hand pose estimation
     preprint arXiv:1502.06807, 2015. 3                                   via biomechanical constraints. In ECCV, 2020. 1, 2
[36] Iasonas Oikonomidis, Nikolaos Kyriazis, and Antonis A. Ar-      [48] Adrian Spurr, Jie Song, Seonwook Park, and Otmar Hilliges.
     gyros. Full DOF tracking of a hand interacting with an object        Cross-modal deep variational hand pose estimation. In
     by modeling occlusions and physical constraints. In Dim-             CVPR, pages 89–98. IEEE Computer Society, 2018. 1, 2
     itris N. Metaxas, Long Quan, Alberto Sanfeliu, and Luc Van      [49] Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep
     Gool, editors, ICCV, pages 2088–2095. IEEE Computer So-              high-resolution representation learning for human pose esti-
     ciety, 2011. 2                                                       mation. In CVPR, 2019. 5
[37] Iasonas Oikonomidis, Nikolaos Kyriazis, and Antonis A Ar-       [50] Bugra Tekin, Federica Bogo, and Marc Pollefeys. H+O: uni-
     gyros. Tracking the articulated motion of two strongly inter-        fied egocentric recognition of 3d hand-object poses and in-
     acting hands. In CVPR, pages 1862–1869. IEEE, 2012. 2,               teractions. In CVPR, pages 4511–4520. Computer Vision
     3                                                                    Foundation / IEEE, 2019. 1, 2
[38] Mohamed Omran, Christoph Lassner, Gerard Pons-Moll, Pe-         [51] Dimitrios Tzionas, Luca Ballan, Abhilash Srikantha, Pablo
     ter Gehler, and Bernt Schiele. Neural body fitting: Unifying         Aponte, Marc Pollefeys, and Juergen Gall. Capturing hands
     deep learning and model based human pose and shape esti-             in action using discriminative salient points and physics sim-
     mation. In 3DV, pages 484–494. IEEE, 2018. 2, 3, 7, 8                ulation. IJCV, 118(2):172–193, 2016. 2, 3
[39] Paschalis Panteleris, Iason Oikonomidis, and Antonis Argy-      [52] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
     ros. Using a single rgb frame for real time 3d hand pose es-         reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia
     timation in the wild. In WACV, pages 436–445. IEEE, 2018.            Polosukhin. Attention is all you need. In NIPS, pages 5998–
     2                                                                    6008, 2017. 3
[40] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer,            [53] Jiayi Wang, Franziska Mueller, Florian Bernard, Suzanne
     James Bradbury, Gregory Chanan, Trevor Killeen, Zeming               Sorli, Oleksandr Sotnychenko, Neng Qian, Miguel A
     Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison,               Otaduy, Dan Casas, and Christian Theobalt. Rgb2hands:
     Andreas Köpf, Edward Yang, Zachary DeVito, Martin Rai-              real-time tracking of 3d hand interactions from monocular
     son, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner,            rgb video. TOG, 39(6):1–16, 2020. 2, 3
     Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An          [54] Linlin Yang and Angela Yao. Disentangling latent hands for
     imperative style, high-performance deep learning library. In         image synthesis and pose estimation. In CVPR, pages 9877–
     NeurIPS, pages 8024–8035, 2019. 5                                    9886. Computer Vision Foundation / IEEE, 2019. 1, 2
[41] Georgios Pavlakos, Luyang Zhu, Xiaowei Zhou, and Kostas         [55] Andrei Zanfir, Eduard Gabriel Bazavan, Hongyi Xu,
     Daniilidis. Learning to estimate 3d human pose and shape             William T Freeman, Rahul Sukthankar, and Cristian Smin-
     from a single color image. In CVPR, pages 459–468, 2018.             chisescu. Weakly supervised 3d human pose and shape re-
     2, 3                                                                 construction with normalizing flows. In ECCV, pages 465–
[42] James M. Rehg and Takeo Kanade. Visual tracking of high              481. Springer, 2020. 2, 3, 7, 8
     dof articulated structures: An application to human hand        [56] Cairong Zhang, Guijin Wang, Xinghao Chen, Pengwei Xie,
     tracking. In Jan-Olof Eklundh, editor, ECCV, pages 35–46,            and Toshihiko Yamasaki. Weakly supervised segmentation
     Berlin, Heidelberg, 1994. Springer Berlin Heidelberg. 2              guided hand pose estimation during interaction with un-
[43] Javier Romero, Dimitrios Tzionas, and Michael J. Black.              known objects. In ICASSP, pages 2673–2677. IEEE, 2020.
     Embodied hands: modeling and capturing hands and bodies              3
     together. ACM Trans. Graph., 36(6):245:1–245:17, 2017. 2        [57] Xiong Zhang, Qiang Li, Hong Mo, Wenbo Zhang, and Wen
[44] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-               Zheng. End-to-end hand mesh recovery from a monocular
     net: Convolutional networks for biomedical image segmen-             RGB image. In ICCV, pages 2354–2364. IEEE, 2019. 2
     tation. In International Conference on Medical image com-       [58] Christian Zimmermann and Thomas Brox. Learning to esti-
     puting and computer-assisted intervention, pages 234–241.            mate 3d hand pose from single RGB images. In ICCV, pages
                                                                          4913–4921. IEEE Computer Society, 2017. 1, 2, 3
You can also read