High-quality First-person Rendering Mixed Reality Gaming System for In Home Setting

Page created by Julio Sutton
 
CONTINUE READING
High-quality First-person Rendering Mixed Reality Gaming System for In Home Setting
2020 IEEE International Conference on Artificial Intelligence and Virtual Reality (AIVR)

         High-quality First-person Rendering Mixed Reality
               Gaming System for In Home Setting
                     Yu-Yen Chung∗ , Hung-Jui Guo† , Hiranya Garbha Kumar‡ and Balakrishnan Prabhakaran§
                                      The University of Texas at Dallas, Richardson, Texas 75080-3021
                                   {Yu-Yen.Chung∗ , Hung-Jui.Guo† , hiranya‡ , bprabhakaran§ }@utdallas.edu

          Abstract—With the advent of low-cost RGB-D cameras, mixed          environments had a better SoE toward their virtual body than
       reality serious games using ‘live’ 3D human avatars have become       those with a third-person perspective (3PP).
       popular. Here, RGB-D cameras are used for capturing and                  In-home Mixed Reality Serious Games: Once the “live”
       transferring user’s motion and texture onto the 3D human
       avatar in virtual environments. A system with a single camera is      3D human mesh models are generated, they can be manip-
       more suitable for such mixed reality games deployed in homes,         ulated just like other 3D graphical models. This fact was
       considering the ease of setting up the system. In these mixed         exploited by Mr. MAPP (Mixed reality-based framework for
       reality games, users can have either a third-person perspec-          MAnaging Phantom Pain) [5], [6] to create an in-home serious
       tive or a first-person perspective of the virtual environments         game. Phantom pain is typically felt after amputation (or even
       used in the games. Since first-person perspective provides a
       better Sense of Embodiment (SoE), in this paper, we explore           paralysis) of limbs. Such phantom pain patients experience
       the problem of providing a first-person perspective for mixed          vivid sensations from the missing body part such as frozen
       reality serious games played in homes. We propose a real time         movement or extreme pain. For intervention, mirror therapy
       textured humanoid-avatar framework to provide a first-person           [7] shows that perceptual exercise such as mirror box help
       perspective and address the challenges involved in setting up         the patient’s brain to learn the fact that the limb is paralyzed
       such a gaming system in homes. Our approach comprises:
       (a) SMPL humanoid model optimization for capturing user’s             or amputated. Mr.MAPP is a suite of serious games targeting
       movements continuously; (b) a real-time texture transferring          upper and lower limb amputees to manage phantom pain. It
       and merging OpenGL pipeline to build a global texture atlas           ‘replicates’ a 3D graphical illusion of the intact limb, to create
       across multiple video frames. We target the proposed approach         a similar illusion as the one in the mirror-box therapy.
       towards a serious game for amputees, called Mr.MAPP (Mixed
       Reality-based framework for Managing Phantom Pain), where             A. Challenges for First-person Perspective In-home Serious
       amputee’s intact limb is mirrored in real-time in the virtual         Games
       environment. For this purpose, our framework also introduces
       a mirroring method to generate a textured phantom limb in                Mr.MAPP is successful in rendering 3PP but insufficient to
       the virtual environment. We carried out a series of visual and        reconstruct the portion invisible from the camera in 1PP. The
       metrics-based studies to evaluate the effectiveness of the proposed   person’s mesh looks intact in 3PP (Fig. 1(b)) even though the
       approaches for skeletal pose fitting and texture transfer to SMPL
       humanoid models, as well as the mirroring and texturing missing       right leg was folded as Fig. 1(a). In the 1PP as Fig. 1(c),
       limb (for future amputee based studies).                              however, the holes on lower limbs appears because the upper
          Index Terms—Mixed Reality, Real-time Textured Humanoids,           side of feet and lower part of leg were blocked by the sole.
       First-Person Perspective Rendering                                    One possible solution is to use pre-defined humanoid models.
                                                                             Skinned Multi-Person Linear model (SMPL) [8] is parametric
                              I. I NTRODUCTION                               humanoid model which contains shape and pose parameters
                                                                             for customizing 3D model of a person without texture. With
          With the advent of RGB-D cameras such as Microsoft                 SMPL, users will be able to see the personalize humanoid
       Kinect, mixed reality systems and applications have incorpo-          avatar in the virtual environment. However, the model will
       rated “live” 3D human models [1]. These “live” 3D human               not be as vivid as the “live” 3D human mesh generated in
       models are generated in real-time by capturing the RGB                real-time using RGB-D cameras, since the SMPL model is an
       (texture), human skeleton, and the depth data associated with a       avatar without any texture.
       human in a 3D scene and using these data to create 3D mesh in
       real-time and applying texture over the mesh. Such “live” 3D          B. Real-time Textured Humanoid Framework
       human models give users a better sense of immersion as they              In this paper, we target the problem of generating high-
       can see details such as the dress the human is wearing, their         quality first-person rendering in mixed reality serious games,
       facial emotions, etc. Sense of Embodiment (SoE) is defined             especially for in-home applications. Such a rendering will use
       as ”the ensemble of sensations that arise in conjunction with         the texture data (in real-time) pertaining to the human in the
       being inside, having, and controlling a body” [2]. SoE is an          3D scene captured by a Kinect RGB-D camera. This real-
       essential part for enhancing user experience in an immersive          time texture will be transferred from the human in the scene
       virtual environment. Recent studies [3], [4] have indicated that      to a customized SMPL humanoid model and will provide an
       users experiencing first-person perspective (1PP) in immersive         enhanced SoE in 1PP. The entire series of operations to achieve

978-1-7281-7463-1/20/$31.00 ©2020 IEEE                                   339
DOI 10.1109/AIVR50618.2020.00070
High-quality First-person Rendering Mixed Reality Gaming System for In Home Setting
Fig. 1. Comparison the mirrored model from Mr.MAPP and proposed method with Kinect at 80 inches high. (a) RGB image from Kinect; (b) Mr.MAPP in
third person perspective; (c) Mr.MAPP in first person perspective; (d) textured SMPL in third person perspective; (e) textured SMPL in first person perspective.

this goal include: two stage SMPL model optimization for                        A. Multi-views systems
matching the humanoid to the human in the scene, real-time                         With a set of RGB-D cameras, point cloud is a simple
texture registration and transfer from the human captured by                    representation for volumetric video since it does not require
a Kinect RGB-D camera. We target the proposed approach                          much preprocessing for rendering. These point clouds from
towards the serious game for amputees, Mr.MAPP. While the                       various cameras can be aligned based on the intrinsic and
proposed approach will work for non-disabled people (with                       extrinsic parameters or Iterative Closest Point (ICP) algorithm
intact limbs) in a straight-forward manner, we outline an                       [10]. Mekuria et al. [11] introduced a point cloud compression
approach that can: (a) replicate the 3D graphical limb from                     codec to transfer the thousands up to millions of points
the customized humanoid; (b) transfer the texture data from                     efficiently. Truncated signed distance function (TSDF) [12]
the intact human limb. Through the sample resulting model in                    is another data structure which fuse the depth data into a
3PP from Fig. 1(d), we can see the position of body, hands and                  predefined volume. Holoportation [13] reconstructs both shape
leg are mostly aligned with the SMPL model. Further, in 1PP                     and color of whole scene based on the TSDF. In Fusion4D
as Fig. 1(e), basically most of the visible surface are textured.               [1], they additionally maintained key volumes to deal with
Hence, this approach can be incorporated in Mr.MAPP to                          radically surface changing and smooth nonrigid motions was
provide better SoE for amputees using the serious games.                        applied within the key volume sequence. Dou et al. [14] further
   For validation, we carried out a series of visual and metrics-               improves the frame rate of Fusion4D by spectral embedding
based studies to evaluate the effectiveness of the proposed                     and more compact parameterization. For the texture recon-
approaches for: (i) real-time skeletal pose fitting and (ii)                     struction, Du et al. [15] apply majority voting to enhance
texture transfer to SMPL humanoid models, as well as the                        the quality of resulting texture. As the task specific target on
(iii) mirroring and texturing missing limb (for future amputee                  human performance capture, Xu et al. [16] uses SMPL [8] as
based studies).                                                                 a core shape and an outer layer to represent the structure of
   Main contributions of this work includes:                                    outfit.
   •   A two stage SMPL optimization strategy incorporating                     B. Single view systems
       shape initialization and near real-time pose update. To
       enhance stability of initialization, a randomized repetition                In case of single view, the limited viewpoint and occlusion
       procedure was introduced and validated by Otte’s Kinect                  make the reconstruction even harder. Hence, to have a first-
       V2 dataset [9].                                                          person perspective game with a single camera from different
   •   An approach for real-time mirroring of the missing limb                  perspective, a prior model become more important in the setup.
       (in case of amputees) so that the approach could be                      SMPL [8] is a parameterized human body model which has
       incorporated in Mr.MAPP for providing better SoE.                        been applied in many researches to achieve better human
   •   A novel approach through simulating the visible area to                  shape or pose estimation. Further, techniques used by single
       evaluate texture retrieval quality with different positions              view system vary depending on whether depth information is
       of Kinect RGB-D camera setup.                                            provided (using RGB-D cameras) or not (only RGB cameras).
                                                                                   1) RGB Camera-based Single View: With a proper model,
                                                                                some research could even build the reconstruction on top on
                        II. R ELATED W ORKS                                     RGB data without depth information. For a single photo, Bogo
                                                                                et al. [17] fit a SMPL model toward 2D skeleton joint with
   Reconstructing a live 3D human model is a challenging task                   other shape constrains for reconstructing the pose and shape
due to the non-rigid deformation of body shape, potentially                     of the person. Pavlakos et al. [18] further extends SMPL
fast motion, and large range of possible poses. Typically, to                   with fully articulated hands and an expressive face to retrieve
achieve viewpoint free model, many approaches have been                         facial expression and hand gestures. For the case with a video,
developed for fusing information from multiple cameras. On                      Alldieck et al. [19], [20] applied SMPL with an extra offset
the contrary, for systems targeting regular in-home based                       layer to reconstruct a textured model. For achieving finer
users, reconstructing from a single camera would be a good                      texture, they further trained a CNN (Convolutional Neural
alternative.                                                                    Networks) for precise texture registration. However, the time

                                                                            340
High-quality First-person Rendering Mixed Reality Gaming System for In Home Setting
taken by the method is of the order of minutes, and not really
in real-time. With a prebuilt textured mesh and parameterized
motion, Xu et al. [21] and Habermann et al. [22] pre-trained a
CNN to convert the 2D skeletal joint to 3D space then perform
a non-rigid deformation based on the joints. Yet, the texture
will stay the same unless the user reconstruct another model.
   2) RGB-D Camera-based Single View: The depth sensor
provides information about the 3D structure. Some approaches
could still build on a model free framework. DynamicFusion
[23] was simultaneously reconstructing a canonical scene and
tracking a non-rigid deforming scene using TSDF in real-time
without a prior model. Volumedeform [24] extended the work         Fig. 2. Proposed Approach for High Quality First-Person Rendering with
by using SIFT (Scale Invariant Feature Transform) features to      SMPL optimization and OpenGL
align the images. Yu et al. [25] added the skeleton information
                                                                   the set of optimized parameters could be applied as initial
in DynamicFusion to better fit a full body shape as targeting
                                                                   value for the next coming frame. Moreover, to avoid the
on human. Doublefusion [26] further leveraged a SMPL layer
                                                                   objective function being trapped into local minimum, extra
to improve the shape. Guo et al. [27] applied estimated albedo
                                                                   nrep times repeated optimizing on randomized parameters
under Lambertian surface assumption to improve the tracking.
                                                                   were performed for each frame in the stage to determine the
These methods use TSDF with GPU to achieve real-time
                                                                   best result for the current frame. The objective function of the
performance. Yet, the voxel grid may require large amount                                           ˆ considered the mismatch of
                                                                   initializing stage, Einit (β, θ, J),
of GPU memory and thus bounded the shape to predefined
                                                                   joints position and regularization term for both θ and β.
volume [28]. To sum up, with single RGB camera only system,
user will not be able to change their texture without a model                   ˆ = λj Ej (β, θ, J)+λ
                                                                   Einit (β, θ, J)               ˆ    p Ep (θ)+λt Et (θ)+λb Eb (β)
re-building process. Moreover, it is harder to achieve real-time                                                                (1)
performance without the depth information. Since RGB-D             wherein λj , λt , λp and λb are the weighted parameters for
cameras are easily available, our system uses a single RGB-D                                         ˆ is the loss corresponding
                                                                   the objective function. Ej (β, θ, J)
camera with SMPL model and provide real-time performance.          to the joint mismatch between SMPL and estimated through
                                                                   Kinect V2. This term is the major source of the whole energy
                  III. P ROPOSED M ETHOD
                                                                   function.
   Figure 2 summarizes the modular framework of our pro-                                       
                                                                                          ˆ =
                                                                                Ej (β, θ, J)       ρ(Rθ (J(β))j − Jˆj )         (2)
posed system for high quality first-person rendering in mixed
reality, along with the evaluation strategies (used in Section                                  j∈Jm

IV):                                                               The set of joints that could be mapped between SMPL and
   1) Fitting the SMPL model with the joints of each frame         Kinect camera extracted joints denote Jm . Jˆ refers to the joint
       to represent the user’s model in action. This operation     set estimated by Kinect. Rθ is the global rigid transformation
       includes model initialization along with pose updates.      derived from θ. Hence, the 3D position of SMPL joints
   2) For rendering, texture is transferred to the global atlas    is Rθ (J(β)). ρ is the robust differentiable Geman-McClure
       using RGB image data from Kinect with fitted SMPL            penalty function.
       model.
                                                                             Ep (θ) = minj (−ln(gj N (θ; μθ,j , Σθ,j )))             (3)
   3) In the case of serious games for amputees, we mirror the                                       
       intact limb of the amputee to create an illusion of the              Et (θ) = exp(   A(θ)i ) +      exp(A(θ)j )               (4)
       missing limb. This mirroring process includes creation                               i∈Js            j∈Jr
       of the missing skeletal joints and the associated texture
                                                                      Ep (θ) is the Gaussians mixture pose prior introduced from
       information.
                                                                   Bogo et al’s work [17]. Et (θ) is the term to regularize θ for
A. SMPL model optimization                                         handling/preventing the twisted or unnatural poses. A(θ)i is
                                                                   the angle of ith set of θ. For avoiding cumulative error over
   The proposed approach fits the SMPL joints with the 3D
                                                                   a few consecutive spine joints, angles corresponding to spine
joints detected by a Kinect RGB-D camera, in near real-time,
                                                                   (Js ) and the rest (Jr ) (except the global transformation) are
for each frame. The optimization is carried out in two stages:
                                                                   calculated differently. Last, Eb (β) is a ridge regularization
(i) Shape initialization; (ii) Real-time pose update. L-BFGS
                                                                   term for beta.
(Limited-memory Broyden–Fletcher–Goldfarb–Shanno) opti-
                                                                                            Eb (β) = β2                      (5)
mization [29] is performed on each frame in both the stages.
   1) Shape Initialization: Shape initialization is to determine      2) Pose update: The pose update stage could be regarded as
the shape (β) and initial pose (θ) for the SMPL model from         a simplified initialization, for achieving real-time performance.
a set of frames of 3D human skeleton joints in the beginning.      Once the shape has been initialized, β become a set of
Due to small pose differences between consecutive frames,          constants for the following pose update stage. Also, no extra

                                                               341
High-quality First-person Rendering Mixed Reality Gaming System for In Home Setting
(a) Unwrap of neutral SMPL             (b) Example atlas [19]
    Fig. 3. SMPL texture atlas with limbs unwrapped symmetrically

repetition is required since θ has been properly initialized in
                                         ˆ shows the objective
the previous stage as well. Epose (β, θ, J)                               Fig. 4. Comparison of average loss after different repeat times of random
                                                                          initialization using Otte’s activity dataset.
function for this stage.
                       ˆ = λj Ej (β, θ, J)
          Epose (β, θ, J)               ˆ + λt Et (θ)               (6)   A. Shape Initialization
B. Texture Atlas and Missing limb generation                                 To enhance the stability, we introduced a random repetition
                                                                          strategy in the initialization process and validated the strategy
   To render the fitted model with a vivid appearance, texturing
                                                                          using Otte’s activity dataset [9]. In Otte et al’s work, they cre-
is required. Rather than applying a synthetic and pre-defined
                                                                          ated a set of Kinect RGB-D skeleton dataset with six different
texture, our approach seeks to transfer the texture from the
                                                                          activities, which contains Stand up And Sit down (SAS), Short
‘live’ human user that is captured by the Kinect RGB-D cam-
                                                                          Comfortable Speed Walk (SCSW), Short Maximum Speed
era, to the fitted SMPL model. Specifically, we back project
                                                                          Walk (SMSW), Short Line walk (SLW), Stance with closed
the triangle from world space to RGB image for retrieving
                                                                          feet and open and closed eyes (POCO) and Walking on the
the corresponding texture piece on the texture atlas of SMPL.
                                                                          spot (STEPO). Data were collected from a total of 19 user
To create a textured and interactable representation of the
                                                                          trials with 3 repetitions for STEPO and 5 repetitions for the
phantom limb for the amputee, both joints and texture will be
                                                                          rest activities. Fig. 4 shows the comparison of initialized result
mirrored from the intact counterpart. For joints mirroring, a
                                                                          in different times of random repetitions. In this experiment,
method based on the joints of hip, shoulder or spine introduced
                                                                          the beginning 10 frames of each trials were used for shape
from Mr.MAPP [5], [6] is applied to generate the joints on the
                                                                          initialization then the average loss from the following 50
amputee’s missing limb. For texture mirroring, a SMPL was
                                                                          frames on pose update were collected as for evaluating the
unwrapped symmetrically on both upper and lower limbs using
                                                                          SMPL optimization. Each activity (with the above acronyms)
Autodesk Maya1 . Fig. 3 shows the unwrapped neutral SMPL.
                                                                          is listed in X axis as a group and considered separately since
In Fig. 3a, the region labeled by green and blue box are region
                                                                          the fitting result for different pose could be varied. The Y axis
for limb mirroring. The red margin labeled the unsymmetrical
                                                                          is distribution the average loss across trials. In each group,
area since the SMPL not exactly symmetrical on the both
                                                                          the blue boxes are results performed without repetitions in
sides. Thus, with the limb mirroring texture atlas, the texture
                                                                          the initialization stage and the others are involving different
mirroring can be integrated into OpenGL efficiently.
                                                                          number of repetitions. For all activities, the averaged loss and
                 IV. E XPERIMENTAL R ESULTS                               its variation in the following pose update stage are much lower
   Prototype of the proposed approach was developed on                    when we have repetitions (as opposed to no repetitions). Thus,
Python 3.6. The optimization is running on intel i7-8750H                 the repetitions result in better and more stable performance in
CPU with 32GB RAM using PyTorch 1.3.1. The rendering en-                  the initialization stage.
gine is Panda3D2 running on dedicated graphic card NVIDIA                 B. Simulation of Texture Atlas Reconstruction with Different
GeForce RTX 2070, with OpenGL 4.5. The human body skele-                  Camera Height
tal joints and RGB image from Kinect cameras are extracted
                                                                             In this experiment, the goal is to estimate effect of camera
using Kinect SDK 2.0. We conducted several experiments to
                                                                          height for texture atlas reconstruction in a first-person game.
evaluate the proposed approach with the following goals: (a)
                                                                          The example gaming scenario is kicking lower limb forward
Skeletal pose to SMPL model fitting and pose updates; (b)
                                                                          in a sitting pose as Bahirat et al.’s work [6]. In Alldieck et
Effectiveness of real-time texture transfer; (c) Missing limb
                                                                          al’s work [19], they released a dataset of total 24 the textured
generation and the associated texture transfer.
                                                                          SMPL with shape parameter and additional personal offsets.
 1 https:/autodesk.com/maya                                               We first recorded movements of the body joints in the example
 2 https://www.panda3d.org/                                               gaming scenario to fit the pose parameters for operating these

                                                                      342
High-quality First-person Rendering Mixed Reality Gaming System for In Home Setting
Fig. 6. Example of lower limb mirrored SMPL model. (a). RGB frame from
                                                                              Kinect; (b),(c). textured SMPL in 3PP and 1PP.

                                                                                •   Green: is the area visible to both eye’s and the virtual
                                                                                    front camera. Hence, these textures would contribute to
                                                                                    the user’s vision in the scene.
                                                                                 • Red: is the area visible only to the eye’s camera. More
                                                                                    red area on upper face of feet and limb in the sub-figure
                                                                                    on the lower virtual front camera setup. In other word,
                                                                                    user will not see texture on these areas eventually it might
                                                                                    lower the fidelity of the game.
                                                                                 • Yellow: is visible only to the virtual camera at the front
                                                                                    position, which means user will not notice this area in
Fig. 5. Visible texture displayed ratio at different camera setup. The sub-         the given gaming scenario.
figures below are the corresponding overlap of texture on lower limb portion
                                                                                 • Blue: the area which will not be captured neither by eye’s
with front camera at 5, 30, 55 and 80 inches high.
                                                                                    nor virtual front camera.
models. Then, to simulate the retrievable and visible texture                 Through this series of sub-figure, as height of the virtual front
areas, we use two virtual cameras: (i) a front camera was                     camera increases, the camera view gradually aligned with the
set at 2.5m in front of these personalized SMPL model; (ii)                   person’s perspective since the green area increase. As a result,
a virtual eye’s camera with FoV 95o x75o located at 0.2m                      raising the camera’s height is improving the texture retrieval
front of the head joint and looked toward to the fitted SMPL                   in this gaming scenario. The result consistent with intuition
model’s lower limb to mimic the eyes’ vision. For evaluating                  since user will see more upper face of their lower limb in 1PP
the effectiveness of texture reconstruction with respect to the               in our example gaming scenario.
height of placement of the virtual front camera, we defined a
                                                                              C. Missing limb generation
visible texture displayed ratio (Rvt (g, c)):
                                                                                 One main motivation of our system is to have a textured
                              area(Tvis (g) ∩ Tret (c))                       humanoid avatar with a generated phantom limb. The lower
               Rvt (g, c) =                                            (7)
                                  area(Tvis (g))                              limb mirrored examples in 3PP and 1PP are provided in Fig.
Rvt ∈ [0, 1], Rvt = 1 means all the portions of the 3D avatar                 1d, e. and Fig. 6b, c. Basically all the visible surface in
seen by the user are textured; on the contrary, Rvt = 0 means                 1PP are hole-free and textured. While, there are still some
none of object seen by the user is textured no matter how                     areas without proper texture on feet, hands, and pants, since
much texel has been retrieved. Tvis (g) refers to the set of texel            the SMPL may not exactly match a body in the image. In
visible by the user in a gaming scenario (g), Tret (c) refers to              summary, our system provides a vivid limb-mirrored textured
the to the set of texel retrieved using camera configuration (c).              model for the lower limb actions in first-person perspective
The camera configuration may involve camera properties such                    with the Kincet RGB-D camera at the front.
as FoV, position and orientation in general but we focus on the                                 V. C ONCLUDING R EMARKS
effectiveness of camera height in our simulation. To calculate                   In this work, we introduced a real-time textured humanoid
the Rvt here, the two parameters are estimated as below:                      avatar framework for mixed reality games in 1PP using a
   1) Visible to virtual front camera at testing height (Tret ):              single RGB-D camera, especially for in-home deployment.
       the texel correctly retrieved from over 80 percent of                  The framework reduced the camera position’s effect through a
       trials.                                                                skeleton-based virtual floor calibration. We employed a two-
   2) Visible to virtual eye’s camera (Tvis ): surface seen by                stage optimization strategy to update SMPL model pose with
       virtual eye’s camera in any trial.                                     respect to the pose of the user. Texture transfer to the SMPL
The resulting Rvt in Figure 5 demonstrates this simple con-                   model was carried out by capturing a set of frames and
cept. As the camera height increases from 5 to 80 inches                      incorporating the accumulated texture with a global atlas, in
height, the corresponding Rvt increased from around 0.48 to                   real-time. In the case of serious games for managing phantom
0.75. Moreover, the overlapped texture area on lower limb                     pain, the system could generate a vivid phantom limb by
with cameras at 5, 30, 55 and 80 inches height are listed in the              mirroring the joints, limb, and the associated texture, from the
Figure 5. The meaning of each color is described as following:                amputee’s in-tact limb. To increase the visible texture display

                                                                          343
High-quality First-person Rendering Mixed Reality Gaming System for In Home Setting
ratio, we can ask users to perform some more initializing                           [12] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim,
activity to let the Kinect RGB-D camera capture more texture                             A. J. Davison, P. Kohi, J. Shotton, S. Hodges, and A. Fitzgibbon,
                                                                                         “Kinectfusion: Real-time dense surface mapping and tracking,” in 2011
or adjust the front camera to a higher position. However, for                            10th IEEE International Symposium on Mixed and Augmented Reality.
texture transferring, we found some portion of face, hand                                IEEE, 2011, pp. 127–136.
and leg may not be well aligned if we perform a back                                [13] S. Orts-Escolano, C. Rhemann, S. Fanello, W. Chang, A. Kowdle,
                                                                                         Y. Degtyarev, D. Kim, P. L. Davidson, S. Khamis, M. Dou et al.,
projection directly. Thus, a more sophisticated registration                             “Holoportation: Virtual 3d teleportation in real-time,” in Proceedings of
approach may improve the texture retrieval of the system.                                the 29th Annual Symposium on User Interface Software and Technology,
Our prototype runs in near real-time, with SMPL optimization                             2016, pp. 741–754.
                                                                                    [14] M. Dou, P. Davidson, S. R. Fanello, S. Khamis, A. Kowdle, C. Rhe-
being carried out on CPU. Therefore, optimizing the system                               mann, V. Tankovich, and S. Izadi, “Motion2fusion: real-time volumetric
to fully leverage GPU capabilities will be the future steps to                           performance capture,” ACM Transactions on Graphics (TOG), vol. 36,
reduce the execution time. Due to the COVID-19 pandemic,                                 no. 6, p. 246, 2017.
                                                                                    [15] R. Du, M. Chuang, W. Chang, H. Hoppe, and A. Varshney, “Montage4d:
we were limited by the ability to integrate our system for                               Interactive seamless fusion of multiview video textures,” 2018.
carrying out human participants-based study for evaluating                          [16] L. Xu, Z. Su, L. Han, T. Yu, Y. Liu, and F. Lu, “Unstructuredfusion:
SoE (Sense of Embodiment) with the proposed approach for                                 Realtime 4d geometry and texture reconstruction using commercialrgbd
                                                                                         cameras,” IEEE transactions on pattern analysis and machine intelli-
first-person perspective. We will take that (human participant                            gence, 2019.
study) also as a future work.                                                       [17] F. Bogo, A. Kanazawa, C. Lassner, P. Gehler, J. Romero, and M. J.
                                                                                         Black, “Keep it smpl: Automatic estimation of 3d human pose and shape
                     VI. ACKNOWLEDGEMENTS                                                from a single image,” in European Conference on Computer Vision.
                                                                                         Springer, 2016, pp. 561–578.
   This material is based upon work supported by the US Army                        [18] G. Pavlakos, V. Choutas, N. Ghorbani, T. Bolkart, A. A. Osman,
Research Office (ARO) Grant W911NF-17-1-0299. Any opin-                                   D. Tzionas, and M. J. Black, “Expressive body capture: 3d hands, face,
                                                                                         and body from a single image,” in Proceedings of the IEEE Conference
ions, findings, and conclusions or recommendations expressed                              on Computer Vision and Pattern Recognition, 2019, pp. 10 975–10 985.
in this material are those of the author(s) and do not necessarily                  [19] T. Alldieck, M. Magnor, W. Xu, C. Theobalt, and G. Pons-Moll, “Video
reflect the views of the ARO.                                                             based reconstruction of 3d people models,” in IEEE Conference on
                                                                                         Computer Vision and Pattern Recognition, 2018.
                                                                                    [20] T. Alldieck, M. Magnor, W. Xu, C. Theobalt, and G. Pons
                               R EFERENCES                                               -Moll, “Detailed human avatars from monocular video,” in 2018 Inter-
 [1] M. Dou, S. Khamis, Y. Degtyarev, P. Davidson, S. R. Fanello, A. Kow-                national Conference on 3D Vision (3DV). IEEE, 2018, pp. 98–109.
     dle, S. O. Escolano, C. Rhemann, D. Kim, J. Taylor et al., “Fusion4d:          [21] W. Xu, A. Chatterjee, M. Zollhöfer, H. Rhodin, D. Mehta, H.-P. Seidel,
     Real-time performance capture of challenging scenes,” ACM Transac-                  and C. Theobalt, “Monoperfcap: Human performance capture from
     tions on Graphics (TOG), vol. 35, no. 4, pp. 1–13, 2016.                            monocular video,” ACM Transactions on Graphics (ToG), vol. 37, no. 2,
 [2] K. Kilteni, R. Groten, and M. Slater, “The sense of embodiment                      pp. 1–15, 2018.
     in virtual reality,” Presence: Teleoperators and Virtual Environments,         [22] M. Habermann, W. Xu, M. Zollhoefer, G. Pons-Moll, and C. Theobalt,
     vol. 21, no. 4, pp. 373–387, 2012.                                                  “Livecap: Real-time human performance capture from monocular
 [3] G. Gorisse, O. Christmann, E. A. Amato, and S. Richir, “First-and third-            video,” ACM Transactions on Graphics (TOG), vol. 38, no. 2, pp. 1–17,
     person perspectives in immersive virtual environments: Presence and                 2019.
     performance analysis of embodied users,” Frontiers in Robotics and AI,         [23] R. A. Newcombe, D. Fox, and S. M. Seitz, “Dynamicfusion: Recon-
     vol. 4, p. 33, 2017.                                                                struction and tracking of non-rigid scenes in real-time,” in Proceedings
 [4] H. G. Debarba, S. Bovet, R. Salomon, O. Blanke, B. Herbelin, and                    of the IEEE conference on computer vision and pattern recognition,
     R. Boulic, “Characterizing first and third person viewpoints and their               2015, pp. 343–352.
     alternation for embodied interaction in virtual reality,” PloS one, vol. 12,   [24] M. Innmann, M. Zollhöfer, M. Nießner, C. Theobalt, and M. Stam-
     no. 12, 2017.                                                                       minger, “Volumedeform: Real-time volumetric non-rigid reconstruc-
 [5] K. Bahirat, T. Annaswamy, and B. Prabhakaran, “Mr. mapp: Mixed                      tion,” in European Conference on Computer Vision. Springer, 2016,
     reality for managing phantom pain,” in Proceedings of the 2017 ACM                  pp. 362–379.
     on Multimedia Conference. ACM, 2017, pp. 1558–1566.                            [25] T. Yu, K. Guo, F. Xu, Y. Dong, Z. Su, J. Zhao, J. Li, Q. Dai, and Y. Liu,
 [6] K. Bahirat, Y.-Y. Chung, T. Annaswamy, G. Raval, K. Desai, B. Prab-                 “Bodyfusion: Real-time capture of human motion and surface geometry
     hakaran, and M. Riegler, “Using mr. mapp for lower limb phantom pain                using a single depth camera,” in Proceedings of the IEEE International
     management,” in Proceedings of the 27th ACM International Conference                Conference on Computer Vision, 2017, pp. 910–919.
     on Multimedia, 2019, pp. 1071–1075.                                            [26] T. Yu, Z. Zheng, K. Guo, J. Zhao, Q. Dai, H. Li, G. Pons-Moll, and
 [7] V. S. Ramachandran and D. Rogers-Ramachandran, “Synaesthesia                        Y. Liu, “Doublefusion: Real-time capture of human performances with
     in phantom limbs induced with mirrors,” Proceedings of the                          inner body shapes from a single depth sensor,” in Proceedings of the
     Royal Society of London. Series B: Biological Sciences, vol.                        IEEE conference on computer vision and pattern recognition, 2018, pp.
     263, no. 1369, pp. 377–386, apr 1996. [Online]. Available:                          7287–7296.
     https://doi.org/10.1098/rspb.1996.0058                                         [27] K. Guo, F. Xu, T. Yu, X. Liu, Q. Dai, and Y. Liu, “Real-time geometry,
 [8] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black,                     albedo, and motion reconstruction using a single rgb-d camera,” ACM
     “Smpl: A skinned multi-person linear model,” ACM transactions on                    Transactions on Graphics (ToG), vol. 36, no. 4, p. 1, 2017.
     graphics (TOG), vol. 34, no. 6, pp. 1–16, 2015.                                [28] M. Zollhöfer, P. Stotko, A. Görlitz, C. Theobalt, M. Nießner, R. Klein,
 [9] K. Otte, B. Kayser, S. Mansow-Model, J. Verrel, F. Paul, A. U. Brandt,              and A. Kolb, “State of the art on 3d reconstruction with rgb-d cameras,”
     and T. Schmitz-Hübsch, “Accuracy and reliability of the kinect version             in Computer graphics forum, vol. 37, no. 2. Wiley Online Library, 2018,
     2 for clinical measurement of motor function,” PloS one, vol. 11, no. 11,           pp. 625–652.
     2016.                                                                          [29] J. Nocedal and S. Wright, Numerical optimization. Springer Science
[10] S. Rusinkiewicz and M. Levoy, “Efficient variants of the icp algorithm,”             & Business Media, 2006.
     in Proceedings third international conference on 3-D digital imaging
     and modeling. IEEE, 2001, pp. 145–152.
[11] R. Mekuria, K. Blom, and P. Cesar, “Design, implementation, and
     evaluation of a point cloud codec for tele-immersive video,” IEEE
     Transactions on Circuits and Systems for Video Technology, vol. 27,
     no. 4, pp. 828–842, 2016.

                                                                                344
You can also read