SelfRecon: Self Reconstruction Your Digital Avatar from Monocular Video

Boyi Jiang1,2      Yang Hong1      Hujun Bao3      Juyong Zhang1*

1University of Science and Technology of China    2Image Derivative Inc    3Zhejiang University

CVPR 2022 (Oral)

We propose SelfRecon, a clothed human body reconstruction method that combines implicit and explicit representations to recover space-time coherent geometries from a monocular self-rotating human video. Explicit methods require a predefined template mesh for a given sequence, while the template is hard to acquire for a specific subject. Meanwhile, the fixed topology limits the reconstruction accuracy and clothing types. Implicit methods support arbitrary topology and have high quality due to continuous geometric representation. However, it is difficult to integrate multi-frame information to produce a consistent registration sequence for downstream applications. We propose to combine the advantages of both representations. We utilize differential mask loss of the explicit mesh to obtain the coherent overall shape, while the details on the implicit surface are refined with the differentiable neural rendering. Meanwhile, the explicit mesh is updated periodically to adjust its topology changes, and a consistency loss is designed to match both representations closely. Compared with existing methods, SelfRecon can produce high-fidelity surfaces for arbitrary clothed humans with self-supervised optimization. Extensive experimental results demonstrate its effectiveness on real captured monocular videos.

The pipeline of SelfRecon

We simultaneously maintain the explicit and implicit geometry representations and use forward deformation fields to transform canonical geometry to the current frame space. For explicit representation, we mainly use differentiable mask loss to recover the overall shape. As for implicit representation, sampled neural rendering loss and predicted normals are used to refine geometry details. Finally, a consistency loss is used to keep both geometric representations matched.

Visualization of forward deformations

Input video

Non-rigid deformation

Whole deformations

Results on People-Snapshot [1]

Input video



[1] Thiemo Alldieck, et al. Video based reconstruction of 3d people models. In CVPR, 2018.

Results on Sequences from ZJU-MoCap [2] (4 views) and MonoperfCap [3]

[2] Sida Peng, et al. Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. In CVPR, 2021.
[3] Weipeng Xu, et al. Monoperfcap: Human performance capture from monocular video. ACM Transactions on Graphics (ToG), 37(2):1–15, 2018.

The paper focuses on clothed body reconstruction using self-rotating videos, and the algorithm framework can still be used to general motion sequences. We can see that a plausible result can be obtained. We believe that valuable explorations based on our framework can further refine the results of general motions in the future.


     author     = {Boyi Jiang and Yang Hong and Hujun Bao and Juyong Zhang},
     title      = {SelfRecon: Self Reconstruction Your Digital Avatar from Monocular Video},
     booktitle = {{IEEE/CVF} Conference on Computer Vision and Pattern Recognition (CVPR)}
     year       = {2022}