PAD3R
Pose-Aware Dynamic 3D Reconstruction from Casual Videos

1University of Maryland College Park   
Joint advising.

SIGGRAPH Asia 2025



Input casual video

Dynamic 3D object reconstruction



Our method reconstructs dynamic 3D objects from a single casual monocular video,
coupling object deformation with camera motion.
This webpage showcases qualitative results and comparisons of our method from In-the-wild videos and Artemis dataset. We also present visualizations from our composite demo, ablation study and highlight representative failure cases. Please refer to our main paper for more details on the results. To explore the content, scroll down or use the navigation buttons below.

Comparisons on In-the-wild Videos   back to top


Sequence

Baseline



Input

Reference view

360° view


Comparisons on Artemis   back to top

We present qualitative results for four sequences from the Artemis dataset with different view coverge.

Panda

Wolf

Cat

Duck


Baseline



Input

Reference view

360° view


Method   back to top

Our method consists of two main stages. In the first stage, we select a frame from the video sequence as the canonical frame (or keyframe), and use an image-to-3D model to obtain a static 3D Gaussian. We then render 3D Gaussian from a set of randomly sampled camera poses to fine-tune a lightweight image-to-pose estimator, PoseNet, using DINO-v2 backbone. Then, we use the camera pose estimator to initialize the camera pose of every input video frame and optimize a deformable 3d object model.

Robustness Across View Coverage   back to top

To further assess PAD3R’s capacity to capture camera movements, we analyze how input view coverage affects reconstruction quality. We evaluate on five input sequences, each providing a different extent of viewpoint variation, 0° (single-view), 40°, 90°, 140° and 180°.
PAD3R maintains consistently high reconstruction quality across varying view coverage angles. In contrast, due to its static camera assumption, DreamMesh4D exhibits a steady decline in performance as the range of viewpoints expands. Conversely, BANMo shows improved results with broader view coverage, but performs poorly under single-view or narrow-view settings.

Composite Demo   back to top

Our model estimates object-centric camera poses (relative to the object). By combining this with off-the-shelf methods for estimating scene-centric camera poses and simple background reconstruction, we can re-project the dynamic 3D object back into a full 3D scene.
This enables rendering with large-scale camera trajectories. Below, we showcase a demo where the camera smoothly navigates through the reconstructed 3D world.


Ablation Study   back to top

We present qualitative ablation results on three Artemis sequences, gradually introducing key components of our method. PoseNet initialization improves camera pose estimation and leads to better novel view consistency. Multi-block tracking supervision helps capture fine-grained motion, particularly around articulated limbs. Incorporating bi-directional multi-block tracking (full model) further improves reconstruction quality, producing more consistent object dynamics and camera motion.

Sequence



Input

cam

cam+P_init

cam+P_init+track

Full method


Reference view
360° view

Limitations   back to top

Inaccurate initial geometry leads to pose errors and degraded reconstruction.


Static model

Input

Reference view

360° view






References
[1] Yifei Zeng, Yanqin Jiang, Siyu Zhu, Yuanxun Lu, Youtian Lin, Hao Zhu, Weiming Hu, Xun Cao and Yao Yao. STAG4D: Spatial-Temporal Anchored Generative 4D Gaussians. In ECCV, 2024.
[2] Jiawei Ren, Kevin Xie, Ashkan Mirzaei, Hanxue Liang, Xiaohui Zeng, Karsten Kreis, Ziwei Liu, Antonio Torralba, Sanja Fidler, Seung Wook Kim and Huan Ling. L4GM: Large 4D Gaussian Reconstruction Model. In NeurIPS, 2024.
[3] Zhiqi Li, Yiming Chen and Peidong Liu. DreamMesh4D: Video-to-4D Generation with Sparse-Controlled Gaussian-Mesh Hybrid Representation. In NeurIPS, 2024.
[4] Gengshan Yang, Minh Vo, Natalia Neverova, Deva Ramanan, Andrea Vedaldi, Hanbyul Joo. BANMo: Building Animatable 3D Neural Models from Many Casual Videos. In CVPR, 2022.


BibTeX

 
      @article{pad3r,
        author    = {Liao, Ting-Hsuan and Liu, Haowen and Xu, Yiran and Ge, Songwei and Yang, Gengshan and Huang, Jia-Bin},
        title     = {PAD3R: Pose-Aware Dynamic 3D Reconstruction from Casual Videos},
        journal   = {SIGGRAPH ASIA},
        year      = {2025},
      }