PAD3R
Pose-Aware Dynamic 3D Reconstruction from Casual Videos

Ting-Hsuan Liao*¹ Haowen Liu*¹ Yiran Xu¹ Songwei Ge¹ Gengshan Yang^✢ Jia-Bin Huang^✢¹

¹University of Maryland College Park
^✢Joint advising.

SIGGRAPH Asia 2025

Input casual video

Dynamic 3D object reconstruction

Our method reconstructs dynamic 3D objects from a single casual monocular video,
coupling object deformation with camera motion.

This webpage showcases qualitative results and comparisons of our method from In-the-wild videos and Artemis dataset. We also present visualizations from our composite demo, ablation study and highlight representative failure cases. Please refer to our main paper for more details on the results. To explore the content, scroll down or use the navigation buttons below.

In-the-wild videos Artemis videos Method Varying View Coverage Composite demo Ablation study Limitations

Comparisons on In-the-wild Videos ↑ back to top

Sequence

Baseline

Input

Reference view

360° view

Comparisons on Artemis ↑ back to top

We present qualitative results for four sequences from the Artemis dataset with different view coverge.

Panda

Wolf

Cat

Duck

Baseline

Input

Reference view

360° view

Method ↑ back to top

Our method consists of two main stages. In the first stage, we select a frame from the video sequence as the canonical frame (or keyframe), and use an image-to-3D model to obtain a static 3D Gaussian. We then render 3D Gaussian from a set of randomly sampled camera poses to fine-tune a lightweight image-to-pose estimator, PoseNet, using DINO-v2 backbone. Then, we use the camera pose estimator to initialize the camera pose of every input video frame and optimize a deformable 3d object model.

Robustness Across View Coverage ↑ back to top

To further assess PAD3R’s capacity to capture camera movements, we analyze how input view coverage affects reconstruction quality. We evaluate on five input sequences, each providing a different extent of viewpoint variation, 0° (single-view), 40°, 90°, 140° and 180°.
PAD3R maintains consistently high reconstruction quality across varying view coverage angles. In contrast, due to its static camera assumption, DreamMesh4D exhibits a steady decline in performance as the range of viewpoints expands. Conversely, BANMo shows improved results with broader view coverage, but performs poorly under single-view or narrow-view settings.

Composite Demo ↑ back to top

Our model estimates object-centric camera poses (relative to the object). By combining this with off-the-shelf methods for estimating scene-centric camera poses and simple background reconstruction, we can re-project the dynamic 3D object back into a full 3D scene.
This enables rendering with large-scale camera trajectories. Below, we showcase a demo where the camera smoothly navigates through the reconstructed 3D world.

Ablation Study ↑ back to top

We present qualitative ablation results on three Artemis sequences, gradually introducing key components of our method. PoseNet initialization improves camera pose estimation and leads to better novel view consistency. Multi-block tracking supervision helps capture fine-grained motion, particularly around articulated limbs. Incorporating bi-directional multi-block tracking (full model) further improves reconstruction quality, producing more consistent object dynamics and camera motion.

Sequence

Input

cam

cam+P_init

cam+P_init+track

Full method

Reference view

360° view

Limitations ↑ back to top

Inaccurate initial geometry leads to pose errors and degraded reconstruction.

Static model

Input

Reference view

360° view

References
[1] Yifei Zeng, Yanqin Jiang, Siyu Zhu, Yuanxun Lu, Youtian Lin, Hao Zhu, Weiming Hu, Xun Cao and Yao Yao. STAG4D: Spatial-Temporal Anchored Generative 4D Gaussians. In ECCV, 2024.
[2] Jiawei Ren, Kevin Xie, Ashkan Mirzaei, Hanxue Liang, Xiaohui Zeng, Karsten Kreis, Ziwei Liu, Antonio Torralba, Sanja Fidler, Seung Wook Kim and Huan Ling. L4GM: Large 4D Gaussian Reconstruction Model. In NeurIPS, 2024.
[3] Zhiqi Li, Yiming Chen and Peidong Liu. DreamMesh4D: Video-to-4D Generation with Sparse-Controlled Gaussian-Mesh Hybrid Representation. In NeurIPS, 2024.
[4] Gengshan Yang, Minh Vo, Natalia Neverova, Deva Ramanan, Andrea Vedaldi, Hanbyul Joo. BANMo: Building Animatable 3D Neural Models from Many Casual Videos. In CVPR, 2022.

BibTeX

 
      @article{pad3r,
        author    = {Liao, Ting-Hsuan and Liu, Haowen and Xu, Yiran and Ge, Songwei and Yang, Gengshan and Huang, Jia-Bin},
        title     = {PAD3R: Pose-Aware Dynamic 3D Reconstruction from Casual Videos},
        journal   = {SIGGRAPH ASIA},
        year      = {2025},
      }

PAD3R Pose-Aware Dynamic 3D Reconstruction from Casual Videos

SIGGRAPH Asia 2025

Comparisons on In-the-wild Videos ↑ back to top

Comparisons on Artemis ↑ back to top

Method ↑ back to top

Robustness Across View Coverage ↑ back to top

Composite Demo ↑ back to top

Ablation Study ↑ back to top

Limitations ↑ back to top

BibTeX

PAD3R
Pose-Aware Dynamic 3D Reconstruction from Casual Videos