Taming Video Diffusion Prior with Scene-Grounding Guidance for 3D Gaussian Splatting from Sparse Inputs

We tackle the critical issues of (a) extrapolation and (b) occlusion in sparse-input 3DGS by leveraging a video diffusion model. Vanilla generation often suffers from inconsistencies within the generated sequences (as highlighted by the yellow arrows), leading to black shadows in the rendered images. In contrast, our scene-grounding generation produces consistent sequences, effectively addressing these issues and enhancing overall quality (c), as indicated by the blue boxes. The numbers refer to PSNR values. All visualization results in the following are rendered with 3DGS optimized with 6 input views, following the setting of an indoor benchmark^[1].

Despite recent successes in novel view synthesis using 3D Gaussian Splatting (3DGS), modeling scenes with sparse inputs remains a challenge. In this work, we address two critical yet overlooked issues in real-world sparse-input modeling: extrapolation and occlusion. To tackle these issues, we propose to use a reconstruction by generation pipeline that leverages learned priors from video diffusion models to provide plausible interpretations for regions outside the field of view or occluded. However, the generated sequences exhibit inconsistencies that do not fully benefit subsequent 3DGS modeling. To address the challenge of inconsistency, we introduce a novel scene-grounding guidance based on rendered sequences from an optimized 3DGS, which tames the diffusion model to generate consistent sequences. This guidance is training-free and does not require any fine-tuning of the diffusion model. To facilitate holistic scene modeling, we also propose a trajectory initialization method. It effectively identifies regions that are outside the field of view and occluded. We further design a scheme tailored for 3DGS optimization with generated sequences. Experiments demonstrate that our method improves upon the baseline and achieves state-of-the-art performance on challenging benchmarks.

Our method consists of three parts: scene-grounding guidance, trajectory initialization, and optimization scheme with generated sequences. Initially, a baseline 3DGS is trained using sparse inputs and initialized with the point cloud from DUSt3R^[2]. Yellow regions denote uncovered areas, e.g., those outside the field of view or occluded. The trajectory initialization determines the paths for sequence generation based on renderings from the baseline 3DGS, facilitating holistic scene modeling. The video diffusion model receives an input image along with the trajectory for sequence generation, incorporating scene-grounding guidance during the denoising process to ensure consistent output. The guidance is based on the rendered sequences. Finally, the generated sequences are utilized to optimize the final 3DGS through a tailored optimization scheme. We use the open-source pose-controllable video diffusion model, ViewCrafter^[3] for the sequence generation.

Comparisons with FSGS^[4] and DNGaussian^[5], the two leading approaches for sparse-input 3DGS modeling based on monocular depth regularization. All methods are initialized with the DUSt3R point cloud^[2].

Comparison with FSGS. Please click the videos and drag the slider for better comparisons.

Comparison with DNGaussian. Please click the videos and drag the slider for better comparisons.

Optimizing a 3DGS with generated sequences without the proposed guidance will result in black shadows in renderings, due to the inconsistency within sequences from the video diffusion model.

Please click the videos and drag the slider for better comparisons.

The baseline 3DGS is optimized with the initialization of DUSt3R point cloud, incorporating the gaussian unpooling from FSGS. It is a strong baseline as indicated by the performance in the main paper.

Please click the videos and drag the slider for better comparisons.

References

[1] Zhong et al. Empowering sparse-input neural radiance fields with dual-level semantic guidance from dense novel views. arXiv:2503.02230. 
[2] Wang et al. Dust3r: Geometric 3d vision made easy. In CVPR, 2024.
[3] Yu et al. Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis. arXiv:2409.02048.
[4] Zhu et al. Fsgs: Real-time few-shot view synthesis using gaussian splatting. In ECCV, 2024.
[5] Li et al. Dngaussian: Optimizing sparse-view 3d gaussian radiance fields with global-local depth normalization. In CVPR, 2024.

BibTeX


  @inproceedings{zhong2025taming,
    title={Taming Video Diffusion Prior with Scene-Grounding Guidance for 3D Gaussian Splatting from Sparse Inputs},
    author={Zhong, Yingji and Li, Zhihao and Chen, Dave Zhenyu and Hong, Lanqing and Xu, Dan},
    booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
    year={2025}
    }

Taming Video Diffusion Prior with Scene-Grounding Guidance for 3D Gaussian Splatting from Sparse Inputs

Abstract

Framework Overview

Comparison with SOTA

Comparison with No Guidance

Comparison with Baseline

References

BibTeX