CVT-xRF: Contrastive In-Voxel Transformer for 3D Consistent
Radiance Fields from Sparse Inputs
CVPR 2024

  • 1HKUST
  • 2Huawei Noah's Ark Lab
overview
Illustration of learned 3D radiance fields and rendered images of the proposed CVT (Contrastive In-Voxel Transformer)-xRF upon three baselines trained from sparse inputs of three views. The `xRF` means that the proposed CVT module can be plugged into different baselines. The radiance fields of the three baselines show different levels of 3D inconsistencies (marked in red boxes), which result in failures or artifacts in rendered images. With CVT, we can obtain radiance fields of better 3D consistency and render images of much higher quality.

Abstract

Neural Radiance Fields (NeRF) have shown impressive capabilities for photorealistic novel view synthesis when trained on dense inputs. However, when trained on sparse inputs, NeRF typically encounters issues of incorrect density or color predictions, mainly due to insufficient coverage of the scene causing sparse supervision, leading to significant performance degradation. While existing works mainly consider ray-level consistency to construct 2D learning regularization based on rendered color, depth, or semantics on image planes, in this paper we propose a novel approach that models 3D spatial field consistency to improve NeRF’s performance with sparse inputs. Specifically, we first adopt a voxel-based ray sampling strategy to ensure that the sampled rays intersect with a certain voxel in 3D space. We then randomly sample additional points within the voxel and apply a Transformer to infer the properties of other points on each ray, which are then incorporated into the volume rendering. By backpropagating through the rendering loss, we enhance the consistency among neighboring points. Additionally, we propose to use a contrastive loss on the encoder output of the Transformer to further improve consistency within each voxel. Experiments demonstrate that our method yields significant improvement over different radiance fields in the sparse inputs setting, and achieves comparable performance with current works.

Framework

overview

The proposed CVT (Contrastive In-Voxel Transformer)-xRF for learning radiance fields from sparse inputs. It consists of three parts, i.e., a voxel-based ray sampling strategy, a local implicit constraint module, and a global explicit constraint module. For simplicity, two voxels are shown, along with two rays for each. The local implicit constraint is implemented by a light-weight In- Voxel Transformer which infers colors and densities of ray points by interacting with surrounding 3D points. The ray points are then inserted among the points from the importance sampler for rendering. The global explicit constraint is conducted by a voxel contrastive regularization, which regularizes the radiance properties between points in a voxel to be more similar than that of points across voxels.

Novel View Synthesis

The following videos illustrates the effectiveness of the CVT module on different baselines. For each video, the lhs is synthesized by the baseline, while the rhs is synthesized by the CVT-xRF.

BARF v.s. CVT-xRF (w/ BARF)

           BARF                 CVT-xRF (w/ BARF)                 BARF                 CVT-xRF (w/ BARF)
3 input views
6 input views
9 input views

SPARF v.s. CVT-xRF (w/ SPARF)

          SPARF                 CVT-xRF (w/ SPARF)                SPARF                 CVT-xRF (w/ SPARF)
3 input views
6 input views
9 input views

Radiance Fields Visualization

We apply the visualization tool from SwitchNeRF to visualize the learned 3D radiance fields. The visualized 3D fields differ from 2D volume-rendered synthesis, since the 3D field records the density/color of 3D points. Visualizing the 3D fields helps us analyze the artifacts in the 2D images which are caused by incorrect density distribution. The following video illustrates the learned radiance fields of SPARF and CVT-xRF (w/ SPARF).

Quantitative Improvements on Baselines

The following table shows the improvements brought by the CVT module over three baselines, i.e., NeRF, BARF and SPARF. The results indicate that the CVT module greatly enhances the performances. These improvements can be mainly attributed to the module's capability of modeling more accurate density distribution, as illustrated in the radiance fields visualization above, leading to significantly less artifacts as shown in the synthesized videos.

overview

Citation

If you find this project helpful, please kindly cite:
@article{zhong2024cvt,
    title={CVT-xRF: Contrastive In-Voxel Transformer for 3D Consistent Radiance Fields from Sparse Inputs},
    author={Zhong, Yingji and Hong, Lanqing and Li, Zhenguo and Xu, Dan},
    journal={arXiv preprint arXiv:2403.16885},
    year={2024}
    }

Acknowledgements

This webpage integrates components from various websites, including Mip-NeRF, FreeNeRF, and RegNeRF. We would like to express our sincere gratitude to their remarkable works and impressive webpages.