【论文阅读】ActiveNeRF：通过不确定性估计候选新视图

本文介绍: Act i v eNeRF: Learnin g where to See wi t h Uncer tainty Es tim ati on 部分翻译

在这里插入图片描述

Recently, Neural Rad iance Fi el ds (NeRF) ha s shown prom i s ing per formanc es on reconstruct ing 3D scene s and synthesizing nov el views from a sparse set of 2D images. Albeit effe ctive, the per formance of NeRF is highly influenced by the quality of train ing samples. With limited pose d images from the scene, NeRF fai ls to generalize well to nov el views and may collapse to trivial solutions in unobserved reg ions. This makes NeRF imp ractical under resource–constrained scenarios. In this paper, we present a novel learning framework, ActiveNeRF, ai ming to model a 3D scene with a constrained input bu d get. Specifically, we first incorporate uncertainty estimation into a NeRF model, which ensu res robustness under few ob servations and provides an int erpretation of how NeRF unde rstands the scene. On this basis, we pro pose to supplement the existing training set with newly captured samples based on an active learning scheme. By evaluating the reduction of uncertainty given new inputs, we select the samples that bring the most information gain. In this way, the quality of novel view synthesis can be improved with minimal additional resources. Extensive experiments validate the performance of our model on both re alist ic and synthetic scenes, especially with sc arcer training data. Code will be released at https://github.com/LeapLabTHU/ActiveNeRF.

最近，神经辐射场 (NeRF) 在重建 3D 场景和从一组稀疏的 2D 图像中合成新颖的视图方面表现出了良好的性能。尽管有效，但 NeRF 的性能很大程度上受训练样本质量的影响。由于场景中的姿势图像有限，NeRF 无法很好地推广到新颖的视图，并且可能会在未观察到的区域中崩溃为破碎结果。这使得 NeRF 在资源受限的情况下不切实际。在本文中，我们提出了一种新颖的学习框架 ActiveNeRF，旨在以有限的输入对 3D 场景进行建模。具体来说，我们首先将不确定性估计纳入 NeRF 模型中，这确保了少视角下的鲁棒性，并提供了 NeRF 如何理解场景的解释。在此基础上，我们建议基于主动学习方案用新捕获的样本来补充现有的训练集。通过评估给定新输入的不确定性的减少，我们选择带来最大信息增益的样本。通过这种方式，可以用最少的额外资源来提高新颖视图合成的质量。大量的实验验证了我们的模型在现实场景和合成场景上的性能，特别是在训练数据较少的情况下。代码将在 https://github.com/LeapLabTHU/ActiveNeRF 发布。

Despite its success in synthesizing high-quality images, the learning scheme for a NeRF model puts forward higher demands on the training data. First, NeRF usually req ui res a large number of posed images and is proved to generalize poorly with limited inputs [36]. Second, it takes a whole observation in the scene to train a well-generali zed NeRF. As illustrated in Figure 2, if we remove observations of a particular part in the scene, NeRF fails to model the region and tends to collapse (e.g., pred icting zero density every where in the scene) instead of performing reasonable pred ictions. This poses chal lenges under real-world applications su ch as robot localization and mapping, where capt uring training data can be co stly, and perc eption of the entire scene is req ui red [23,11,31].

尽管 NeRF 模型在合成高质量图像方面取得了成功，但其学习方案对训练数据提出了更高的要求。首先，NeRF 通常需要大量的位姿已知图像，并且被证明在输入有限的情况下泛化能力很差 [36]。其次，需要对场景进行整体观察来训练泛化良好的 NeRF。如图 2 所示，如果我们删除场景中特定部分的观测，NeRF 无法对该区域进行建模，并且往往会崩溃（例如，预测场景中各处的密度为零），而不是执行合理的预测。这在机器人定位和地图绘制等现实应用中提出了挑战，其中捕获训练数据的成本可能很高，并且需要感知整个场景[23,11,31]。

(

)

r(t)

$r (t)$ follo ws a Gaussian distribution parameterized by mean

‾

(

)

overline{{c}}(r(t))

$\overline{c} (r (t))$ and variance

‾

(

)

overline{{β}}^2(r(t))

β^2_0

(

)

r(t)

$r (t)$ as

(

)

∼

(

‾

(

)

‾

(

)

c(r(t)) ∼ N (overline{{c}}(r(t)),overline{{β}}^2(r(t)))

α_is

$α_{i} s$ are the same as in Eq.(4), and

‾

(

)

overline{C}(r)

$\overline{C} (r)$ ,

‾

(

)

overline{{β}}^2(r)

$\overline{β}^{2} (r)$ denote the mean and variance of the rendered color through the samp led ray

{r^N_{i=1}}

$r_{i = 1}^{N}$ from a batch

α_i

D_1

$D_{1}$ denote the existing training set and

F_θ

$F_{θ}$ denote the trained NeRF model given

D_1

$D_{1}$ . For sim pli city, we first consider the influence of a single ray

r_2

$r_{2}$ from the new input

D_2

$D_{2}$ . Thus, for the

k_{th}

$k_{t h}$ samp led location

(

)

r_2(t_k)

r_2

$r_{2}$ are independent with

(

)

r_2(t_k)

$r_{2} (t_{k})$ , we can represent the unrelated part in the mean as a constant

(

)

b(t_k)

(

)

C(r_2)

$C (r_{2})$ , the variance is independent of the ground truth val ue and therefore can be pre cisely computed based on the current model

F_θ

$F_{θ}$ . Additionally, it is worth noting that the variance of the posterior distribution of a newly observed location

(

)

r_2(t_k)

(

)

r_2(t_k)

$r_{2} (t_{k})$ from the new ray

r_2

H, W

$H, W$ , we can sample

N = H×W

$N = H \times W$ independent rays, with

N_s

r_i

$r_{i}$ denotes ray from different images, and

(

)

∀

x = r_i(t_{k_i}), ∀i

$x = r_{i} (t_{k_{i}}), \forall i$ . Please refer to Appendix B for details.。
In practical implementation, we first sample candidate views from a spherical space, and choose the top-k candidates that score highest in the acquisition function as the supplementary of the current training set. In this way, the captured new inputs bring the most information gain and promote the performance of the current model with the highest efficiency.
Besides, a quality-efficiency trade-off can also be achieved by evaluating new inputs with lower resolution. For example, instead of using full image size