Master's student, School of Information Technology and Engineering, Kazakh-British Technical University, Kazakhstan, Almaty
3D OBJECT RECONSTRUCTION FROM SINGLE RGB IMAGE: DEPTH-AWARE NEURAL RADIANCE FIELDS
ABSTRACT
Reconstructing accurate 3D structures from a single RGB image remains a significant challenge in computer vision. While Neural Radiance Fields (NeRF) have shown promise in multi-view scenarios, their performance in singleview settings is limited. In this work, we propose an approach that integrates monocular depth priors, obtained from the pre-trained state-of-the-art model, into a NeRF-based framework to enhance single-view 3D object reconstruction. By incorporating these depth priors, our method guides the radiance field optimization process, leading to more accurate geometry and improved rendering quality. We evaluate our approach on the ShapeNet dataset, demonstrating that the integration of depth information improves reconstruction fidelity compared to baseline NeRF models. Our results highlight the potential of combining monocular depth estimation with neural rendering techniques to advance single-image 3D reconstruction.
АННОТАЦИЯ
Восстановление точных 3D-структур по единственному RGB-изображению остаётся серьёзной задачей в области компьютерного зрения. Хотя нейронные поля излучения (NeRF) показали хорошие результаты в сценариях с несколькими видами, их эффективность в условиях единственного вида ограничена. В этой работе мы предлагаем подход, который интегрирует монокулярные приоритеты глубины, полученные с помощью предобученной модели передового уровня, в основанную на NeRF систему для улучшения реконструкции 3D-объектов по одному изображению. За счёт включения этих приоритетов глубины наш метод направляет процесс оптимизации поля излучения, что приводит к более точному восстановлению геометрии и повышению качества визуализации. Мы оцениваем предложенный подход на наборе данных ShapeNet и показываем, что интеграция информации о глубине улучшает качество реконструкции по сравнению с базовыми моделями NeRF. Наши результаты подчёркивают потенциал сочетания монокулярной оценки глубины и нейронных методов рендеринга для прогресса в области 3D-реконструкции по одиночному изображению.
Keywords: Neural Radiance Fields, Monocular Depth Estimation, Single-View 3D Reconstruction, Volume Rendering, Depth Priors, Novel View Synthesis.
Ключевые слова: Нейронные поля излучения, монокулярная оценка глубины, 3D-реконструкция по одному виду, объёмный рендеринг, априорная информация о глубине, синтез новых ракурсов.
Introduction
Image-based 3D object reconstruction aims to derive accurate three-dimensional representations of objects and scenes from single or multiple two-dimensional RGB images. Effective 3D representation is crucial for numerous applications, such as robotic navigation in complex environments, autonomous driving systems, augmented and virtual reality, and digital content creation. Formally, the reconstruction problem is defined as follows: given a set of RGB images
, where typically
in single-view scenarios, our goal is to learn a model
capable of accurately predicting a 3D representation
that closely approximates the true, unknown shape
. Mathematically, this is achieved by minimizing a reconstruction objective function:
/Valikhanov.files/image006.png)
where
denotes the parameters of the model
, and
represents a distance metric quantifying the similarity between the predicted shape
and the ground-truth shape
. The function
is commonly referred to as the loss function.
Recently, Neural Radiance Fields (NeRF) [1] have demonstrated substantial success in synthesizing novel views and capturing complex representations of objects and scenes from multi-view images using implicit volumetric representations. However, their effectiveness significantly deteriorates in single-view reconstruction tasks due to the inherent ambiguity of inferring complete 3D shape from a single 2D observation without additional priors or constraints. To address these limitations, we propose a novel depth-aware approach that integrates monocular depth priors extracted from the pre-trained DepthAnythingV2-Base model into a NeRF-based single-view reconstruction pipeline. Our primary goal is to explicitly incorporate reliable depth cues into the implicit volumetric representation, effectively mitigating depth ambiguity and substantially improving the accuracy and realism of reconstructed 3D objects. We rigorously evaluate our methodology using the ShapeNet dataset [2], benchmarking against strong single-view reconstruction methods. Our experiments demonstrate decent improvements in standard metrics, including Peak Signal-toNoise Ratio (PSNR) and Structural Similarity Index Measure (SSIM), confirming the effectiveness of integrating explicit depth information into neural radiance-based methods.
Materials and methods.
An essential consideration in designing machine learning models for 3D reconstruction is the choice of data representation. In 3D machine learning, there is no universally accepted representation that balances compactness, computational efficiency, and ease of acquisition from real-world data. Existing methods can be categorized on the basis of their output representations: voxel-based, point-based, mesh-based, and implicit representations. First works have employed volumetric pixels [3; 4] and point cloud representations [5]. Voxels and point clouds offer straightforward and uniform structures, as they don’t require representing multiple primitives or intricate connectivity patterns. However, representing objects as polygonal meshes is more practical, but requires complex loss functions and sophisticated neural network architectures [6]. Alternatively, functional representations [7] can reduce memory footprint during training. The rendering of scenes and objects is another critical component. Recent advances utilize differentiable 3D-aware architectures capable of refinement using 2D multiview images via neural rendering techniques [8; 9]. Most existing methods adopt encoder-decoder architectures [10; 6], where the encoder learns a low-dimensional representation of the 3D object, and the decoder reconstructs the shape. Incorporating diffusion models in the decoder can enhance reconstruction quality [11]. These architectures perform well, and further improvements can be achieved through modifications to hidden layers and additional loss functions. Several recent works which shows state of the art results using NeRF and 3DGS methods [11; 12].
Neural Radiance Fields and Single-View Reconstruction Neural Radiance Fields. (NeRF) [1] have emerged as a powerful method for novel view synthesis and 3D reconstruction, representing scenes as continuous volumetric functions. Traditional NeRF requires multiple calibrated views and significant computational resources for per-scene optimization, limiting its applicability in single-view scenarios. To address these limitations, pixelNeRF [13] introduces a framework that predicts a continuous neural scene representation conditioned on one or few input images. By conditioning NeRF on image inputs in a fully convolutional manner, pixelNeRF enables the network to learn scene priors across multiple scenes, facilitating novel view synthesis in a feedforward manner from sparse views. This approach allows training directly from images without explicit 3D supervision and demonstrates superior performance on ShapeNet benchmarks for single-image novel view synthesis tasks. Building upon this, VisionNeRF [14] leverages both global and local features to form an expressive 3D representation. Global features are extracted using a vision transformer, while local features are obtained from a 2D convolutional network. A multilayer perceptron (MLP) conditioned on the learned 3D representation performs volume rendering for novel view synthesis. This method enables rendering novel views from a single input image and generalizes across multiple object categories using a single model, achieving state-of-the-art performance with richer detail rendering. Several recent studies have explored various strategies to incorporate depth information into NeRF training, aiming to enhance reconstruction quality, accelerate convergence, and improve generalization from sparse or single-view inputs. DS-NeRF [15] introduces a depth supervision loss that leverages sparse 3D points obtained from structure-from-motion (SfM) pipelines. By encouraging the rendered depth along rays to align with these sparse depth points, DS-NeRF significantly improves geometry learning, enabling accurate novel view synthesis with fewer input images and achieving 2–6× faster training compared to the original NeRF. DINER [16] integrates depth predictions into both feature fusion and sampling processes. By conditioning the radiance field on the deviation between sample locations and estimated depths, DINER enhances sampling efficiency and reconstruction quality, particularly in scenarios with large viewpoint disparities. This approach allows for more complete scene capture without additional hardware requirements. Another work [17] addresses the challenge of novel view synthesis from a single image. The method employs a depth teacher network to generate dense pseudo-depth maps, which supervise a joint rendering mechanism combining coarse planar and fine volumetric rendering. This strategy improves geometry consistency and rendering quality. These approaches demonstrate the efficacy of incorporating depth information—whether from sparse SfM points, predicted depth maps, or pseudo-depth supervision—in enhancing NeRF performance, especially under limited input conditions.
Monocular Depth Estimation. Monocular depth estimation focuses on predicting depth information from a single RGB image, a task inherently ill-posed due to the absence of stereoscopic cues. Recent advancements in deep learning have led to significant progress in this area, with several models achieving remarkable accuracy and generalization capabilities. MiDaS [18] introduced a robust approach by training on a diverse mixture of datasets, enabling zero-shot cross-dataset transfer. The model employs a multi-objective optimization framework to handle varying depth ranges and scales across datasets, resulting in improved generalization to unseen data. ZoeDepth [19] introduces approach build on top of MiDaS, that combines relative and metric depth estimation to achieve zero-shot generalization. The model is pre-trained on twelve datasets using relative depth and fine-tuned on two datasets with metric depth. A key component is the metric bins module, which adjusts depth predictions to maintain metric scale. ZoeDepth achieves state-of-the-art performance on multiple benchmarks, demonstrating significant improvements in relative absolute error (REL) and robust generalization to unseen datasets. Depth Anything V2 [20] presents a scalable solution for MDE by leveraging large-scale synthetic data and pseudolabeled real images. Depth Anything V2 offers models of varying scales, ranging from 25M to 1.3B parameters, catering to different application needs. Compared to diffusion-based models, it achieves over 10x faster inference speed and superior accuracy, making it suitable for real-time applications. Integrating depth priors from models like ZoeDepth and Depth Anything V2 into 3D reconstruction pipelines can enhance the accuracy and robustness of single-view reconstructions, particularly in scenarios with limited or no depth information.
Challenges in Single-View 3D. Reconstruction Despite advancements, single-view 3D reconstruction remains challenging due to the inherent ambiguity of inferring complete 3D shape from a single 2D observation without additional priors or constraints. While methods like pixelNeRF and VisionNeRF have made significant progress, they still face difficulties in accurately reconstructing fine details and complex geometries. Our work aims to address these challenges by integrating monocular depth priors from pre-trained models into the NeRF framework, enhancing the reconstruction quality from single-view images.
The methodology pipeline is shown in Figure 1. For our experiments, we utilize the ShapeNet dataset [2], a comprehensive repository of richly annotated 3D CAD models spanning over 3,000 object categories. Specifically, we focus on the ShapeNet Single-Category (SRN [21]) subset, which provides rendered images of objects from consistent viewpoints along with corresponding camera parameters. Following the protocol established in the pixelNeRF framework [13], we select the ”car” and ”chair” categories from the SRN subset to train category-specific models. Each object in these categories is rendered at a resolution of 128 × 128 pixels. During training, each object instance is rendered from 50 viewpoints uniformly distributed on the upper hemisphere, simulating diverse perspectives under simple lighting conditions. For testing, objects are rendered from 251 viewpoints arranged along an Archimedean spiral, maintaining the same illumination as in training. In the evaluation phase, the 64th view is designated as the input, while the remaining 250 views serve as target views for assessing novel view synthesis performance. To enhance the learning process, we precompute depth maps for each rendered image using a pretrained monocular depth estimation model. These depth maps serve as additional supervision signals, providing valuable geometric cues that guide the network towards more accurate 3D reconstructions from single-view inputs.
/Valikhanov.files/image012.png)
Figure 1. Overview of our methodology
To quantitatively assess the quality of our 3D reconstructions and novel view synthesis, we employ two widely recognized full-reference image quality metrics: Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM). These metrics compare the rendered images from our reconstructed 3D models against the ground truth images.
Peak Signal-to-Noise Ratio (PSNR). PSNR measures the ratio between the maximum possible power of a signal and the power of corrupting noise that affects the fidelity of its representation. It is expressed in decibels (dB) and is commonly used to evaluate the quality of reconstruction in image processing tasks.
Given two images
(ground truth) and
(reconstructed), each of size
, the Mean Squared Error (MSE) is computed as:
/Valikhanov.files/image016.png)
The PSNR is then defined as:
/Valikhanov.files/image017.png)
where
is the maximum possible pixel value of the image (e.g., 255 for 8-bit images). Higher PSNR values indicate better reconstruction quality.
Structural Similarity Index Measure (SSIM). SSIM is a perceptual metric that quantifies image quality degradation caused by processing such as data compression or transmission losses. Unlike PSNR, which considers absolute errors, SSIM assesses the structural similarity between images, incorporating luminance, contrast, and structural comparisons.
The SSIM between two image patches
and
is calculated as:
/Valikhanov.files/image021.png)
where:
are the average pixel values of
and
.
are the variances of
and
.
is the covariance between
and
.
,
are constants to stabilize the division with weak denominators, with
being the dynamic range of the pixel values (e.g., 255 for 8-bit images), and
,
by default.
SSIM values range from -1 to 1, where 1 indicates perfect similarity between the two images.
Model. Our approach extends the pixelNeRF framework [13] by incorporating monocular depth priors from the frozen Depth Anything V2 estimator [20]. As illustrated in Figure 2, Given an input RGB image
, we compute a depth map
/Valikhanov.files/image031.png)
where
is the pretrained Depth Anything V2 network (frozen during training). We then extract feature maps from both modalities:
/Valikhanov.files/image033.png)
where
is the strong convolutional encoder - ResNet backbone as in pixelNeRF, but with modification of first convolutional layer to accept 4 channels tensor (concatenation of image and depth) and
is a lightweight convolutional encoder for the depth map. We fuse these features by concatenation:
/Valikhanov.files/image036.png)
Then, after obtaining the total feature maps
we can use it during volumetric rendering and condition the NeRF MLP on it to predict radiance and density along each query ray.
/Valikhanov.files/image038.png)
Figure 2. Overview of our architecture
Neural Radiance Fields (NeRF). NeRF [1] represents a scene as a continuous volumetric function:
/Valikhanov.files/image039.png)
which receives a 3D point
and a viewing direction vector
, and returns the volume density
, and the emitted color
(RGB). For a specific target camera pose
we can render resulting image by obtaining colors for each pixel. Single camera ray is parameterized as
, where
is the origin of the ray (center of the camera) and
is direction,
is the distance from the origin to the point. A 3D points along a camera ray are sampled at
depths
between near and far bounds. Each sample
can be projected to the source image plane with know input view camera position and parameters. Coordinates of projected sample
then used to extract corresponding feature vector
, which used then to condition NeRF MLP. To enable the network to represent high‑frequency scene details, position
lifted via a sinusoidal positional encoding
:
/Valikhanov.files/image056.png)
where
is a number of base frequencies. We used 6 base frequencies during train and evaluation as in pixelNeRF default setup. Finally, we pass all positionally encoded position, along with the view direction and corresponding feature vector into the NERF MLP:
/Valikhanov.files/image057.png)
where
is the volume density and
the emitted color. The continuous volume rendering integral:
/Valikhanov.files/image060.png)
with transmittance
is approximated by the discrete sum
/Valikhanov.files/image062.png)
/Valikhanov.files/image063.png)
/Valikhanov.files/image064.png)
where
. This ray‑marching formulation allows NeRF to synthesize photorealistic novel views by learning both geometry (via
) and appearance (via
).
Training Loss. We supervise the rendered color
against the ground‑truth pixel color
using an
loss:
/Valikhanov.files/image070.png)
where
is the set of training rays.
Implementation and Training Details. All experiments build upon the public pixelNeRF codebase [13], with two main extensions:
- Dataset module: augmented to load precomputed monocular depth maps (Depth Anything V2) alongside RGB.
- Model module: modified encoders to accept and fuse depth features as described earlier.
Training Configuration (Single-View, Category-Agnostic). We adopt the default pixelNeRF training schedule for single-view, category-agnostic reconstruction, with the following specifics:
- Hardware: Single NVIDIA P100 GPU.
- Optimizer: Adam with initial learning rate
. - Sampling:
– Bounding-box sampling enabled for the first 100,000 iterations.
– Batch size: 4 objects per batch, each with 128 rays.
- Fine-tuning: After 100,000 iterations:
– Disable bounding-box sampling (switch to full-image ray sampling).
– Increase batch size to 8 instances (still 128 rays each).
– Apply exponential learning-rate decay with decay factor γ = 0.995 per epoch.
- Total training time: Approximately 36 hours on the “cars” category.
Hierarchical Sampling (Coarse & Fine MLPs). To allocate samples efficiently along each ray, we follow NeRF’s coarse‐to‐fine strategy. First, 64 stratified samples are drawn uniformly between the near and far bounds and fed into the coarse
. The output densities produce weights that define a probability distribution for 16 importance samples. To further concentrate samples near the predicted surface, we also draw 16 additional points from a Gaussian centered at the coarse network’s estimated depth (standard deviation 0.01). The union of these samples is then evaluated by the fine
to regress the final ray color.
In summary, our approach extends the pixelNeRF baseline by fusing monocular depth priors with image features, conditioning the combined representation on a NeRF MLP, and optimizing with standard volume rendering and an ℓ2 reconstruction loss. As a result, the model converges more quickly and achieves higher fidelity. As shown in Fig. 3, the depth-aware model’s smoothed evaluation loss decreases faster than the baseline (identical training settings), and its validation PSNR rises more rapidly, reaching approximately 21.9 dB versus 20.1 dB for the standard pixelNeRF model. This demonstrates that incorporating depth guidance both accelerates learning and improves final rendering quality.
/Valikhanov.files/image075.jpg)
/Valikhanov.files/image076.jpg)
Figure 3. Comparison of training dynamics between the depth-aware and baseline pixelNeRF models.
Results and discussions.
We evaluate our depth‑aware NeRF in the category‑specific single‑view setting, using the exact data splits and rendering protocol defined in pixelNeRF [13]. In particular, we train separate models on the “car” and “chair” subsets of ShapeNet SRN, use the 64th view as input, and compare novel‑view renderings against the remaining 250 ground‑truth views.
Table 1 summarizes quantitative results on the ShapeNet SRN dataset for novel-view synthesis, comparing our depth-aware model with pixelNeRF and VisionNeRF. Our approach clearly improves upon the pixelNeRF baseline, achieving approximately a 0.54 dB increase in PSNR on chairs and a 0.33 dB increase on cars. Additionally, our model demonstrates comparable or superior performance relative to VisionNeRF, despite VisionNeRF's use of more sophisticated global and local feature encoders.
Table 1.
PSNR and SSIM comparison on ShapeNet SRN single-view novel-view synthesis for chair and car categories
|
Method |
Chairs |
Cars |
||
|
PSNR(↑) |
SSIM(↑) |
PSNR(↑) |
SSIM(↑) |
|
|
SRN |
22.89 |
0.89 |
22.25 |
0.88 |
|
pixelNeRF |
23.72 |
0.90 |
23.17 |
0.89 |
|
VisionNeRF |
24.48 |
0.92 |
22.88 |
0.90 |
|
Ours |
24.26 |
0.91 |
23.50 |
0.90 |
These results demonstrate that incorporating monocular depth priors yields a consistent boost in reconstruction quality compared to using only RGB. The improvement is more pronounced in PSNR (which is sensitive to pixel-wise differences), suggesting that the depth helps reduce color projection errors (perhaps by nailing down geometry, hence reducing blurring or doubling of features in renderings). The SSIM improvements are present but more modest, indicating structural similarity was already quite high with PixelNeRF and we are refining details.
Notably, our model, despite its simplicity, achieved performance on par with the more complex VisionNeRF on the car category and only slightly below it on chairs. VisionNeRF’s transformer-based global feature likely helps with chairs which have more self-similarity (four legs etc.), whereas our depth prior helps more with cars that have clear depth gradients (hood vs windshield etc.). Overall, the depth-aware approach appears to be a very cost-effective way to improve PixelNeRF.
Figure 4 presents qualitative results of our depth-aware NeRF model trained on single-view images. Since our work extends the pixelNeRF framework, it similarly tends to produce slightly blurry artifacts in novel-view renderings. Nevertheless, despite being trained for significantly shorter periods, our depth-aware model achieves comparable—and in some instances superior—visual quality. For minor changes in viewpoint, the model reliably generates clear novel views; however, as the viewing angle deviates more substantially from the source view, artifacts and blurring become increasingly noticeable.
/Valikhanov.files/image077.jpg)
Figure 4. Qualitative examples of novel-view synthesis using our depth-aware NeRF model. Columns represent the source view (input), target ground-truth view, and the generated view, respectively
Conclusion. We have presented a depth-aware extension of pixelNeRF that integrates monocular depth priors into the NeRF framework. By fusing depth maps with image features, our method enhances the conditioning of the NeRF MLP, leading to improved convergence speed and rendering quality. Experimental results on the ShapeNet SRN dataset demonstrate that our approach outperforms both pixelNeRF and VisionNeRF in single-view novel view synthesis tasks. Despite these improvements, our method has several limitations. First, it does not leverage geometric symmetries or canonical object structures, which could aid in reconstructing occluded or unobserved regions. Second, the model lacks high-level semantic understanding, such as recognizing object parts (e.g., wheels on a car), which could improve reconstruction of unseen components. Third, similar to the original NeRF, our method suffers from slow rendering times, making real-time applications challenging. Lastly, the model’s performance degrades when synthesizing views with significant angular deviations from the input, due to limited information from a single image. To address these limitations, future research could explore incorporating generative models, such as diffusion models, to infer plausible structures in occluded regions [22; 11]. Employing alternative scene representations like 3D Gaussian splats [23] may offer more efficient and flexible rendering capabilities [12]. Additionally, adopting techniques from InstantNGP [24] or Re-ReND [25] could significantly reduce rendering times, facilitating real-time applications.
References:
- Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis, 2020.
- Angel X. Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi, and Fisher Yu. Shapenet: An information-rich 3d model repository, 2015.
- Christopher B. Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, and Silvio Savarese. 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. 4 2016.
- Haozhe Xie, Hongxun Yao, Xiaoshuai Sun, Shangchen Zhou, and Shengping Zhang. Pix2vox: Context-aware 3d reconstruction from single and multi-view images. 1 2019.
- Haoqiang Fan, Hao Su, and Leonidas Guibas. A point set generation network for 3d object reconstruction from a single image. 12 2016.
- Hiroharu Kato, Yoshitaka Ushiku, and Tatsuya Harada. Neural 3d mesh renderer, 2017.
- Lars Mescheder, Michael Oechsle, Michael Niemeyer, Sebastian Nowozin, and Andreas Geiger. Occupancy networks: Learning 3d reconstruction in function space, 2019.
- Yue Jiang, Dantong Ji, Zhizhong Han, and Matthias Zwicker. Sdfdiff: Differentiable rendering of signed distance fields for 3d shape optimization, 2022.
- Shichen Liu, Tianye Li, Weikai Chen, and Hao Li. Soft rasterizer: A differentiable renderer for image-based 3d reasoning. 4 2019.
- Alexandre Boulch and Renaud Marlet. Poco: Point convolution for surface reconstruction, 2022.
- Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl Vondrick. Zero-1-to-3: Zero-shot one image to 3d object, 2023.
- Stanislaw Szymanowicz, Christian Rupprecht, and Andrea Vedaldi. Splatter image: Ultra-fast single-view 3d reconstruction. 12 2023.
- Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. pixelnerf: Neural radiance fields from one or few images, 2021.
- Kai-En Lin, Lin Yen-Chen, Wei-Sheng Lai, Tsung-Yi Lin, Yi-Chang Shih, and Ravi Ramamoorthi. Vision transformer for nerf-based view synthesis from a single input image, 2022.
- Kangle Deng, Andrew Liu, Jun-Yan Zhu, and Deva Ramanan. Depth-supervised nerf: Fewer views and faster training for free, 2022.
- Malte Prinzler, Otmar Hilliges, and Justus Thies. Diner: Depth-aware image-based neural radiance fields, 2023.
- Yurui Chen, Chun Gu, Feihu Zhang, and Li Zhang. Single-view neural radiance fields with depth teacher, 2023.
- Ren´e Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer, 2020.
- Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, and Matthias M¨uller. Zoedepth: Zero-shot transfer by combining relative and metric depth, 2023.
- Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything v2, 2024.
- Vincent Sitzmann, Michael Zollh¨ofer, and Gordon Wetzstein. Scene representation networks: Continuous 3d-structure-aware neural scene representations, 2020.
- Jiatao Gu, Alex Trevithick, Kai-En Lin, Josh Susskind, Christian Theobalt, Lingjie Liu, and Ravi Ramamoorthi. Nerfdiff: Single-image view synthesis with nerf-guided distillation from 3d-aware diffusion, 2023.
- Bernhard Kerbl, Georgios Kopanas, Thomas Leimk¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42(4), July 2023.
- Thomas M¨uller, Alex Evans, Christoph Schied, and Alexander Keller. Instant neural graphics primitives with a multiresolution hash encoding. ACM Transactions on Graphics, 41(4):1–15, July 2022.
- Sara Rojas, Jesus Zarzar, Juan Camilo Perez, Artsiom Sanakoyeu, Ali Thabet, Albert Pumarola, and Bernard Ghanem. Re-rend: Real-time rendering of nerfs across devices, 2023.