3D StreetUnveiler with Semantic-Aware 2DGS (2024)

Jingwei Xu¹ Yikai Wang² Yiqun Zhao¹ Yanwei Fu² Shenghua Gao¹

¹ShanghaiTech University ²Fudan University

xujw2023@shanghaitech.edu.cn gaoshh@shanghaitech.edu.cn

Corresponding author

Abstract

Unveiling an empty street from crowded observations captured by in-car cameras is crucial for autonomous driving. However, removing all temporary static objects, such as stopped vehicles and standing pedestrians, presents a significant challenge. Unlike object-centric 3D inpainting, which relies on thorough observation in a small scene, street scenes involve long trajectories that differ from previous 3D inpainting tasks. The camera-centric moving environment of captured videos further complicates the task due to the limited degree and time duration of object observation.To address these obstacles, we introduce StreetUnveiler to reconstruct an empty street.StreetUnveiler learns a 3D representation of the empty street from crowded observations.Our representation is based on the hard-label semantic 2D Gaussian Splatting (2DGS) for its scalability and ability to identify Gaussians to be removed.We inpaint rendered image after removing unwanted Gaussians to provide pseudo-labels and subsequently re-optimize the 2DGS.Given its temporal continuous movement, we divide the empty street scene into observed, partial-observed, and unobserved regions, which we propose to locate through a rendered alpha map.This decomposition helps us to minimize the regions that need to be inpainted.To enhance the temporal consistency of the inpainting, we introduce a novel time-reversal framework to inpaint frames in reverse order and use later frames as references for earlier frames to fully utilize the long-trajectory observations.Our experiments conducted on the street scene dataset successfully reconstructed a 3D representation of the empty street. The mesh representation of the empty street can be extracted for further applications. Project page and more visualizations can be found at: https://streetunveiler.github.io

1 Introduction

Accurate 3D reconstruction of an empty street scene from an in-car camera video would greatly facilitate autonomous driving by providing reliable digital environments that simulate real-world street scenarios.Although this is an important task, it is seldomly studied in previous work because of its challenging nature in the following aspects:(1) Lack of ground truth data for pre-training inpainting models specialized for street scenes;(2) The camera-centric moving captures objects from limited angles and for brief periods;(3) The long trajectory of in-car videos leads to objects appearing and disappearing at different time points, complicating object removal.

But there still exists a blessing we can take from the long trajectory moving-forward nature. As the car moves forward, objects that disappear from the later frame will only be visible in previous video frames. This gives a hint about maintaining the temporal consistency of the same regions.

To address the challenge of reconstructing an empty street, we introduce StreetUnveiler,a reconstruction method targeting unveiling the empty representation of long-trajectory street scenes.StreetUnveiler involves the reconstruction of the observed 3D representation, identification of unobserved regions that are occluded by objects, a time-reversal inpainting framework to consistently inpaint the unobserved regions as pseudo labels, and the re-optimization of the 3D representation based on the pseudo labels.

3D StreetUnveiler with Semantic-Aware 2DGS (1)

3D StreetUnveiler with Semantic-Aware 2DGS (2)

StreetUnveiler first reconstructs the original parked-up street with Gaussian Splatting (GS) due to its scalability and editability. However, as is illustrated in Fig.2, inpainting with the naïve object mask (orange mask) often results in blurring and loss of detail in large inpainted regions, which is a common issue in the previous works[37, 68, 61, 67, 33]. Generating masks for completely unobservable regions (blue mask) that are invisible from any viewpoint remains a challenge. Recent work[33] requires user-provided masks, which is impractical for long trajectories. Moreover, the messy appearance of these regions after removing the Gaussians makes it difficult to use methods like SAM[22].To address the difficulty of finding an ideal inpainting mask, we propose to generate the mask through the rendered alpha map and reconstruct the scene using a hard-label semantic 2DGS[17] instead of 3DGS[18]. In contrast, 2DGS has a high opacity value for Gaussians, resulting in low alpha values in completely unobservable regions. A semantic distortion loss and a shrinking loss are employed to further reduce the rendered alpha values of the completely unobservable regions.This approach automatically generates masks for unobservable regions without user input, leading to better inpainting results.

Furthermore, we propose a time-reversal inpainting framework to enhance the consistency of inpainting results in completely unobservable regions. By inpainting the video frames in reverse order, we use the later frame as a reference to inpaint the earlier frame. When the video is played in reverse, old content transitions only from near to far in the camera view as the camera moves away from the object in reversed time-space. This method uses a high-to-low-resolution guiding approach instead of filling an area larger than the reference region, as in the low-to-high-resolution approach. This results in more consistent inpainting.Finally, the inpainted pixels are used as pseudo labels to guide the re-optimization of 2DGS. This enables our method to learn a scalable 2DGS model that represents an empty street while preserving the appearance integrity of regions visible in other views.

Our contribution can be summarized as follows:

•
We propose representing the street as hard-label semantic 2DGS, optimizing the 3D scene with semantic guidance for scalable representation and improved instance decoupling.
•
We use a rendered alpha map to locate completely unobservable regions and apply a semantic distortion loss and a shrinking loss to create a reasonable inpainting mask for these regions.
•
We introduce a novel time-reversal inpainting strategy for long-trajectory scenes, enhancing the consistency of inpainting results for re-optimization. Experiments show that our method can reconstruct an empty street from in-car camera video containing obstructive elements.

2 Related work

Neural scene representation and reconstruction. The use of neural radiance fields (NeRF)[36] to represent 3D scenes inspired a lot of follow-up work based on the original approach. Some works[38, 6, 52, 48] explore explicit representations such as low-rank matrices, hash grids, or voxel grids to increase the model capacity of original MLPs. Some work explores multiple separate MLPs[43, 24, 12] to represent instances and backgrounds separately. However, these scale-up strategies are complicated to implement at the scale of street scenes. Existing works[71, 55, 57, 34, 44, 76, 66, 58, 35, 62, 14, 86, 51] explore mesh-based, primitive-based, or grid-based representations for large-scale street scenes. However, both grid-based representation[14] and mesh-based representation[66] may be constrained by their limited topology, making it hard to decouple the scene into separate instances. Recent advances in point-based rendering techniques[18, 25, 72, 17] can achieve both high-quality and fast rendering speed. The point-based nature of Gaussian Splatting enables scalability for street scenes. While recent works[8, 74, 28, 46, 9] have explored the reconstruction of large-scale scenes using Gaussian Splatting, Our work focus on the unveiling stage of a street scene, which is more important for autonomous driving and more challenging.

3D scene manipulation and inpainting. Early works[63, 82, 41, 56, 1, 29, 79, 80, 78, 89] explored street scene editing by leveraging single-view or multi-view image inpainting networks. With the rapid development of Neural Scene Representation, editing a 3D scene has been explored by lots of works[10, 88, 75, 81, 2, 23, 19, 40]. Edit-NeRF[32] pioneered shape and color editing of neural fields using latent codes. Subsequent works[2, 23, 19, 40] utilized CLIP models to provide editing guidance from text prompts or reference images. Recent works[68, 83, 37, 69, 7, 11, 77, 67, 65, 27] also explored 2D stylization and inpainting techniques, utilizing pretrained Diffusion Priors[47] for editing 3D scenes. Specifically,[7, 11, 77, 65] investigate these approaches in collaboration with Gaussian Splatting. Unlike them, our work focuses on street scene object removal and empty street reconstruction, which is more challenging.

Image and video inpainting. Image inpainting[3]aims to fulfill the missing region within an image.Standard approaches includeGAN-based methods[39, 87], attention-based methods[79, 30], transformer-based methods[59, 31], and more recently, diffusion-based methods[47].Control-Net[84] enables generating images with additional conditions on the frozen diffusion models.Recently, LeftRefill[4] learns to guide the frozen diffusion inpainting models with extra conditions of the reference image, enabling multi-view inpainting on the frozen diffusion model.However, these image inpainting methods mainly focus on the static scenario.Video inpainting considers the temporal consistent inpainting in the continuous image sequence, utilizing approaches like 3D CNN[60, 16], temporal shifting[94], flow guidance[20, 73, 26], temporal attentions[45], to name a few.However, these video inpainting methods hardly consider the long trajectory movement of cameras.In contrast, in our paper, we focus on the inpainting of large-scale street scenes.Furthermore, the 2DGS representation used in our paper enables the free-view rendering of the inpainted video.

3 Problem formulation

Given in-car camera videos and the Lidar data of a parked-up street, our goal is to remove all static objects in the street, like stopping vehicles and standing pedestrians, and finally reconstruct an empty street. This task, named as Street Unveiling, isto generate reconstructed scenes devoid of these static obstacles, providing an empty representation of the street environment.Such representations are mainly represented by 3D models for free-view rendering.This task holds significant implications for autonomous driving systems, urban planning, and scene understanding applications.

Street unveiling shares some similarities with related tasks but cannot be addressed using existing approaches.(1) 3D reconstruction primarily involves modeling a primary image or scene with an object-centric camera. In contrast, street unveiling focuses on the background, aiming to remove foreground objects to reveal an empty street. The absence of ground truth further differentiates it from standard 3D reconstruction tasks.(2) Video inpainting typically deals with videos captured by fixed or minimally moving cameras, featuring one or a few central objects. Conversely, street unveiling involves long camera trajectories without central objects.These distinctions require different capabilities and novel methods to address the unique challenges of street unveiling.

4 Semantic street reconstruction

We opt for 2D Gaussian Splatting [17] (2DGS) as our scene representation for its rendering speed and editability. We first introduce the 2DGS in Sec.4.1. Subsequently, we elaborate our algorithm tailored for street unveiling using 2DGS in Sec.4.2 and Sec.4.3.

4.1 Preliminary: 2D Gaussian Splatting

Our reconstruction stage builds upon the state-of-the-art point-based renderer with the splendid geometry performance, 2D Gaussian splatting[17]. 2D splatting is defined by several key components: the central point $\mathbf{p}_{k}$ , two principal tangential vectors $\mathbf{t}_{u}$ and $\mathbf{t}_{v}$ that determine its orientation, and a scaling vector $\mathbf{S}=(s_{u},s_{v})$ controlling the variances of the 2D Gaussian distribution.

2D Gaussian splatting represents the scene’s geometry as a set of 2D Gaussians. A 2D Gaussian is defined in a local tangent plane in world space, parameterized as follows:

\displaystyle P(u,v)

\displaystyle=\mathbf{p}_{k}+s_{u}\mathbf{t}_{u}u+s_{v}\mathbf{t}_{v}v.

(1)

For the point $\mathbf{u}=(u,v)$ in $uv$ space, its 2D Gaussian value can then be evaluated using the standard Gaussian function:

\mathcal{G}(\mathbf{u})=\exp\left(-\frac{u^{2}+v^{2}}{2}\right).

(2)

The center $\mathbf{p}_{k}$ , scaling $(s_{u},s_{v})$ , and the rotation $(\mathbf{t}_{u},\mathbf{t}_{v})$ are learnable parameters.Each 2D Gaussian primitive has opacity $\alpha$ and view-dependent appearance $\mathbf{c}$ with spherical harmonics.For volume rendering, Gaussians are sorted according to their depth value and composed into an image with front-to-back alpha blending:

\mathbf{c}(\mathbf{x})=\sum_{i=1}\mathbf{c}_{i},\alpha_{i},\hat{\mathcal{G}}i(%\mathbf{u}(\mathbf{x}))\prod{j=1}^{i-1}(1-\alpha_{j},\hat{\mathcal{G}}_{j}(%\mathbf{u}(\mathbf{x}))).

(3)

where $\mathbf{x}$ represents a hom*ogeneous ray emitted from the camera and passing through $uv$ space.

4.2 2DGS for street scene reconstruction

2DGS features for its accurate geometry reconstruction of the object surface.However, the application of 2DGS to reconstruct objects devoid of surfaces, such as the sky in an open-air street scene, remains unexplored. We aim to reconstruct the street scene as a radiance field and semantic field using 2DGS. More details about radiance field reconstruction are placed in the supplementary.

Learning 2D Gaussians with semantic guidance.We aim to augment the radiance field of street scenes with editability. Inspired from[13, 74, 8, 90], we harness the power of 2D semantic segmentation and distill such knowledge back to 2D Gaussians. To do so, we inject each 2D Gaussian with a ‘hard’ semantic label. The ‘hard’ means that the semantic label is non-trainable, which differs from the learnable ‘soft’ label used in recent works[90, 74, 93]. Note that although our ‘hard’ semantic label is not trainable, it allows for rendering correct 2D semantic maps by altering its opacity, rotation, and scaling. This encourages the points with the same semantic labels to gather closer in 3D space, facilitating object removal in 3D space.Assume that each 2D Gaussian associated with a one-hot encoded semantic label $\mathrm{s}_{k}$ , we render the 2D semantic map as:

\hat{S}(\mathbf{x})=\sum_{i=1}\mathrm{s}_{i},\alpha_{i},\hat{\mathcal{G}}i(%\mathbf{u}(\mathbf{x}))\prod{j=1}^{i-1}(1-\alpha_{j},\hat{\mathcal{G}}_{j}(%\mathbf{u}(\mathbf{x}))).

(4)

4.3 Optimization of 2DGS for Street Unveiling

In this part, we first introduce standard objectives used by previous approaches to optimize 2DGS.Then we discuss the inferiority of these objectives in the street scene and propose the newly introduced objectives tailored for street unveiling.In summary, our objective consists of a photo-metric loss, a semantic loss, a normal consistency loss, two different depth distortion losses, and a shrinking loss.

Standard approach:As in 3DGS[18], we use $\mathcal{L}_{1}$ loss and D-SSMI loss for supervising RGB color, with $\lambda=0.2$ :

\mathcal{L}_{\text{rgb}}=(1-\lambda)\mathcal{L}_{1}+\lambda\mathcal{L}_{\text{%D-SSIM}}.

(5)

Following 2DGS[17], depth distortion loss and normal consistency loss are adopted to refine the geometry property of the 2DGS representation of the street scene.

\mathcal{L}_{\text{d}}=\sum_{i,j}\omega_{i}\omega_{j}|z_{i}-z_{j}|\hskip 28.45%274pt\mathcal{L}_{n}=\sum_{i}\omega_{i}(1-\mathbf{n}_{i}^{\top}\mathbf{N})

(6)

Here, $\omega_{i}$ represents the blending weight of the $i-$ th intersection. $z_{i}$ denotes the depth of the intersection points. $\mathbf{n}_{i}$ is the normal of the splat facing the camera. $\mathbf{N}$ is the estimated normal at nearby depth point $\mathbf{p}$ .

We employ Cross-Entropy (CE) loss to supervise semantic labels:

\mathcal{L}_{s}(\mathbf{x})=\text{CE}(\hat{S}(\mathbf{x}),S(\mathbf{x}))

(7)

where $S$ is a pseudo semantic map extracted from a pre-trained detector[70].

Inferiority of standard objectives.In street unveiling, the scene semantics are expected to be maintained in a less messy and more consistent way to better recognize the Gaussian Points of objects to remove.However, solely naïve depth distortion won’t hinder the merging of the 2DGS with different semantic labels, leading to noisy semantic information about the 3D world. Meanwhile, the Gaussians in the unseen region will still exist if we don’t find a way to eliminate them. These problems will further harm the generation of an ideal inpainting mask.

Clean up objectives.To reduce the noise in the semantic fields, we propose a semantic depth distortion loss $\mathcal{L}_{\text{ds}}$ and a shrinking loss $\mathcal{L}_{\alpha}$ on opacity $\alpha$ :

\mathcal{L}_{\text{ds}}=\sum_{k}\mathcal{L}_{\text{d}}^{k}\hskip 28.45274pt%\mathcal{L}_{\alpha}=\sum_{p}\alpha_{p}

(8)

where $k$ iterates over each semantic label and $\mathcal{L}_{\text{d}}^{k}$ denotes the distortion loss of 2DGS with same semantic labels. The depth distortion loss is exerted on the rendered result of the Gaussians with the same semantic label. Intuitively, it will encourage the 2DGS with the same label to have a more consistent depth at the pixel level.Shrinking loss will further eliminate the Gaussians that are actually unseen by any viewpoint. $\alpha_{p}$ represents the opacity value $\alpha$ of each Gaussians.

5 Empty street reconstruction

A common strategy[37, 68] to inpaint within small scenes is utilizing 2D inpainting methods to inpaint removed objects in the image space for re-optimization.However, lots of problems arise when it comes to the street scene.(1) Some views result in over-blurry inpainting results due to the huge size of the inpainting mask, as is illustrated in Fig.2(b);(2) Some occluded regions of the street find it hard to maintain consistency because they’re exposed to a large number of views in the long trajectory.These challenges will make it more vulnerable to inconsistent inpainting.

In the context of point-based scene representation, eliminating the object involves deleting Gaussians.However, a naïve removal often yields unsatisfactory results, particularly in the completely unobservable regions beneath the object.In this section, we first propose how to generate the ideal mask for inpainting as in Fig.2(c). Then, we propose our time-reversal inpainting algorithm and how to use the inpainting results to re-optimize the 2DGS.

5.1 Generation of ideal inpainting mask

In the street video captured by a moving car, we could divide the occluded region into three categories:(1) The observable regions, where the regions are not occluded by any objects;(2) The partially observable regions, where the regions are occluded in some views but are observable in other views;(3) The completely unobservable regions, where the regions are unobservable in all recording views.For regions in the second case, we could utilize information from other views to preserve more information about the street scene’s appearance.As illustrated in Fig.2, naïvely inpainting with the object mask will cause the unexpected blurry inpainting result at the partially observable region, which can be viewed from other viewpoints but is occluded from the current viewpoint.

To distinguish partially observable regions from completely unobservable regions and improve the inpainting quality, we propose using the rendered alpha map to generate the mask for completely unobservable regions.For a given viewpoint, we first remove the Gaussian points and their neighbors in the 2DGS. Then we render the alpha map of the remaining scene.We identify the completely unobservable region via the pixels with low alpha values.The pixels with alpha values lower than a threshold are selected as inpainting masks.The threshold is set as 0.99 in our implementation.

5.2 Time-reversal inpainting

The core challenge in reconstructing the empty street scene is ensuring consistency between different viewpoints over the long trajectory.However, current video inpainting methods cannot generalize to our long trajectory and complex scenarios, which can be validated from Tab.2 and supplementary video comparison. This usually lags behind the scale-up speed of image inpainting models.To this end, we propose using a reference-based image inpainting method that is trained to ensure consistency between the inpainted region and the reference-based image.Particularly, we adopt the LeftRefill[4] for its stable diffusion-based backbone and the matching-based training strategy.The stable diffusion backbone leads to a more powerful inpainting model with a strong generation capacity in open-world scenarios, which fits the requirement of street unveiling.Furthermore, the matching-based training strategy ensures that the inpainting model correctly fulfills the masked region based on the observation in the reference image, which encourages consistency between different views.

3D StreetUnveiler with Semantic-Aware 2DGS (3)

However, the typical time-forward inpainting sequences usually lead to failure of consistent inpainting.Given the moving-forward nature of data-collecting vehicles, objects to be removed transit from far to near in the camera view.(1) As is illustrated in Fig.3, when we use the far-view image as a reference to inpaint the same region in the near-view image, the models may not correctly capture the matching relationships and thus causing inconsistent inpainting.Conversely, setting the near-view image as the reference image leads to a more precise matching result and naturally better inpainting results.(2) The near-view image can capture more fine-grained information and a larger receptive field, thus making the inpainting easier to inpaint in a high-to-low resolution instead of low-to-high which requires extra super-resolution capacity for the inpainting model.Besides, the objects removed in the final frames are consistently observed in the earlier frames.

Based on the above analysis, we propose the time-reversal inpainting framework.If we reverse the time, we can turn the moving-forward nature into a moving-backward nature. When the time is reversed, objects to be removed will instead transition from near to far in the camera view because the camera will be away from the removed object in reversed time-space.

We target to unconditionally inpaint a 3D region only once and then transmit the inpainted region’s pixels to other views with reference-based inpainting. As is illustrated in Fig.4, we first unconditionally inpaint[5] both frame $T_{n}$ and $T_{n+1}$ . However, for frame $T_{n}$ , there are some regions that can be seen in $T_{n+1}$ . We expect they would share more matching pixels by utilizing the implicit pixel-matching ability of reference-based inpainting model[4]. Then we use frame $T_{n+1}$ as a reference to inpaint frame $T_{n}$ , masking only the regions visible in $T_{n+1}$ .

3D StreetUnveiler with Semantic-Aware 2DGS (4)

5.3 Re-optimization of the 2D Gaussians

Once we finish time reversal inpainting, we use our inpainting results as pseudo labels to guide the retraining of 2DGS representation.We use the following loss to refine 2DGS:

\mathcal{L}_{\text{retrain}}=\mathcal{L}_{1}+\lambda_{d}\mathcal{L}_{\text{d}}%+\lambda_{n}\mathcal{L}_{\text{n}}.

(10)

6 Experiments

Dataset.For the evaluation of our approach from the reconstruction aspect and the object removal aspect, we adopt real-world street scenes from Waymo Open Perception Dataset[53].The Waymo dataset collects data from 5 camera perspectives, encompassing roughly 230 degrees in field of view (FOV). We downscale the origin image to $484\times 320$ for efficiency and a fair comparison.

Metrics.To evaluate the effectiveness of object removal, we approach it from a multi-view inpainting perspective. Following previous works[37, 68, 33], we calculate the LPIPS[85] and Fréchet Inception Distance (FID)[15] scores to quantify the discrepancies between the ground-truth views and removed results. These metrics are computed for each frame and then averaged.

Baselines.We compare our approach to 3D inpainting method SPIn-NeRF[37] and a recent Gaussian Splatting based inpainting method Infusion[33].As the original MLP implementation of SPIn-NeRF[37] works poorly in the large-scale street scene, we re-implement SPIn-NeRF[37] based on 2DGS[17], clarifying that our superiority not only from 2DGS but also the proposed time reversal inpainting. Infusion[33] is implemented with the official implementation. But Infusion[33] only conducts GS removal and projection once for the whole scene. This doesn’t match our long-trajectory tasks. Instead, we conduct every 10 frames to fit our setting.

3D StreetUnveiler with Semantic-Aware 2DGS (5)

6.1 Comparison

Method	LPIPS $\downarrow$	FID $\downarrow$
Single Image Inpainting
LaMa(2D)[54]	0.251	164.247
SDXL[42]	0.269	149.222
Video Inpainting
ProPainter[92]	0.257	162.584
3D Inpainting
SPIn-NeRF[37]	0.252	165.792
(in 2DGS)
Infusion[33]	0.330	189.586
Ours	0.241	157.970

Method	LPIPS $\downarrow$	FID $\downarrow$
Single Image Inpainting
w/LaMa[54]	0.247	158.383
w/SDXL[42]	0.252	162.213
Video Inpainting
w/ProPainter[92]	0.252	160.809
Time-Reversal Inpainting
Ours	0.241	157.970

Method	LPIPS $\downarrow$	FID $\downarrow$
w/3DGS[18]	0.245	160.783
Ours	0.241	157.970

Peak GPU memory usage in our experiments is 16GB. The quantitative comparison results are shown in Tab.2, and the qualitative comparison of 3D inpainting methods are shown in Fig.5. Noticed that SPIn-NeRF[37] utilizes LaMa[54] and Infusion[33] utilizes SDXL[42] for inpainting. We can observe that 3D inpainting baseline methods lead to worse results, especially when the case is challenging. The results demonstrate that our proposed method achieves better 3D inpainting results from the appearance aspect. The geometry property of the removed region will be discussed in the supplementary. Video comparisons will also be included in the supplementary. In Tab.2, our proposed method outperforms all the baselines in both LPIPS.It only achieves a lower FID compared to SDXL, yet SDXL doesn’t maintain consistency between different video frames. This can be easily observed from supplementary videos. Infusion[33] suffers from SDXL’s pseudo-appearance guidance.

6.2 Further analysis

3D StreetUnveiler with Semantic-Aware 2DGS (6)

Ablation of different inpainting methods as pseudo labels.We compare the reconstruction results with pseudo labels from different inpainting methods.From Fig.6, we can observe that time reversal will maintain the consistency between View $1$ and View $2$ . Current single image inpainting models, like LaMa[54] and SDXL[42], fail to maintain the consistency over the video frames. Although the video inpainting models[92] can be temporal-consistent at near frames, the whole inpainting regions will be blurred, since it can not guarantee 3D consistency. Please see the videos in supplementary that demonstrate our methods are much more consistent compared with all baselines.

Ablation of 3D representation.We ablate through the 3D representation by comparing the results obtained with 3DGS[18] and 2DGS[17]. From Fig.7, we can observe that after we remove the Gaussians, the rendered alpha map with 3DGS fails to generate an ideal inpainting mask. The quantitative results given in Tab.2(b) verify the necessity of 2DGS representation.

3D StreetUnveiler with Semantic-Aware 2DGS (7)

7 Conclusion

We propose StreetUnveiler, a comprehensive pipeline for reconstructing empty streets from in-car camera videos. Our method represents the street scene using a hard-label semantic-aware 2D Gaussian Splatting[17], allowing us to remove each instance from the scene seamlessly. To create an ideal inpainting mask, we utilize the rendered alpha map after removing unwanted 2DGS. Additionally, we introduce a novel time-reversal inpainting framework that enhances consistency across different viewpoints, facilitating the reconstruction of empty streets. Extensive experiments demonstrate that our method effectively reconstructs empty street scenes and supports free-viewpoint rendering.

References

[1]Dragomir Anguelov, Carole Dulong, Daniel Filip, Christian Frueh, Stephane Lafon, Richard Lyon, Abhijit Ogale, Luc Vincent, and Josh Weaver.Google street view: Capturing the world at street level.Computer, 43, 2010.
[2]Chong Bao, Yinda Zhang, and Bangbang etal. Yang.Sine: Semantic-driven image-based nerf editing with prior-guided editing field.In CVPR, pages 20919–20929, 2023.
[3]Marcelo Bertalmio, Guillermo Sapiro, Vincent Caselles, and Coloma Ballester.Image inpainting.In Proceedings of the 27th annual conference on Computer graphics and interactive techniques, pages 417–424, 2000.
[4]Chenjie Cao, Yunuo Cai, Qiaole Dong, Yikai Wang, and Yanwei Fu.Leftrefill: Filling right canvas based on left reference through generalized text-to-image diffusion model.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
[5]Chenjie Cao, Qiaole Dong, and Yanwei Fu.Zits++: Image inpainting by improving the incremental transformer on structural priors.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
[6]Anpei Chen, Zexiang Xu, Andreas Geiger, Jingyi Yu, and Hao Su.Tensorf: Tensorial radiance fields.In European Conference on Computer Vision (ECCV), 2022.
[7]Yiwen Chen, Zilong Chen, Chi Zhang, Feng Wang, Xiaofeng Yang, Yikai Wang, Zhongang Cai, Lei Yang, Huaping Liu, and Guosheng Lin.Gaussianeditor: Swift and controllable 3d editing with gaussian splatting, 2023.
[8]Yurui Chen, Chun Gu, Junzhe Jiang, Xiatian Zhu, and LiZhang.Periodic vibration gaussian: Dynamic urban scene reconstruction and real-time rendering.arXiv:2311.18561, 2023.
[9]Kai Cheng, Xiaoxiao Long, Kaizhi Yang, Yao Yao, Wei Yin, Yuexin Ma, Wenping Wang, and Xuejin Chen.Gaussianpro: 3d gaussian splatting with progressive propagation.arXiv preprint arXiv:2402.14650, 2024.
[10]Chong Bao and Bangbang Yang, Zeng Junyi, Bao Hujun, Zhang Yinda, Cui Zhaopeng, and Zhang Guofeng.Neumesh: Learning disentangled neural mesh-based implicit field for geometry and texture editing.In European Conference on Computer Vision (ECCV), 2022.
[11]Jiemin Fang, Junjie Wang, Xiaopeng Zhang, Lingxi Xie, and QiTian.Gaussianeditor: Editing 3d gaussians delicately with text instructions, 2023.
[12]Xiao Fu, Shangzhan Zhang, Tianrun Chen, Yichong Lu, Lanyun Zhu, Xiaowei Zhou, Andreas Geiger, and Yiyi Liao.Panoptic nerf: 3d-to-2d label transfer for panoptic urban scene segmentation.In Proceedings of the International Conference on 3D Vision (3DV), 2022.
[13]Haoyu Guo, Sida Peng, Haotong Lin, Qianqian Wang, Guofeng Zhang, Hujun Bao, and Xiaowei Zhou.Neural 3d scene reconstruction with the manhattan-world assumption.In CVPR, 2022.
[14]Jianfei Guo, Nianchen Deng, Xinyang Li, Yeqi Bai, Botian Shi, Chiyu Wang, Chenjing Ding, Dongliang Wang, and Yikang Li.Streetsurf: Extending multi-view implicit surface reconstruction to street views.arXiv preprint arXiv:2306.04988, 2023.
[15]Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter.Gans trained by a two time-scale update rule converge to a local nash equilibrium.In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 6629–6640, Red Hook, NY, USA, 2017. Curran Associates Inc.
[16]Yuan-Ting Hu, Heng Wang, Nicolas Ballas, Kristen Grauman, and AlexanderG Schwing.Proposal-based video completion.In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVII 16, pages 38–54. Springer, 2020.
[17]Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger, and Shenghua Gao.2d gaussian splatting for geometrically accurate radiance fields.In SIGGRAPH 2024 Conference Papers. Association for Computing Machinery, 2024.
[18]Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis.3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4), July 2023.
[19]Justin Kerr, ChungMin Kim, Ken Goldberg, Angjoo Kanazawa, and Matthew Tancik.Lerf: Language embedded radiance fields.In International Conference on Computer Vision (ICCV), 2023.
[20]Dahun Kim, Sanghyun Woo, Joon-Young Lee, and InSo Kweon.Deep video inpainting.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5792–5801, 2019.
[21]DiederikP Kingma and Max Welling.Auto-encoding variational bayes, 2022.
[22]Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, AlexanderC. Berg, Wan-Yen Lo, Piotr Dollar, and Ross Girshick.Segment anything.In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4015–4026, October 2023.
[23]Sosuke Kobayashi, Eiichi Matsumoto, and Vincent Sitzmann.Decomposing nerf for editing via feature field distillation.In Advances in Neural Information Processing Systems, volume35, 2022.
[24]Abhijit Kundu, Kyle Genova, Xiaoqi Yin, Alireza Fathi, Caroline Pantofaru, Leonidas Guibas, Andrea Tagliasacchi, Frank Dellaert, and Thomas Funkhouser.Panoptic neural fields: A semantic object-aware neural scene representation.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
[25]Christoph Lassner and Michael Zollhofer.Pulsar: Efficient sphere-based neural rendering.In CVPR, 2021.
[26]Zhen Li, Cheng-Ze Lu, Jianhua Qin, Chun-Le Guo, and Ming-Ming Cheng.Towards an end-to-end framework for flow-guided video inpainting.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17562–17571, 2022.
[27]ChiehHubert Lin, Changil Kim, Jia-Bin Huang, Qinbo Li, Chih-Yao Ma, Johannes Kopf, Ming-Hsuan Yang, and Hung-Yu Tseng.Taming latent diffusion model for neural radiance field inpainting.2024.
[28]Jiaqi Lin, Zhihao Li, Xiao Tang, Jianzhuang Liu, Shiyong Liu, Jiayue Liu, Yangdi Lu, Xiaofei Wu, Songcen Xu, Youliang Yan, and Wenming Yang.Vastgaussian: Vast 3d gaussians for large scene reconstruction.In CVPR, 2024.
[29]Guilin Liu, KevinJ Shih, Ting-Chun Wang, FitsumA Reda, Karan Sapra, Zhiding Yu, Andrew Tao, and Bryan Catanzaro.Partial convolution based padding.Arxiv, 2018.
[30]Hongyu Liu, Bin Jiang, YiXiao, and Chao Yang.Coherent semantic attention for image inpainting.In Proceedings of the IEEE/CVF international conference on computer vision, pages 4170–4179, 2019.
[31]Qiankun Liu, Zhentao Tan, Dongdong Chen, QiChu, Xiyang Dai, Yinpeng Chen, Mengchen Liu, LuYuan, and Nenghai Yu.Reduce information loss in transformers for pluralistic image inpainting.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11347–11357, 2022.
[32]Steven Liu, Xiuming Zhang, Zhoutong Zhang, Richard Zhang, Jun-Yan Zhu, and Bryan Russell.Editing conditional radiance fields, 2021.
[33]Zhiheng Liu, Hao Ouyang, Qiuyu Wang, KaLeong Cheng, Jie Xiao, Kai Zhu, Nan Xue, YuLiu, Yujun Shen, and Yang Cao.Infusion: Inpainting 3d gaussians via learning depth completion from diffusion prior.arXiv preprint arXiv:2404.11613, 2024.
[34]Fan Lu, Yan Xu, Guang Chen, Hongsheng Li, Kwan-Yee Lin, and Changjun Jiang.Urban radiance field representation with deformable neural mesh primitives.In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
[35]Andreas Meuleman, Yu-Lun Liu, Chen Gao, Jia-Bin Huang, Changil Kim, MinH Kim, and Johannes Kopf.Progressively optimized local radiance fields for robust view synthesis.In Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, pages 16539–16548, 2023.
[36]Ben Mildenhall, PratulP. Srinivasan, Matthew Tancik, JonathanT. Barron, Ravi Ramamoorthi, and Ren Ng.Nerf: Representing scenes as neural radiance fields for view synthesis.In European Conference on Computer Vision (ECCV), 2020.
[37]Ashkan Mirzaei, Tristan Aumentado-Armstrong, KonstantinosG. Derpanis, Jonathan Kelly, MarcusA. Brubaker, Igor Gilitschenski, and Alex Levinshtein.SPIn-NeRF: Multiview segmentation and perceptual inpainting with neural radiance fields.In CVPR, 2023.
[38]Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller.Instant neural graphics primitives with a multiresolution hash encoding.ACM Trans. Graph., 41(4):102:1–102:15, July 2022.
[39]Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and AlexeiA Efros.Context encoders: Feature learning by inpainting.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2536–2544, 2016.
[40]Songyou Peng, Kyle Genova, Chiyu"Max" Jiang, Andrea Tagliasacchi, Marc Pollefeys, and Thomas Funkhouser.Openscene: 3d scene understanding with open vocabularies.2023.
[41]Julien Philip and George Drettakis.Plane-based multi-view inpainting for image-based rendering in large scenes.In Proceedings of the ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games (I3D), 2018.
[42]Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach.Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023.
[43]Christian Reiser, Songyou Peng, Yiyi Liao, and Andreas Geiger.Kilonerf: Speeding up neural radiance fields with thousands of tiny mlps.In International Conference on Computer Vision (ICCV), 2021.
[44]Konstantinos Rematas, Andrew Liu, PratulP. Srinivasan, JonathanT. Barron, Andrea Tagliasacchi, Tom Funkhouser, and Vittorio Ferrari.Urban radiance fields.CVPR, 2022.
[45]Jingjing Ren, Qingqing Zheng, Yuanyuan Zhao, Xuemiao Xu, and Chen Li.Dlformer: Discrete latent transformer for video inpainting.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3511–3520, 2022.
[46]Kerui Ren, Lihan Jiang, Tao Lu, Mulin Yu, Linning Xu, Zhangkai Ni, and BoDai.Octree-gs: Towards consistent real-time rendering with lod-structured 3d gaussians, 2024.
[47]Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer.High-resolution image synthesis with latent diffusion models.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
[48]Sara Fridovich-Keil and Alex Yu, Matthew Tancik, Qinhong Chen, Benjamin Recht, and Angjoo Kanazawa.Plenoxels: Radiance fields without neural networks.In CVPR, 2022.
[49]Johanneslu*tz Schönberger and Jan-Michael Frahm.Structure-from-motion revisited.In Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[50]Johanneslu*tz Schönberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm.Pixelwise view selection for unstructured multi-view stereo.In European Conference on Computer Vision (ECCV), 2016.
[51]Yawar Siddiqui, Lorenzo Porzi, SamuelRota Bulò, Norman Müller, Matthias Nießner, Angela Dai, and Peter Kontschieder.Panoptic lifting for 3d scene understanding with neural fields.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9043–9052, June 2023.
[52]Cheng Sun, Min Sun, and Hwann-Tzong Chen.Direct voxel grid optimization: Super-fast convergence for radiance fields reconstruction.CVPR, 2022.
[53]Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, Vijay Vasudevan, Wei Han, Jiquan Ngiam, Hang Zhao, Aleksei Timofeev, Scott Ettinger, Maxim Krivokon, Amy Gao, Aditya Joshi, YuZhang, Jonathon Shlens, Zhifeng Chen, and Dragomir Anguelov.Scalability in perception for autonomous driving: Waymo open dataset.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
[54]Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin, Anastasia Remizova, Arsenii Ashukha, Aleksei Silvestrov, Naejin Kong, Harsh*th Goka, Kiwoong Park, and VictorS. Lempitsky.Resolution-robust large mask inpainting with fourier convolutions.2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 3172–3182, 2021.
[55]Matthew Tancik, Vincent Casser, Xinchen Yan, Sabeek Pradhan, Ben Mildenhall, Pratul Srinivasan, JonathanT. Barron, and Henrik Kretzschmar.Block-NeRF: Scalable large scene neural view synthesis.arXiv, 2022.
[56]Theo Thonat, Eli Shechtman, Sylvain Paris, and George Drettakis.Multi-view inpainting for image-based scene editing and rendering.In Proceedings of the International Conference on 3D Vision (3DV), 2016.
[57]Haithem Turki, Deva Ramanan, and Mahadev Satyanarayanan.Mega-nerf: Scalable construction of large-scale nerfs for virtual fly-throughs.In CVPR, pages 12922–12931, June 2022.
[58]Haithem Turki, JasonY Zhang, Francesco Ferroni, and Deva Ramanan.Suds: Scalable urban dynamic scenes.In Computer Vision and Pattern Recognition (CVPR), 2023.
[59]Ziyu Wan, Jingbo Zhang, Dongdong Chen, and Jing Liao.High-fidelity pluralistic image completion with transformers.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4692–4701, 2021.
[60]Chuan Wang, Haibin Huang, Xiaoguang Han, and Jue Wang.Video inpainting by jointly learning temporal structure and spatial details.In Proceedings of the AAAI conference on artificial intelligence, volume33, pages 5232–5239, 2019.
[61]Dongqing Wang, Tong Zhang, Alaa Abboud, and Sabine Susstrunk.Inpaintnerf360: Text-guided 3d inpainting on unbounded neural radiance fields.arXiv, 2023.
[62]Peng Wang, Yuan Liu, Zhaoxi Chen, Lingjie Liu, Ziwei Liu, Taku Komura, Christian Theobalt, and Wenping Wang.F2-nerf: Fast neural radiance field training with free camera trajectories.CVPR, 2023.
[63]Yifan Wang, Andrew Liu, Richard Tucker, Jiajun Wu, BrianL. Curless, StevenM. Seitz, and Noah Snavely.Repopulating street scenes.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
[64]Yikai Wang, Chenjie Cao, and KeFan Xiangyang XueYanwei Fu.Towards context-stable and visual-consistent image inpainting, 2024.
[65]Yuxin Wang, Qianyi Wu, Guofeng Zhang, and Dan Xu.Gscream: Learning 3d geometry and feature consistent gaussian splatting for object removal.arXiv preprint arXiv:2404.13679, 2024.
[66]Zian Wang, Tianchang Shen, Jun Gao, Shengyu Huang, Jacob Munkberg, Jon Hasselgren, Zan Gojcic, Wenzheng Chen, and Sanja Fidler.Neural fields meet explicit geometric representations for inverse rendering of urban scenes.In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2023.
[67]Ethan Weber, Aleksander Holynski, Varun Jampani, Saurabh Saxena, Noah Snavely, Abhishek Kar, and Angjoo Kanazawa.Nerfiller: Completing scenes via generative 3d inpainting.In CVPR, 2024.
[68]Silvan Weder, Guillermo Garcia-Hernando, Aron Monszpart, Marc Pollefeys, Gabriel Brostow, Michael Firman, and Sara Vicente.Removing objects from neural radiance fields.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
[69]Tiange Xiang, Adam Sun, Jiajun Wu, Ehsan Adeli, and Fei-Fei Li.Rendering humans from object-occluded monocular videos.In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023.
[70]Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, JoseM Alvarez, and Ping Luo.Segformer: Simple and efficient design for semantic segmentation with transformers.In Neural Information Processing Systems (NeurIPS), 2021.
[71]Linning Xu, Yuanbo Xiangli, Sida Peng, Xingang Pan, Nanxuan Zhao, Christian Theobalt, BoDai, and Dahua Lin.Grid-guided neural radiance fields for large urban scenes.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
[72]Qiangeng Xu, Zexiang Xu, Julien Philip, Sai Bi, Zhixin Shu, Kalyan Sunkavalli, and Ulrich Neumann.Point-nerf: Point-based neural radiance fields.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5438–5448, 2022.
[73]Rui Xu, Xiaoxiao Li, Bolei Zhou, and ChenChange Loy.Deep flow-guided video inpainting.In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
[74]Yunzhi Yan, Haotong Lin, Chenxu Zhou, Weijie Wang, Haiyang Sun, Kun Zhan, Xianpeng Lang, Xiaowei Zhou, and Sida Peng.Street gaussians for modeling dynamic urban scenes.2023.
[75]Bangbang Yang, Wenqi Dong, Lin Ma, Wenbo Hu, Xiao Liu, Zhaopeng Cui, and Yuewen Ma.Dreamspace: Dreaming your room space with text-driven panoramic texture propagation.2023.
[76]ZeYang, Yun Chen, Jingkang Wang, Sivabalan Manivasagam, Wei-Chiu Ma, AnqiJoyce Yang, and Raquel Urtasun.Unisim: A neural closed-loop sensor simulator.In CVPR, 2023.
[77]Mingqiao Ye, Martin Danelljan, Fisher Yu, and Lei Ke.Gaussian grouping: Segment and edit anything in 3d scenes, 2023.
[78]Zili Yi, Qiang Tang, Shekoofeh Azizi, Daesik Jang, and Zhan Xu.Contextual residual aggregation for ultra high-resolution image inpainting.In IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), pages 7508–7517, 2020.
[79]Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and ThomasS Huang.Generative image inpainting with contextual attention.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5505–5514, 2018.
[80]Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and ThomasS Huang.Free-form image inpainting with gated convolution.In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 4471–4480, 2019.
[81]Yu-Jie Yuan, Yang-Tian Sun, Yu-Kun Lai, Yuewen Ma, Rongfei Jia, and Lin Gao.Nerf-editing: Geometry editing of neural radiance fields.In CVPR, 2022.
[82]Zefeng Yuan, Hengyu Li, Jingyi Liu, and Jun Luo.Multiview scene image inpainting based on conditional generative adversarial networks.IEEE Transactions on Intelligent Vehicles, 5(2), June 2020.
[83]Kai Zhang, Nick Kolkin, Sai Bi, Fujun Luan, Zexiang Xu, Eli Shechtman, and Noah Snavely.Arf: Artistic radiance fields, 2022.
[84]Lvmin Zhang, Anyi Rao, and Maneesh Agrawala.Adding conditional control to text-to-image diffusion models.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
[85]Richard Zhang, Phillip Isola, AlexeiA. Efros, Eli Shechtman, and Oliver Wang.The unreasonable effectiveness of deep features as a perceptual metric.In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 586–595, 2018.
[86]Xiaoshuai Zhang, Abhijit Kundu, Thomas Funkhouser, Leonidas Guibas, Hao Su, and Kyle Genova.Nerflets: Local radiance fields for efficient structure-aware 3d scene representation from 2d supervision.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
[87]Shengyu Zhao, Jonathan Cui, Yilun Sheng, Yue Dong, Xiao Liang, IEric, Chao Chang, and Yan Xu.Large scale image completion via co-modulated generative adversarial networks.In International Conference on Learning Representations, 2020.
[88]Yiqun Zhao, Zibo Zhao, Jing Li, Sixun Dong, and Shenghua Gao.Roomdesigner: Encoding anchor-latents for style-consistent and shape-compatible indoor scene generation.In Proceedings of the International Conference on 3D Vision (3DV), 2024.
[89]Zibo Zhao, Wen Liu, Yanyu Xu, Xianing Chen, Weixin Luo, Lei Jin, Bohui Zhu, Tong Liu, Binqiang Zhao, and Shenghua Gao.Prior based human completion.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7951–7961, 2021.
[90]Hongyu Zhou, Jiahao Shao, LuXu, Dongfeng Bai, Weichao Qiu, Bingbing Liu, Yue Wang, Andreas Geiger, and Yiyi Liao.Hugs: Holistic urban 3d scene understanding via gaussian splatting, 2024.
[91]Qian-Yi Zhou, Jaesik Park, and Vladlen Koltun.Open3D: A modern library for 3D data processing.arXiv:1801.09847, 2018.
[92]Shangchen Zhou, Chongyi Li, KelvinC.K Chan, and ChenChange Loy.ProPainter: Improving propagation and transformer for video inpainting.In Proceedings of IEEE International Conference on Computer Vision (ICCV), 2023.
[93]Shijie Zhou, Haoran Chang, Sicheng Jiang, Zhiwen Fan, Zehao Zhu, Dejia Xu, Pradyumna Chari, Suya You, Zhangyang Wang, and Achuta Kadambi.Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields.arXiv preprint arXiv:2312.03203, 2023.
[94]Xueyan Zou, Linjie Yang, Ding Liu, and YongJae Lee.Progressive temporal feature alignment network for video inpainting.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16448–16457, 2021.

Appendix A Implementation details

A.1 Details of hard-label semantic 2DGS reconstruction

Initialization with Lidar points.High-quality appearance and semantic reconstruction of the whole street scene are hard to reach, with barely SFM points[49, 50] as initialization for street scenes.Lidar points are leveraged to better reconstruct the street scene like in [74, 8, 90].We use an off-the-self 2D semantic segmenter[70] to process the 2D images and back-project the hard semantic labels to 2D Gaussians.

Environment map for street reconstruction.We empirically find thatmost 2D Gaussians’ opacity will be larger than $0.9$ or lower than $0.1$ , leading to the imperfect reconstruction quality of the background environment, i.e., sky.To better model the environment in the street scene, we employ a tiny MLP $f$ to query the color of the environment map, which is similar to [14, 58]. The queried environment color at $\mathbf{x}$ is denoted as $\mathbf{c}_{\text{env}}$ . The final color of the ray is obtained by blending the color of 2DGS projection and the environment map as follows:

\mathbf{c}_{\text{env}}(\mathbf{x})=f(\mathbf{M,\mathbf{x}})\hskip 28.45274pt%\mathbf{c}_{\text{final}}(\mathbf{x})=\mathbf{c}(\mathbf{x})+(1-\mathbf{\alpha%}(\mathbf{x}))\mathbf{c}_{\text{env}}(\mathbf{x})

(11)

where $\mathbf{M}$ denotes the projection matrix from world coordinates to pixel coordinates. $\mathbf{\alpha}(\mathbf{x})$ is the rendered alpha map of 2DGS rendering.

Details of two-stage reconstruction training

The optimization of our designed 2DGS reconstruction for street scenes contains two stages.(1) In the first stage, we employ adaptive density control of 3DGS following[18] and $\mathcal{L}_{\text{d}}$ , $\mathcal{L}_{\text{n}}$ and $\mathcal{L}_{\text{ds}}$ will be deactivated to reach a more stable initialization of 2DGS reconstruction.(2) In the second stage, $\mathcal{L}_{\text{d}}$ , $\mathcal{L}_{\text{n}}$ and $\mathcal{L}_{\text{ds}}$ is activated. As empirically, most 2D Gaussians’ opacity will be larger than $0.9$ or lower than $0.1$ . The noisy 2DGS with the wrong semantic label will be optimized as low opacity through $\mathcal{L}_{\text{ds}}$ . We prune the Gaussians with opacity lower than a threshold $\epsilon$ to further eliminate the noisy semantics in the 3D world, with $\epsilon$ set as $0.3$ in our experiments.

A.2 Details of time-reversal inpainting framework

As is mentioned in [64], when we are using a latent-diffusion-based inpainting model, there will be non-ignorable shifts in low-frequency fields if we use images decoded by KL-VAE[21, 47] repeatedly for different times. Given that our method can be summarised as inpainting frame $T_{i}$ with $T_{i+1}$ as a reference/condition through LeftRefill[4], which is latent-diffusion-based. For a whole sequence of video, if we simply iteratively inpaint every $T_{i}$ with $T_{i+1}$ as a reference/condition, the shifts in low-frequency fields will be badly augmented by KL-VAE, which will severely harm the quality of our 2D inpainting guidance. To alleviate this inevitable shift from the KL-VAE of the latent diffusion model[47]. We first select some keyframes in the video. Then we use time-reversal inpainting to inpaint the keyframes iteratively in the reversed time sequence instead of inpainting every frame.

We firstly time-reversal inpaint the keyframes of timestamps $\{T_{k_{1}},\ldots,T_{k_{n}}\}$ .After inpainting all the keyframes in the reversed time sequence, we generate the middle frames between keyframe $T_{k_{i+1}}$ and keyframe $T_{k_{i}}$ with keyframe $T_{k_{i}}$ as reference image. When we have inpainted the street scene in the image sequences, we will use these results as pseudo-labeled data to further re-optimize the 2DGS of the empty street scene.

Appendix B More experiments

B.1 Ablation of time-forward inpainting and time-reversal inpainting

To further validate the effectiveness of time-reversal inpainting, we do an additional ablation here with time-forward inpainting, which is the reverse version of our proposed time-reversal inpainting. In Tab.3, our time-reversal inpainting achieves better quantitative results than time-forward inpainting.

For our time-reversal inpainting, we inpaint frame $T_{n}$ with $T_{n+1}$ as reference. For time-forward inpainting, frame $T_{n+1}$ is inpainted with frame $T_{n}$ as reference. The Fig.8 elaborates the details about the process of these two methods. The qualitative comparison in Fig.9 showcases the high-to-low-resolution nature of time-reversal inpainting, which will enhance the quality of the results.

3D StreetUnveiler with Semantic-Aware 2DGS (8)

3D StreetUnveiler with Semantic-Aware 2DGS (9)

B.2 Ablation of hard semantic label

We additionally ablate the effectiveness of the hard semantic label. From Fig.10, we can observe that both 2DGS representation and hard semantic label contribute to a more stable reconstruction of the semantic field.

The comparison between (a) and (b) demonstrates that the use of hard semantic labels effectively reduces noise within the semantic fields. In addition, the comparison between (a) and (c) indicates that the 2DGS representation leads to more stable semantic fields. Finally, (d) illustrates the clean and stable semantic field achieved by employing hard-label semantic 2DGS in our method.

Method	LPIPS $\downarrow$	FID $\downarrow$
Time-Forward Inpainting	0.249	160.666
Time-Reversal Inpainting(Ours)	0.241	157.970

By reconstructing a clean and stable semantic field of the street scene, we can more accurately identify the Gaussians that need to be removed. This facilitates obtaining a high-quality 2D inpainting result, which serves as effective guidance for re-optimization.

3D StreetUnveiler with Semantic-Aware 2DGS (10)

B.3 Comparison of geometry performance

Since we want to reconstruct the empty street, we also want to compare the geometry property of our method other than just appearance. From Fig.11, Fig.12, Fig.13, Fig.14, we can observe that our method produces both better appearance quality and geometry quality from rendered RGB and normal images.

3D StreetUnveiler with Semantic-Aware 2DGS (11)

3D StreetUnveiler with Semantic-Aware 2DGS (12)

3D StreetUnveiler with Semantic-Aware 2DGS (13)

3D StreetUnveiler with Semantic-Aware 2DGS (14)

3D StreetUnveiler with Semantic-Aware 2DGS (15)

Appendix C Empty street scene mesh extraction

We can further extract the mesh for our reconstructed empty street scene using TSDF fusion following 2DGS[17] with Open3D[91]. In Fig.16 and Fig.17, we compare the extracted colored mesh between before and after our unveiling. Our inpainting framework can successfully remove the unwanted cars from the street and finally reconstruct an empty street.

3D StreetUnveiler with Semantic-Aware 2DGS (16)

3D StreetUnveiler with Semantic-Aware 2DGS (17)

Appendix D Limitations

Although our proposed methods can achieve scene reconstruction without unwanted static objects, our methods are not without limitations.(1) Our methods depend on the precision of the 2D segmentation model to reconstruct a reliable semantic field to identify the Gaussians to remove. Failure of the 2D semantic segmentation will lead to low-quality Gaussians removal results. Street Unveiling without pseudo-semantic guidance would be an important step towards a more robust solution.(2) Since the consistency of our inpainting assumes the sufficient ability of reference-based inpainting to be aware of matching, the bottleneck of the pre-trained reference-based inpainting model may also be a bottleneck of our method.(3) Our method’s computational costs grow linearly with the video’s frame number because every frame needs to be inpainted.

Appendix E Societal impacts

The technology can potentially distort the representation of public spaces in urban planning, leading to flawed decision-making. At the same time, it may also be misused to alter the representation of important archaeological sites in digital reconstructions, leading to misinformation and misunderstanding of historical facts.