EmbodiedOcc: Embodied 3D Occupancy Prediction for
Vision-based Online Scene Understanding
Overview of our contributions. Targeting progressive embodied exploration in indoor scenarios, we formulate an embodied 3D occupancy prediction task and propose a Gaussian-based EmbodiedOcc framework accordingly. Our EmbodiedOcc maintains an explicit Gaussian memory of the current scene and updates this memory during the exploration of this scene. Both quantitative and visualization results have shown that our EmbodiedOcc outperforms existing methods in terms of local occupancy prediction and accomplishes the embodied occupancy prediction task with high accuracy and strong expandability.
We use a set of 3D semantic Gaussians to represent an indoor scene and update the Gaussian-based representation according to semantic and structural features extracted from the input image. In the local occupancy prediction module, we use a depth-aware branch to provide local structural information for the update of each Gaussian. Along a specific ray, Gaussians distributed in front of the true depth point are likely to model the empty semantic (as Gaussian A). Gaussians distributed behind the true depth point closely are likely to model valid semantics (as Gaussian B). Those Gaussians that are distributed behind the true depth point but are too far away require more information to guide their updates (as Gaussian C).
We initialize the global scene with uniform 3D semantic Gaussians and progressively update local regions observed by the embodied agent. During each update, the Gaussians within the current frustum are taken from the memory and updated according to their confidence values. Confidence values of those well-updated Gaussians are set to a certain value between 0 and 1, while others are set to 0. The former will be updated slightly and the latter efficiently.
We maintain an explicit global memory of 3D Gaussians during the exploration of the current scene. For each update, the Gaussians within the frustum are taken from the memory and updated using semantic and structural features extracted from the monocular RGB input. Each Gaussian has a confidence value to determine the degree of this update. Then we detach and put these updated Gaussians back into the memory. During the continuous exploration, we can obtain the current 3D occupancy prediction using a Gaussian-to-voxel splatting module.
@article{wu2024embodiedoccembodied3doccupancy, title={EmbodiedOcc: Embodied 3D Occupancy Prediction for Vision-based Online Scene Understanding}, author={Yuqi Wu and Wenzhao Zheng and Sicheng Zuo and Yuanhui Huang and Jie Zhou and Jiwen Lu}, journal={arXiv preprint arXiv:2412.04380}, year={2024} }