Point3R: Streaming 3D Reconstruction with
Explicit Spatial Pointer Memory
Overview of our contributions. We propose Point3R, an online framework targeting dense streaming 3D reconstruction. Given streaming image inputs, our method maintains an explicit spatial pointer memory in which each pointer is assigned a 3D position and points to a changing spatial feature. We conduct a pointer-image interaction to integrate new observations into the global coordinate system and update our spatial pointer memory accordingly. Our method achieves competitive or state-of-the-art performance across various tasks: dense 3D reconstruction, monocular and video depth estimation, and camera pose estimation.
Given streaming image inputs, our method maintains an explicit spatial pointer memory to store the observed information of the current scene. We use a ViT encoder to encode the current input into image tokens and use ViT-based decoders to conduct interaction between image tokens and spatial features in the memory. We use two DPT heads to decode local and global pointmaps from the output image tokens. Besides, a learnable pose token is added during this stage so we can directly decode the camera parameters of the current frame. Then we use a simple memory encoder to encode the current input and its integrated output into new pointers, and use a memory fusion mechanism to enrich and update our spatial pointer memory.
@article{point3r, title={Point3R: Streaming 3D Reconstruction with Explicit Spatial Pointer Memory}, author={Yuqi Wu and Wenzhao Zheng and Jie Zhou and Jiwen Lu}, journal={arXiv preprint arXiv:2507.02863}, year={2025} }