Point3R: Streaming 3D Reconstruction with
Explicit Spatial Pointer Memory

Yuqi Wu*, Wenzhao Zheng*, †, Jie Zhou, Jiwen Lu
Tsinghua University
[Paper (Arxiv)]     [Code (GitHub)]    
*Equal contributions. Project Leader.

Overview of our contributions. We propose Point3R, an online framework targeting dense streaming 3D reconstruction. Given streaming image inputs, our method maintains an explicit spatial pointer memory in which each pointer is assigned a 3D position and points to a changing spatial feature. We conduct a pointer-image interaction to integrate new observations into the global coordinate system and update our spatial pointer memory accordingly. Our method achieves competitive or state-of-the-art performance across various tasks: dense 3D reconstruction, monocular and video depth estimation, and camera pose estimation.

Overall Framework of Point3R


Given streaming image inputs, our method maintains an explicit spatial pointer memory to store the observed information of the current scene. We use a ViT encoder to encode the current input into image tokens and use ViT-based decoders to conduct interaction between image tokens and spatial features in the memory. We use two DPT heads to decode local and global pointmaps from the output image tokens. Besides, a learnable pose token is added during this stage so we can directly decode the camera parameters of the current frame. Then we use a simple memory encoder to encode the current input and its integrated output into new pointers, and use a memory fusion mechanism to enrich and update our spatial pointer memory.

Results


3D Reconstruction


We evaluate the 3D reconstruction performance on the 7-scenes and NRGBD datasets. Our method achieves comparable or better results than other memory-based online approaches.

Monocular Depth Estimation


We evaluate zero-shot monocular depth estimation performance on NYU-v2 (static), Sintel, Bonn, and KITTI datasets. Our method achieves state-of-the-art or competitive performance in both static and dynamic, indoor and outdoor scenes.

Video Depth Estimation


We align predicted depth maps to ground truth using a per-sequence scale (Per-sequence alignment) to evaluate per-frame quality and inter-frame consistency. We also compare results without alignment with other metric pointmap methods like MASt3R and CUT3R (Metric-scale alignment).

Visualizations


We visualize the 3D reconstruction results of our method under sparse and dense inputs. We will show more visualizations soon.

Bibtex

	@article{point3r,
	      title={Point3R: Streaming 3D Reconstruction with Explicit Spatial Pointer Memory}, 
	      author={Yuqi Wu and Wenzhao Zheng and Jie Zhou and Jiwen Lu},
	      journal={arXiv preprint arXiv:2507.02863},
	      year={2025}
	}

Website Template