AI

LingBot-Map: Ant Group's Open-Source 3D Foundation Model for Real-Time Scene Reconstruction

LingBot-Map is a feed-forward 3D foundation model by Ant Group for streaming scene reconstruction from single RGB video, achieving 20 FPS with state-of-the-art accuracy.

LingBot-Map: Ant Group's Open-Source 3D Foundation Model for Real-Time Scene Reconstruction

3D scene reconstruction has long been a foundational challenge in computer vision. Traditional approaches rely on expensive LiDAR hardware, offline batch processing, or iterative optimization that is too slow for real-time applications. On April 16, 2026, Robbyant – the embodied AI division of Ant Group (蚂蚁集团) – released LingBot-Map (github.com/robbyant/lingbot-map), a feed-forward 3D foundation model that changes this equation entirely.

LingBot-Map takes a single RGB video stream and reconstructs dense, accurate 3D environments in real time – no LiDAR, no multi-pass optimization, no offline processing. It runs at approximately 20 FPS at 518x378 resolution and maintains consistent accuracy over sequences exceeding 10,000 frames. The paper, available on arXiv (2604.14141), reports state-of-the-art results across multiple benchmarks, including an Absolute Trajectory Error (ATE) of 6.42 meters on the Oxford Spires dataset – a 2.8x improvement over prior methods – and an F1 score of 98.98 on ETH3D, more than 20 points ahead of the competition.

The model is open source under the Apache License 2.0, with weights available on Hugging Face and ModelScope, making it immediately accessible to researchers, robotics engineers, and AR/VR developers worldwide.

The Streaming Reconstruction Challenge

Traditional 3D reconstruction pipelines follow a familiar but brittle pattern: detect keypoints, match features across frames, estimate camera poses through bundle adjustment, then fuse depth estimates into a volumetric map. Each step compounds errors, and the computational cost grows superlinearly with sequence length. For long videos – the kind a robot or a handheld camera might capture over minutes or hours – drift becomes inevitable, and batch optimization becomes impractical.

LingBot-Map sidesteps these limitations entirely by adopting a feed-forward architecture that processes video streams in a single pass. Instead of tracking features and optimizing poses frame by frame, it learns a direct mapping from image sequences to 3D geometry, leveraging learned priors from large-scale training data to resolve ambiguity that would stump traditional geometric methods.

Geometric Context Transformer: The Core Innovation

At the heart of LingBot-Map is the Geometric Context Transformer (GCT), a novel architecture that unifies three critical capabilities into a single streaming framework.

Unified Coordinate Grounding

The GCT establishes a consistent 3D coordinate frame across the entire video stream. Rather than maintaining a separate SLAM-style pose estimator alongside a depth network, LingBot-Map learns an end-to-end mapping from temporal image sequences to a shared coordinate system. This eliminates the cascading error typical of modular pipelines, where pose errors corrupt depth estimates and vice versa.

Dense Geometric Cues

The model predicts dense geometric representations directly from RGB input. For every pixel in every frame, it estimates not just depth but surface orientation, local curvature, and occupancy likelihood. These dense cues feed into the reconstruction volume at the model’s native framerate, producing maps with fine geometric detail that conventional structure-from-motion methods struggle to capture from texture-poor surfaces like white walls, glass, or featureless floors.

Long-Range Drift Correction

Long video sequences inevitably accumulate drift – a few millimeters of error per frame becomes meters of error after thousands of frames. LingBot-Map addresses this with a learned global consistency mechanism. The transformer architecture maintains a spatial memory that spans the entire sequence, allowing the model to recognize when it has returned to a previously observed location and correct accumulated drift accordingly. This is why the model maintains near-constant accuracy over more than 10,000 frames, where traditional SLAM systems would have diverged entirely.

CapabilityTraditional SLAMLingBot-Map
Pose estimationSequential, error-proneLearned, end-to-end
Depth predictionFeature-based or separate CNNUnified geometric cues
Drift correctionLoop closure detectionLearned global consistency
LiDAR requirementRequired for accuracyOptional (RGB only)
Frame processingIncreasing cost per frameConstant ~20 FPS

Benchmark Performance

LingBot-Map’s paper reports extensive evaluations across multiple 3D reconstruction and visual odometry benchmarks. The results establish a new state of the art across the board.

Oxford Spires Dataset

The Oxford Spires dataset is a challenging benchmark for large-scale scene reconstruction, featuring complex indoor and outdoor environments captured over long trajectories. LingBot-Map achieves an Absolute Trajectory Error (ATE) of 6.42 meters, representing a 2.8x improvement over the previous best method. This is particularly significant because Oxford Spires includes sequences where conventional SLAM approaches fail entirely due to challenging lighting conditions, repetitive textures, and wide baselines.

ETH3D Benchmark

On the ETH3D benchmark, which evaluates dense 3D reconstruction quality, LingBot-Map achieves an F1 score of 98.98 – more than 21 points ahead of prior state-of-the-art methods. This near-perfect score indicates that the model reconstructs geometry with exceptional completeness and accuracy, recovering fine details that previous methods miss.

BenchmarkMetricTraditional SOTALingBot-MapImprovement
Oxford SpiresATE (m)~18.06.422.8x better
ETH3DF1 score~7798.98+21.98 points

Architecture Overview

The LingBot-Map architecture can be understood as a streaming pipeline with three main stages:

The Frame Encoder extracts per-frame visual features. The Geometric Context Transformer processes these features across the temporal dimension, maintaining a spatial memory that spans the entire sequence. Three specialized prediction heads produce dense depth maps, camera trajectories, and a global occupancy volume. The final scene reconstruction fuses these outputs into a unified 3D representation.

The Robbyant AI Ecosystem

LingBot-Map is not an isolated project. It is part of a growing ecosystem of embodied AI models from Robbyant, Ant Group’s dedicated embodied intelligence division:

  • LingBot-Depth – Monocular depth estimation foundation model, providing dense metric depth from single images.
  • LingBot-VLA – Vision-Language-Action model for robotic manipulation and navigation, integrating visual perception with language instructions and motor commands.
  • LingBot-World – World model for predicting future states and planning in 3D environments.

Together, these models form a comprehensive stack for embodied AI applications. LingBot-Map provides the 3D perception layer, LingBot-Depth handles per-frame depth, LingBot-VLA translates perception into action, and LingBot-World enables forward planning.

Practical Applications

Robotics Navigation

Autonomous robots need to build maps of their surroundings in real time to navigate safely. LingBot-Map’s 20 FPS throughput means a robot equipped with a single RGB camera can construct a dense 3D map of a warehouse, factory floor, or outdoor environment while moving at walking speed, without any LiDAR hardware. The long-sequence stability means the robot can operate for extended periods without map degradation.

Augmented and Virtual Reality

AR glasses and VR headsets require instant understanding of the physical environment to place virtual objects convincingly. LingBot-Map’s feed-forward architecture provides the low-latency, high-accuracy 3D reconstruction needed for compelling mixed reality experiences, all from the headset’s built-in cameras.

Autonomous Driving

While autonomous vehicles typically rely on multiple sensors, LingBot-Map demonstrates that high-quality 3D reconstruction is achievable from vision alone. This has implications for cost-reduced autonomy systems, secondary perception validation, and offline scene reconstruction from dashcam footage.

Large-Scale Scene Digitization

Architecture, construction, heritage preservation, and digital twin applications all require scanning large environments with high geometric fidelity. LingBot-Map enables practitioners to walk through a space with a standard video camera and obtain a production-quality 3D model – no specialized scanning equipment, no post-processing delays.

How to Get Started

LingBot-Map is available under the Apache License 2.0, making it suitable for both academic research and commercial applications. The model weights can be downloaded from:

The repository provides a straightforward inference pipeline. Given a directory of video frames, LingBot-Map outputs camera trajectories and a reconstructed 3D mesh:

# Clone the repository
git clone https://github.com/robbyant/lingbot-map.git
cd lingbot-map

# Download pretrained weights (automated via script)
python scripts/download_weights.py

# Run reconstruction on a video frame sequence
python run.py --input_dir /path/to/frames --output_dir /path/to/output

Frequently Asked Questions

What is LingBot-Map?

LingBot-Map is a feed-forward 3D foundation model developed by Robbyant, the embodied AI division of Ant Group, for real-time streaming 3D scene reconstruction from single RGB video input.

What makes LingBot-Map different from other 3D reconstruction methods?

LingBot-Map uses a Geometric Context Transformer that unifies coordinate grounding, dense geometric cues, and long-range drift correction in a single streaming framework without needing LiDAR. Unlike traditional SLAM pipelines that compound errors across sequential modules, LingBot-Map learns an end-to-end mapping from video to 3D geometry.

How fast is LingBot-Map?

LingBot-Map runs at approximately 20 FPS at 518x378 resolution. Critically, this throughput is maintained even over very long sequences – the model has been demonstrated on sequences exceeding 10,000 frames with no degradation in accuracy.

Is LingBot-Map open source?

Yes, LingBot-Map is open source under the Apache License 2.0, with model weights available on Hugging Face and ModelScope. The full source code and inference pipeline are available on GitHub.

What are the practical applications of LingBot-Map?

Applications include robotics navigation, AR/VR environment mapping, autonomous driving perception, and large-scale 3D scene digitization from simple video input. Any scenario that requires real-time, high-quality 3D reconstruction from a moving camera is a candidate use case.

What hardware does LingBot-Map require?

LingBot-Map runs on a standard GPU. The model processes RGB video only – no LiDAR, depth camera, or specialized sensor hardware is required. The 518x378 resolution and 20 FPS throughput are achievable on consumer-grade GPUs.

How does LingBot-Map relate to other Robbyant projects?

LingBot-Map is part of Robbyant’s broader embodied AI ecosystem, alongside LingBot-Depth (depth estimation), LingBot-VLA (vision-language-action), and LingBot-World (world modeling). Together these models provide a complete stack for embodied AI perception and control.

Further Reading


LingBot-Map is an open-source project by Robbyant, the embodied AI division of Ant Group (蚂蚁集团). The project is licensed under Apache License 2.0.

TAG
CATEGORIES