SP-VLA

Spatially Guided Training for Vision-Language-Action Model

Abstract

Large vision–language models (VLMs) excel at multimodal understanding but fall short when extended to embodied tasks, where instructions must be transformed into low-level motor actions. We introduce SP-VLA, a dual-system Vision–Language–Action framework that leverages Spatial Priors as a bridge between linguistic instructions and embodiment-specific control. SP-VLA aligns action learning with spatial priors through two stages: (i) spatial grounding pre-training, which equips the VLM with transferable priors via scalable point, box, and trajectory prediction from both web-scale and robot-specific data, and (ii) spatially guided action post-training, which encourages the model to produce richer spatial priors to guide action generation via spatial prompting. This design preserves spatial grounding during policy learning and promotes consistent optimization across spatial and action objectives. Empirically, SP-VLA achieves substantial improvements over vanilla VLA, with performance increasing from 66.1 to 84.6 on Google Robot and from 54.7 to 73.2 on WidowX, establishing new state-of-the-art results on SimplerEnv. It also demonstrates stronger generalization to unseen objects and paraphrased instructions, as well as robustness to long-horizon perturbations in real-world settings. These results highlight scalable spatially guided training as a promising direction for robust, generalizable robot learning. We will release code, data, and model checkpoints to support future research.

Model Overview

SP-VLA Model Architecture

SP-VLA integrates spatial grounding into the vision–language–action training pipeline. Given a task instruction, the VLM planner produces latent plans through explicit spatial prompting, which then effectively guides the action expert to generate control signals.

Results

Watch SP-VLA perform instruction-following manipulation tasks in both large-scale simulated environments and real-world tasks.

Instruction-Following Manipulation

Cluttered-scene Pick-and-Place

Long-horizon and Reasoning Manipulation

Experimental Results

SP-VLA demonstrates superior performance across various challenging scenarios

Performance Comparison on Simpler Env and Libero Benchmark

0 20 40 60 80 Success Rate (%) Google Robot VM 35.2 58.8 84.6 Google Robot VA 44.5 54.8 75.9 WidowX VM 61.9 27.1 73.2 Libero 93.9 94.2 95.9
GR00T
π0​
SP-VLA

Effects of Spatially Guided VLA Training

0 20 40 60 80 Success Rate (%) Google Robot VM 66.1 84.6 Google Robot VA 63.5 75.9 WidowX VM 54.7 73.2 Libero 91.6 95.9
Vanilla VLA
SP-VLA

System2 Spatial Reasoning Results

Demonstrating SP-VLA's System 2 capabilities in box detection, point localization, and visual trace prediction.

📦 Box Detection

Precise bounding box detection and object localization

📍 Point Localization

Precise keypoint localization and spatial analysis

📈 Trajectory Prediction

Intelligent trajectory prediction and motion path planning

VLM Pre-training Data Distribution

Comprehensive dataset composition for spatial grounding pre-training

Pre-training Data 3,032K General VQA Spatial Grounding QA VQA Trajectory-QA Point-QA BOX-QA
VLM Training Data
General VQA 21.0%
Spatial Grounding QA 79.0%
Detailed Components
VQA 21.0%
Trajectory-QA 22.6%
Point-QA 27.4%
BOX-QA 29.0%

Simulation Data Generation

Simulation Data Pipeline

The pipeline automatically generates diverse instruction-following robotic manipulation data from a large asset library, incorporating intermediate representations such as Box, Point, and Trajectory, which can be further converted into VLM spatial grounding data.