SP-VLA

Spatially Guided Training for Vision-Language-Action Model

Abstract

Large vision–language models (VLMs) excel at multimodal understanding but fall short when extended to embodied tasks, where instructions must be transformed into low-level motor actions. We introduce SP-VLA, a dual-system Vision–Language–Action framework that leverages Spatial Priors as a bridge between linguistic instructions and embodiment-specific control. SP-VLA aligns action learning with spatial priors through two stages: (i) spatial grounding pre-training, which equips the VLM with transferable priors via scalable point, box, and trajectory prediction from both web-scale and robot-specific data, and (ii) spatially guided action post-training, which encourages the model to produce richer spatial priors to guide action generation via spatial prompting. This design preserves spatial grounding during policy learning and promotes consistent optimization across spatial and action objectives. Empirically, SP-VLA achieves substantial improvements over vanilla VLA, with performance increasing from 66.1 to 84.6 on Google Robot and from 54.7 to 73.2 on WidowX, establishing new state-of-the-art results on SimplerEnv. It also demonstrates stronger generalization to unseen objects and paraphrased instructions, as well as robustness to long-horizon perturbations in real-world settings. These results highlight scalable spatially guided training as a promising direction for robust, generalizable robot learning. We will release code, data, and model checkpoints to support future research.

Model Overview

SP-VLA integrates spatial grounding into the vision–language–action training pipeline. Given a task instruction, the VLM planner produces latent plans through explicit spatial prompting, which then effectively guides the action expert to generate control signals.

Results

Watch SP-VLA perform instruction-following manipulation tasks in both large-scale simulated environments and real-world tasks.

Instruction-Following Manipulation

Cluttered-scene Pick-and-Place

Long-horizon and Reasoning Manipulation

Make Sandwich

Purchase Goods

Sort Objects

Sort to Drawer

Math Calculation

Experimental Results

SP-VLA demonstrates superior performance across various challenging scenarios

Performance Comparison on Simpler Env and Libero Benchmark

GR00T

π0

SP-VLA

Effects of Spatially Guided VLA Training

Vanilla VLA

SP-VLA

System2 Spatial Reasoning Results

Demonstrating SP-VLA's System 2 capabilities in box detection, point localization, and visual trace prediction.

📦 Box Detection

Precise bounding box detection and object localization

"put the grapes on the basket."

"Put the pear in the white basket."

"could you please give me the glue stick."

"Indicate points within the vacant area that lies between the blue cup and the teal bowl on the table."

"a bush of plant behind middle woman."

"Put the lettuce on the plate."

"a girl sitting on a bench near a boy."

"What is the closest object to cabinet?"

"What is the oval-shaped, charcoal gray object around the vintage camera."

"Put the bun on the plate."

"the lady on the right with the red necktie."

"put the grapes on the basket."

"Put the pear in the white basket."

"could you please give me the glue stick."

"Indicate points within the vacant area that lies between the blue cup and the teal bowl on the table."

"a bush of plant behind middle woman."

"Put the lettuce on the plate."

"a girl sitting on a bench near a boy."

"What is the closest object to cabinet?"

"What is the oval-shaped, charcoal gray object around the vintage camera."

"Put the bun on the plate."

"the lady on the right with the red necktie."

📍 Point Localization

Precise keypoint localization and spatial analysis

"a toddler in gray pants and a striped shirt."

"a small bird hanging on the net."

"Put the pear in the white basket."

"press the blue botton."

"put the grapes on the basket."

"I hope to get the small blue oblong box."

"I want to get the large toothpaste box."

"Locate points within the unoccupied space that lies before the leftmost fruit on the table."

"put the eggplant on the purple plate."

"could you find the light brown lion."

"soup with chicken and carrots and yellow broth."

"What on the table is used for computing tasks?"

"Put the lettuce on the plate."

"a toddler in gray pants and a striped shirt."

"a small bird hanging on the net."

"Put the pear in the white basket."

"press the blue botton."

"put the grapes on the basket."

"I hope to get the small blue oblong box."

"I want to get the large toothpaste box."

"Locate points within the unoccupied space that lies before the leftmost fruit on the table."

"put the eggplant on the purple plate."

"could you find the light brown lion."

"soup with chicken and carrots and yellow broth."

"What on the table is used for computing tasks?"

"Put the lettuce on the plate."

📈 Trajectory Prediction

Intelligent trajectory prediction and motion path planning

"put the spoon on the towel."

"stack the green block on the yellow one."

"put carrot on plate."

"place apple into top drawer."

"put the coke on the front side of the table."

"open the middle drawer."

"put coke."

"open the middle drawer."

"Put the bun on the plate."

"put eggplant into yellow basket."

"Put the pear in the white basket."

"close the drawer."

"Put the lettuce on the plate."

"put the grapes on the basket."

"close the drawer."

"close the bottom drawer."

"put the spoon on the towel."

"stack the green block on the yellow one."

"put carrot on plate."

"place apple into top drawer."

"put the coke on the front side of the table."

"open the middle drawer."

"put coke."

"open the middle drawer."

"Put the bun on the plate."

"put eggplant into yellow basket."

"Put the pear in the white basket."

"close the drawer."

"Put the lettuce on the plate."

"put the grapes on the basket."

"close the drawer."

"close the bottom drawer."

VLM Pre-training Data Distribution

Comprehensive dataset composition for spatial grounding pre-training

VLM Training Data

General VQA 21.0%

Spatial Grounding QA 79.0%

Detailed Components

VQA 21.0%

Trajectory-QA 22.6%

Point-QA 27.4%

BOX-QA 29.0%

Simulation Data Generation

The pipeline automatically generates diverse instruction-following robotic manipulation data from a large asset library, incorporating intermediate representations such as Box, Point, and Trajectory, which can be further converted into VLM spatial grounding data.