TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies

Although large vision-language-action (VLA) models pretrained on extensive robot datasets offer promising generalist policies for robotic learning, they still struggle with spatial-temporal dynamics in interactive robotics, making them less effective in handling complex tasks, such as manipulation. In this work, we introduce visual trace prompting, a simple yet effective approach to facilitate VLA models’ spatial-temporal awareness for action prediction by encoding state-action trajectories visually. We develop a new TraceVLA model by finetuning OpenVLA on our own collected dataset of 150K robot manipulation trajectories using visual trace prompting. Evaluations of TraceVLA across 137 configurations in SimplerEnv and 4 tasks on a physical WidowX robot demonstrate state-of-the-art performance, outperforming OpenVLA by 10% on SimplerEnv and 3.5x on real-robot tasks and exhibiting robust generalization across diverse embodiments and scenarios. To further validate the effectiveness and generality of our method, we present a compact VLA model based on 4B Phi-3-Vision, pretrained on the Open-X-Embodiment and finetuned on our dataset, rivals the 7B OpenVLA baseline while significantly improving inference efficiency.

Visual Trace Generation & Close-loop Control with TraceVLA

Given a sequence of historical image observations, we first use Co-tracker to extract dense point trajectories and keep active point trajectories with significant movement. We then overlay active point trajectories on the robot’s initial observation frame as visual trace prompting and feed both the image overlaid with visual traces and the original image into VLA as model input.

Real-Robot Rollouts

Below are videos of TraceVLA and OpenVLA on physical WidowX-250 robot manipulation tasks with different manipulation skills and objects. (Videos are sped up by 2.5x.)

TraceVLA masters soft object manipulation, pick-and-place operations, and object movement, demonstrating reliable performance in both in-distribution and out-of-distribution generalization tasks.

Fold Cloth

(In-Distribution)

TraceVLA

OpenVLA

Pickplace Corn Pot

(Out-of-Distribution: Unseen Task)

TraceVLA

OpenVLA

Pick Banana to the Right of Plate

(Out-of-Distribution: Unseen Task)

TraceVLA

OpenVLA

Lift AAA Battery

(Out-of-Distribution: Unseen Object)

TraceVLA

OpenVLA

Pick Eggplant on Plate

(Out-of-Distribution: Unseen Task)

TraceVLA

OpenVLA

Push Cloth Left to Right

(Out-of-Distribution: Distracting Object, Inverse Motion)

TraceVLA

OpenVLA

We design 8 real-world robot tasks with different manipulation skills and objects and include unseen tasks involving novel objects, goals, and language instructions for evaluating generalization in real robot settings.

TraceVLA consistently outperforms OpenVLA across diverse tasks including soft object manipulation, pick-and-place operations, and object movement and demonstrates superior generalization

Simulation Benchmark: SimplerEnv

TraceVLA consistently outperforms OpenVLA across various tasks and evaluation metrics in the SimplerEnv Google robot tasks. The improvements are evident in both the full-scale 7B models (TraceVLA vs OpenVLA) and their 4B versions (TraceVLA-Phi3 vs OpenVLA-Phi3).

Environmental Variant Aggregation: TraceVLA shows substantial enhancements under camera orientation changes, distractor presence, and background alterations, with an average improvement exceeding 20% in these categories.

Training Memory Cost and Inference Speed

TraceVLA's memory overhead is manageable at less than 10GB when using 8 H100 GPUs, with the difference decreasing at smaller batch sizes. In terms of speed, the model introduces three main components during inference: image/text tokens (0.002s), CoTracker tracking (0.03s), and dense point tracking (0.004s) per timestep. These additional computational costs remain relatively small and well-optimized due to GPU attention optimization.

BibTeX

@article{zheng2024tracevla, title={TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies}, author={Zheng, Ruijie and Liang, Yongyuan and Huang, Shuaiyi and Gao, Jianfeng and Daum{\'e} III, Hal and Kolobov, Andrey and Huang, Furong and Yang, Jianwei}, journal={arXiv preprint arXiv:2412.10345}, year={2024} }

TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies

Abstract

Visual Trace Generation & Close-loop Control with TraceVLA

Real-Robot Rollouts

Fold Cloth

Pickplace Corn Pot

Pick Banana to the Right of Plate

Lift AAA Battery

Pick Eggplant on Plate

Push Cloth Left to Right

Simulation Benchmark: SimplerEnv

Training Memory Cost and Inference Speed

BibTeX