TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies

Ruijie Zheng*1    Yongyuan Liang*1    Shuaiyi Huang1   
Jianfeng Gao2    Hal Daumé III1    Andrey Kolobov2    Furong Huang1    Jianwei Yang2   
1University of Maryland          2Microsoft Research
* Equal contribution.
Image description

Visual trace prompting enhances VLA models' spatial-temporal understanding, boosting manipulation performance.
In the inputs of TraceVLA, the first image shows the original robot’s observation, while the second contains the same image with overlaid visual traces. A separator token is then inserted between the visual tokens of these two images, then concatenating with text tokens and feeding into the underlying vision language model backbone to output action tokens.

Abstract

Although large vision-language-action (VLA) models pretrained on extensive robot datasets offer promising generalist policies for robotic learning, they still struggle with spatial-temporal dynamics in interactive robotics, making them less effective in handling complex tasks, such as manipulation. In this work, we introduce visual trace prompting, a simple yet effective approach to facilitate VLA models’ spatial-temporal awareness for action prediction by encoding state-action trajectories visually. We develop a new TraceVLA model by finetuning OpenVLA on our own collected dataset of 150K robot manipulation trajectories using visual trace prompting. Evaluations of TraceVLA across 137 configurations in SimplerEnv and 4 tasks on a physical WidowX robot demonstrate state-of-the-art performance, outperforming OpenVLA by 10% on SimplerEnv and 3.5x on real-robot tasks and exhibiting robust generalization across diverse embodiments and scenarios. To further validate the effectiveness and generality of our method, we present a compact VLA model based on 4B Phi-3-Vision, pretrained on the Open-X-Embodiment and finetuned on our dataset, rivals the 7B OpenVLA baseline while significantly improving inference efficiency.

Visual Trace Generation & Close-loop Control with TraceVLA

Image description

Given a sequence of historical image observations, we first use Co-tracker to extract dense point trajectories and keep active point trajectories with significant movement. We then overlay active point trajectories on the robot’s initial observation frame as visual trace prompting and feed both the image overlaid with visual traces and the original image into VLA as model input.

Real-Robot Rollouts

Below are videos of TraceVLA and OpenVLA on physical WidowX-250 robot manipulation tasks with different manipulation skills and objects. (Videos are sped up by 2.5x.)

TraceVLA masters soft object manipulation, pick-and-place operations, and object movement, demonstrating reliable performance in both in-distribution and out-of-distribution generalization tasks.

Fold Cloth

(In-Distribution)

TraceVLA
OpenVLA
Pickplace Corn Pot

(Out-of-Distribution: Unseen Task)

TraceVLA
OpenVLA
Pick Banana to the Right of Plate

(Out-of-Distribution: Unseen Task)

TraceVLA
OpenVLA
Lift AAA Battery

(Out-of-Distribution: Unseen Object)

TraceVLA
OpenVLA
Pick Eggplant on Plate

(Out-of-Distribution: Unseen Task)

TraceVLA
OpenVLA
Push Cloth Left to Right

(Out-of-Distribution: Distracting Object, Inverse Motion)

TraceVLA
OpenVLA

We design 8 real-world robot tasks with different manipulation skills and objects and include unseen tasks involving novel objects, goals, and language instructions for evaluating generalization in real robot settings.

Image description

TraceVLA consistently outperforms OpenVLA across diverse tasks including soft object manipulation, pick-and-place operations, and object movement and demonstrates superior generalization

Simulation Benchmark: SimplerEnv

Image description

TraceVLA consistently outperforms OpenVLA across various tasks and evaluation metrics in the SimplerEnv Google robot tasks. The improvements are evident in both the full-scale 7B models (TraceVLA vs OpenVLA) and their 4B versions (TraceVLA-Phi3 vs OpenVLA-Phi3).

Image description

Environmental Variant Aggregation: TraceVLA shows substantial enhancements under camera orientation changes, distractor presence, and background alterations, with an average improvement exceeding 20% in these categories.

Training Memory Cost and Inference Speed

Image description

TraceVLA's memory overhead is manageable at less than 10GB when using 8 H100 GPUs, with the difference decreasing at smaller batch sizes. In terms of speed, the model introduces three main components during inference: image/text tokens (0.002s), CoTracker tracking (0.03s), and dense point tracking (0.004s) per timestep. These additional computational costs remain relatively small and well-optimized due to GPU attention optimization.

BibTeX

@article{zheng2024tracevla,
        title={TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies},
        author={Zheng, Ruijie and Liang, Yongyuan and Huang, Shuaiyi and Gao, Jianfeng and Daum{\'e} III, Hal and Kolobov, Andrey and Huang, Furong and Yang, Jianwei},
        journal={arXiv preprint arXiv:2412.10345},
        year={2024}
      }