Long-Horizon Manipulationvia Trace-Conditioned VLA Planning

Isabella Liu¹, An-Chieh Cheng¹, Rui Yan¹, Geng Chen¹, Ri-Zhao Qiu¹, Xueyan Zou¹,
Sha Yi¹, Hongxu Yin^2†, Xiaolong Wang^1†, Sifei Liu^2†

¹UC San Diego, ²NVIDIA, ^†Equal Advising

Paper arXiv

LoHo-Manip is a modular framework that enables short-horizon vision‑language‑action policies to perform long-horizon manipulation tasks. It uses a task-management vision‑language model above the action executor. At each step, it takes the current observation and predicts the future plan. The prediction includes a sequence of remaining subtasks and a visual trace, which shows the executor where to move and what to approach next.

By re-conditioning the executor on an updated visual trace at each step, LoHo-Manip reduces long-horizon reasoning to a sequence of short-horizon control problems. This enables implicit progress tracking, re-planning, and recovery without hand-crafted failure detectors.

Planning & Reasoning Benchmarks

embodied reasoning long-horizon planning

long-horizon planning

human-level planning

Trajectory Prediction

Trajectory prediction in LoHo-Manip is more than spatial forecasting: it is an execution interface. The task manager predicts a 2D visual trace from the current observation, instruction, and textual progress memory, indicating where the robot should move next. This trace grounds language into image-space intent, letting the VLA executor solve short-horizon control by following the trace. The receding-horizon design updates traces after each step, enabling replanning and recovery from failures.

Our high-level task manager generalizes across diverse robotic setups and environments, including different manipulators, objects, and scenes. By decoupling high-level task planning from low-level control, the manager can seamlessly plug into downstream executor policies trained on different embodiments and datasets.

Trajectory prediction qualitative results

Real-World Rollouts

“Help me organize the table, put all vegetables and fruits in the black bowl, and the rest of the items in the container.”

Citation

@misc{liu2026longhorizonmanipulationtraceconditionedvla,
      title={Long-Horizon Manipulation via Trace-Conditioned VLA Planning},
      author={Isabella Liu and An-Chieh Cheng and Rui Yan and Geng Chen and Ri-Zhao Qiu and Xueyan Zou and Sha Yi and Hongxu Yin and Xiaolong Wang and Sifei Liu},
      year={2026},
      eprint={2604.21924},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2604.21924},
}