Long-Horizon Manipulation via Trace-Conditioned VLA Planning

LoHo-Manip is a modular framework that enables short-horizon vision‑language‑action policies to perform long-horizon manipulation tasks. It uses a task-management vision‑language model above the action executor. At each step, it takes the current observation and predicts the future plan. The prediction includes a sequence of remaining subtasks and a visual trace, which shows the executor where to move and what to approach next.

By re-conditioning the executor on an updated visual trace at each step, LoHo-Manip reduces long-horizon reasoning to a sequence of short-horizon control problems. This enables implicit progress tracking, re-planning, and recovery without hand-crafted failure detectors.

Planning & Reasoning Benchmarks

embodied reasoning long-horizon planning

long-horizon planning

human-level planning

Trajectory Prediction

Trajectory prediction in LoHo-Manip is more than spatial forecasting: it is an execution interface. The task manager predicts a 2D visual trace from the current observation, instruction, and textual progress memory, indicating where the robot should move next. This trace grounds language into image-space intent, letting the VLA executor solve short-horizon control by following the trace. The receding-horizon design updates traces after each step, enabling replanning and recovery from failures.

lower is better

Our high-level task manager generalizes across diverse robotic setups and environments, including different manipulators, objects, and scenes. By decoupling high-level task planning from low-level control, the manager can seamlessly plug into downstream executor policies trained on different embodiments and datasets.

“Take out the trash bag.”

“Grasp the pot in the sink.”

“Pick up the coke can.”

“Pick up the Pringles.”

“Put orange on the cutting board.”

“Grasp the bowl.”

“Take the cup from the table.”

“Pick up the cup.”

“Grasp the green can.”

Real-World Rollouts

“Help me organize the table, put all vegetables and fruits in the black bowl, and the rest of the items in the container.”

Citation

@misc{liu2026longhorizonmanipulationtraceconditionedvla,
      title={Long-Horizon Manipulation via Trace-Conditioned VLA Planning},
      author={Isabella Liu and An-Chieh Cheng and Rui Yan and Geng Chen and Ri-Zhao Qiu and Xueyan Zou and Sha Yi and Hongxu Yin and Xiaolong Wang and Sifei Liu},
      year={2026},
      eprint={2604.21924},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2604.21924},
}

Long-Horizon Manipulationvia Trace-Conditioned VLA Planning

Planning & Reasoning Benchmarks

Trajectory Prediction

Real-World Rollouts

Citation