Generalizable Coarse-to-fine Robot Manipulation via Language-aligned 3D Keypoints
Abstract
Hierarchical coarse-to-fine policy, where a coarse branch predicts a region of interest to guide a fine-grained action predictor, has demonstrated significant potential in robotic 3D manipulation tasks by especially enhancing sample efficiency and enabling more precise manipulation. However, even augmented with pre-trained models, these hierarchical policies still suffer from generalization issues. To enhance generalization to novel instructions and environment variations, we propose Coarse-to-fine Language-Aligned manipulation Policy (CLAP), a framework that integrates three key components: 1) task decomposition, 2) VLM fine-tuning for 3D keypoint prediction, and 3) 3D-aware representation. Through comprehensive experiments in simulation and on a real robot, we demonstrate its superior generalization capability. Specifically, on GemBench, a benchmark designed for evaluating generalization, our approach achieves a 12\% higher average success rate than the SOTA method while using only 1/5 of the training trajectories. In real-world experiments, our policy, trained on only 10 demonstrations, successfully generalizes to novel instructions and environments.
Experimental Results
CLAP for GemBench L1 Tasks
CLAP for GemBench L2 Tasks
CLAP for GemBench L3 Tasks
CLAP for GemBench L4 Tasks
BibTeX
@article{hu2025generalizablecoarsetofinerobotmanipulation,
title={Generalizable Coarse-to-Fine Robot Manipulation via Language-Aligned 3D Keypoints},
author={Jianshu Hu and Lidi Wang and Shujia Li and Yunpeng Jiang and Xiao Li and Paul Weng and Yutong Ban},
year={2025},
eprint={2509.23575},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2509.23575},
}