Generalizable Coarse-to-fine Robot Manipulation via Language-aligned 3D Keypoints

1Shanghai Jiao Tong University   2Duke Kunshan University

Under Review
Indicates Corresponding Authors
Teaser Image
Our method achieves strong generalization ability by decomposing tasks into step-wise language instructions, each aligned with a 3D keypoint.

Abstract

Hierarchical coarse-to-fine policy, where a coarse branch predicts a region of interest to guide a fine-grained action predictor, has demonstrated significant potential in robotic 3D manipulation tasks by especially enhancing sample efficiency and enabling more precise manipulation. However, even augmented with pre-trained models, these hierarchical policies still suffer from generalization issues. To enhance generalization to novel instructions and environment variations, we propose Coarse-to-fine Language-Aligned manipulation Policy (CLAP), a framework that integrates three key components: 1) task decomposition, 2) VLM fine-tuning for 3D keypoint prediction, and 3) 3D-aware representation. Through comprehensive experiments in simulation and on a real robot, we demonstrate its superior generalization capability. Specifically, on GemBench, a benchmark designed for evaluating generalization, our approach achieves a 12\% higher average success rate than the SOTA method while using only 1/5 of the training trajectories. In real-world experiments, our policy, trained on only 10 demonstrations, successfully generalizes to novel instructions and environments.

Teaser Image
We propose a novel coarse-to-fine 3D manipulation policy, comprising of a coarse task planner and a fine-grained action predictor. The coarse task planner reasons about the task plans and the positions of task-related objects to generate language-aligned 3D keypoints. The fine-grained action predictor fuses the corresponding step instruction with a 3D-aware visual representation from refined observations to predict the final action.

Experimental Results

Teaser Image
We choose GEMBench as the generalization benchmark for evaluations in simulation.
Teaser Image
Our method achieves 12\% improvement on the average success rate. CLAP shows strong generalization ability with respect to visual changes, object variations and novel language instructions. CLAP is trained with only 20 episodes per task variation while all other methods are trained with 100 episodes per task variation.

CLAP for GemBench L1 Tasks

CLAP for GemBench L2 Tasks

CLAP for GemBench L3 Tasks

CLAP for GemBench L4 Tasks

Real-world Experimental Settings and Evaluation Results
Teaser Image Teaser Image

The agent can solve novel tasks under all perturbations in real world.

BibTeX

@article{hu2025generalizablecoarsetofinerobotmanipulation,
      title={Generalizable Coarse-to-Fine Robot Manipulation via Language-Aligned 3D Keypoints},
      author={Jianshu Hu and Lidi Wang and Shujia Li and Yunpeng Jiang and Xiao Li and Paul Weng and Yutong Ban},
      year={2025},
      eprint={2509.23575},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2509.23575},
}