Any-point Trajectory Modeling for Policy Learning

1UC Berkeley, 2IIIS, Tsinghua University, 3Stanford University,
4Shanghai Artificial Intelligence Laboratory, 5Shanghai Qi Zhi Institute, 6CUHK

ATM learns point track proposals for arbitrary 2D points from actionless video datasets, enabling sample-efficient policy learning and cross-embodiment transfer. We visualize the predicted tracks, beginning from the set of blue points.



Abstract

Learning from demonstration is a powerful method for teaching robots new skills, and having more demonstration data often improves policy learning. However, the high cost of collecting demonstration data is a significant bottleneck. Videos, as a rich data source, contain knowledge of behaviors, physics, and semantics, but extracting control-specific information from them is challenging due to the lack of action labels. In this work, we introduce a novel framework, Any-point Trajectory Modeling (ATM), that utilizes video demonstrations by pre-training a trajectory model to predict future trajectories of arbitrary points within a video frame. Once trained, these trajectories provide detailed control guidance, enabling the learning of robust visuomotor policies with minimal action-labeled data. Across over 130 language-conditioned tasks we evaluated in both simulation and the real world, ATM outperforms strong video pre-training baselines by 80% on average. Furthermore, we show effective transfer learning of manipulation skills from human videos.


Method Overview

Prior work in policy learning from video reasons about images, which is computationally expensive and prone to hallucination. We take inspiration from particle-based approaches and leverage a point representation. Compared to image generation, points naturally capture inductive biases such as object permanence, and decouples relevant motion of objects from lighting and texture. ATM first pre-trains a language-conditioned track prediction model on video data to predict future trajectories of arbitrary points within a video frame. Using these learned track proposals, we can then train policies by learning from limited human demonstration (e.g. 10 demonstrations per task), using a simple fixed set of 64 points across the third-person camera view and the wrist view.

teaser

Results

We evaluate our method on a challenging simulation benchmark (LIBERO) comprised of 130 language-conditioned manipulation tasks, and on 5 tasks in a real-world UR5 Kitchen environment. Our experiments demonstrate that trajectory-guided policies significantly surpass various strong baselines in video pre-training. With dense supervision from the predicted tracks, our trained policies are able to perform long-horizon tasks, and reason about objects, spatial locations, and language instructions. We visualize policy rollouts on all 130 of the LIBERO tasks below:



Policy Rollout Visualization

Use the tabs and the dropdown menu to select the task suite in the benchmark the language instruction. The colored curves indicate the locations (in the camera frame) that the points should go to in future time steps. For simulation tasks, the green border indicates successful task completion.


Tracks Enable Cross-embodiment Learning

ATM's track transformer can leverage videos of different embodiments accomplishing the same task to capture the relevant motion. In the below examples, we train the track transformer on a large dataset of actionless videos from the first embodiment, and 10 demonstrations from the second embodiment. We then perform policy learning on 10 demonstrations. By incorporating the cross-embodiment videos, ATM generates higher fidelity tracks, and ATM's policy performance greatly increases.

Fold the cloth and pull it to the right: tracks model changes in deformation.

Video Pre-training

100 human

10 UR5

ATM Policy Learning

UR5 only: 0%

Human+UR5: 63%

Put the tomato into the pan and close the cabinet door: tracks effectively guide long-horizon behaviors.

Video Pre-training

100 human

10 UR5

ATM Policy Learning

UR5 only: 0%

Human+UR5: 63%

Use the broom to sweep the toys into the dustpan and put it in front of the dustpan: tracks enable reasoning about tools.

Video Pre-training

100 human

10 UR5

ATM Policy Learning

UR5 only: 13%

Human+UR5: 60%

Pick up the can and place in the bin: tracks transfer between robots.

Video Pre-training

160 Franka

10 UR5

ATM Policy Learning

UR5 only: 47%

Franka+UR5: 80%


Which is better for video pre-training? Generative Video Model vs. Trajectory Model

To better understand the advantages of ATM's track subgoals, we compare them qualitatively to image subgoals generated by UniPi (left). To decouple the advantages of open-loop and closed-loop video generation, we additionally instantiate UniPi-Replan (right), which proposes new image subgoals every 8 actions.


Qualitatively, UniPi suffers from motor control failure caused by a lack of fine-grained details in image subgoals. UniPi-Replan additionally experiences failures in image generation, generating noisy images when out of distribution, or generating images that correspond to a different task.

BibTeX

@misc{wen2023anypoint,
      title={Any-point Trajectory Modeling for Policy Learning},
      author={Chuan Wen and Xingyu Lin and John So and Kai Chen and Qi Dou and Yang Gao and Pieter Abbeel},
      year={2023},
      eprint={2401.00025},
      archivePrefix={arXiv},
      primaryClass={cs.RO}
}