ATM learns point track proposals for arbitrary 2D points from actionless video datasets, enabling sample-efficient policy learning and cross-embodiment transfer. We visualize the predicted tracks, beginning from the set of blue points.
Learning from demonstration is a powerful method for teaching robots new skills, and having more demonstration data often improves policy learning. However, the high cost of collecting demonstration data is a significant bottleneck. Videos, as a rich data source, contain knowledge of behaviors, physics, and semantics, but extracting control-specific information from them is challenging due to the lack of action labels. In this work, we introduce a novel framework, Any-point Trajectory Modeling (ATM), that utilizes video demonstrations by pre-training a trajectory model to predict future trajectories of arbitrary points within a video frame. Once trained, these trajectories provide detailed control guidance, enabling the learning of robust visuomotor policies with minimal action-labeled data. Across over 130 language-conditioned tasks we evaluated in both simulation and the real world, ATM outperforms strong video pre-training baselines by 80% on average. Furthermore, we show effective transfer learning of manipulation skills from human videos.
Prior work in policy learning from video reasons about images, which is computationally expensive and prone to hallucination. We take inspiration from particle-based approaches and leverage a point representation. Compared to image generation, points naturally capture inductive biases such as object permanence, and decouples relevant motion of objects from lighting and texture. ATM first pre-trains a language-conditioned track prediction model on video data to predict future trajectories of arbitrary points within a video frame. Using these learned track proposals, we can then train policies by learning from limited human demonstration (e.g. 10 demonstrations per task), using a simple fixed set of 64 points across the third-person camera view and the wrist view.
We evaluate our method on a challenging simulation benchmark (LIBERO) comprised of 130 language-conditioned manipulation tasks, and on 5 tasks in a real-world UR5 Kitchen environment. Our experiments demonstrate that trajectory-guided policies significantly surpass various strong baselines in video pre-training. With dense supervision from the predicted tracks, our trained policies are able to perform long-horizon tasks, and reason about objects, spatial locations, and language instructions. We visualize policy rollouts on all 130 of the LIBERO tasks below:
Use the tabs and the dropdown menu to select the task suite in the benchmark the language instruction. The colored curves indicate the locations (in the camera frame) that the points should go to in future time steps. For simulation tasks, the green border indicates successful task completion.
ATM's track transformer can leverage videos of different embodiments accomplishing the same task to capture the relevant motion. In the below examples, we train the track transformer on a large dataset of actionless videos from the first embodiment, and 10 demonstrations from the second embodiment. We then perform policy learning on 10 demonstrations. By incorporating the cross-embodiment videos, ATM generates higher fidelity tracks, and ATM's policy performance greatly increases.
Fold the cloth and pull it to the right: tracks model changes in deformation.
Video Pre-training
100 human
10 UR5
ATM Policy Learning
UR5 only: 0%
Human+UR5: 63%
Put the tomato into the pan and close the cabinet door: tracks effectively guide long-horizon behaviors.
Video Pre-training
100 human
10 UR5
ATM Policy Learning
UR5 only: 0%
Human+UR5: 63%
Use the broom to sweep the toys into the dustpan and put it in front of the dustpan: tracks enable reasoning about tools.
Video Pre-training
100 human
10 UR5
ATM Policy Learning
UR5 only: 13%
Human+UR5: 60%
Pick up the can and place in the bin: tracks transfer between robots.
Video Pre-training
160 Franka
10 UR5
ATM Policy Learning
UR5 only: 47%
Franka+UR5: 80%
To better understand the advantages of ATM's track subgoals, we compare them qualitatively to image subgoals generated by UniPi (left). To decouple the advantages of open-loop and closed-loop video generation, we additionally instantiate UniPi-Replan (right), which proposes new image subgoals every 8 actions.
Qualitatively, UniPi suffers from motor control failure caused by a lack of fine-grained details in image subgoals. UniPi-Replan additionally experiences failures in image generation, generating noisy images when out of distribution, or generating images that correspond to a different task.
Pick up the alphabet soup and place it in the basket: UniPi-Replan fails to pick up the soup. It is difficult to determine whether the generated image subgoals reach for the soup can in the back, or the carton in the front.
ATM
UniPi
UniPi-Replan
Pick up the bbq sauce and place it in the basket: UniPi fails to pick up the bbq sauce, as image subgoals lack finer details relevant to motor control.
ATM
UniPi
UniPi-Replan
Open the middle drawer of the cabinet: UniPi-Replan's diffusion model generates subgoals corresponding to a different task, indicating the increased difficulty of closed-loop video generation.
ATM
UniPi
UniPi-Replan
Open the top drawer of the cabinet: Both UniPi and UniPi-Replan experience motor control failure, as it is difficult to tell when to close the gripper from noisy image subgoals.
ATM
UniPi
UniPi-Replan
@misc{wen2023anypoint,
title={Any-point Trajectory Modeling for Policy Learning},
author={Chuan Wen and Xingyu Lin and John So and Kai Chen and Qi Dou and Yang Gao and Pieter Abbeel},
year={2023},
eprint={2401.00025},
archivePrefix={arXiv},
primaryClass={cs.RO}
}