NovaFlow: Zero-Shot Manipulation via Actionable Flow from Generated Videos
Hongyu Li*1,2, Lingfeng Sun*1, Yafei Hu1, Duy Ta1, Jennifer Barry1, George Konidaris2, Jiahui Fu1
1Robotics and AI Institute   2Brown University
*These authors contributed equally to this work
Abstract

Enabling robots to execute novel manipulation tasks zero-shot is a central goal in robotics. Most existing methods assume in-distribution tasks or rely on fine-tuning with embodiment-matched data, limiting transfer across platforms. We present NovaFlow, an autonomous manipulation framework that converts a task description into an actionable plan for a target robot without any demonstrations. Given a task description, NovaFlow synthesizes a video using a video generation model and distills it into 3D actionable object flow using off-the-shelf perception modules. From the object flow, it computes relative poses for rigid objects and realizes them as robot actions via grasp proposals and trajectory optimization. For deformable objects, this flow serves as a tracking objective for model-based planning with a particle-based dynamics model. By decoupling task understanding from low-level control, NovaFlow naturally transfers across embodiments. We validate on rigid, articulated, and deformable object manipulation tasks using a table-top Franka arm and a Spot quadrupedal mobile robot, and achieve effective zero-shot execution without demonstrations or embodiment-specific training.

Interactive Viewer

3D Point Cloud & Actionable Flow from Generated Video
Note: The visualized end-effector is offset from the physical one due to longer gripper fingers used in real-world experiments.

Initial Observation

Initial Observation

Generated Video

Execution Video (1x speed)

Methods

Flow generator pipeline: Given an initial image and a task prompt, a video model is used to generate a video of the plausible object motion. This video is then processed by pretrained perception modules to distill an actionable 3D object flow. This involves (1) lifting the 2D video to 3D using monocular depth estimation, (2) calibrating the estimated depth against the initial depth, (3) tracking the dense per-point motion using 3D point tracking, and (4) extracting the object-centric 3D flow via object grounding.
Flow executor pipeline. The initial end-effector pose is determined from grasp proposal candidates. Robot trajectories are then planned based on the actionable flow considering costs and constraints, and subsequently tracked by the robots.

Experiments
Experiment Results
Failure Modes