Whole-Body Conditioned Egocentric Video Prediction

• Predicting Ego-centric Video from human Actions (PEVA). • Given past video frames and an action specifying a desired change in 3D pose, PEVA predicts the next video frame. • Our results show that, given the first frame and a sequence of actions, our model can generate videos of atomic actions (a), simulate counterfactuals (b), and support long video generation (c). • Recent years have brought significant advances in world models that learn to simulate future outcomes for planning and control. • From intuitive physics to multi-step video prediction, these models have grown increasingly powerful and expressive. • But few are designed for truly embodied agents.

Article Summaries:

Predicting Ego-centric Video from human Actions (PEVA). Given past video frames and an action specifying a desired change in 3D pose, PEVA predicts the next video frame. Our results show that, given the first frame and a sequence of actions, our model can generate videos of atomic actions (a), simulate counterfactuals (b), and support long video generation (c). Recent years have brought significant advances in world models that learn to simulate future outcomes for planning and control. From intuitive physics to multi-step video prediction, these models have grown increasingly powerful and exp

Sources:

http://bair.berkeley.edu/blog/2025/07/01/peva/