

Paper Club #3 - Quentin Garrido
Join us on Feb 11th for the third edition of the Unaite Paper Club featuring Quentin, research scientist at Meta Fair !
This talk explores recent developments in World Models and the learning of expressive and efficient latent spaces from video.
We will begin by discussing V-JEPA 2[1], a state-of-the-art video encoder trained with self-supervised learning by predicting missing parts of videos in latent space. We will then study the model's understanding of intuitive physical knowledge when predicting the future[2].
Finally, we will discuss how we can learn a world model that predicts physical actions, from videos that do not contain action information, by learning a Latent Action World Model[3]. We will demonstrate how such model can be used to solve planning tasks in robotics and navigation.
[1] Assran, Mido, et al. "V-jepa 2: Self-supervised video models enable understanding, prediction and planning." https://arxiv.org/abs/2506.09985
[2] Garrido, Quentin, et al. "Intuitive physics understanding emerges from self-supervised pretraining on natural videos." https://arxiv.org/abs/2502.11831
[3] Garrido, Quentin, et al. "Learning Latent Action World Models In The Wild." https://arxiv.org/abs/2601.05230