Cover Image for VChain: Chain-of-Visual-Thought for Reasoning in Video Generation by Ziqi Huang

Presented by

Every week we pick one paper and go deep — video generation, world models, physical reasoning, diffusion, flow matching, and everything in between.

Hosted By

VChain: Chain-of-Visual-Thought for Reasoning in Video Generation by Ziqi Huang

Video Model Journal Club

Virtual

Welcome! To join the event, please register below.

You will be asked to verify token ownership with your wallet.

About Event

Abstract: Recent video generation models can produce smooth and visually appealing clips, but they often struggle to synthesize complex dynamics with a coherent chain of consequences. Accurately modeling visual outcomes and state transitions over time remains a core challenge. In contrast, large language and multimodal models exhibit strong visual state reasoning and future prediction capabilities. To bridge these strengths, we introduce VChain, a novel inference-time chain-of-visual-thought framework that injects visual reasoning signals from multimodal models into video generation. Specifically, VChain contains a dedicated pipeline that leverages large multimodal models to generate a sparse set of critical keyframes as snapshots, which are then used to guide the sparse inference-time visual-state adaptation of a pre-trained video generator only at these key moments. Our approach is tuning-efficient, introduces minimal overhead and avoids dense supervision. Extensive experiments on complex, multi-step scenarios show that VChain significantly enhances the quality of generated videos.

Speaker: Ziqi Huang — Ph.D. candidate at MMLab@NTU, advised by Prof. Ziwei Liu. Her research focuses on generative models and their evaluation for image and video generation. Apple Scholar in AI/ML, Google PhD Fellow, Microsoft Research Fellow.

Website: https://journal.video-reason.com/

To join over zoom, please subscribe to get zoom link: https://forms.gle/ebgyvtLRz8ABTfdX6

Presented by

Video Model Journal Club

Every week we pick one paper and go deep — video generation, world models, physical reasoning, diffusion, flow matching, and everything in between.

Hosted By