

Multimodal Weekly 73: Video-Language Models, Reasoning-Across-Time in Videos, Long-Horizon Multimodal Inference, and Scaling Vision Encoders
In the 73rd session of Multimodal Weekly, we have four exciting presentations on video-language models, reasoning-across-time in videos, multimodality in long horizon causal inference, and scaling vision encoders for multimodal models.
✅ Peter Yu will present Espresso, a novel method that extracts and compresses spatial and temporal information in videos separately.
✅ Jr-Jen Chen will present ReXTime, a benchmark designed to rigorously test AI models' ability to perform temporal reasoning within video events.
✅ Zhuoyi Huang will present MARPLE, a benchmark for evaluating long-horizon inference capabilities using multi-modal evidence.
✅ Jieneng Chen will present the study on the analysis of redundancy concerning visual tokens and efficient training within large multimodal models.
Join the Multimodal Minds community to connect with the speakers!
Multimodal Weekly is organized by Twelve Labs, a startup building multimodal foundation models for video understanding. Learn more about Twelve Labs here: https://twelvelabs.io/