How Multimodal Models Are Actually Built: From CLIP to Qwen2.5-VL
Multimodal models have evolved rapidly from early image-text alignment systems into assistants that can read documents, interpret charts, ground visual details, and reason over video. This session looks beyond leaderboard hype to focus on the design ideas that shaped that evolution: contrastive alignment, vision-language bridging, multimodal instruction tuning, and grounded long-context modeling.
The main discussion paper will be the Qwen2.5-VL Technical Report, supported by a short lineage overview of CLIP, Flamingo, BLIP-2, and LLaVA to show how the field arrived here.
The session aims to give attendees a practical mental model of modern multimodal systems: how they are structured, which architectural choices matter most, and what those choices enable in real-world use cases. We will close with a brief forward-looking discussion on the adjacent direction: Qwen3.5 as a signal of the shift toward native multimodal agents.
Suggested pre-reading:
· Main paper: Qwen2.5-VL Technical Report
· Lineage papers: CLIP, Flamingo, BLIP-2, LLaVA
· Optional forward-looking reference: Qwen3.5 announcement/blog.
Important notice
Participants are to read the material in advance to engage more fully with the technical and methodological details during the session.
If you have knowledge, experience or saw something on the internet (blogpost, article etc.) that would add to the discussion, do contribute!
More About the Host
Anshu Singh is an AI and Data Privacy Research Engineer at the Government Technology Agency (GovTech), Singapore. Before joining GovTech, her research focused on the intersection of computer vision and privacy at the NUS Centre for Research in Privacy Technologies. She holds a Master’s degree in AI from the National University of Singapore and enjoys building practical, user-centric solutions by putting research into practice.
More About the Series
Paper Club is Lorong AI’s community-driven initiative where members gather to discuss and analyze academic papers, research articles, or key developments in artificial intelligence.
Get involved: Learn more about Lorong AI | Speaker Sign-up | WhatsApp Community | LinkedIn | X
