Cover Image for How Multimodal Models Are Actually Built: From CLIP to Qwen2.5-VL
Cover Image for How Multimodal Models Are Actually Built: From CLIP to Qwen2.5-VL
Avatar for Lorong AI
Presented by
Lorong AI
Hosted By

How Multimodal Models Are Actually Built: From CLIP to Qwen2.5-VL

Registration
Approval Required
Your registration is subject to host approval.
Welcome! To join the event, please register below.
About Event

Multimodal models have evolved rapidly from early image-text alignment systems into assistants that can read documents, interpret charts, ground visual details, and reason over video. This session looks beyond leaderboard hype to focus on the design ideas that shaped that evolution: contrastive alignment, vision-language bridging, multimodal instruction tuning, and grounded long-context modeling.

The main discussion paper will be the Qwen2.5-VL Technical Report, supported by a short lineage overview of CLIPFlamingoBLIP-2, and LLaVA  to show how the field arrived here.

The session aims to give attendees a practical mental model of modern multimodal systems: how they are structured, which architectural choices matter most, and what those choices enable in real-world use cases. We will close with a brief forward-looking discussion on the adjacent direction: Qwen3.5 as a signal of the shift toward native multimodal agents.

Suggested pre-reading:

· Main paper: Qwen2.5-VL Technical Report

· Lineage papers: CLIPFlamingoBLIP-2LLaVA

· Optional forward-looking reference: Qwen3.5 announcement/blog.


Important notice

  • Participants are to read the material in advance to engage more fully with the technical and methodological details during the session.

  • If you have knowledge, experience or saw something on the internet (blogpost, article etc.) that would add to the discussion, do contribute!


More About the Host

Anshu Singh is an AI and Data Privacy Research Engineer at the Government Technology Agency (GovTech), Singapore. Before joining GovTech, her research focused on the intersection of computer vision and privacy at the NUS Centre for Research in Privacy Technologies. She holds a Master’s degree in AI from the National University of Singapore and enjoys building practical, user-centric solutions by putting research into practice.


More About the Series

Paper Club is Lorong AI’s community-driven initiative where members gather to discuss and analyze academic papers, research articles, or key developments in artificial intelligence.

Get involved: Learn more about Lorong AI | Speaker Sign-up | WhatsApp Community | LinkedIn | X

Location
Lorong AI @ One-North
69 Ayer Rajah Cres., Singapore 139961
69 Ayer Rajah Crescent, Level 3
Avatar for Lorong AI
Presented by
Lorong AI
Hosted By