Cover Image for How Multimodal Models Are Actually Built: From CLIP to Qwen2.5-VL

Presented by

Lorong AI

A space for AI practitioners to connect, learn, and grow through curated programs and a supportive community.

More to come in 2026, watch this space!

More info here: https://lorong.ai

Hosted By

Climate

How Multimodal Models Are Actually Built: From CLIP to Qwen2.5-VL

Name: How Multimodal Models Are Actually Built: From CLIP to Qwen2.5-VL
Start: 2026-04-20T15:00:00.000+08:00
End: 2026-04-20T17:00:00.000+08:00
Location: Lorong AI @ One-North

Lorong AI

Lorong AI @ One-North

Singapore

Past Event

Welcome! To join the event, please register below.

You will be asked to verify token ownership with your wallet.

About Event

Multimodal models have evolved rapidly from early image-text alignment systems into assistants that can read documents, interpret charts, ground visual details, and reason over video. This session looks beyond leaderboard hype to focus on the design ideas that shaped that evolution: contrastive alignment, vision-language bridging, multimodal instruction tuning, and grounded long-context modeling.

The main discussion paper will be the Qwen2.5-VL Technical Report, supported by a short lineage overview of CLIP, Flamingo, BLIP-2, and LLaVA to show how the field arrived here.

The session aims to give attendees a practical mental model of modern multimodal systems: how they are structured, which architectural choices matter most, and what those choices enable in real-world use cases. We will close with a brief forward-looking discussion on the adjacent direction: Qwen3.5 as a signal of the shift toward native multimodal agents.

Suggested pre-reading:

· Main paper: Qwen2.5-VL Technical Report

· Lineage papers: CLIP, Flamingo, BLIP-2, LLaVA

· Optional forward-looking reference: Qwen3.5 announcement/blog.

Important notice

Participants are to read the material in advance to engage more fully with the technical and methodological details during the session.
If you have knowledge, experience or saw something on the internet (blogpost, article etc.) that would add to the discussion, do contribute!

More About the Host

Anshu Singh is an AI and Data Privacy Research Engineer at the Government Technology Agency (GovTech), Singapore. Before joining GovTech, her research focused on the intersection of computer vision and privacy at the NUS Centre for Research in Privacy Technologies. She holds a Master’s degree in AI from the National University of Singapore and enjoys building practical, user-centric solutions by putting research into practice.

More About the Series

Paper Club is Lorong AI’s community-driven initiative where members gather to discuss and analyze academic papers, research articles, or key developments in artificial intelligence.

Get involved: Learn more about Lorong AI | Speaker Sign-up | WhatsApp Community | LinkedIn | X

Location

Lorong AI @ One-North

69 Ayer Rajah Cres., Singapore 139961

69 Ayer Rajah Crescent, Level 3

Presented by

Lorong AI

A space for AI practitioners to connect, learn, and grow through curated programs and a supportive community.

More to come in 2026, watch this space!

More info here: https://lorong.ai

Hosted By