Cover Image for Bingyi Cao & Koert Chen - TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment
Cover Image for Bingyi Cao & Koert Chen - TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment
Led by Mayank Bhaskar and Benedict Emoe-Kabu. Part of the Cohere Labs Open Science initiative https://cohere.com/research/open-science
Hosted By

Bingyi Cao & Koert Chen - TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment

Google Meet
Registration
Welcome! To join the event, please register below.
About Event

This session presents TIPSv2 (CVPR 2026 paper), a vision-language pretraining framework designed to improve dense patch-text alignment, a core limitation of existing image-text encoders that affects several downstream tasks. We will discuss evidence that patch-level distillation substantially improves patch-text alignment, including the finding that a distilled student can outperform its teacher on this capability. This motivates the introduction of a new self-supervised learning loss, dubbed iBOT++, which extends the masked image modeling objective by allowing unmasked tokens to contribute directly to the loss. Additional pretraining recipe enhancements include a more efficient exponential moving average setup and caption sampling across multiple granularities. Together, these components produce a family of image-text encoders evaluated on 9 tasks and 20 datasets, with performance generally matching or exceeding recent vision encoder baselines.

Led by Mayank Bhaskar and Benedict Emoe-Kabu. Part of the Cohere Labs Open Science initiative https://cohere.com/research/open-science
Hosted By