Bingyi Cao & Koert Chen - TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment

Computer Vision - Cohere Labs Open Science Community

Google Meet

Welcome! To join the event, please register below.

You will be asked to verify token ownership with your wallet.

About Event

This session presents TIPSv2 (CVPR 2026 paper), a vision-language pretraining framework designed to improve dense patch-text alignment, a core limitation of existing image-text encoders that affects several downstream tasks. We will discuss evidence that patch-level distillation substantially improves patch-text alignment, including the finding that a distilled student can outperform its teacher on this capability. This motivates the introduction of a new self-supervised learning loss, dubbed iBOT++, which extends the masked image modeling objective by allowing unmasked tokens to contribute directly to the loss. Additional pretraining recipe enhancements include a more efficient exponential moving average setup and caption sampling across multiple granularities. Together, these components produce a family of image-text encoders evaluated on 9 tasks and 20 datasets, with performance generally matching or exceeding recent vision encoder baselines.

Presented by

Computer Vision - Cohere Labs Open Science Community

Led by Mayank Bhaskar and Benedict Emoe-Kabu. Part of the Cohere Labs Open Science initiative https://cohere.com/research/open-science

Hosted By