



Advancing the Frontier of Multilingual Multimodality
About the talk:
Aya Vision is a family of open-weight vision-language models that brings strong multilingual performance to multimodal inputs across 23 languages. The work introduces a synthetic annotation framework and a cross-modal model-merging method that preserves text-only skills and enhances multimodal generative performance. The result is Aya-Vision-8B and Aya-Vision-32B, which report best-in-class results for their sizes and competitive win rates against much larger models on generative benchmarks.
We’ll cover:
• Two-stage training for multilingual multimodal alignment across 23 languages
• Multilingual Multimodal data pipeline using synthetic instructions and translation/rewrites for quality
• Model merging to preserve strong text-only skills while improving image-grounded generation
• Benchmarks and results: AyaVisionBench and m-WildVision
About the Speaker:
Saurabh Dash is a researcher at Cohere Labs. His work focuses on multimodal models and efficiency. Prior to Cohere, he was a PhD student at Georgia Tech and research intern at Apple AI/ML.
Website: https://saurabhdash.com
X: https://x.com/TheyCallMeMr
