Strange Evals - VLMs
a paper reading club but we deep dive into benchmarks
----
Each session we dive deep on a related battery of widely cited benchmarks so we can develop an intuition about what we're up against. This includes reading papers, but also eyeballing raw benchmark samples (a surprising number of them don't make any sense).
This week we are discussing the big VLM benchmarks to find out why almost all of them are already fully saturated.
---
Pre-reading:
Read the ZeroBench paper: https://zerobench.github.io
OR
Spend 30 minutes eyeballing the samples from some of these benchmarks https://github.com/Lewington-pitsos/brokenevals-vlm and come up with something interesting.
---
Special thanks to HUD for hosting us: https://www.hud.ai
Attendance will be limited to keep discussion focused.
Food web image source:
MLA (9th edition): Kembhavi, Aniruddha, et al. "A Diagram Is Worth A Dozen Images." Computer Vision – ECCV 2016, Springer, 2016, pp. 235–251.
