

Lost in Translation: The Future of Multilingual AI Evaluations - #NYTechWeek
An invite-only evening on agentic AI evaluation across languages
As AI transitions from static chatbots to autonomous agents capable of multi-step reasoning and tool use, we've hit a critical wall: the English-Centric Evaluation Gap.
Most multilingual benchmarks are just translated English datasets — and that translation introduces noise, hallucinations, and "translationese" that quietly breaks tasks the best agents should be able to solve. The result: we're mistaking measurement errors for capability gaps.
Over drinks and a closed-door conversation with Spence Green (CEO, LILT), we'll get into the actual taxonomy of pitfalls - from instructional leakage to cultural anchor bias - and the frameworks needed to clean the yardstick before the next generation of global models is built on top of it.
This is a researcher-to-researcher conversation, not a product pitch. We're keeping the room small and the discussion technical.
What We'll Cover
The "Fluent yet Broken" Paradox: Why a translation can be grammatically perfect yet functionally flawed when tool behaviours, locale conventions, or cultural contexts are lost.
GAIA-v2-LILT: How re-auditing the GAIA benchmark recovered an average of +20.7 percentage points in measured performance, proving that current "capability gaps" are often just measurement errors.
Terminal-Bench & τ³-bench: Evaluating agentic coding and multi-turn customer support conversations in non-English environments.
Functional and Cultural Alignment: The key requirements and pitfalls when transforming English benchmarks into other languages.
Programming / Schedule
5:00 PM - Arrivals, drinks, and networking
6:30 PM - Welcome from AI Collective & Fireside discussion with Spence Green
7:00 PM - Drinks & food continue
LILT is the only AI-native multilingual solution for frontier AI data and enterprise localisation. Specialising in language-grounded alignment and multimodal evaluation, LILT provides research-grade expertise to govern AI systems at scale across 200+ languages.
The AI Collective is a community of practitioners across research and deployment advancing the frontier of AI.