

Roundtable Dinner: AI Benchmarking Across Languages hosted by AI Circle & LILT
A Deep Dive with Joern and LILT’s Applied AI Research Team
As AI transitions from static chatbots to autonomous agents capable of multi-step reasoning and tool use, we have hit a critical wall: the English-Centric Evaluation Gap. Most multilingual benchmarks today are built on "translated" versions of English datasets—a process that introduces noise, hallucinations, and "translationese" that makes tasks impossible for even the most capable agents to solve.
In this session, Joern will lead a technical discussion on the AI Benchmarking across languages and how LILT is redefining what it means to benchmark agentic performance at the frontier.
What We’ll Discuss:
The "Fluent yet Broken" Paradox: Why a translation can be grammatically perfect yet functionally flawed if tool behaviors, locale conventions, or cultural contexts are lost.
GAIA-v2-LILT: A breakdown of how re-auditing the GAIA benchmark recovered an average of +20.7 percentage points in measured performance—proving that current "capability gaps" are often just measurement errors.
Terminal-Bench & tau(3)-bench: Evaluating agentic coding and multi-turn customer support conversations in non-English environments.
Functional and Cultural Alignment: What are the key requirements and pitfalls when transforming English benchmarks into other languages?
The Experience
A Multi-Course Intellectual Tasting
We are pairing a Michelin-starred Mexican dinner with a structured technical "Engagement."
5:30 PM The Warm Up - Cocktails, arrivals, and networking.
6:00 PM The Thesis - Opening note AI Benchmarking
6:15 PM The Engagement A curated roundtable. "Bouncers" will be served with each course to drive deep-dive debate.
9:00 PM The Commitment Closing remarks and the path toward Frontier Model safety.
About LILT:
LILT is the only AI-native multilingual solution for frontier AI data and enterprise localisation. We help make your data and content multilingual—faster, more accurately, securely, and at scale. Specialising in language-grounded alignment and multimodal evaluation, we provide research-grade expertise to govern AI systems. Unlike crowdsourced options, our curated expert network and continuous quality calibration provides high-fidelity signals to build reliable models ready for global deployment.