

Break Frontier AI — In Your Language
LILTBench Hackathon (Hosted by LILT × The AI Collective)
Registration
Approval Required
Your registration is subject to host approval.
Welcome! To join the event, please register below.
About Event
Can you design a coding task that breaks the world’s best AI models — in your language?
While frontier LLMs demonstrate high proficiency in English-centric benchmarks, their capabilities often degrade sharply when processing complex instructions, nuances, or data in other languages.
LILTBench invites applied AI researchers and evaluation engineers to identify, formalize, and benchmark these cross-lingual vulnerabilities. Your objective is to design rigorous evaluation tasks that expose systematic non-English performance gaps in today’s most advanced models.
The Evaluation Architecture
To ensure scientific rigor, all submissions will undergo automated evaluation via a production-grade benchmarking pipeline:
Target Model: Claude Opus 4.6
Framework: Terminal-Bench
Agent Harness: Terminus 2
Execution: Accepted tasks are run through 15 deterministic iterations to map exact pass/fail boundaries.
📅 Schedule
June 15 (Mon) — Kickoff webinar: rules, workflow, evaluation rubric, live demo. Repo made public
June 15–21 — Hackathon week: design, develop, test, and submit tasks
June 21 (Sun) 11:59 PM UTC — Code freeze (all PRs must be passing CI by this deadline)
June 22–23 — Evaluation: accepted tasks run against Claude Opus 4.6 (15 iterations each)
June 24 (Tue) — Awards webinar: winners announced, top tasks showcased
🏆 Prizes & Recognition
Beyond contributing to the advancement of multilingual AI safety and evaluation, top performers will receive:
Global Visibility: The top 5 winners will be featured prominently in the AI Collective Newsletter and across corporate channels.
Cash Prizes: Tiered awards up to $1,500 for 1st place ($1,000 for 2nd, $500 for 3rd, etc.).
Scoring Rubric
Points are weighted heavily by task difficulty—we value profound, formalized edge cases over volume.
Easy (13–15 passes out of 15): 1 point
Medium (9–12 passes): 2 points
Hard (4–8 passes): 4 points
Very hard (0–3 passes): 8 points
💡 Note: Quality over quantity. A single "Very Hard" task (8 points) nets a higher score than four "Easy" tasks (4 points).
Location
Virtual
(Exact joining details provided after registration approval.)
For detailed information about submission and evaluation, please visit this notion page: https://www.notion.so/lilt/LILTBench-Hackathon-361c66a75a508039bf00c9303a85ed3b