AI Interpretability Mini-Hackathon
Join us for a hands-on mini-hackathon focused on interpretability and truthfulness in language models!
We will work in small groups to replicate results from The Internal State of an LLM Knows When It’s Lying using a Google Colab notebook, which will be provided before the hackathon starts with instructions and tutorial. You’ll train a probe on the internal activations of Gemma 3 270M, Google’s lightweight open-source LLM, to detect whether a statement is true or false—and compare your probe’s performance to the model’s explicit answers.
We’ll also introduce several popular interpretability techniques and host a short brainstorming session for new research ideas💡!
🏆 Prizes will be awarded to the top team with the highest probe accuracy!
Catered dinner will be provided.