Cover Image for AI Interpretability Mini-Hackathon
Cover Image for AI Interpretability Mini-Hackathon
Avatar for Cornell AI Alignment
A community of students and researchers conducting research and outreach to mitigate risks from advanced AI systems.
11 Went

AI Interpretability Mini-Hackathon

Registration
Past Event
Welcome! To join the event, please register below.
About Event

Join us for a hands-on mini-hackathon focused on interpretability and truthfulness in language models!

We will work in small groups to replicate results from The Internal State of an LLM Knows When It’s Lying using a Google Colab notebook, which will be provided before the hackathon starts with instructions and tutorial. You’ll train a probe on the internal activations of Gemma 3 270M, Google’s lightweight open-source LLM, to detect whether a statement is true or false—and compare your probe’s performance to the model’s explicit answers.

We’ll also introduce several popular interpretability techniques and host a short brainstorming session for new research ideas💡!

🏆 Prizes will be awarded to the top team with the highest probe accuracy!

Catered dinner will be provided.

Location
Bowers Hall
Gerhart Dr, Cortland, NY 13045, USA
Room 324
Avatar for Cornell AI Alignment
A community of students and researchers conducting research and outreach to mitigate risks from advanced AI systems.
11 Went