Cover Image for AI Interpretability Mini-Hackathon

Presented by

Cornell AI Alignment

A community of students and researchers conducting research and outreach to mitigate risks from advanced AI systems.

Hosted By

11 Went

IA

AI Interpretability Mini-Hackathon

Name: AI Interpretability Mini-Hackathon
Start: 2025-11-19T17:00:00.000-05:00
End: 2025-11-19T18:30:00.000-05:00
Location: Bowers Hall

Cornell AI Alignment

Bowers Hall

Cortland, New York

Past Event

Welcome! To join the event, please register below.

You will be asked to verify token ownership with your wallet.

About Event

Join us for a hands-on mini-hackathon focused on interpretability and truthfulness in language models!

We will work in small groups to replicate results from The Internal State of an LLM Knows When It’s Lying using a Google Colab notebook, which will be provided before the hackathon starts with instructions and tutorial. You’ll train a probe on the internal activations of Gemma 3 270M, Google’s lightweight open-source LLM, to detect whether a statement is true or false—and compare your probe’s performance to the model’s explicit answers.

We’ll also introduce several popular interpretability techniques and host a short brainstorming session for new research ideas💡!

🏆 Prizes will be awarded to the top team with the highest probe accuracy!

Catered dinner will be provided.

Location

Bowers Hall

Gerhart Dr, Cortland, NY 13045, USA

Room 324

Presented by

Cornell AI Alignment

A community of students and researchers conducting research and outreach to mitigate risks from advanced AI systems.

Hosted By

11 Went

IA