

Featured in
Toronto
Workshop: Steering AI Models through Mechanistic Interpretability
Registration
Past Event
About Event
AI is a black box, but that's changing. Labs like Anthropic publish research that push towards a glass-box scenario, where humans have full observability into the brains of our AI. This workshop covers the specifics of that research, covering intros and demos to:
Reverse Engineering Neural Networks (mechanistic interpretability)
How Anthropic/OpenAI mapped out 'features' of their models using Sparse Autoencoders
Steering Models in Real Time (how Anthropic/OpenAI steer their models, demonstrated with a 3B param model)
Approaches from scientific literature will be expanded upon, with the goal of breaking down the knowledge for common understanding