AI Evals w/ Mike Merrill — Terminal Bench: A benchmark for AI agents in terminal environments
When: Thursday, Oct 9, 11 am PT
Where: Zoom link created by alphaXiv, later uploaded to alphaXiv Youtube
Zoom link: https://stanford.zoom.us/j/95904059062?pwd=0ErKmwUCab6qBSNls8oUhmeF1pzeIo.1&from=addon
About Event
🔬 AI Evals on alphaXiv
🗓 Thursday October 9th 2025 · 11AM PT
🎙 Featuring Mike Merrill
💬 Casual Talk + Open Discussion
Terminal-Bench: A benchmark for AI agents in terminal environments
We are excited to have Mike Merrill to discuss his work on Terminal Bench, a widely used benchmark for evaluating agents in terminal environments. He will also present his broader work and perspectives on evaluations. Mike is a Postdoctoral Researcher at Stanford Computer Science working with Ludwig Schmidt on empirical evaluations of reasoning LLMs.
Hosted by: alphaXiv x Vals AI
AI Evals: join the community
