NeurIPS Social: Evaluating Agentic Systems: Bridging Research Benchmarks and Real-World Impact
Agentic AI systems - LLM-driven agents that plan, use tools, and act autonomously - are rapidly evolving, but evaluating them remains one of the hardest problems in AI. Traditional benchmarks fail to capture long-term reasoning, goal shifts, or real-world impact. This event brings together researchers and practitioners to discuss how we can meaningfully assess success, alignment, and reliability in autonomous agents.
Agenda
⚡ Lightning talks
🎤 Panel discussion
🍽️ Food & Networking
⚡ Lightning talks
Reducing Uncertainty in Evaluation for Better Open Language Models
Why do so many benchmarks fail, but some don't? This talk shows a simple indicator for evaluation in language model pretraining that an experimental result will hold from small to large compute scales: the ratio of signal, a benchmark’s ability to separate better models from worse models, and noise, a benchmark’s sensitivity to random variability between training steps. Then, this talk will share how better evaluation methods, using signal and noise, changed our experimental design for 1000s of preliminary training runs evaluated with 43 benchmarks when developing the Olmo 3 suite of language models.
David Heineman Allen AI. working to improve language model pre-training and evaluation
Evaluating and Optimizing AI Agents with Harbor: Terminal-Bench and Beyond
This talk introduces Terminal-Bench, the Harbor framework, and Harbor Adapters that enable simple, scalable evaluation of agents across diverse datasets and benchmarks. We’ll share practical insights gained from building Terminal-Bench and Harbor (e.g., task quality-audits, agent bottlenecks), and demonstrate how adapters help the community more easily evaluate, compare, and optimize agent performance.
Lin Shi, Terminal-Bench and Harbor Team, Adapters Team Lead - Datasets, Benchmarks, and Agent Evaluation.
Teaching AI Agents to Learn from Their Mistakes
This talk walks through a full learning loop for AI agents: simulating realistic user scenarios, evaluating behavior with rich metrics, and automatically optimizing both the configs and structure of the agent. I’ll demo how the RELAI-SDK plugs this loop into your own stack so agents improve over time—instead of relying on one-off prompt hacks.
Soheil Feizi Founder & CEO RELAI
Building Reliable Agentic Benchmarks: Insights from AssetOpsBench
This talk will focus on designing and evaluating agentic benchmarks with a strong emphasis on in-domain evaluation and real-world task reliability. Drawing from the development of AssetOpsBench, we’ll discuss practical considerations for measuring agent behavior, task completion quality, and decision robustness. The session will highlight what works, what doesn’t, and what matters most when building benchmarks for agent-based systems.
Dhaval Patel, IBM Research, IEEE, ACM - AI for Industrial Automation, Agentic Workflows, Reliability Analytics
Persona Consistency as an Evaluation Signal for Agentic Systems
This talk explores how consistent persona simulations can be used to evaluate and improve agentic systems. Building on recent work using multi-turn RL to reduce persona drift, it highlights how consistency serves as both a diagnostic tool and a way to train more stable, trustworthy simulated users.
Marwa Abdulhai, UC Berkley, ex-MIT - AI Safety & Ethics, RL, Language modelling
🎤 Panel Discussion
Ofir Press Princeton University, SWE-bench
Yu Su OSU NLP Group, MMMU
Polina Kirichenko AI at Meta, AbstentionBench
Shannon Sands NousResearch
🍽️ Food & Networking
Huge thanks to Comet Opik for the food - well-fed minds network better.