Cover Image for NeurIPS Social: Evaluating Agentic Systems: Bridging Research Benchmarks and Real-World Impact
Cover Image for NeurIPS Social: Evaluating Agentic Systems: Bridging Research Benchmarks and Real-World Impact
2,063 Went

NeurIPS Social: Evaluating Agentic Systems: Bridging Research Benchmarks and Real-World Impact

Hosted by Tatia Tsmindashvili, Rapael Kalandadze & Abigail Morgan
Registration
Past Event
Welcome! To join the event, please register below.
About Event

Agentic AI systems - LLM-driven agents that plan, use tools, and act autonomously - are rapidly evolving, but evaluating them remains one of the hardest problems in AI. Traditional benchmarks fail to capture long-term reasoning, goal shifts, or real-world impact. This event brings together researchers and practitioners to discuss how we can meaningfully assess success, alignment, and reliability in autonomous agents.

Agenda


⚡ Lightning talks
🎤 Panel discussion
🍽️ Food & Networking


⚡ Lightning talks


Reducing Uncertainty in Evaluation for Better Open Language Models

Why do so many benchmarks fail, but some don't? This talk shows a simple indicator for evaluation in language model pretraining that an experimental result will hold from small to large compute scales: the ratio of signal, a benchmark’s ability to separate better models from worse models, and noise, a benchmark’s sensitivity to random variability between training steps. Then, this talk will share how better evaluation methods, using signal and noise, changed our experimental design for 1000s of preliminary training runs evaluated with 43 benchmarks when developing the Olmo 3 suite of language models.

David Heineman Allen AI. working to improve language model pre-training and evaluation


Evaluating and Optimizing AI Agents with Harbor: Terminal-Bench and Beyond

This talk introduces Terminal-Bench, the Harbor framework, and Harbor Adapters that enable simple, scalable evaluation of agents across diverse datasets and benchmarks. We’ll share practical insights gained from building Terminal-Bench and Harbor (e.g., task quality-audits, agent bottlenecks), and demonstrate how adapters help the community more easily evaluate, compare, and optimize agent performance.

Lin Shi, Terminal-Bench and Harbor Team, Adapters Team Lead - Datasets, Benchmarks, and Agent Evaluation.


Teaching AI Agents to Learn from Their Mistakes

This talk walks through a full learning loop for AI agents: simulating realistic user scenarios, evaluating behavior with rich metrics, and automatically optimizing both the configs and structure of the agent. I’ll demo how the RELAI-SDK plugs this loop into your own stack so agents improve over time—instead of relying on one-off prompt hacks.

Soheil Feizi Founder & CEO RELAI


Building Reliable Agentic Benchmarks: Insights from AssetOpsBench

This talk will focus on designing and evaluating agentic benchmarks with a strong emphasis on in-domain evaluation and real-world task reliability. Drawing from the development of AssetOpsBench, we’ll discuss practical considerations for measuring agent behavior, task completion quality, and decision robustness. The session will highlight what works, what doesn’t, and what matters most when building benchmarks for agent-based systems.

Dhaval Patel, IBM Research, IEEE, ACM - AI for Industrial Automation, Agentic Workflows, Reliability Analytics


Persona Consistency as an Evaluation Signal for Agentic Systems

This talk explores how consistent persona simulations can be used to evaluate and improve agentic systems. Building on recent work using multi-turn RL to reduce persona drift, it highlights how consistency serves as both a diagnostic tool and a way to train more stable, trustworthy simulated users.

Marwa Abdulhai, UC Berkley, ex-MIT - AI Safety & Ethics, RL, Language modelling


🎤 Panel Discussion


🍽️ Food & Networking

Huge thanks to Comet Opik for the food - well-fed minds network better.

Sponsors / Friends

Location
San Diego Convention Center
111 Harbor Dr, San Diego, CA 92101, USA
Upper Level Ballroom 20AB
2,063 Went