Cover Image for NeurIPS Social: Evaluating Agentic Systems: Bridging Research Benchmarks and Real-World Impact

Hosted By

2,065 Went

Featured in

San Diego

NeurIPS Social: Evaluating Agentic Systems: Bridging Research Benchmarks and Real-World Impact

Name: NeurIPS Social: Evaluating Agentic Systems: Bridging Research Benchmarks and Real-World Impact
Start: 2025-12-04T19:00:00.000-08:00
End: 2025-12-04T21:00:00.000-08:00
Location: San Diego Convention Center

Hosted by Tatia Tsmindashvili, Raphael Kalandadze & Abigail Morgan

San Diego Convention Center

San Diego, California

Past Event

Welcome! To join the event, please register below.

You will be asked to verify token ownership with your wallet.

About Event

Agentic AI systems - LLM-driven agents that plan, use tools, and act autonomously - are rapidly evolving, but evaluating them remains one of the hardest problems in AI. Traditional benchmarks fail to capture long-term reasoning, goal shifts, or real-world impact. This event brings together researchers and practitioners to discuss how we can meaningfully assess success, alignment, and reliability in autonomous agents.

Agenda

⚡ Lightning talks
🎤 Panel discussion
🍽️ Food & Networking

⚡ Lightning talks

Reducing Uncertainty in Evaluation for Better Open Language Models

Why do so many benchmarks fail, but some don't? This talk shows a simple indicator for evaluation in language model pretraining that an experimental result will hold from small to large compute scales: the ratio of signal, a benchmark’s ability to separate better models from worse models, and noise, a benchmark’s sensitivity to random variability between training steps. Then, this talk will share how better evaluation methods, using signal and noise, changed our experimental design for 1000s of preliminary training runs evaluated with 43 benchmarks when developing the Olmo 3 suite of language models.

David Heineman Allen AI. working to improve language model pre-training and evaluation

Evaluating and Optimizing AI Agents with Harbor: Terminal-Bench and Beyond

This talk introduces Terminal-Bench, the Harbor framework, and Harbor Adapters that enable simple, scalable evaluation of agents across diverse datasets and benchmarks. We’ll share practical insights gained from building Terminal-Bench and Harbor (e.g., task quality-audits, agent bottlenecks), and demonstrate how adapters help the community more easily evaluate, compare, and optimize agent performance.

Lin Shi, Terminal-Bench and Harbor Team, Adapters Team Lead - Datasets, Benchmarks, and Agent Evaluation.

Teaching AI Agents to Learn from Their Mistakes

This talk walks through a full learning loop for AI agents: simulating realistic user scenarios, evaluating behavior with rich metrics, and automatically optimizing both the configs and structure of the agent. I’ll demo how the RELAI-SDK plugs this loop into your own stack so agents improve over time—instead of relying on one-off prompt hacks.

Soheil Feizi Founder & CEO RELAI

Building Reliable Agentic Benchmarks: Insights from AssetOpsBench

This talk will focus on designing and evaluating agentic benchmarks with a strong emphasis on in-domain evaluation and real-world task reliability. Drawing from the development of AssetOpsBench, we’ll discuss practical considerations for measuring agent behavior, task completion quality, and decision robustness. The session will highlight what works, what doesn’t, and what matters most when building benchmarks for agent-based systems.

Dhaval Patel, IBM Research, IEEE, ACM - AI for Industrial Automation, Agentic Workflows, Reliability Analytics

Persona Consistency as an Evaluation Signal for Agentic Systems

This talk explores how consistent persona simulations can be used to evaluate and improve agentic systems. Building on recent work using multi-turn RL to reduce persona drift, it highlights how consistency serves as both a diagnostic tool and a way to train more stable, trustworthy simulated users.

Marwa Abdulhai, UC Berkley, ex-MIT - AI Safety & Ethics, RL, Language modelling

🎤 Panel Discussion

Ofir Press Princeton University, SWE-bench
Yu Su OSU NLP Group, MMMU
Polina Kirichenko AI at Meta, AbstentionBench
Shannon Sands NousResearch

🍽️ Food & Networking

Huge thanks to Comet Opik for the food - well-fed minds network better.

Sponsors / Friends

Location

San Diego Convention Center

111 Harbor Dr, San Diego, CA 92101, USA

Upper Level Ballroom 20AB

Hosted By

2,065 Went

NeurIPS Social: Evaluating Agentic Systems: Bridging Research Benchmarks and Real-World Impact

​Agenda

​⚡ Lightning talks

​🎤 Panel Discussion

​🍽️ Food & Networking

​Sponsors / Friends

Agenda

⚡ Lightning talks

🎤 Panel Discussion

🍽️ Food & Networking

Sponsors / Friends