How to Evaluate MCP-powered AI Agents Beyond Accuracy using Agent GPA

DataTalks.Club events

YouTube

Registration Closed

This event is not currently taking registrations. You may contact the host or subscribe to receive updates.

About Event

This hands-on workshop introduces the Agent Goal-Plan-Action (Agent GPA) framework, a practical and advanced method for evaluating and improving AI agents.

Moving beyond simple final-answer scoring, Agent GPA focuses on the agent's entire reasoning process: evaluating goal achievement efficiency, plan logic, appropriate tool usage, and execution follow-through.

Agent GPA has achieved state-of-the-art benchmark results on TRAIL/GAIA, with 95% error coverage and 86% error localization, demonstrating the power of process-level evaluation over simple final-answer scoring.

We'll move beyond simple accuracy to analyze the agent's behavior holistically through the Agent GPA lens, which provides a deeper view of the agent’s working process and allows us to evaluate it. Using Agent GPA, you will diagnose and iteratively improve the agent's performance, specifically addressing frequent issues like planning failures, tool selection errors, and execution gaps.

You’ll discover how seemingly minor changes, particularly in tool definitions, can lead to measurable improvements in tool selection and tool calling.

What you’ll learn:

How to build an AI agent powered by Snowflake MCP
How agents discover and choose tools through MCP
How to design tool descriptions that influence agent behavior
How to evaluate agent quality using structured metrics
How to compare agent versions using observability and traces
Why data grounding matters for reliable agents

What we’ll do:

Build an initial agent version connected to Snowflake MCP
Evaluate its performance using TruLens metrics
Identify failure modes in tool selection and tool calling
Improve MCP tool definitions using a coding agent
Rebuild and re-evaluate a second agent version
Compare both versions side by side using their traces and evaluation data

The workshop uses a concrete example: a health research agent, grounded on clinical trials and PubMed data available from the Snowflake Marketplace.

By the end of the session, you’ll understand how to evaluate AI agents using the Agent GPA framework and move beyond simple accuracy or final-answer scoring. You’ll learn how to analyze an agent’s goals, plans, tool usage, and execution, diagnose failures, and iteratively improve agent performance using structured evaluation and observability.

Please come prepared with a fresh Python environment (such as Jupyter) to run the lab.

About the speaker:

Josh is a developer advocate for AI and Open Source at Snowflake, previously at TruEra (acquired by Snowflake). He is also a maintainer of TruLens, an open-source library for systematically tracking and evaluating LLM-based applications.

Josh regularly delivers tech talks and workshops at events including PyData, Devoxx, AI_Dev, AI DevWorld, AI Camp meetups, and more. He also developed courses and taught students on a variety of platforms, including Coursera, DeepLearning.ai, Udemy, and DataCamp, and served as an advisor for Trustworthy Machine Learning at Stanford.

DataTalks.Club is the place to talk about data. Join our Slack community!

This post is sponsored by Snowflake.

Presented by

DataTalks.Club events

DataTalks.Club is a global online community of people who love data.

Hosted By

581 Went