Cover Image for Running Durable Agents in Production
Cover Image for Running Durable Agents in Production
Avatar for DataTalks.Club events
DataTalks.Club is a global online community of people who love data.

Running Durable Agents in Production

YouTube
Registration
Welcome! To join the event, please register below.
About Event

Agent demos often fail at the exact moment they become useful, due to a process crashes, interrupted deployments, or approvals that never trigger. This workshop shows how to turn working LLM agents into durable production workflows. You will learn how to persist state outside the agent process, resume agents after failure, retry tool calls, pause safely for approvals, inspect every step of execution, and deploy real multi-agent apps to real infrastructure.

We’ll cover the following steps:

● Understand why agent demos break in production - Explore the common failure modes of LLM agents, including lost state, brittle long-running tasks, unreliable tool calls, unclear recovery paths, and lack of operational visibility.

● Design agents as durable workflows - Learn how to structure agentic systems so each step is stateful, observable, retryable, and recoverable instead of being trapped inside a single fragile process.

● Build a production-oriented agent pattern - We’ll walk through an example agent that uses planning, tool execution, decision points, and external services while preserving execution history and state across failures.

● Add reliability controls for real-world execution - See how retries, timeouts, compensation logic, human approval, and event-driven continuation can make agents safer and more dependable.

● Handle human-in-the-loop requirements - Learn where human review, approval, escalation, or correction should be inserted into agent workflows without breaking the overall execution.

● Observe, debug, and improve agent behavior - Understand what needs to be visible when an agent runs in production, including execution paths, intermediate decisions, failed steps, tool responses, and recovery attempts.

● Leave with a reusable production blueprint - By the end, participants will have a practical mental model for building durable agents that can be adapted to RAG systems, workflow automation, customer operations, data tasks, and enterprise AI applications.

By the end of the workshop, attendees should be able to explain and implement the execution layer that separates a clever agent loop from a reliable agent service.

● Recognize the production failure modes of in-process agent loops: process death, deploys, flaky tools, slow approvals, missing history, and distributed state.

● Convert a basic tool-calling agent into a durable workflow with server-side state and per-step execution history.

● Add a human approval step that can wait safely and resume without losing context.

● Use execution traces to debug tool calls, LLM calls, timing, token usage, and failures.

● Understand where durable agent execution fits alongside RAG, evaluation, monitoring, and capstone project expectations.

About the Speaker:
Nicholas Lotz is a DevSecOps Engineer and technical enablement specialist dedicated to removing the organizational barriers that keep engineers from shipping great software. Currently a Technical Marketing Engineer at Voxel51 and a freelance DevSecOps consultant, Nick has built a career at the intersection of infrastructure automation and product education, including impactful roles at GitLab and Harness.

He is the author of the second edition of Automating DevOps with GitLab Pipelines and is a recognized expert in Kubernetes, Terraform, and CI/CD modernization. With a unique academic interest in applying control theory to digital networks, he focuses on building transparent, secure software stacks that solve real-world business problems.

This post is sponsored by Orkes. Thank you for supporting our community!

DataTalks.Club is the place to talk about data. Join our Slack community!

Avatar for DataTalks.Club events
DataTalks.Club is a global online community of people who love data.