

Running Durable Agents in Production
Agent demos often fail at the exact moment they become useful, due to process crashes, interrupted deployments, or approvals that never trigger. This workshop shows how to turn working LLM agents into durable production workflows.
You will learn how to persist state outside the agent process, resume agents after failure, retry tool calls, pause safely for approvals, inspect every step of execution, and deploy real multi-agent apps to real infrastructure.
We’ll cover the following steps:
Understand why agent demos break in production - Explore the common failure modes of LLM agents, including lost state, brittle long-running tasks, unreliable tool calls, unclear recovery paths, and lack of operational visibility.
Design agents as durable workflows - Learn how to structure agentic systems so each step is stateful, observable, retryable, and recoverable instead of being trapped inside a single fragile process.
Build a production-oriented agent pattern - We’ll walk through an example agent that uses planning, tool execution, decision points, and external services while preserving execution history and state across failures.
Add reliability controls for real-world execution - See how retries, timeouts, compensation logic, human approval, and event-driven continuation can make agents safer and more dependable.
Handle human-in-the-loop requirements - Learn where human review, approval, escalation, or correction should be inserted into agent workflows without breaking the overall execution.
Observe, debug, and improve agent behavior - Understand what needs to be visible when an agent runs in production, including execution paths, intermediate decisions, failed steps, tool responses, and recovery attempts.
Leave with a reusable production blueprint - By the end, participants will have a practical mental model for building durable agents that can be adapted to RAG systems, workflow automation, customer operations, data tasks, and enterprise AI applications.
By the end of the workshop, attendees should be able to explain and implement the execution layer that separates a clever agent loop from a reliable agent service.
Recognize the production failure modes of in-process agent loops: process death, deploys, flaky tools, slow approvals, missing history, and distributed state.
Convert a basic tool-calling agent into a durable workflow with server-side state and per-step execution history.
Add a human approval step that can wait safely and resume without losing context.
Use execution traces to debug tool calls, LLM calls, timing, token usage, and failures.
Understand where durable agent execution fits alongside RAG, evaluation, monitoring, and capstone project expectations.
About the Speaker:
Nicholas Lotz is an engineer focused on technical enablement, helping organizations use software to solve real business problems. His work centers on infrastructure automation and operational transparency, with an emphasis on removing technical and organizational barriers so teams can build and ship high-quality software.
DataTalks.Club is the place to talk about data. Join our Slack community!
This post is sponsored by Orkes. Thank you for supporting our community!