Cover Image for From Prototype to Production: The Hidden Engineering of AI Inference

Presented by

Data Phoenix is a live media platform for AI and Data professionals, covering technologies under the hood, best practices, and live demos from the builders shaping the industry, via original shows.

Hosted By

From Prototype to Production: The Hidden Engineering of AI Inference

Data Phoenix

Virtual

Welcome! To join the event, please register below.

You will be asked to verify token ownership with your wallet.

About Event

Every developer can spin up an LLM app in an afternoon. Almost no one can run it efficiently in production. This talk unpacks the gap between "it works on my laptop" and "it handles 10,000 requests per day without exploding your GPU bill." We go under the hood on the actual mechanics of modern AI inference, KV cache pressure, continuous batching, quantization tradeoffs, and GPU utilization math, and show how these translate directly into cost and latency decisions. Every concept is grounded in real benchmarks and code attendees can run against a live API the same day.

Key Highlights

Why KV cache memory, not compute, is the real bottleneck in LLM serving

Continuous batching vs. static batching — the optimization that changed production inference
Quantization tradeoffs: FP16, INT8, AWQ, GPTQ, and when each makes sense
Reading GPU utilization (MFU) and what it actually means for your cloud bill
Live benchmark walkthrough: 3.7x throughput, 5.1x faster inference, 30% lower cost
Drop-in OpenAI-compatible code patterns for serverless and dedicated endpoints

Speaker

Roan Weigert is a Developer Relations engineer at GMI Cloud, where he works at the intersection of AI infrastructure and the developer community. He helps engineers navigate the gap between model development and production deployment, with a focus on LLM inference performance, GPU cloud economics, and hands-on technical enablement. At GMI Cloud, he builds the tools, content, and community programs that help AI teams ship faster on NVIDIA-powered infrastructure.

Please join DataPhoenix Slack and follow us on LinkedIn and YouTube to stay updated on our community events and the latest AI and data news.