

π¦ ai that works: Understanding Latency
βπ¦ ai that works
βA weekly conversation about how we can all get the most juice out of todays models with @vaibcode & @dexhorthy
βhttps://github.com/ai-that-works/ai-that-works
β
βThis episode is all about latency. How do we stop users from twiddling their thumbs when LLM apis are getting faster, but still too slow? The answer shouldn't be "LLMs will eventually get faster".
βWe'll talk about:
βwhy time-to-first-token is not time-to-useful-content
βwhy streaming partially-complete JSON data is hard from a tech perspective
βbalancing perceived performance with actual utility with semantic streaming
βdesigning to keep users engaged during longer operations
βPre-reading
βTo prevent repeating the basics, we recommend you come in having already understanding some of the tooling we will be using:
βDiscord
βCursor or VS Code
βProgramming languages
βApplication Logic: Python or Typescript or Go
βPrompting: BAML (recommend video)
βMeet the Speakers π§βπ»
βββMeet Vaibhav Gupta, one of the creators of BAML and YC alum. He spent 10 years in AI performance optimization at places like Google, Microsoft, and D. E. Shaw. He loves diving deep and chatting about anything related to Gen AI and Computer Vision!Β
βMeet Dex Horthy, founder at HumanLayer and coiner of the term Context Engineering. He spent 10+ years building devops tools at Replicated, Sprout Social and JPL. DevOps junkie turned AI Engineer.