

BaseThesis x smallest.ai Teardown: Building Conversational Voice Agents That Remember YOU
The Physics of Real-Time Conversation: Building Conversational Agents That Remember YOU
A hands on investigation into multi-session memory, persistent context, state management, and how production voice agents maintain understanding across days and weeks
Current conversational memory systems fail because they record comprehensively, not intelligently.
When someone says "I'm from Bangalore," most systems store: {"location": "Bangalore"}.
BUT a great salesman stores: startup ecosystem context, infrastructure challenges, tech talent density AND then surfaces it naturally three conversations later.
The difference is comprehension vs recording.
Recording is cheap, BUT Comprehension requires deciding what information means for future interactions.
Saturday, we're investigating how to build memory that demonstrates understanding, not just recall.
The core questions are when a user returns for their 3rd conversation with your agent, what should it remember from Sessions 1 and 2? And how do you architect memory to make that possible without drowning in context? How do you architect memory that demonstrates understanding, not just recall?
We're bringing Akshat Mandloi, the Co-Founder and CTO at smallest.ai, to tear down and ship these constraints with you on "what should an agent remember about a human?"
The constraint is <100ms retrieval budget that forces architectural discipline. You cannot:
Store full transcripts + semantic search at query time
Retrieve 50 documents and hope LLM finds relevant bits
Defer "what matters?" decisions to inference time
You must decide at storage time on What did this person reveal? What will matter three sessions from now? How should context surface naturally?
Questions We're Pursuing:
On Memory Architecture: What causes multi session degradation? Model capacity? Context limits? Orchestration patterns? What gets stored after session ends? How do you decide what's relevant to retain vs discard? Where does intelligence live: storage, retrieval, or inference?
On Latency That Compounds: You built 100ms TTS. Why does turn 10 feel slower than turn 1? Where does latency hide across cascaded components? How do you debug "the agent feels slow" complaints?
On Retrieval Strategy: How do you avoid overwhelming context with everything? How do you know what's relevant before retrieving? Semantic search? Structured queries? Hybrid? How do you avoid context overload? At what point does retrieval become your latency bottleneck?Token limits with multi-session history?
On Context Synthesis: How do you merge multi-session context without hitting token limits? What's the balance between verbatim recall vs summarized understanding? How do you handle conflicts? (User says X in Session 1, Y in Session 3)
Format
Part 1. Investigation (30 mins): Akshat walks through what actually breaks in production. We whiteboard latency breakdowns and run live comparisons to expose where models, orchestration, and memory management interact.
Part 2. Memory Deep Dive (30 mins) What actually breaks when agents need to remember across sessions? What to store after session ends? We whiteboard different storage strategies, retrieval approaches, synthesis patterns (full transcripts, structured extractions, vector embeddings, hybrid). Live demo to understand "what makes memory intelligent?" by showing agent with Session 1 context, then Session 2 referencing it.
Part 3. Build Session (2 hours)
Challenge: Build a conversational agent that maintains context across multiple sessions.
You'll use Smallest's model as base layer. Your job is to figure out the orchestration and persistent memory architecture that maintains coherence and
Build retrieval strategy (what from Session 1 matters in Session 2?)
Manage context synthesis (merge multiple sessions without token overflow)
Keep it fast (<100 ms ttfb response time with retrieval)
Part 4. Retro (30 mins): Honest synthesis of What worked? What felt intelligent vs mechanical? Where did <100ms force interesting decisions? What does Smallest solve? What still needs solving? What would you need to ship this in production?
Who This Is For?
Ideally for builders who
Have shipped in production
Have hit real constraints
Think architecturally about tradeoffs (you know you can't optimize everything simultaneously)
Question surface mechanisms ("just use RAG" isn't satisfying without understanding what makes memory intelligent?)
Care about production reality over benchmarks
You don't need Smallest or voice AI experience, just production experience and curiosity about how multi session memory actually works will suffice.
If you've never wondered why your voice agent doesn't remember the user in the next session, this probably isn't for you.
OR
If "just store everything in a vector database" feels like enough, this isn't for you.
What to bring:
Laptop with development environment
Curiosity about constraints
Production war stories
Pre-work (if accepted):
Set up Smallest API access
Review documentation we'll share
Prepare your questions
Top questions submitted will be shared with Akshat beforehand and form part of the investigation.
What you'll gain?
Mental models, not tool training. (Working prototype you built in 2 hours, Direct feedback from Smallest and BaseThesis team on your approach, Network of other builders solving similar problems)
Architectural clarity. Understanding where intelligence lives in memory systems? (at storage time retrieval time or inference time) What's lazy vs optimal and when?
Practical outcomes. (Understanding where and why multi session conversations degrade, Latency debugging strategies across cascaded components, Hands on experience with real constraints)
About the speaker
Akshat Mandloi is the Co-Founder and CTO at smallest.ai, a voice-AI startup focused on ultra-low-latency multilingual speech models. He previously worked as a senior data scientist at Bosch developing deep-learning systems for autonomous vehicles, perception systems, and electric powertrains.
At Smallest, he's built and led Lightning TTS (100ms generation for 10 seconds of audio), Electron SLM (4B param model optimized for voice with 53ms TTFB), Production deployment (Millions of call minutes/month in industries like banking/finance)
Plus: Attendees get priority access to apply for BaseThesis's February hackathon including cloud credits, TTS model access, and compute credits so you can immediately apply what you learn to build production systems.
Note: Tea, Coffee, Snacks and Light Lunch Provided
About BaseThesis
We're an AI infrastructure lab investigating what becomes possible when constraints change.
What we're building?
ConnectionOS, a voice based system that captures intent through conversation. Understanding someone's actual goals (not just stated preferences) requires persistent memory across sessions separated by days and months.
Why we run these teardowns?
To understand production constraints from teams building at the edge.
We did this with DJ Biswas (Groq) in December. We investigate, not learn APIs, because we're building systems ourselves.
Our community: 300+ AI builders, mostly focused on voice, real time systems, and human AI interaction.
Space is intentionally limited to ensure focus. If you're building production AI and multi session coherence matters to your work, apply now.