

Hands-on Workshop: Give your AI Agents Eyes and Ears
LLMs gave us reasoning. RAG gave us retrieval. Tool calling gave us action. What’s missing in the modern agent stack is perception: the ability to see, hear, and remember the world as it happens.
This workshop is a practical walkthrough of building a perception layer for agents using VideoDB. You’ll learn how to convert continuous media (screen, mic, camera, RTSP, files) into a structured context your agent can use:
Indexes (searchable understanding)
Events (real-time triggers)
Memory (episodic recall with playable evidence)
We’ll implement the core loop:
Continuous Media → Perception Layer (VideoDB) → Agent (reasoning + action) → Output grounded in evidence
Who should attend:
Engineers building agents that need continuous and temporal awareness (not one-shot screenshots).
Research teams building in physical AI, desktop robots, and wearables.
Product teams building meeting bots, desktop copilots, monitoring/ops, QA/compliance
Founders building multimodal apps where “show me the moment” matters
What You’ll Discover:
What “perception” actually means for agents: continuous, temporal, multi-source, searchable, actionable.
How to support three input modes with one mental model: files, live streams, desktop capture.
How to build searchable memory so your agent can retrieve results with playable evidence, not vibes.
How to move from batch video AI to real-time event streams your agent can react to immediately.
Plus:
A starter template you can reuse: “Index + Events + Memory” as the default perception stack
Networking with builders working on agents + multimodal infra