

Mini-Hackathon: Build a Perception-First Agent
About Event
LLMs gave us reasoning. RAG gave us retrieval. Tool calling gave us action. What’s missing in the modern agent stack is perception: the ability to see, hear, and remember the world as it happens.
This workshop is a practical walkthrough of building a perception layer for agents using VideoDB. You’ll learn how to convert continuous media (screen, mic, camera, RTSP, files) into a structured context your agent can use:
Indexes (searchable understanding)
Events (real-time triggers)
Memory (episodic recall with playable evidence)
We’ll implement the core loop:
Continuous Media → Perception Layer (VideoDB) → Agent (reasoning + action) → Output grounded in evidence
Who should attend:
Engineers building agents that need continuous and temporal awareness (not one-shot screenshots).
Research teams building in physical AI, desktop robots and wearables.
Product teams building meeting bots, desktop copilots, monitoring/ops, QA/compliance
Founders building multimodal apps where “show me the moment” matters
What You’ll Discover:
What “perception” actually means for agents: continuous, temporal, multi-source, searchable, actionable.
How to support three input modes with one mental model: files, live streams, desktop capture.About Event
This is a build-first mini-hackathon to ship a working prototype where an agent is no longer blind. You’ll use VideoDB as the perception layer that sits between transport layer and agent logic, converting real-time streams into structured context. Video is no longer a file, it’s multimodal context.
Your prototype must do at least one of these well:
Realtime Ingestion : Ingest continuous stream of desktop screen, mic and system audio.
Real-time events and alerts (events arriving as the world unfolds, not after processing finishes)
Episodic recall (agent can answer “what happened” across time with timestamps with playable moments)
Who should attend:
Individuals building monitoring agents, meeting/desktop agents, or multimodal copilots
Engineers who want a shippable demo in a few hours
Builders who care about outputs grounded in observable evidence
What You Can Build :
Real-Time Watcher Agent: Stream continuously, emit structured events, trigger Slack/webhooks when a condition hits.
Desktop Copilot with Awareness: Capture screen + mic, detect key moments, and generate actions grounded in what was seen and said.
Refer to the docs below to check what’s possible with VideoDB:
Docs link: docs.videodb.io
VideoDB Skills: https://github.com/video-db/skills
Format:
Kickoff: perception stack + demo (15–20 min)
Build sprint: teams/solo (3-4 hrs)
Demos: 3 minutes each (30–45 min)
Winners + networking (30-45mins)
What we provide:
Starter kit + example pipelines (files/streams/desktop capture)
Quick patterns for Indexes, Events, Memory
On-site support to unblock teams
Winning Prize:
Upto INR 1L + $500 credits for VideoDB.