

CSC Workshop: Build Your Own Data Factory: AI Agents That Generate and Validate Data
Build Your Own Data Factory: AI Agents That Generate and Validate Data
Most of us use ChatGPT to generate text. But large language models can also produce structured, typed outputs—such as JSON with defined fields and constraints—making them far more powerful for building real systems.
In this workshop, participants build a simple two-agent pipeline: one agent generates synthetic data records, and another reviews them for quality. Along the way, we explore structured LLM outputs, generator–validator loops, and multi-agent design patterns that are quickly becoming core building blocks of production AI.
Why synthetic data? Realistic datasets are often paywalled, privacy-restricted, expensive to annotate, or unavailable in emerging domains. In fields like clinical AI, synthetic data offers a practical alternative—and in some cases, can even outperform models trained on real data.
We begin with templates from healthcare, civic tech, and humanitarian aid, then invite you to design your own schema for any domain.
Open to all; basic familiarity with Python and Jupyter or Colab is recommended.
About Shayan:
Shayan Chowdhury (Columbia '26, CS & Policy) has worked across medical AI research at Harvard Med, disaster relief coordination in 38+ countries through his nonprofit Reach4Help in partnership with the UN and Google, and COVID-19 data infrastructure for the Bangladesh government — the common thread being using data and AI to make systems work for people who usually don't get a seat at the table. He'll kick off with a 30-minute talk walking through his journey and how synthetic data and multi-agent systems show up in real research and production, before we get into hands-on coding. In his free time, he plays guitar and sings jazz-pop mashups of Frank Sinatra and The Weeknd that absolutely no one asked for.