Exosphere

Shipped an AI agent used by non-devs? Now how do we tell if it is actually working? 

The only people who can really judge the outputs are experts - lawyers, doctors, or underwriters, who are often slow & always expensive.

So we find hacks, build something to make things work. An LLM grading another LLM, a spreadsheet of test cases an in-house expert updates when they have time. 

Let's find some answers. Small group of founding engineers working through this. Chatham House rules.

Fresh Context Chai: The Domain Expert Eval Problem