Třetí Evals.cz meetup

Name: Třetí Evals.cz meetup
Start: 2026-06-17T17:30:00.000+02:00
End: 2026-06-17T21:30:00.000+02:00
Location: Make

Hosted by Veronika Peskova & 3 others

Make

Hlavní město Praha, Czechia

Past Event

Please click on the button below to join the waitlist. You will be notified if additional spots become available.

You will be asked to verify token ownership with your wallet.

About Event

Benchmarky modelů jsou všude. Ale když nasazujete AI do produktu, zajímá vás něco jiného: funguje to u vás, s vašimi daty, pro vaše uživatele?

[English below]

evals.cz je pražský meetup pro lidi, kteří AI produkty staví - a potřebují je měřit.

Co čekat

3 krátké přednášky + diskuze
Praktické zkušenosti z praxe
Žádné sales pitche
Převážně anglicky

Témata

Kvalita RAGu · používání veřejných benchmarků · human-in-the-loop · metriky a metodologie

Pro koho

ML/AI engineers, backend vývojáři integrující LLMs, data scientists, product manažeři, AI researchers — i zvědaví začátečníci.

Talks and Speakers

Viz níže (v anglické verzi).

[English here]

Model benchmarks are everywhere. But when you're shipping AI in a product, you care about something different: does it work for you, with your data, for your users?

evals.cz is a Prague meetup for people building AI products who need to measure them.

What to expect

3 short talks + discussion
Practical, real-world experience
No sales pitches
Mostly in English

Topics

RAG quality · using public benchmarks · human-in-the-loop · metrics and methodology

Who it's for

ML/AI engineers, backend developers integrating LLMs, data scientists, product managers, AI researchers — and curious beginners too.

Talks and Speakers

🚀 Evaluating AI Agents in Production: Lessons from a Million Runs

Duvo.ai runs autonomous agents that handle back-office operations work — over a million production runs to date, most of them with no human watching the outcome. To measure quality, we evaluate every completed run with an LLM-as-judge against a structured catalog of failure rubrics. Martin Kostelansky walks through how that evaluation works, what the resulting failure data shows, how it differs from common assumptions, and the limits of the judge itself.

🎧 Multi-modal evals: let's talk about audio

Most evals practitioners evaluate textual outputs, but in a world where you can talk to your laptop and it talks right back at you, why would you stop there? (And, indeed, can you?) @Veronika Kral talks you through the setup and the travails it brings.

👾 Choose Your Ground Truth: A Field Guide to Synthetic Data in Evals

Sometimes, it's useful to evaluate against ground truth which is known by construction. But how can you make the synthetic data... not suck? Or, more operationally said, be sufficiently complex to allow you to rely on synthetic testing? @Simon Podhajsky (me!) will talk you through the actual process and show off a few examples.

Location

Make

Menclova 2538/2, 180 00 Praha 8-Palmovka, Czechia

Hosted By

109 Went

AI