

MLflow Community Meetup
Join us for the next MLflow Community Meetup on October 8 at 4PM PT! Ben Wilson, MLflow Maintainer, will dive deep into:
Building Smarter Evals with Trace-Aware, Feedback-Aligned Judges: Most evals only look at the final answer. MLflow's LLM judges can consider retrieval hits, tool calls, and spans so they can check whether an answer is actually backed by the steps that produced it. And instead of staying static, the judges themselves can be optimized with human-labeled traces, so over time they better reflect your team’s real standards.
Keeping Eval Datasets Relevant as Your App Changes: Evaluation datasets don’t tend to age well: they get stale, scattered, or disconnected from the app you’re actually shipping. With MLflow's new evaluation datasets, you can grow and update them over time. You can add new records, merge in feedback, tag and search across versions, and always know which dataset was used for which eval run. And importantly, you can plug those datasets straight into your evaluation harness so every build, model version, or prompt tweak is benchmarked against the same, versioned source of truth, making comparisons fair and regressions obvious.
Bring your questions about dataset management, evaluation workflows, or how to best contribute to MLflow OSS development!