Generative AI-focused workshops, hackathons, and more. Come build with us!

Arize AI

When building and testing AI agents, one practical question that arises is whether to use the same model for both the agent’s reasoning and the evaluation of its outputs. Keeping the model consistent may simplify the setup and reduce costs, but it also raises concerns about bias, over-familiarity, and inflated scores.

To better understand these trade-offs, we ran an experiment comparing how evaluations differ when the same model is used versus when evaluation is handled by a different model. 

Join us to see the results and our take on the implications. 

Testing Self-Evaluation Bias of LLMs

Sahil Gogna

Shilpa Mohanty

Kat Yenko