Reward Hacking Benchmark: How Frontier LLMs Game Their Own Evaluations

Name: Reward Hacking Benchmark: How Frontier LLMs Game Their Own Evaluations
Start: 2026-05-08T18:30:00.000+05:30
End: 2026-05-08T19:30:00.000+05:30
Location: Bengaluru, India

Lossfunk Event Calendar

Register to See Address

Bengaluru, India

Welcome! Please choose your desired ticket type:

You will be asked to verify token ownership with your wallet.

About Event

As LLMs become agents that take actions through tools, a new failure mode emerges: the model accomplishes the measured objective without actually doing the intended task, skipping verification, fabricating intermediate artifacts, or directly tampering with the grading function.

In this talk, Kunvar will introduce the Reward Hacking Benchmark (RHB), a multi-step tool-use benchmark that measures how often frontier LLMs take these shortcuts on realistic agentic tasks.

RHB evaluates 13 frontier models from OpenAI, Anthropic, Google, and DeepSeek across multi-step coding, ML, and data-analysis workflows.

Exploit rates range from 0% (Claude Sonnet 4.5) to 13.9% (DeepSeek-R1-Zero). A controlled comparison DeepSeek-V3 → DeepSeek-R1-Zero) shows that RL post-training drives reward hacking from 0.6% to 13.9%, and that 72% of exploit episodes carry explicit chain-of-thought rationales, suggesting models often frame the cheating step as legitimate problem-solving.

Simple environmental hardening reduces exploits by 87.7% relative.

Kunvar will walk through how the benchmark is constructed, what the failure modes look like in practice, what changes when reasoning models are post-trained with RL, and what this implies for evaluation design as agents move into production.

About the speaker:

Kunvar Thaman is an independent AI researcher whose work focuses on agent reliability, evaluation, and reward hacking. His paper "Reward Hacking Benchmark: Measuring Exploits in LLM Agents with Tool Use', the subject of this talk, was accepted to ICML 2026 as a solo-author submission, and was supported by a research grant from Exception Raised.

He completed a degree in B.E. EEE at BITS Pilani in 2022, and has previously worked across red-teaming, alignment evals, and benchmark design.

LinkedIn: https://www.linkedin.com/in/kunvar-thaman/

To attend online:

⁠Add to calendar: https://shorturl.at/4DyR4
Gmeet link: meet.google.com/uap-hodc-bbn

Look forward to seeing you!

Location

Please register to see the exact location of this event.

Bengaluru, India

Presented by

Lossfunk Event Calendar

Your friendly neighborhood AI lab

Hosted By

140 Going

AI