Assessing Harm in AI Agents: What Questionnaires Miss (Max Hellrigel-Holderbaum, FAU Erlangen-Nürnberg)

Name: Assessing Harm in AI Agents: What Questionnaires Miss (Max Hellrigel-Holderbaum, FAU Erlangen-Nürnberg)
Start: 2025-10-18T16:00:00.000+01:00
End: 2025-10-18T18:00:00.000+01:00
Location: Old Divinity School, St John's College

Artificial Agency Speaker Series

Old Divinity School, St John's College

Cambridge, England

Past Event

Welcome! To join the event, please register below.

You will be asked to verify token ownership with your wallet.

About Event

This is the second event of UK AI Forum's Artificial Agency speaker series. This event will include a talk, followed by a discussion/Q&A, followed by time for networking!

Abstract: As AI systems advance in their capabilities, it is quickly becoming paramount to measure their safety and alignment to human values. A fast-growing field of AI research is devoted to developing such evaluation methods. However, most current advances in this domain are of doubtful quality. Standard methods typically prompt large language models (LLMs) in a questionnaire-style to describe their values or how they would behave in hypothetical scenarios. Because current assessments focus on unaugmented LLMs, they fall short in evaluating AI agents which are expected to pose the greatest risks. The space of an LLM's responses to questionnaire-style prompts is extremely narrow compared to the space of inputs, possible actions, and continuous interactions of AI agents with their environment in realistic scenarios, and hence unlikely to be representative thereof. We further contend that such assessments make strong, unfounded assumptions concerning the ability and tendency of LLMs to report accurately about their counterfactual behavior. This makes them inadequate to assess risks from AI systems in real-world contexts as they lack the necessary ecological validity. We then argue that a structurally identical issue holds for current approaches to AI alignment. These are similarly applied only to LLMs rather than AI agents, hence neglecting the most critical risk scenarios. Lastly, we discuss how to improve both safety assessments and alignment training by taking these shortcomings to heart while satisfying practical constraints.

Location

Old Divinity School, St John's College

University Of Cambridge, St Johns St, Cambridge CB2 1TP, UK

To get to the Lightfoot Room, head up the stairs to the right of the main entrance to the Old Divinity School. There will be signage to help you on your way!

Presented by

Artificial Agency Speaker Series

Hosted By

62 Went

AI