Cover Image for Challenges in AI Safety
Cover Image for Challenges in AI Safety
Hosted By
3 Went

Challenges in AI Safety

Hosted by NICE AI Talk
YouTube
Registration
Past Event
Welcome! To join the event, please register below.
About Event

Welcome to NICE Talk 123! (Chinese talk)

This talk is about safety in AI Agent

Youtube livestream: https://www.youtube.com/watch?v=TBuk1qXsrPM

Abstract:

To address and mitigate the potential risks of frontier AI models and agents, the current industry approach to improving safety involves first conducting safety evaluations of models, then performing "safety alignment" using human value preference data, such as through RLHF techniques. However, these methods have not truly resolved AI risks. On one hand, this is due to insufficient understanding of the underlying mechanisms of safety alignment methods; on the other hand, methods like RLHF merely teach AI to refuse to answer sensitive questions without actually removing sensitive knowledge from within the model itself.

This presentation will introduce the latest explorations and achievements in "intrinsic safety," covering the journey from safety evaluation to intrinsic safety, including dynamically discovering vulnerabilities in frontier AI models and agents, AI internal alignment, interpretability of useful mechanisms, and more.

Speaker: Dongrui Liu

Young Scientist at the Trustworthy and Safe AI Center, Shanghai AI Laboratory. Received PhD from Shanghai Jiao Tong University. Has been engaged in trustworthy and safe AI research for an extended period, including large model interpretability, adversarial robustness, alignment, and evaluation. Has published over 20 papers in conferences and journals including NeurIPS, ICLR, CVPR, AAAI, ACL, EMNLP, T-ITS, and TCSVT. Has received honors including CVPR 2024 Best Paper Nominee, ACL 2025 Outstanding Paper Award, and Outstanding PhD Graduate of Shanghai Jiao Tong University.

Host: Boyang Xue

Fourth-year PhD student at The Chinese University of Hong Kong, supervised by Professor Wong Kam-Fai. His research interests include trustworthy large language models/agents, dialogue systems, Bayesian learning, and speech recognition. He has conducted academic exchanges at ETH Zurich and University College London, and has published multiple papers as first author in top AI conferences and journals including ACL, EMNLP, and TASLP. He also serves as a reviewer for related conferences and journals. His current research primarily focuses on improving the reliability of large language models.

Hosted By
3 Went