

Lang Gao & Jinghui Zhang - The Cylindrical Representation Hypothesis for Language Model Steering
Steering is a widely used technique for controlling large language models, yet its effects are often unstable and hard to predict. Existing theoretical accounts are largely based on the Linear Representation Hypothesis (LRH). While LRH assumes that concepts can be orthogonalized for lossless control, this idealized mapping fails in real representations and cannot account for the observed unpredictability of steering. By relaxing LRH's orthogonality assumption while preserving linear representations, we show that overlapping concept contributions naturally yield a sample-specific axis-orthogonal structure. We formalize this as the Cylindrical Representation Hypothesis (CRH). In CRH, a central axis captures the main difference between concept absence and presence and drives concept generation. A surrounding normal plane controls steering sensitivity by determining how easily the axis can activate the target concept. Within this plane, only specific sensitive sectors strongly facilitate concept activation, while other sectors can suppress or delay it. While the surrounding normal plane can be reliably identified from difference vectors, the sensitive sector cannot, introducing intrinsic uncertainty at the sector level. This uncertainty provides a principled explanation for why steering outcomes often fluctuate even when using well-aligned directions.
Lang Gao is a first-year PhD student in Natural Language Processing at the Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), supervised by Prof. Xiuying Chen and Prof. Preslav Nakov, and concurrently an NLP Algorithm Intern at ByteDance. His research sits at the intersection of mechanistic interpretability and LLM trustworthiness. On the interpretability side, he investigates the geometric structure of latent spaces in foundation models to understand how models encode and process information. On the trustworthiness side, his work broadly addresses jailbreak attacks, social biases, and failure modes in machine-generated text detection utilizing interpretability techniques. Prior to his PhD, he gained broad research experience across institutions including Cambridge, UC Santa Cruz, the University of Notre Dame, and MBZUAI. His work has appeared at premier venues including ACL, ICML, ICLR, and EMNLP.
Jinghui Zhang is a first-year PhD student in Natural Language Processing at the Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), supervised by Prof. Xiuying Chen and Prof. Kentaro Inui. He received his Bachelor’s degree from Shandong University, where he was advised by Prof. Pengwei Wang. His research interests lie in personalization and interpretability in NLP. On the personalization side, he studies user-centered language modeling and personalized text generation, aiming to build NLP systems that better capture individual preferences and behavioral patterns. On the interpretability side, he investigates the internal mechanisms of large language models to better understand how models encode, process, and generate information. More broadly, he is also interested in trustworthiness and efficiency. His work has appeared at venues including ACM MM, ICML, CVPR, and other leading conferences and journals.