TY - GEN
T1 - SafeSwitch
T2 - 30th Conference on Empirical Methods in Natural Language Processing, EMNLP 2025
AU - Han, Peixuan
AU - Qian, Cheng
AU - Chen, Xiusi
AU - Zhang, Yuji
AU - Ji, Heng
AU - Zhang, Denghui
N1 - Publisher Copyright:
©2025 Association for Computational Linguistics.
PY - 2025
Y1 - 2025
N2 - Large language models (LLMs) exhibit exceptional capabilities across various tasks but also pose risks by generating harmful content. Existing safety mechanisms, while improving model safety, often lead to overly cautious behavior and fail to fully leverage LLMs’ internal cognitive processes. Inspired by humans’ reflective thinking capability, we first show that LLMs can similarly perform internal assessments about safety in their internal states. Building on this insight, we propose SafeSwitch, a dynamic framework that regulates unsafe outputs by utilizing the prober-based internal state monitor that actively detects harmful intentions, and activates a safety head that leads to safer and more conservative responses only when necessary. SafeSwitch reduces harmful outputs by approximately 80% on harmful queries while maintaining strong utility, reaching a Pareto optimal among several methods. Our method is also advantageous over traditional methods in offering more informative, context-aware refusals, and achieves these benefits while only tuning less than 6% of the original parameters. SafeSwitch demonstrates large language models’ capacity for self-awareness and reflection regarding safety, offering a promising approach to more nuanced and effective safety controls. Codes for this work are available at https://github.com/Hanpx20/SafeSwitch.
AB - Large language models (LLMs) exhibit exceptional capabilities across various tasks but also pose risks by generating harmful content. Existing safety mechanisms, while improving model safety, often lead to overly cautious behavior and fail to fully leverage LLMs’ internal cognitive processes. Inspired by humans’ reflective thinking capability, we first show that LLMs can similarly perform internal assessments about safety in their internal states. Building on this insight, we propose SafeSwitch, a dynamic framework that regulates unsafe outputs by utilizing the prober-based internal state monitor that actively detects harmful intentions, and activates a safety head that leads to safer and more conservative responses only when necessary. SafeSwitch reduces harmful outputs by approximately 80% on harmful queries while maintaining strong utility, reaching a Pareto optimal among several methods. Our method is also advantageous over traditional methods in offering more informative, context-aware refusals, and achieves these benefits while only tuning less than 6% of the original parameters. SafeSwitch demonstrates large language models’ capacity for self-awareness and reflection regarding safety, offering a promising approach to more nuanced and effective safety controls. Codes for this work are available at https://github.com/Hanpx20/SafeSwitch.
UR - https://www.scopus.com/pages/publications/105028956915
UR - https://www.scopus.com/pages/publications/105028956915#tab=citedBy
U2 - 10.18653/v1/2025.findings-emnlp.366
DO - 10.18653/v1/2025.findings-emnlp.366
M3 - Conference contribution
AN - SCOPUS:105028956915
T3 - EMNLP 2025 - 2025 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2025
SP - 6936
EP - 6955
BT - EMNLP 2025 - 2025 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2025
A2 - Christodoulopoulos, Christos
A2 - Chakraborty, Tanmoy
A2 - Rose, Carolyn
A2 - Peng, Violet
Y2 - 4 November 2025 through 9 November 2025
ER -