Detection and Defense Against Backdoor Attacks in Large Language Models Based on Repeated Words Analysis

Chuan Zhang; Xiaolong Tao; Baokun Zheng; Zheng Gao; 烈煌 祝

Chuan Zhang, Xiaolong Tao, Baokun Zheng, Zheng Gao, 烈煌祝. Detection and Defense Against Backdoor Attacks in Large Language Models Based on Repeated Words Analysis[J]. Chinese Journal of Electronics.

Citation:

Detection and Defense Against Backdoor Attacks in Large Language Models Based on Repeated Words Analysis

Abstract

Abstract

Backdoor attacks pose significant threats to the security and reliability of large language models (LLMs). Existing approaches to backdoor detection often struggle with accurately identifying complex and stealthy triggers, especially in diverse and large-scale datasets, leading to gaps in defense effectiveness. This paper proposes a novel approach to detect and defend against such attacks by analyzing repeated patterns in input data. By identifying repeated words that frequently appear in malicious inputs, the proposed approach effectively locates backdoor triggers and mitigates their impact on LLMs. The method leverages semantic clustering and recursive optimization to enhance detection precision and ensure minimal disruption to benign outputs. Experimental results, based on a real-world movie review dataset, demonstrate the accuracy, robustness, and efficiency of this approach in detecting backdoor attacks and enhancing model security.

FullText(HTML)