Detection and Defense Against Backdoor Attacks in Large Language Models Based on Repeated Words Analysis
-
Abstract
Backdoor attacks pose significant threats to the security and reliability of large language models (LLMs). Existing approaches to backdoor detection often struggle with accurately identifying complex and stealthy triggers, especially in diverse and large-scale datasets, leading to gaps in defense effectiveness. This paper proposes a novel approach to detect and defend against such attacks by analyzing repeated patterns in input data. By identifying repeated words that frequently appear in malicious inputs, the proposed approach effectively locates backdoor triggers and mitigates their impact on LLMs. The method leverages semantic clustering and recursive optimization to enhance detection precision and ensure minimal disruption to benign outputs. Experimental results, based on a real-world movie review dataset, demonstrate the accuracy, robustness, and efficiency of this approach in detecting backdoor attacks and enhancing model security.
-
-