RLMI: A Black-box Model Inversion Attack Method Against Large Language Models for Text Classification

Xie Yuxin; Gao Ying

doi:10.23919/cje.2025.00.037

Yuxin Xie and Ying Gao, “RLMI: a black-box model inversion attack method against large language models for text classification,” Chinese Journal of Electronics, vol. x, no. x, pp. 1–13, xxxx. DOI: 10.23919/cje.2025.00.037

Citation:

RLMI: A Black-box Model Inversion Attack Method Against Large Language Models for Text Classification

Xie Yuxin,
Gao Ying

Graphical Abstract

Graphical Abstract

Abstract

Abstract

The combination of pre-training and fine-tuning has become a standard paradigm for training large language models (LLMs). However, recent studies have shown that LLMs may unintentionally memorize fine-tuning data, which is typically domain-specific and sensitive, potentially leading to severe privacy risks. To address this privacy concern, this paper proposes a reinforcement learning-based model inversion attack to extract text data used for fine-tuning LLMs on text classification tasks. First, we formulate the inversion of text data as a generation process of text sequences token by token. Then, a policy-based reinforcement learning algorithm optimizes the parameters of a text generation model, serving as the policy network. Finally, we generate text sequences highly likely to be part of the fine-tuning dataset with the optimized policy network. By integrating a reinforcement learning framework, this method enables efficient exploration of the text space, addressing challenges posed by the discreteness of text data and the unavailability of the LLM’s internal structure and parameters. Extensive experiments demonstrate the effectiveness of this method in generating text sequences consistent with target labels and similar to the fine-tuning data. These findings reveal potential privacy vulnerabilities in fine-tuning LLMs and underscore the need for further research into privacy-preserving technologies.