论文标题
您不知道我最喜欢的颜色:防止对话表示透露扬声器的私人角色
You Don't Know My Favorite Color: Preventing Dialogue Representations from Revealing Speakers' Private Personas
论文作者
论文摘要
社交聊天机器人,也称为chat-chat聊天机器人,通过大型的语言模型迅速发展。尽管取得了巨大进展,但最近出现了隐私问题:可以通过模型反转攻击提取大语言模型的培训数据。另一方面,用于培训聊天机器人的数据集包含两个人之间的许多私人对话。在这项工作中,我们进一步研究了通过语言建模培训的聊天机器人隐藏状态的隐私泄漏,而语言建模尚未得到很好的研究。我们表明,可以通过高精度的简单神经网络来推断说话者的角色。为此,我们提出了有效的防御目标,以保护角色泄漏免受隐藏状态。我们进行了广泛的实验,以证明我们提出的防御目标可以将攻击准确性从37.6%降低到0.5%。同时,提出的目标保留了语言模型的强大生成能力。
Social chatbots, also known as chit-chat chatbots, evolve rapidly with large pretrained language models. Despite the huge progress, privacy concerns have arisen recently: training data of large language models can be extracted via model inversion attacks. On the other hand, the datasets used for training chatbots contain many private conversations between two individuals. In this work, we further investigate the privacy leakage of the hidden states of chatbots trained by language modeling which has not been well studied yet. We show that speakers' personas can be inferred through a simple neural network with high accuracy. To this end, we propose effective defense objectives to protect persona leakage from hidden states. We conduct extensive experiments to demonstrate that our proposed defense objectives can greatly reduce the attack accuracy from 37.6% to 0.5%. Meanwhile, the proposed objectives preserve language models' powerful generation ability.