论文标题

PCHATBOT:一个用于个性化聊天机器人的大型数据集

Pchatbot: A Large-Scale Dataset for Personalized Chatbot

论文作者

Qian, Hongjin, Li, Xiaohe, Zhong, Hanxun, Guo, Yu, Ma, Yueyuan, Zhu, Yutao, Liu, Zhanliang, Dou, Zhicheng, Wen, Ji-Rong

论文摘要

自然语言对话系统最近引起了极大的关注。由于许多对话模型都是数据驱动的,因此高质量的数据集对于这些系统至关重要。在本文中,我们介绍了PCHATBOT,这是一个大规模的对话数据集,分别包含来自微博和司法论坛的两个子集。为了使原始数据集适应对话系统,我们通过匿名,重复数据删除,细分和过滤等过程将原始数据集详细地归一化。 PCHATBOT的规模明显大于现有的中国数据集,这可能使数据驱动的模型受益。此外,当前的个性化聊天机器人的对话数据集通常包含几个角色句子或属性。与现有数据集不同,PCHATBOT为帖子和响应提供了匿名用户ID和时间戳。这使得开发个性化对话模型,这些模型可以直接从用户的对话历史记录中学习隐式用户个性。我们的初步实验研究基准了几种最先进的对话模型,为将来的工作提供了比较。该数据集可以在GitHub公开访问。

Natural language dialogue systems raise great attention recently. As many dialogue models are data-driven, high-quality datasets are essential to these systems. In this paper, we introduce Pchatbot, a large-scale dialogue dataset that contains two subsets collected from Weibo and Judicial forums respectively. To adapt the raw dataset to dialogue systems, we elaborately normalize the raw dataset via processes such as anonymization, deduplication, segmentation, and filtering. The scale of Pchatbot is significantly larger than existing Chinese datasets, which might benefit the data-driven models. Besides, current dialogue datasets for personalized chatbot usually contain several persona sentences or attributes. Different from existing datasets, Pchatbot provides anonymized user IDs and timestamps for both posts and responses. This enables the development of personalized dialogue models that directly learn implicit user personality from the user's dialogue history. Our preliminary experimental study benchmarks several state-of-the-art dialogue models to provide a comparison for future work. The dataset can be publicly accessed at Github.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源