一个简单有效的合奏分类器，结合了越南社交媒体数据集上的多个神经网络模型

论文标题

一个简单有效的合奏分类器，结合了越南社交媒体数据集上的多个神经网络模型

A Simple and Efficient Ensemble Classifier Combining Multiple Neural Network Models on Social Media Datasets in Vietnamese

论文作者

Huynh, Huy Duc, Do, Hang Thi-Thuy, Van Nguyen, Kiet, Nguyen, Ngan Luu-Thuy

论文摘要

文本分类是自然语言处理的一个流行话题，该主题目前吸引了全球众多研究工作。社交媒体中数据的显着增加需要研究人员的广泛关注来分析此类数据。该领域的许多语言都有各种研究，但仅限于越南语言。因此，本研究旨在通过三个不同的越南基准数据集对社交媒体上的越南文本进行分类。在本研究中使用和优化了先进的深度学习模型，包括CNN，LSTM及其变体。我们还实施了从未应用于数据集的BERT。我们的实验找到了一个合适的模型，用于每个特定数据集上的分类任务。为了利用单个模型，我们提出了一个合奏模型，结合了最高绩效模型。我们的单个模型在每个数据集上达到积极的结果。此外，我们的合奏模型在所有三个数据集上都达到了最佳性能。我们达到HSD-VLSP数据集的F1得分的86.96％，UIT-VSMEC数据集的F1得分的65.79％，分别为UIT-VSFC数据集的情感和主题为92.79％和89.70％。因此，与这些数据集的先前研究相比，我们的模型取得了更好的性能。

Text classification is a popular topic of natural language processing, which has currently attracted numerous research efforts worldwide. The significant increase of data in social media requires the vast attention of researchers to analyze such data. There are various studies in this field in many languages but limited to the Vietnamese language. Therefore, this study aims to classify Vietnamese texts on social media from three different Vietnamese benchmark datasets. Advanced deep learning models are used and optimized in this study, including CNN, LSTM, and their variants. We also implement the BERT, which has never been applied to the datasets. Our experiments find a suitable model for classification tasks on each specific dataset. To take advantage of single models, we propose an ensemble model, combining the highest-performance models. Our single models reach positive results on each dataset. Moreover, our ensemble model achieves the best performance on all three datasets. We reach 86.96% of F1- score for the HSD-VLSP dataset, 65.79% of F1-score for the UIT-VSMEC dataset, 92.79% and 89.70% for sentiments and topics on the UIT-VSFC dataset, respectively. Therefore, our models achieve better performances as compared to previous studies on these datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题