论文标题
来自具有部分风险正则化的多个未标记数据集的多类分类
Multi-class Classification from Multiple Unlabeled Datasets with Partial Risk Regularization
论文作者
论文摘要
近年来,有监督的深度学习取得了巨大的成功,在该学习中,预测模型是从大量完全标记的数据中培训的。但是,实际上,标记这样的大数据可能非常昂贵,甚至出于隐私原因甚至可能是不可能的。因此,在本文中,我们旨在学习一个无需任何类标签的准确分类器。更具体地说,我们考虑了多组未标记的数据且仅其类先验的情况,即每个班级的比例。在此问题设置下,我们首先得出了对分类风险的无偏估计量,该估计可以从给定的未标记集中估算,并理论上分析了学习分类器的概括误差。然后,我们发现获得的分类器往往会导致过度拟合,因为其经验风险在训练过程中呈负面。为了防止过度拟合,我们进一步提出了一个部分风险正规化,该级别的数据集和类别的某些级别都保持了部分风险。实验表明,我们的方法有效地减轻了过度拟合,并且优于从多个未标记集中学习的最先进方法。
Recent years have witnessed a great success of supervised deep learning, where predictive models were trained from a large amount of fully labeled data. However, in practice, labeling such big data can be very costly and may not even be possible for privacy reasons. Therefore, in this paper, we aim to learn an accurate classifier without any class labels. More specifically, we consider the case where multiple sets of unlabeled data and only their class priors, i.e., the proportions of each class, are available. Under this problem setup, we first derive an unbiased estimator of the classification risk that can be estimated from the given unlabeled sets and theoretically analyze the generalization error of the learned classifier. We then find that the classifier obtained as such tends to cause overfitting as its empirical risks go negative during training. To prevent overfitting, we further propose a partial risk regularization that maintains the partial risks with respect to unlabeled datasets and classes to certain levels. Experiments demonstrate that our method effectively mitigates overfitting and outperforms state-of-the-art methods for learning from multiple unlabeled sets.