罗勒：平衡的积极的半监督学习，用于类不平衡数据集

论文标题

罗勒：平衡的积极的半监督学习，用于类不平衡数据集

BASIL: Balanced Active Semi-supervised Learning for Class Imbalanced Datasets

论文作者

Kothawade, Suraj, Reddy, Pavan Kumar, Ramakrishnan, Ganesh, Iyer, Rishabh

论文摘要

当前的半监督学习（SSL）方法在标记的数据集和未标记的数据集中每个类可用的数据点数量之间达到平衡。但是，在大多数实际数据集中自然存在类不平衡。众所周知，这种不平衡数据集上的训练模型会导致偏见的模型，从而导致对更频繁的类别的预测有偏见。在SSL方法中，此问题进一步明显，因为它们将使用此偏见模型在训练过程中获得Psoudo-Labels（在未标记的数据上）。在本文中，我们通过尝试为SSL选择平衡的标记数据集来解决此问题，该数据集将导致公正的模型。不幸的是，从类不平衡的分布中获取一个平衡的标记数据集很具有挑战性。我们提出了Basil（平衡的活跃的半监督学习），这是一种新型算法，以每类方式优化了子模型相互信息（SMI）功能，以逐渐在主动学习循环中逐渐选择平衡的数据集。重要的是，我们的技术可以有效地用于改善任何SSL方法的性能。我们对各种SSL方法的Path-Mnist和Organ-Mnist医学数据集进行的实验表明了罗勒的有效性。此外，我们观察到，由于SMI函数选择了一个更平衡的数据集，因此Basil优于最新的多样性和基于不确定性的主动学习方法。

Current semi-supervised learning (SSL) methods assume a balance between the number of data points available for each class in both the labeled and the unlabeled data sets. However, there naturally exists a class imbalance in most real-world datasets. It is known that training models on such imbalanced datasets leads to biased models, which in turn lead to biased predictions towards the more frequent classes. This issue is further pronounced in SSL methods, as they would use this biased model to obtain psuedo-labels (on the unlabeled data) during training. In this paper, we tackle this problem by attempting to select a balanced labeled dataset for SSL that would result in an unbiased model. Unfortunately, acquiring a balanced labeled dataset from a class imbalanced distribution in one shot is challenging. We propose BASIL (Balanced Active Semi-supervIsed Learning), a novel algorithm that optimizes the submodular mutual information (SMI) functions in a per-class fashion to gradually select a balanced dataset in an active learning loop. Importantly, our technique can be efficiently used to improve the performance of any SSL method. Our experiments on Path-MNIST and Organ-MNIST medical datasets for a wide array of SSL methods show the effectiveness of Basil. Furthermore, we observe that Basil outperforms the state-of-the-art diversity and uncertainty based active learning methods since the SMI functions select a more balanced dataset.

下载PDF全文

下载文献需遵守相关版权规定

论文标题