论文标题
半监督学习中的公平性:未标记的数据有助于减少歧视
Fairness in Semi-supervised Learning: Unlabeled Data Help to Reduce Discrimination
论文作者
论文摘要
机器学习兴起的幽灵越来越多,是机器学习模型做出的决定是否公平。尽管研究已经在进行正式的公平学习概念并设计框架以牺牲准确性的公平模型,但大多数人旨在进行监督或无监督的学习。然而,两次观察激发了我们怀疑半监督学习是否可能对解决歧视问题有用。首先,先前的研究表明,增加训练集的规模可能会导致公平与准确性之间的更折衷。其次,当今最强大的模型需要大量的数据才能训练,实际上,这些数据可能从标记和未标记的数据的组合中可以使用。因此,在本文中,我们在预处理阶段提出了一个公平半监督学习的框架,包括伪标记,以预测未标记数据的标签,这是一种重新采样的方法,以获得多个公平数据集,最后,整体学习以提高准确性和降低歧视。偏见,方差和噪声的理论分解分析突出了歧视的不同来源及其对半监督学习中公平性的影响。关于现实世界和合成数据集的一组实验表明,我们的方法能够使用未标记的数据来在准确性和歧视之间取得更好的权衡。
A growing specter in the rise of machine learning is whether the decisions made by machine learning models are fair. While research is already underway to formalize a machine-learning concept of fairness and to design frameworks for building fair models with sacrifice in accuracy, most are geared toward either supervised or unsupervised learning. Yet two observations inspired us to wonder whether semi-supervised learning might be useful to solve discrimination problems. First, previous study showed that increasing the size of the training set may lead to a better trade-off between fairness and accuracy. Second, the most powerful models today require an enormous of data to train which, in practical terms, is likely possible from a combination of labeled and unlabeled data. Hence, in this paper, we present a framework of fair semi-supervised learning in the pre-processing phase, including pseudo labeling to predict labels for unlabeled data, a re-sampling method to obtain multiple fair datasets and lastly, ensemble learning to improve accuracy and decrease discrimination. A theoretical decomposition analysis of bias, variance and noise highlights the different sources of discrimination and the impact they have on fairness in semi-supervised learning. A set of experiments on real-world and synthetic datasets show that our method is able to use unlabeled data to achieve a better trade-off between accuracy and discrimination.