学习稀疏功能会导致神经网络中的过度拟合

论文标题

学习稀疏功能会导致神经网络中的过度拟合

Learning sparse features can lead to overfitting in neural networks

论文作者

Petrini, Leonardo, Cagnetta, Francesco, Vanden-Eijnden, Eric, Wyart, Matthieu

论文摘要

人们普遍认为，深网的成功在于他们学习数据功能的有意义表示的能力。但是，了解该功能学习何时以及如何提高性能仍然是一个挑战：例如，它对经过对图像进行分类的现代体系结构有益，而对于在相同数据上训练相同任务的完全连接的网络是有害的。在这里，我们提出了这个难题的解释，表明特征学习的性能比懒惰训练（通过随机特征内核或NTK）更糟，因为前者可以导致较少的神经表示。尽管已知稀疏性对于学习各向异性数据至关重要，但是当目标函数沿输入空间的某些方向恒定或平滑时，这是有害的。我们在两个设置中说明了这种现象：（i）在d维单元球体上的高斯随机函数的回归，以及（ii）图像基准数据集的分类。对于（i），我们通过训练点数量计算概括误差的缩放率，并表明，即使输入空间的维度很大，也不学习特征的方法也可以更好地推广。对于（ii），我们从经验上表明，学习特征确实会导致稀疏，从而减少图像预测指标的平滑表示。这一事实是可能导致性能恶化的，这与沿差异性的平滑度相关。

It is widely believed that the success of deep networks lies in their ability to learn a meaningful representation of the features of the data. Yet, understanding when and how this feature learning improves performance remains a challenge: for example, it is beneficial for modern architectures trained to classify images, whereas it is detrimental for fully-connected networks trained for the same task on the same data. Here we propose an explanation for this puzzle, by showing that feature learning can perform worse than lazy training (via random feature kernel or the NTK) as the former can lead to a sparser neural representation. Although sparsity is known to be essential for learning anisotropic data, it is detrimental when the target function is constant or smooth along certain directions of input space. We illustrate this phenomenon in two settings: (i) regression of Gaussian random functions on the d-dimensional unit sphere and (ii) classification of benchmark datasets of images. For (i), we compute the scaling of the generalization error with number of training points, and show that methods that do not learn features generalize better, even when the dimension of the input space is large. For (ii), we show empirically that learning features can indeed lead to sparse and thereby less smooth representations of the image predictors. This fact is plausibly responsible for deteriorating the performance, which is known to be correlated with smoothness along diffeomorphisms.

下载PDF全文

下载文献需遵守相关版权规定

论文标题