通过蒙特卡洛辍学，提高深度学习模型的可重复性

论文标题

通过蒙特卡洛辍学，提高深度学习模型的可重复性

Improving the repeatability of deep learning models with Monte Carlo dropout

论文作者

Lemay, Andreanne, Hoebel, Katharina, Bridge, Christopher P., Befano, Brian, De Sanjosé, Silvia, Egemen, Diden, Rodriguez, Ana Cecilia, Schiffman, Mark, Campbell, John Peter, Kalpathy-Cramer, Jayashree

论文摘要

将人工智能整合到临床工作流程中需要可靠，可靠的模型。可重复性是模型鲁棒性的关键属性。在相似条件下进行的独立测试期间，可重复的模型输出预测较低。在模型开发和评估期间，在很少评估模型可重复性的同时，非常关注分类性能，从而导致在临床实践中无法使用的模型的发展。在这项工作中，我们评估了在同一访问期间从同一患者获取的四种模型类型（二进制分类，多类分类，序数分类和回归）的可重复性。我们研究了来自公共和私人数据集的四个医学图像分类任务的二元，多级，序数和回归模型的性能：膝关节骨关节炎，宫颈癌筛查，乳房密度估计和早产性视网膜病变。测量可重复性并在重新网络和登录结构上进行比较。此外，我们评估了测试时间对蒙特卡洛辍学预测进行采样的影响，对分类性能和可重复性。利用Monte Carlo预测可显着提高二进制，多级和序数模型的所有任务的重复性，从而导致一致性的95 \％限制的平均降低16％，而分析率的平均限制降低了7％。分类精度在大多数设置以及可重复性方面都提高了。我们的结果表明，超过约20个蒙特卡洛迭代，重复性没有进一步的增益。除了较高的测试一致性外，蒙特卡洛预测得到了更好的校准，从而导致输出概率更准确地反映了正确分类的真实可能性。

The integration of artificial intelligence into clinical workflows requires reliable and robust models. Repeatability is a key attribute of model robustness. Repeatable models output predictions with low variation during independent tests carried out under similar conditions. During model development and evaluation, much attention is given to classification performance while model repeatability is rarely assessed, leading to the development of models that are unusable in clinical practice. In this work, we evaluate the repeatability of four model types (binary classification, multi-class classification, ordinal classification, and regression) on images that were acquired from the same patient during the same visit. We study the performance of binary, multi-class, ordinal, and regression models on four medical image classification tasks from public and private datasets: knee osteoarthritis, cervical cancer screening, breast density estimation, and retinopathy of prematurity. Repeatability is measured and compared on ResNet and DenseNet architectures. Moreover, we assess the impact of sampling Monte Carlo dropout predictions at test time on classification performance and repeatability. Leveraging Monte Carlo predictions significantly increased repeatability for all tasks on the binary, multi-class, and ordinal models leading to an average reduction of the 95\% limits of agreement by 16% points and of the disagreement rate by 7% points. The classification accuracy improved in most settings along with the repeatability. Our results suggest that beyond about 20 Monte Carlo iterations, there is no further gain in repeatability. In addition to the higher test-retest agreement, Monte Carlo predictions were better calibrated which leads to output probabilities reflecting more accurately the true likelihood of being correctly classified.

下载PDF全文

下载文献需遵守相关版权规定

论文标题