通过无监督的重建和监督分类进行的深层检测

论文标题

通过无监督的重建和监督分类进行的深层检测

Deepfake Detection via Joint Unsupervised Reconstruction and Supervised Classification

论文作者

Yan, Bosheng, Li, Chang-Tsun, Lu, Xuequan

论文摘要

深度学习使逼真的面部操纵（即深击）对媒体在流通中的完整性提出了重大关注。大多数现有的深度学习技术用于深膜检测可以在数据集内评估设置（即在同一数据集上的培训和测试）中实现有希望的性能，但无法在数据集中评估设置（即，在一个数据集中的培训和另一个数据集中进行测试）中令人满意地执行。以前的大多数方法都使用骨干网络来提取全局功能来进行预测，仅采用二进制监督（即指示培训实例是假的还是真实的）来训练网络。仅基于对全球特征引导的学习的分类通常会导致较弱的可概括性，从而无法看见的操纵方法。此外，重建任务可以改善学习的表示。在本文中，我们介绍了一种新颖的DeepFake检测方法，该方法同时考虑了重建和分类任务以解决这些问题。该方法与另一个任务共享一项任务所学的信息，该信息的重点是其他现有作品的不同方面，很少考虑，因此可以提高整体绩效。特别是，我们设计了两个分支卷积自动编码器（CAE），其中两个分支都共享了用于压缩特征图中的卷积编码器将特征图压缩到潜在表示中。然后，将输入数据的潜在表示同时馈送到简单的分类器和无监督的重建组件。我们的网络是端到端训练的。实验表明，我们的方法在三个常用数据集上实现了最先进的性能，尤其是在跨数据库评估设置中。

Deep learning has enabled realistic face manipulation (i.e., deepfake), which poses significant concerns over the integrity of the media in circulation. Most existing deep learning techniques for deepfake detection can achieve promising performance in the intra-dataset evaluation setting (i.e., training and testing on the same dataset), but are unable to perform satisfactorily in the inter-dataset evaluation setting (i.e., training on one dataset and testing on another). Most of the previous methods use the backbone network to extract global features for making predictions and only employ binary supervision (i.e., indicating whether the training instances are fake or authentic) to train the network. Classification merely based on the learning of global features leads often leads to weak generalizability to unseen manipulation methods. In addition, the reconstruction task can improve the learned representations. In this paper, we introduce a novel approach for deepfake detection, which considers the reconstruction and classification tasks simultaneously to address these problems. This method shares the information learned by one task with the other, which focuses on a different aspect other existing works rarely consider and hence boosts the overall performance. In particular, we design a two-branch Convolutional AutoEncoder (CAE), in which the Convolutional Encoder used to compress the feature map into the latent representation is shared by both branches. Then the latent representation of the input data is fed to a simple classifier and the unsupervised reconstruction component simultaneously. Our network is trained end-to-end. Experiments demonstrate that our method achieves state-of-the-art performance on three commonly-used datasets, particularly in the cross-dataset evaluation setting.

下载PDF全文

下载文献需遵守相关版权规定

论文标题