带有结构化彩票的Diet Deep Generative Audio模型

论文标题

带有结构化彩票的Diet Deep Generative Audio模型

Diet deep generative audio models with structured lottery

论文作者

Esling, Philippe, Devis, Ninon, Bitton, Adrien, Caillon, Antoine, Chemla--Romeu-Santos, Axel, Douwes, Constance

论文摘要

深度学习模型在大多数音频应用程序字段中都提供了非常成功的解决方案。但是，这些模型的高精度是以巨大的计算成本为代价。在评估提出的模型的质量时，几乎总是忽略了这一方面。但是，如果不考虑其复杂性，就不应评估模型。在音频应用程序中，这一方面尤其重要，这在很大程度上依赖于具有实时限制的专门嵌入式硬件。在本文中，我们基于最近观察到的，即深层模型通过研究深刻生成音频模型的彩票假设高度参数化。该假设指出，在深层模型中存在极有效的小子网络，如果受到隔离训练，将比大型模型提供更高的准确性。但是，彩票是通过依靠非结构化掩蔽来找到的，这意味着所产生的模型在磁盘尺寸或推理时间上均未提供任何增益。相反，我们在这里开发了一种旨在执行结构化修剪的方法。我们表明，这需要依靠全球选择并基于相互信息引入特定标准。首先，我们证实了令人惊讶的结果，即较小的模型比大型同行提供了更高的精度。我们进一步表明，我们可以删除多达95％的模型权重，而无需精确降解。因此，我们可以在流行的方法（例如WaveNet，Sing或DDSP）中获得非常轻的模型，这些模型具有相称的精度，这些模型较小，高达100倍。我们研究了将这些模型嵌入Raspberry Pi和Arduino上的理论界限，并表明我们可以在CPU上获得具有等效质量的GPU模型的CPU的生成模型。最后，我们讨论了在嵌入式平台上实施深层生成音频模型的可能性。

Deep learning models have provided extremely successful solutions in most audio application fields. However, the high accuracy of these models comes at the expense of a tremendous computation cost. This aspect is almost always overlooked in evaluating the quality of proposed models. However, models should not be evaluated without taking into account their complexity. This aspect is especially critical in audio applications, which heavily relies on specialized embedded hardware with real-time constraints. In this paper, we build on recent observations that deep models are highly overparameterized, by studying the lottery ticket hypothesis on deep generative audio models. This hypothesis states that extremely efficient small sub-networks exist in deep models and would provide higher accuracy than larger models if trained in isolation. However, lottery tickets are found by relying on unstructured masking, which means that resulting models do not provide any gain in either disk size or inference time. Instead, we develop here a method aimed at performing structured trimming. We show that this requires to rely on global selection and introduce a specific criterion based on mutual information. First, we confirm the surprising result that smaller models provide higher accuracy than their large counterparts. We further show that we can remove up to 95% of the model weights without significant degradation in accuracy. Hence, we can obtain very light models for generative audio across popular methods such as Wavenet, SING or DDSP, that are up to 100 times smaller with commensurate accuracy. We study the theoretical bounds for embedding these models on Raspberry Pi and Arduino, and show that we can obtain generative models on CPU with equivalent quality as large GPU models. Finally, we discuss the possibility of implementing deep generative audio models on embedded platforms.

下载PDF全文

下载文献需遵守相关版权规定

论文标题