论文标题

高斯内核的双层归一化对异性噪声很强

Doubly-Stochastic Normalization of the Gaussian Kernel is Robust to Heteroskedastic Noise

论文作者

Landa, Boris, Coifman, Ronald R., Kluger, Yuval

论文摘要

许多数据分析技术的基本步骤是构建一个亲和力矩阵,描述了数据点之间的相似性。当数据点位于欧几里得空间中时,一种广泛的方法是从具有成对距离的高斯内核的亲和力矩阵中,并遵循一定的归一化(例如,行 - 体积归一化或其对称变体)。我们证明,具有零主对角线的高斯内核的双重归一化(即,没有自我循环)对异性噪声是可靠的。也就是说,双层归一化是有利的,因为它会自动说明具有不同噪声方差的观察结果。具体而言,我们证明,在合适的高维度中,异质噪声不会在空间上的任何特定方向上集中太多,结果(双重性)嘈杂的亲和力矩阵将其与速率$ m^{ - 1/2} $的速率$ m^{ - 1/2} $收敛,其中$ m $是环境变化。我们以数值为单位证明了这一结果,并表明,在异性噪声噪声下,流行的排行和对称标准化表现不佳。此外,我们提供了具有固有的异性弹性性的模拟和实验单细胞RNA序列数据的示例,其中显而易见的是双重构化归一化进行探索性分析的优势。

A fundamental step in many data-analysis techniques is the construction of an affinity matrix describing similarities between data points. When the data points reside in Euclidean space, a widespread approach is to from an affinity matrix by the Gaussian kernel with pairwise distances, and to follow with a certain normalization (e.g. the row-stochastic normalization or its symmetric variant). We demonstrate that the doubly-stochastic normalization of the Gaussian kernel with zero main diagonal (i.e., no self loops) is robust to heteroskedastic noise. That is, the doubly-stochastic normalization is advantageous in that it automatically accounts for observations with different noise variances. Specifically, we prove that in a suitable high-dimensional setting where heteroskedastic noise does not concentrate too much in any particular direction in space, the resulting (doubly-stochastic) noisy affinity matrix converges to its clean counterpart with rate $m^{-1/2}$, where $m$ is the ambient dimension. We demonstrate this result numerically, and show that in contrast, the popular row-stochastic and symmetric normalizations behave unfavorably under heteroskedastic noise. Furthermore, we provide examples of simulated and experimental single-cell RNA sequence data with intrinsic heteroskedasticity, where the advantage of the doubly-stochastic normalization for exploratory analysis is evident.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源