论文标题
朝着连续的一致性公理
Towards Continuous Consistency Axiom
论文作者
论文摘要
开发机器学习领域的新算法,尤其是聚类,对此类算法的比较研究以及根据软件工程原理进行测试,需要对标记的数据集进行可用性。尽管可以提供标准基准测试,但为了避免过度拟合的问题,需要更广泛的此类数据集。在这种情况下,关于聚类算法的公理化的理论作品,尤其是在保存转换的集群方面的公理是从现有数据集中产生标记的数据集的一种非常便宜的方法。但是,正如我们在本文中所示,经常引用的Kleinberg:2002的公理系统不适用于有限维度的欧几里得空间,其中许多算法(例如$ k $ - ameans)运行。特别是,所谓的外部矛盾公理在对数据点位置进行小更改时失败,并且内符敏感公理仅适用于一般设置中的身份转换。 因此,我们提出了一个替代的公理系统,其中克莱因伯格的内部一致性公理被以中心的一致性公理代替,外部一致性公理被运动一致性公理所取代。我们证明,对于具有自动调整的$ k $的$ k $ - eaneans的分层版本,新系统是可以满足的,因此这并不矛盾。此外,由于$ k $ -Means仅创建凸簇,我们证明可以创建一个检测凹形簇的版本,并且仍然可以满足公理系统。这种公理系统的实际应用领域可能是从现有测试数据中生成新标记的测试数据,用于聚类算法测试。 %我们提出重力一致性是没有这种缺陷的替代品。
Development of new algorithms in the area of machine learning, especially clustering, comparative studies of such algorithms as well as testing according to software engineering principles requires availability of labeled data sets. While standard benchmarks are made available, a broader range of such data sets is necessary in order to avoid the problem of overfitting. In this context, theoretical works on axiomatization of clustering algorithms, especially axioms on clustering preserving transformations are quite a cheap way to produce labeled data sets from existing ones. However, the frequently cited axiomatic system of Kleinberg:2002, as we show in this paper, is not applicable for finite dimensional Euclidean spaces, in which many algorithms like $k$-means, operate. In particular, the so-called outer-consistency axiom fails upon making small changes in datapoint positions and inner-consistency axiom is valid only for identity transformation in general settings. Hence we propose an alternative axiomatic system, in which Kleinberg's inner consistency axiom is replaced by a centric consistency axiom and outer consistency axiom is replaced by motion consistency axiom. We demonstrate that the new system is satisfiable for a hierarchical version of $k$-means with auto-adjusted $k$, hence it is not contradictory. Additionally, as $k$-means creates convex clusters only, we demonstrate that it is possible to create a version detecting concave clusters and still the axiomatic system can be satisfied. The practical application area of such an axiomatic system may be the generation of new labeled test data from existent ones for clustering algorithm testing. %We propose the gravitational consistency as a replacement which does not have this deficiency.