论文标题
针对癌症的机器学习:通过机器学习分类整个基因组测序数据的准确诊断癌症
Machine Learning Against Cancer: Accurate Diagnosis of Cancer by Machine Learning Classification of the Whole Genome Sequencing Data
论文作者
论文摘要
机器学习可以通过基于基因组概况对癌性和健康样本进行分类,可以精确地鉴定出不同的癌症肿瘤。我们已经开发了新颖的MLAC(机器学习针对癌症),以完美的精确度,敏感性和特异性实现了完美的结果。我们已经使用了癌症和健康组织中的下一代RNA测序技术获得的整个基因组测序数据和基因型 - 组织表达项目。此外,我们已经表明,无监督的机器学习聚类具有巨大的潜力用于癌症诊断。确实,一种使用数据和一般算法的创造性方式导致了完美的分类,即使大多数不同的肿瘤类型的所有精度,敏感性和特异性也等于1,即使使用了适度的数据,相同的方法也适用于一系列癌症,并在癌症和健康的样本中也很大。我们的系统可以在实践中使用,因为一旦对分类器进行了培训,就可以将其用于对任何新的潜在患者样本进行分类。我们工作的优点是,在包括非常早期的癌症的所有阶段的样本上获得了上述完美的精度和回忆。因此,这是在早期诊断癌症的有前途的工具。我们新颖模型的另一个优点是它可以与RNA测序数据的归一化值一起使用,因此人们的私人敏感医学数据将保持隐藏,保护和安全。这种类型的分析将来将是广泛和经济的,人们甚至可以学会接收他们的RNA测序数据,并进行自己的初步癌症研究本身,这些研究有可能帮助医疗保健系统。这是迈向良好健康的伟大一步,这是可持续社会的主要基础。
Machine learning can precisely identify different cancer tumors at any stage by classifying cancerous and healthy samples based on their genomic profile. We have developed novel methods of MLAC (Machine Learning Against Cancer) achieving perfect results with perfect precision, sensitivity, and specificity. We have used the whole genome sequencing data acquired by next-generation RNA sequencing techniques in The Cancer Genome Atlas and Genotype-Tissue Expression projects for cancerous and healthy tissues respectively. Moreover, we have shown that unsupervised machine learning clustering has great potential to be used for cancer diagnosis. Indeed, a creative way to work with data and general algorithms has resulted in perfect classification i.e. all precision, sensitivity, and specificity are equal to 1 for most of the different tumor types even with a modest amount of data, and the same method works well on a series of cancers and results in great clustering of cancerous and healthy samples too. Our system can be used in practice because once the classifier is trained, it can be used to classify any new sample of new potential patients. One advantage of our work is that the aforementioned perfect precision and recall are obtained on samples of all stages including very early stages of cancer; therefore, it is a promising tool for diagnosis of cancers in early stages. Another advantage of our novel model is that it works with normalized values of RNA sequencing data, hence people's private sensitive medical data will remain hidden, protected, and safe. This type of analysis will be widespread and economical in the future and people can even learn to receive their RNA sequencing data and do their own preliminary cancer studies themselves which have the potential to help the healthcare systems. It is a great step forward toward good health that is the main base of sustainable societies.