论文标题

拼图:发现随机森林的解释性高阶相互作用的工具

JigSaw: A tool for discovering explanatory high-order interactions from random forests

论文作者

DiMucci, Demetrius

论文摘要

机器学习正在通过促进大量数据集中发现的复杂模式的结果预测来彻底改变生物学。大型生物学数据集(例如由转录组或微生物组研究产生的)测量许多相关组件,这些组件以模块化方式相互作用。识别机器学习模型用于做出预测的高阶相互作用将促进链接测量成分组合的假设组合的假设的发展。通过使用随机森林的结构,开发了一种称为拼图的新算法方法,以帮助发现可以解释森林预测的模式。通过检查单个决策树的模式,拼图可以确定与特定结果密切相关的测量特征之间的高阶相互作用,并确定了相关的决策阈值。在模拟研究中测试了拼图的有效性,即使存在明显的噪声,它也能够恢复多个地面真相模式;然后使用它来查找与两个现实世界数据集中的结果相关的模式。它首先用于识别与心脏病相关的临床测量模式。然后使用在血液中测得的代谢产物来找到与乳腺癌相关的模式。在心脏病中,拼图识别出几种三向相互作用,这些相互作用结合在一起,以高精度(93%)来解释大多数心脏病记录(66%)。在乳腺癌中,回收了三个双向相互作用,可以合并以解释几乎所有记录(92%),以良好的精度(79%)。拼图是一种有效的方法,用于探索与给定结果的统计关联的规则高维特征空间,并可以激发可检验的假设的产生。

Machine learning is revolutionizing biology by facilitating the prediction of outcomes from complex patterns found in massive data sets. Large biological data sets, like those generated by transcriptome or microbiome studies,measure many relevant components that interact in vivo with one another in modular ways.Identifying the high-order interactions that machine learning models use to make predictions would facilitate the development of hypotheses linking combinations of measured components to outcome. By using the structure of random forests, a new algorithmic approach, termed JigSaw,was developed to aid in the discovery of patterns that could explain predictions made by the forest. By examining the patterns of individual decision trees JigSaw identifies high-order interactions between measured features that are strongly associated with a particular outcome and identifies the relevant decision thresholds. JigSaw's effectiveness was tested in simulation studies where it was able to recover multiple ground truth patterns;even in the presence of significant noise. It was then used to find patterns associated with outcomes in two real world data sets.It was first used to identify patterns clinical measurements associated with heart disease. It was then used to find patterns associated with breast cancer using metabolites measured in the blood. In heart disease, JigSaw identified several three-way interactions that combine to explain most of the heart disease records (66%) with high precision (93%). In breast cancer, three two-way interactions were recovered that can be combined to explain almost all records (92%) with good precision (79%). JigSaw is an efficient method for exploring high-dimensional feature spaces for rules that explain statistical associations with a given outcome and can inspire the generation of testable hypotheses.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源