论文标题
Tensair:来自数据流的神经网络的实时培训
TensAIR: Real-Time Training of Neural Networks from Data-streams
论文作者
论文摘要
来自数据流的在线学习(OL)是一个新兴领域,它涵盖了流处理,机器学习和网络中的许多挑战。溪流处理平台(例如Apache Kafka和Flink)具有基本的扩展,用于在流中心处理管道中培训人工神经网络(ANN)。但是,这些扩展并非旨在实时训练ANN,并且在这样做时会遇到性能和可伸缩性问题。本文介绍了Tensair,这是第一个实时培训ANN的OL系统。 Tensair通过使用DASGD(分散和异步的随机梯度下降)来训练ANN模型(无论是新鲜初始化或预训练),从而实现了出色的性能和可伸缩性。我们从经验上证明,Tensair在(1)网络中部署的工人节点的数量以及(2)数据批次到达数据流操作员的吞吐量方面实现了几乎线性的扩展性能。我们通过研究稀疏(单词嵌入)和致密(图像分类)用例来描述Tensair的多功能性,而Tensair的可持续性吞吐率比最先进的系统高出6至116倍,用于在流中处理管道中培训ANN的最先进系统。
Online learning (OL) from data streams is an emerging area of research that encompasses numerous challenges from stream processing, machine learning, and networking. Stream-processing platforms, such as Apache Kafka and Flink, have basic extensions for the training of Artificial Neural Networks (ANNs) in a stream-processing pipeline. However, these extensions were not designed to train ANNs in real-time, and they suffer from performance and scalability issues when doing so. This paper presents TensAIR, the first OL system for training ANNs in real time. TensAIR achieves remarkable performance and scalability by using a decentralized and asynchronous architecture to train ANN models (either freshly initialized or pre-trained) via DASGD (decentralized and asynchronous stochastic gradient descent). We empirically demonstrate that TensAIR achieves a nearly linear scale-out performance in terms of (1) the number of worker nodes deployed in the network, and (2) the throughput at which the data batches arrive at the dataflow operators. We depict the versatility of TensAIR by investigating both sparse (word embedding) and dense (image classification) use cases, for which TensAIR achieved from 6 to 116 times higher sustainable throughput rates than state-of-the-art systems for training ANN in a stream-processing pipeline.