论文标题
通过在线优化进行动态工作负载的集群资源管理
Cluster Resource Management for Dynamic Workloads by Online Optimization
论文作者
论文摘要
在过去的十年中,对于长期运行,动态和多样化的工作负载,例如处理查询流或分布式深度学习,已经提出了许多不同的方法。特别是对于由容器化的微服务组成的应用,研究人员试图解决动态选择的问题,例如:虚拟服务的类型和数量(例如IaAS/VMS),不同微服务的垂直和水平缩放,将微服务分配给VMS,为VMS分配给VMS,任务调度计划,或某些组合。在这种情况下,我们认为,诸如模拟退火之类的框架非常适合在在线导航性能(SLO)和成本之间的权衡,尤其是当复杂的工作负载和云服务产品随着时间而变化时。基于结合性能和成本术语的宏观目标,退火促进了探索和剥削的轻巧和连贯的政策。在本文中,我们首先给出一些模拟退火的背景,然后在实验上证明其对不同案例研究的有用性,包括用于选择单一类型的工作负载(例如分布式深度学习)和工作量类型的混合物(探索部分分类的选项),以及用于Microservice Benchmarks的容器大小。最后,我们讨论了如何将基本退火平台应用于其他资源管理问题,与其他方法杂交,并适应用户指定的经验规则。
Over the past ten years, many different approaches have been proposed for different aspects of the problem of resources management for long running, dynamic and diverse workloads such as processing query streams or distributed deep learning. Particularly for applications consisting of containerized microservices, researchers have attempted to address problems of dynamic selection of, for example: types and quantities of virtualized services (e.g., IaaS/VMs), vertical and horizontal scaling of different microservices, assigning microservices to VMs, task scheduling, or some combination thereof. In this context, we argue that frameworks like simulated annealing are highly suitable for online navigation of trade-offs between performance (SLO) and cost, particularly when the complex workloads and cloud-service offerings vary over time. Based on a macroscopic objective that combines both performance and cost terms, annealing facilitates light-weight and coherent policies of exploration and exploitation. In this paper, we first give some background on simulated annealing and then experimentally demonstrate its usefulness for different case studies, including service selection for both a single type of workload (e.g., distributed deep learning) and a mixture of workload types (exploring a partially categorical set of options), and container sizing for microservice benchmarks. We conclude with a discussion of how the basic annealing platform can be applied to other resource-management problems, hybridized with other methods, and accommodate user-specified rules of thumb.