论文标题
不规则访问重新订购单元:改善基于图的工作负载的GPGPU内存合并
Irregular Accesses Reorder Unit: Improving GPGPU Memory Coalescing for Graph-Based Workloads
论文作者
论文摘要
GPGPU体系结构已成为主要的并行平台和性能平台,可实现卓越的普及和授权域,例如常规代数,机器学习,图像检测和自动驾驶汽车。但是,由于不规则的记忆访问模式,由于控制流差异和内存差异而导致的不规则应用程序难以完全实现GPGPU性能。 为了改善这些问题,程序员有义务仔细考虑建筑特征,并努力通过复杂的优化技术来修改算法,这些技术将程序员的优先级转移,但很难平息这些缺点。我们表明,在基于图的GPGPU不规则应用中,这些效率低下的应用程序占上风,但我们发现可以放松线程和处理数据之间的严格关系以增强新的优化。 基于这个关键想法,我们提出了不规则的访问重新排序单元(IRU),这是一种紧密整合在GPGPU管道中的新型硬件扩展。 IRU会通过不规则访问的线程处理的数据来重新处理,这些数据可显着改善记忆合并,并提高性能和能源效率。此外,IRU能够过滤和合并重复的不规则访问,从而进一步改善了基于图的不规则应用。程序员可以使用简单的API轻松利用IRU,或使用提供的扩展ISA指令优化的编译器优化生成的代码。 我们评估了针对最先进的基于图的算法的建议和广泛的应用程序。结果表明,IRU可共同改善1.32倍的记忆力和记忆层次结构的整体流量下降46%,这分别导致1.33倍和13%的绩效和节能提高,而在小面积的5.6%面积为众多的5.6%。
GPGPU architectures have become established as the dominant parallelization and performance platform achieving exceptional popularization and empowering domains such as regular algebra, machine learning, image detection and self-driving cars. However, irregular applications struggle to fully realize GPGPU performance as a result of control flow divergence and memory divergence due to irregular memory access patterns. To ameliorate these issues, programmers are obligated to carefully consider architecture features and devote significant efforts to modify the algorithms with complex optimization techniques, which shift programmers priorities yet struggle to quell the shortcomings. We show that in graph-based GPGPU irregular applications these inefficiencies prevail, yet we find that it is possible to relax the strict relationship between thread and data processed to empower new optimizations. Based on this key idea, we propose the Irregular accesses Reorder Unit (IRU), a novel hardware extension tightly integrated in the GPGPU pipeline. The IRU reorders data processed by the threads on irregular accesses which significantly improves memory coalescing, and allows increased performance and energy efficiency. Additionally, the IRU is capable of filtering and merging duplicated irregular access which further improves graph-based irregular applications. Programmers can easily utilize the IRU with a simple API, or compiler optimized generated code with the extended ISA instructions provided. We evaluate our proposal for state-of-the-art graph-based algorithms and a wide selection of applications. Results show that the IRU achieves a memory coalescing improvement of 1.32x and a 46% reduction in the overall traffic in the memory hierarchy, which results in 1.33x and 13% improvement in performance and energy savings respectively, while incurring in a small 5.6% area overhead.