SIMD $^2 $：用于加速张量计算的广义矩阵指令以外的张量计算

论文标题

SIMD $^2 $：用于加速张量计算的广义矩阵指令以外的张量计算

SIMD$^2$: A Generalized Matrix Instruction Set for Accelerating Tensor Computation beyond GEMM

论文作者

Zhang, Yunan, Tsai, Po-An, Tseng, Hung-Wei

论文摘要

现在在每个计算平台中都普遍存在矩阵 - 培养基单元（MXU）。使Mxus如此成功的关键属性是半程结构，它允许平行性和数据重用的平铺。尽管如此，矩阵 - 培训并不是唯一具有此类属性的算法。我们发现许多算法具有相同的结构，并且仅在核心操作上有所不同。例如，使用添加最小值而不是乘数。因此，具有半度性结构的算法具有通过通用矩阵操作结构而不是常见MXU来加速的潜力。在本文中，我们提出了Simd $^2 $，这是一种新的编程范式，以支持具有类似于半度的结构的广义矩阵操作。除了矩阵乘法外，SIMD $^2 $指令还加速了另外八种类型的矩阵操作。由于SIMD $^2 $指令类似于矩阵 - 型授权指令，因此我们能够在任何MXU架构上构建SIMD $^2 $体系结构，并具有最小的修改。我们开发了一个框架，该框架使用带有张量核心的NVIDIA GPU模拟和验证SIMD $^2 $。在8个应用程序中，SIMD2最多提供38.59 $ \ times $速度，超过10.63 $ \ times $ $ \ times $均在优化的CUDA程序中，只有5％的全芯片区域的开销。

Matrix-multiplication units (MXUs) are now prevalent in every computing platform. The key attribute that makes MXUs so successful is the semiring structure, which allows tiling for both parallelism and data reuse. Nonetheless, matrix-multiplication is not the only algorithm with such attributes. We find that many algorithms share the same structure and differ in only the core operation; for example, using add-minimum instead of multiply-add. Algorithms with a semiring-like structure therefore have potential to be accelerated by a general-purpose matrix operation architecture, instead of common MXUs. In this paper, we propose SIMD$^2$, a new programming paradigm to support generalized matrix operations with a semiring-like structure. SIMD$^2$ instructions accelerate eight more types of matrix operations, in addition to matrix multiplications. Since SIMD$^2$ instructions resemble a matrix-multiplication instruction, we are able to build SIMD$^2$ architecture on top of any MXU architecture with minimal modifications. We developed a framework that emulates and validates SIMD$^2$ using NVIDIA GPUs with Tensor Cores. Across 8 applications, SIMD2 provides up to 38.59$\times$ speedup and more than 10.63$\times$ on average over optimized CUDA programs, with only 5% of full-chip area overhead.

下载PDF全文

下载文献需遵守相关版权规定

论文标题