Vectorization and Minimization of Memory Footprint for Linear High-Order Discontinuous Galerkin Schemes | ChEESE - Centre of Excellence for Exascale Supercomputing in the area of the Solid Earth

Type of publication

Publication in Conference Proceedings/Workshop

Year of publication

2020

Publisher

2020 IEEE International Parallel and Distributed Processing Symposium Workshops

Link to the publication

https://ieeexplore.ieee.org/document/9150350

Link to the repository

https://arxiv.org/abs/2003.12787

Authors

Jean-Matthieu Gallard, Leonhard Rannabauer, Anne Reinarz and Michael Bader

Citation

Gallard, J., Rannabauer, L., Reinarz, A., & Bader, M. (2020). Vectorization and Minimization of Memory Footprint for Linear High-Order Discontinuous Galerkin Schemes. 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), 711-720.

Short summary

We present a sequence of optimizations to the performance-critical compute kernels of the high-order discontinuous Galerkin solver of the hyperbolic PDE engine ExaHyPE – successively tackling bottlenecks due to SIMD operations, cache hierarchies and restrictions in the software design. Starting from a generic scalar implementation of the numerical scheme, our first optimized variant applies state-ofthe-art optimization techniques by vectorizing loops, improving the data layout and using Loop-over-GEMM to perform tensor contractions via highly optimized matrix multiplication functions provided by the LIBXSMM library. We show that memory stalls due to a memory footprint exceeding our L2 cache size hindered the vectorization gains. We therefore introduce a new kernel that applies a sum factorization approach to reduce the kernel’s memory footprint and improve its cache locality. With the L2 cache bottleneck removed, we were able to exploit additional vectorization opportunities, by introducing a hybrid Array-of-Structure-of-Array data layout that solves the data layout conflict between matrix multiplications kernels and the
point-wise functions to implement PDE-specific terms. With this last kernel, evaluated in a benchmark simulation at
high polynomial order, only 2% of the floating point operations are still performed using scalar instructions and 22.5% of the available performance is achieved.