Type of publication
Publication in Conference Proceedings/Workshop
Year of publication
2021
Publisher
ACM - HPC Asia 2021
Link to the publication
Citation
Ravil Dorozhinskii and Michael Bader. 2021. SeisSol on Distributed Multi-GPU Systems: CUDA Code Generation for the Modal Discontinuous Galerkin Method. In The International Conference on High Performance Computing in Asia-Pacific Region (HPC Asia 2021). Association for Computing Machinery, New York, NY, USA, 69–82. DOI:https://doi.org/10.1145/3432261.3436753
Short summary
We present a GPU implementation of the high order Discontinuous Galerkin (DG) scheme in SeisSol, a software package for simulating seismic waves and earthquake dynamics. Our particular focus is on providing a performance portable solution for heterogeneous distributed multi-GPU systems. We therefore redesigned SeisSol’s code generation cascade for GPU programming models. This includes CUDA source code generation for the performance-critical small batched matrix multiplications kernels. The parallelisation extends the existing MPI+X scheme and supports SeisSol’s cluster-wise Local Time Stepping (LTS) algorithm for ADER time integration.
We performed a Roofline model analysis to ensure that the generated batched matrix operations achieve the performance limits posed by the memory-bandwidth roofline. Our results also demonstrate that the generated GPU kernels outperform the corresponding cuBLAS subroutines by 2.5 times on average. We present strong and weak scaling studies of our implementation on the Marconi100 supercomputer (with 4 Nvidia Volta V100 GPUs per node) on up to 256 GPUs , which revealed good parallel performance and efficiency in case of time integration using global time stepping. However, we show that directly mapping the LTS method from CPUs to distributed GPU environments results in lower hardware utilization. Nevertheless, due to the algorithmic advantages of local time stepping, the method still reduces time-to-solution by a factor of 1.3 on average in contrast to the GTS scheme.
We performed a Roofline model analysis to ensure that the generated batched matrix operations achieve the performance limits posed by the memory-bandwidth roofline. Our results also demonstrate that the generated GPU kernels outperform the corresponding cuBLAS subroutines by 2.5 times on average. We present strong and weak scaling studies of our implementation on the Marconi100 supercomputer (with 4 Nvidia Volta V100 GPUs per node) on up to 256 GPUs , which revealed good parallel performance and efficiency in case of time integration using global time stepping. However, we show that directly mapping the LTS method from CPUs to distributed GPU environments results in lower hardware utilization. Nevertheless, due to the algorithmic advantages of local time stepping, the method still reduces time-to-solution by a factor of 1.3 on average in contrast to the GTS scheme.