Ravil Dorozhinskii and Michael Bader. 2021. SeisSol on Distributed Multi-GPU Systems: CUDA Code Generation for the Modal Discontinuous Galerkin Method. In The International Conference on High Performance Computing in Asia-Pacific Region (HPC Asia 2021). Association for Computing Machinery, New York, NY, USA, 69–82. DOI:https://doi.org/10.1145/3432261.3436753
We performed a Roofline model analysis to ensure that the generated batched matrix operations achieve the performance limits posed by the memory-bandwidth roofline. Our results also demonstrate that the generated GPU kernels outperform the corresponding cuBLAS subroutines by 2.5 times on average. We present strong and weak scaling studies of our implementation on the Marconi100 supercomputer (with 4 Nvidia Volta V100 GPUs per node) on up to 256 GPUs , which revealed good parallel performance and efficiency in case of time integration using global time stepping. However, we show that directly mapping the LTS method from CPUs to distributed GPU environments results in lower hardware utilization. Nevertheless, due to the algorithmic advantages of local time stepping, the method still reduces time-to-solution by a factor of 1.3 on average in contrast to the GTS scheme.