SeisSol on Distributed Multi-GPU Systems: CUDA Code Generation for the Modal Discontinuous Galerkin Method | ChEESE - Centre of Excellence for Exascale Supercomputing in the area of the Solid Earth

Type of publication

Publication in Conference Proceedings/Workshop

Year of publication

2021

Publisher

ACM - HPC Asia 2021

Link to the publication

https://dl.acm.org/doi/pdf/10.1145/3432261.3436753

Authors

Ravil Dorozhinskii and Michael Bader

Citation

Ravil Dorozhinskii and Michael Bader. 2021. SeisSol on Distributed Multi-GPU Systems: CUDA Code Generation for the Modal Discontinuous Galerkin Method. In The International Conference on High Performance Computing in Asia-Pacific Region (HPC Asia 2021). Association for Computing Machinery, New York, NY, USA, 69–82. DOI:https://doi.org/10.1145/3432261.3436753

Short summary

We present a GPU implementation of the high order Discontinuous Galerkin (DG) scheme in SeisSol, a software package for simulating seismic waves and earthquake dynamics. Our particular focus is on providing a performance portable solution for heterogeneous distributed multi-GPU systems. We therefore redesigned SeisSol’s code generation cascade for GPU programming models. This includes CUDA source code generation for the performance-critical small batched matrix multiplications kernels. The parallelisation extends the existing MPI+X scheme and supports SeisSol’s cluster-wise Local Time Stepping (LTS) algorithm for ADER time integration.

We performed a Roofline model analysis to ensure that the generated batched matrix operations achieve the performance limits posed by the memory-bandwidth roofline. Our results also demonstrate that the generated GPU kernels outperform the corresponding cuBLAS subroutines by 2.5 times on average. We present strong and weak scaling studies of our implementation on the Marconi100 supercomputer (with 4 Nvidia Volta V100 GPUs per node) on up to 256 GPUs , which revealed good parallel performance and efficiency in case of time integration using global time stepping. However, we show that directly mapping the LTS method from CPUs to distributed GPU environments results in lower hardware utilization. Nevertheless, due to the algorithmic advantages of local time stepping, the method still reduces time-to-solution by a factor of 1.3 on average in contrast to the GTS scheme.