This paper presents and evaluates a graph-based, cache-aware multi-core approach for polynomial-based time integration of linear autonomous partial differential equations.
Linear autonomous PDEs occur in various applications such as acoustic wave propagation for full waveform inversion problems or in splitting methods used in climate, weather and ocean simulations. To reduce the resource consumption for evaluating such PDEs on HPC systems, integration methods have to exploit the full potential of modern multi-core architectures with steep memory hierarchies.
Very often the heavy memory bandwidth requirements of the underlying kernel are a limiting factor on the performance of polynomial-based time integration.
To this end, we extend our prior work on a sequential temporal cache-blocking algorithm for matrix-polynomials with an explicit graph representation of the arising data dependencies. Combining static graph partitioning and dynamic scheduling the cache-aware execution of the sequential kernel can be mapped to various parallel setups including multiple caches and NUMA domains.
We demonstrate the performance of our approach for a 2-nd, 4-th and 6-th order time integration of the linear advection equation on three different architectures with widely varying memory systems and achieve an up to 60% reduction of wall clock time compared to a conventional, state-of-the-art non-cache-aware approach.
We conclude that matrix-polynomial-based time integration methods can strongly profit from our parallel cache-aware approach which opens the door for an efficient usage of modern HPC hardware.