The mining of multi-dimensional time series is a crucial step in gaining insights into data obtained from physical systems and from monitoring infrastructures. A widely accepted approach for this challenge is the matrix profile, which, however, is computationally very expensive. It relies on calculating large correlation matrices coupled with sort operations across all dimensions of the data, as well as on performing inclusive scans. All of these steps are inherently data parallel and can, therefore, benefit from execution on GPUs, and even more so from horizontal scaling on multiple GPUs. In addition, the nature of the matrix profile calculation allows the exploitation of reduced precision on GPUs. This offers further improvements to enable the analysis of ever growing data sets in real-world scenarios.
Based on these motivations, we introduce the first parallel algorithm for multi-dimensional matrix profile on multiple GPUs exploiting reduced precision modes and provide a highly optimized implementation using novel optimization techniques. On one NVIDIA A100 GPU, our implementation achieves a 54x performance improvement in comparison to an optimized single-node execution on a state-of-the-art CPU-based implementation relying on double-precision computation and an additional factor of 1.4x when switching to reduced precision while maintaining sufficient accuracy. We study the accuracy and performance trade-offs for our proposed algorithm in detail and present synthetic and real-world case studies to demonstrate how the reduced precision improves the performance, while accomplishing sufficiently accurate results.
Authors: Yi Ju, Amir Raoofy, Dai Yang, Erwin Laure, Martin Schulz