HPC systems are getting ever more powerful, but this comes at the price of increasing system complexity: node architectures are deeply hierarchical and in many cases heterogeneous, and components can interact with each other in unpredictable ways. Further, current and future systems exhibit increasingly dynamic behavior, making static knowledge of their configuration alone insufficient. To use such systems efficiently, users as well as runtime systems have to be aware of the exact hardware structure at any time, i.e., the systems topology, its configuration parameters, and any side-effect a component can have on the rest of the system, and how this changes over time.
Current approaches to providing such information usually focus on a single aspect and do not consider dynamic behavior. For example, the widely used hwloc library, the current de-facto standard solution for retrieving hardware topology information, provides a static hierarchical view of all node hardware, but neither covers other system configuration aspects nor dynamic behavior; other systems have similar limitations.
In this paper, we propose sys-sage, a novel approach that overcomes these limitations and goes beyond the functionality of existing tools, including hwloc. It offers the ability to track dynamic changes, while unifying access to all system topology and configuration data. With that, it provides, at any point in time, a complete and updated view of the HPC system on which an application or runtime system is executing. The novelty of our approach lies in the ability to combine static hardware topology information with other relevant system data in a single API, while enabling a dynamic view and exposing system updates and reconfigurations on the fly. We show the design of sys-sage and demonstrate its applicability based on three separate use-cases, as well as by presenting further scenarios not easily solvable with currently available tools.
The paper is available here:
www.ce.cit.tum.de/fileadmin/w00cgn/caps/vanecek/sys-sage.pdf