The network hardware used in High-Performance Computing (HPC) systems is the core differentiator to regular computer clusters and has seen substantial improvements over the years. However, to fully utilize its capabilities, careful application tuning is necessary, which in turn is only possible if insights into application behavior can be obtained. With the rise of increasingly complex HPC application software stacks, this requires new kinds of fine-grained network monitoring capabilities. While regular TCP/IP based traffic can be monitored using a large variety of existing tools, for Remote Direct Memory Access (RDMA) based communication, which is prevalent in HPC networks, only very few tools are available. Further, those few tools either largely depend on features of specific network hardware, are mostly limited to node-granular monitoring only, or are limited to specific programming models.
In this paper, we introduce a novel network monitoring tool that directly layers on a portable network abstraction library, namely libfabric, while enabling application specific monitoring. With that, our tool provides network hardware and programming model agnostic, per-process, monitoring of RDMA-based network utilization and is capable of always-on, system-wide monitoring of production environments. We conduct a detailed overhead analysis on several state-of-the-art HPC systems with a variety of network hardware and show that our monitoring tool induces low overhead in the monitored application. In addition, we apply our monitoring tool to a real-world HPC application running in a production environment.
Link: