As modern HPC systems are typically composed of fat and rich compute nodes, it is usually difficult to fully utilize all node resources with a single application. Co-scheduling, i.e., co-executing multiple complementary applications (or jobs) on the same node in a space sharing manner, is a promising solution and thus has been widely studied in the past decade.
As one major drawback of co-scheduling is that it induces the interference effects among co-located applications due to contention among shared resources, the industry has started to support several resource/traffic partitioning features, e.g., in shared caches or memory controllers, on modern commercial processors. Recent studies proposed effective approaches to make use of these advanced features, however, the interactions between these features and (1) job scheduling decisions as well as (2) NUMA (Non-Uniform Memory Access) effects were generally overlooked.
This paper explicitly targets these two missing pieces and comprehensively harmonizes the following decisions using reinforcement learning: (a) job selections for co-execution from a given job queue; and (b) diverse resource assignments to co-executed jobs, leveraging emerging hardware partitioning features, while taking NUMA-awareness into account. Our evaluation result demonstrates that our approach can improve the total system throughput by up to 78.1% over time sharing-based naive scheduling.