Slurm plugin for resource co-allocation

Last modified by luehrs on 2023/10/16 14:45

Slurm is a popular resource management system in HPC environments. Traditionally, compute nodes are the only resource set considered in scheduling decisions, but to improve the performance of (future) exascale HPC systems running large data-intensive applications, we should not ignore the importance of ever-increasing fast storage technologies, while making the scheduling decision. Since the overall capacity of these fast storage technologies is limited, HPC centres have started considering multi-tiered storage systems.

To cover this, we developed a plugin for Slurm that co-allocates compute and high-performance storage resources in a multi-tiered storage HPC system. In contrast to the native approach, which assigns high-performance storage tiers to an application upon user request, our plugin estimates waiting times for different storage tiers and schedules high-performance storages only if turnaround times can be decreased. To evaluate waiting times, we expect the user to specify I/O requirements rather than storage targets in order to increase utilization and throughput.

The code for this plugin is available on GitHub: https://github.com/HumanBrainProject/coallocation-slurm-plugin

A video showcasing how the plugin chooses the storage tier based on previous job allocations is available here.

Scope of usage

I/O-intensive HPC batch jobs submitted to Slurm

Development level and deployment

The plugin has been developed in TRL6 going to TRL7 and is ready to be deployed in a production environment.

Tests have been conducted at CINECA’s MARCONI100 with a wide-striped file system for node-local burst buffer systems emulating the targeted storage tier architecture. Performance tests are pending and will be conducted as soon as CINECA can provide a suitable environment (GALILEO100).

Technologies used in the background

Slurm (UNICORE interface in the future, currently under evaluation ), remote-shared burst buffer systems such as Cray’s DataWarp or DDN’s IME

Dependencies and requirements

Remote-shared burst buffer system, Slurm

Evaluation method

Simulations in a Vagrant-induced virtual environment and correctness tests conducted on a production system with node-local burst buffers validated that I/O requests were correctly performed on the targeted storage system. Performance will be evaluated by comparing mean turnaround times of HBP batch jobs on GALILEO100. Further evaluations in the related publication.

Related publications?

L. E. Lackner, H. M. Fard and F. Wolf, "Efficient Job Scheduling for Clusters with Shared Tiered Storage," 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), 2019, pp. 321-330, doi: 10.1109/CCGRID.2019.00046. (also available here)

Advantages

Reducing turnaround times of batch jobs by reducing I/O transfer times by taking over the decision of which storage tier to target

Shortcoming and limitations

Needs fine-grained I/O information from the user (collectible by profiling tools)

Use cases

Sub-workflows of HBP workflows targeting a single site

Expectations in an operational environment

Reduced turnaround times if the user is able to provide detailed I/O information (e.g. jobs that are submitted periodically)

Next steps

Deployment on GALILEO100 and modifications of the plugin to be compatible with DDN’s IME. Furthermore, an interface to UNICORE is evaluated to allow querying information on HBP (sub-)workflows to extend the plugin for scientific workflows.

Slurm plugin for resource co-allocation

EBRAINS Computing Services (WP6) -public