Slurm plugin for resource co-allocation
Slurm is a popular resource management system in HPC environments. Traditionally, compute nodes are the only resource set considered in scheduling decisions, but to improve the performance of (future) exascale HPC systems running large data-intensive applications, we should not ignore the importance of ever-increasing fast storage technologies, while making the scheduling decision. Since the overall capacity of these fast storage technologies is limited, HPC centres have started considering multi-tiered storage systems.
To cover this, we developed a plugin for Slurm that co-allocates compute and high-performance storage resources in a multi-tiered storage HPC system. In contrast to the native approach, which assigns high-performance storage tiers to an application upon user request, our plugin estimates waiting times for different storage tiers and schedules high-performance storages only if turnaround times can be decreased. To evaluate waiting times, we expect the user to specify I/O requirements rather than storage targets in order to increase utilization and throughput.
The code for this plugin is available on GitHub: https://github.com/HumanBrainProject/coallocation-slurm-plugin
Scope of usage
I/O-intensive HPC batch jobs submitted to Slurm
Development level and deployment
The plugin has been developed in TRL6 going to TRL7 and is ready to be deployed in a production environment.
Tests have been conducted at CINECA’s MARCONI100 with a wide-striped file system for node-local burst buffer systems emulating the targeted storage tier architecture. Performance tests are pending and will be conducted as soon as CINECA can provide a suitable environment (GALILEO100).
Technologies used in the background
Slurm (UNICORE interface in the future, currently under evaluation ), remote-shared burst buffer systems such as Cray’s DataWarp or DDN’s IME
Dependencies and requirements
Remote-shared burst buffer system, Slurm
Evaluation method
Simulations in a Vagrant-induced virtual environment and correctness tests conducted on a production system with node-local burst buffers validated that I/O requests were correctly performed on the targeted storage system. Performance will be evaluated by comparing mean turnaround times of HBP batch jobs on GALILEO100. Further evaluations in the related publication.
Related publications?
L. E. Lackner, H. M. Fard and F. Wolf, "Efficient Job Scheduling for Clusters with Shared Tiered Storage," 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), 2019, pp. 321-330, doi: 10.1109/CCGRID.2019.00046. (also available here)
Advantages
Reducing turnaround times of batch jobs by reducing I/O transfer times by taking over the decision of which storage tier to target
Shortcoming and limitations
Needs fine-grained I/O information from the user (collectible by profiling tools)
Use cases
Sub-workflows of HBP workflows targeting a single site
Expectations in an operational environment
Reduced turnaround times if the user is able to provide detailed I/O information (e.g. jobs that are submitted periodically)
Next steps
Deployment on GALILEO100 and modifications of the plugin to be compatible with DDN’s IME. Furthermore, an interface to UNICORE is evaluated to allow querying information on HBP (sub-)workflows to extend the plugin for scientific workflows.