SLURM Workload Manager

Overview

The AUHPCS cluster uses SLURM (Simple Linux Utility for Resource Management) as its workload manager. SLURM is responsible for managing and scheduling cluster resources and jobs. Cluster resource assignments are referred to as allocations, and job queues are referred to as partitions in SLURM terminology.

The SLURM utilities are available on job submission nodes and the software build node.

Official SLURM documentation: SchedMD SLURM

SLURM Partition Assignments

Each compute node profile is assigned to a SLURM partition, determining how jobs are scheduled. Check Cluster Overview for more information.

SLURM Partition Mapping:

Partition Name

Compute Node Assignment

Notes

Memory Per Node

Total Memory

interactive_q

(1)-general compute node, (1)-GPU node

Used for interactive workloads

96 GB / 768 GB

Variable

cpu_normal_q

(18)-general compute nodes

Default partition for standard CPU jobs

96 GB / 768 GB

1.73 TB

cpu_middle_mem_q

(8)-middle memory compute nodes

For memory-intensive CPU workloads

768 GB

6 TB

cpu_high_mem_q

(2)-high memory compute nodes

For high memory workloads

1.53 TB

3.06 TB

gpu_normal_q

(2)-RTX6000 GPU compute nodes

For GPU workloads requiring moderate power

768 GB

1.54 TB

gpu_middle_ai_q

(2)-T4 GPU compute nodes

Suitable for AI, ML, and inference workloads

768 GB

1.54 TB

gpu_high_ai_q

(1)-DGX A100 GPU compute node

Optimized for large-scale AI and deep learning

1 TB

1 TB

Partition Reservation Policies

Some partitions in AUHPCS require special reservation procedures:

  • cpu_high_mem_q and gpu_high_ai_q are high memory queues and must be reserved in advance.

  • To reserve these partitions, users must email auhpcs_support@augusta.edu with project details and expected runtime.

  • The maximum reservation period is 10 days.

  • Reservations are granted based on resource availability and priority.

Please plan ahead and request access early if your work requires these high-capacity nodes.

Job Submission

Job submission is the process of requesting resources from the scheduler. It is the gateway to all the computational horsepower in the cluster. Users submit jobs to tell the scheduler what resources are needed and for how long. The scheduler then evaluates the request according to resource availability and cluster policy to determine when the job will run and which resources to use.

Batch Job Submission

Batch jobs are submitted using SLURM job scripts. Slurm directives can be in a job script as header lines (#SBATCH), as command-line options to the sbatch command or a combination of both. If both, the command-line option takes precedence.

The general form of the sbatch command:

$ sbatch [OPTIONS(0)...] [ : [OPTIONS(N)...]] script(0) [args(0)...]

Example:

$ sbatch -N1 -t 1:00:00 my_job.sh
$ cat my_job.sh
#!/bin/bash
#SBATCH --job-name=my_job            # job name
#SBATCH --ntasks=10                  # number of tasks across all nodes
#SBATCH --partition=cpu_normal_q     # name of partition to submit job
#SBATCH --time=01:00:00              # run time (D-HH:MM:SS)
#SBATCH --output=job_output.txt      # output file
#SBATCH --error=job_error.txt        # error file
#SBATCH --mail-type=ALL              # will send email for begin,end,fail
#SBATCH --mail-user=user@augusta.edu

srun ./my_application

This batch job submission requests one node (N1) and a total of four tasks (T4) with a walltime of 1 hr (1:00:00) as specified in the job script using sbatch directives. The job is assigned to the cpu_normal_q partition, and output/error logs are redirected to job_output.txt and job_error.txt, respectively. The application is executed using srun.

Note

Users can find pre-configured SLURM job script templates at: $ ls -l /home/<username>/scripts/templates/

Best Practices

  • Always select the appropriate partition based on job requirements.

  • Do not over-utilize small resource groups or under-utilize large nodes.

  • Use interactive jobs for debugging and development, then submit batch jobs for full-scale runs.

  • Optimize scripts for efficient resource usage and avoid idle node occupation.


For additional SLURM support, contact auhpcs_support@augusta.edu.