SLURM Workload Manager

Overview

The AUHPCS cluster uses SLURM (Simple Linux Utility for Resource Management) as its workload manager. SLURM is responsible for managing and scheduling cluster resources and jobs. Cluster resource assignments are referred to as allocations, and job queues are referred to as partitions in SLURM terminology.

The SLURM utilities are available on job submission nodes and the software build node.

Official SLURM documentation: SchedMD SLURM

SLURM Partition Assignments

Each compute node profile is assigned to a SLURM partition, determining how jobs are scheduled. Check Cluster Overview for more information.

SLURM Partition Mapping:

Partition Name	Compute Node Assignment	Notes	Memory Per Node	Total Memory
interactive_q	(1)-general compute node, (1)-GPU node	Used for interactive workloads	96 GB / 768 GB	Variable
cpu_normal_q	(18)-general compute nodes	Default partition for standard CPU jobs	96 GB / 768 GB	1.73 TB
cpu_middle_mem_q	(8)-middle memory compute nodes	For memory-intensive CPU workloads	768 GB	6 TB
cpu_high_mem_q	(2)-high memory compute nodes	For high memory workloads	1.53 TB	3.06 TB
gpu_normal_q	(2)-RTX6000 GPU compute nodes	For GPU workloads requiring moderate power	768 GB	1.54 TB
gpu_middle_ai_q	(2)-T4 GPU compute nodes	Suitable for AI, ML, and inference workloads	768 GB	1.54 TB
gpu_high_ai_q	(1)-DGX A100 GPU compute node	Optimized for large-scale AI and deep learning	1 TB	1 TB

Partition Reservation Policies

Some partitions in AUHPCS require special reservation procedures:

cpu_high_mem_q and gpu_high_ai_q are high memory queues and must be reserved in advance.
To reserve these partitions, users must email auhpcs_support@augusta.edu with project details and expected runtime.
The maximum reservation period is 10 days.
Reservations are granted based on resource availability and priority.

Please plan ahead and request access early if your work requires these high-capacity nodes.

Job Submission

Job submission is the process of requesting resources from the scheduler. It is the gateway to all the computational horsepower in the cluster. Users submit jobs to tell the scheduler what resources are needed and for how long. The scheduler then evaluates the request according to resource availability and cluster policy to determine when the job will run and which resources to use.

Batch Job Submission

Batch jobs are submitted using SLURM job scripts. Slurm directives can be in a job script as header lines (#SBATCH), as command-line options to the sbatch command or a combination of both. If both, the command-line option takes precedence.

The general form of the sbatch command:

$ sbatch [OPTIONS(0)...] [ : [OPTIONS(N)...]] script(0) [args(0)...]

Example:

$ sbatch -N1 -t 1:00:00 my_job.sh
$ cat my_job.sh

#!/bin/bash
#SBATCH --job-name=my_job            # job name
#SBATCH --ntasks=10                  # number of tasks across all nodes
#SBATCH --partition=cpu_normal_q     # name of partition to submit job
#SBATCH --time=01:00:00              # run time (D-HH:MM:SS)
#SBATCH --output=job_output.txt      # output file
#SBATCH --error=job_error.txt        # error file
#SBATCH --mail-type=ALL              # will send email for begin,end,fail
#SBATCH --mail-user=user@augusta.edu

srun ./my_application

This batch job submission requests one node (N1) and a total of four tasks (T4) with a walltime of 1 hr (1:00:00) as specified in the job script using sbatch directives. The job is assigned to the cpu_normal_q partition, and output/error logs are redirected to job_output.txt and job_error.txt, respectively. The application is executed using srun.

Note

Users can find pre-configured SLURM job script templates at: $ ls -l /home/<username>/scripts/templates/

Job-Related Commands

User Command	Slurm	Description
Job Submission	sbatch [jobscript]	Submit a job.
Job Listing	squeue	Job info for all jobs.
Job Status(job)	squeue -j [jobid]	Job info for a specific job.
Job Status(user)	squeue -u [userid]	Job info for a specific user.
Partitions	sinfo	Partition info.
Delete Job	scancel [jobid]	Cancel a job.
Hold Job	scontrol hold [jobid]	Prevent a job from starting.
Release Job	scontrol release [jobid]	Release a held job.

Job Monitoring and Management

SLURM provides utilities for monitoring and managing submitted jobs.

Check job queue:

$ squeue

Check job details:

$ scontrol show job <job_id>

Prevent a pending job from starting:

$ scontrol hold <job id>

Release a previously held job:

$ scontrol release <job id>

Cancel a job:

$ scancel <job_id>

Check available partitions and nodes:

$ sinfo

Interactive Job Submission

To request an interactive session on a specific partition:

$ salloc --nodes=1 --ntasks=1 --cpus-per-task=2 --time=02:00:00 --partition=cpu_normal_q
$ srun --jobid <ALLOCATION_NUMBER> --pty /bin/bash -i

Note

srun is the executable wrapper for running jobs under SLURM. It is allocation-aware, meaning it ensures processes run within the context of the granted resources. Using srun after salloc is best practice for launching interactive sessions.

Here we are requesting one core on a single node to run our job interactively. Next you will need to check and make sure the necessary modules needed for the job are loaded:

$ [user@node03] > module list

Load any additional modules needed before running the program:

$ [user@node03] > module load samtools

You can exit the interactive session by typing in the following:

$ [user@node03] > exit

Attention

If you’re unsure how long your job will run or you’re still developing and testing your workflow, it’s recommended to use the interactive node (hpc-inter-sub). This node provides a controlled environment to debug scripts, test software modules, and estimate runtime and memory requirements without consuming compute partition time.

Using hpc-inter-sub helps avoid job failures, premature timeouts, or resource over-requests on shared compute nodes. Once you’ve determined the appropriate job configuration, you can submit your job to the appropriate partition (e.g., cpu_normal_q, gpu_normal_q) with confidence.

Best Practices

Always select the appropriate partition based on job requirements.

Do not over-utilize small resource groups or under-utilize large nodes.

Use interactive jobs for debugging and development, then submit batch jobs for full-scale runs.

Optimize scripts for efficient resource usage and avoid idle node occupation.

For additional SLURM support, contact auhpcs_support@augusta.edu.