Loki User Guideο
This guide provides essential information for accessing and using the Loki cluster, including job submission, queue monitoring, resource allocation, and module management.
Batch Submissionο
To submit a batch job, follow these steps:
1. Login to the cluster:
> ssh lskywalker@hpc-sub.augusta.edu
Example output:
[lskywalker@Loki_batch_submission ~]$
2. Check system status:
[lskywalker@Loki_batch_submission ~]$ sinfo
[lskywalker@Loki_batch_submission ~]$ squeue
Example output:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
121277 cpu_middl cosmx-RN hsolo R 17:11:00 1 cnode019
121294 cpu_norma 1DLN_GEX lorgana R 4:21:30 1 cnode001
To check your jobs specifically:
[lskywalker@Loki_batch_submission ~]$ squeue -u $USER
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
121322 cpu_norma bash lskywalker R 0:58 1 cnode016
To view job details:
[lskywalker@Loki_batch_submission ~]$ scontrol show job 121322
To check job resource consumption:
While job is running:
[lskywalker@Loki_batch_submission ~]$ sstat -j 121322.batch --format=JobID,MaxRSS,AveCPU,Elapsed
After the job is done:
[lskywalker@Loki_batch_submission ~]$ sacct -j 121322 --format=JobID,JobName,MaxRSS,Elapsed,State
3. Create a Python script (batch_scitkit.py)
1from sklearn.datasets import make_classification
2from sklearn.model_selection import train_test_split
3from sklearn.linear_model import LogisticRegression
4from sklearn.metrics import accuracy_score
5
6# Generate a small dataset
7X, y = make_classification(n_samples=1000, n_features=10, random_state=42)
8
9# Split into training and testing sets
10X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
11
12# Train a simple logistic regression model
13model = LogisticRegression()
14model.fit(X_train, y_train)
15
16# Make predictions
17y_pred = model.predict(X_test)
18
19# Print accuracy
20print(f"Interactive Model Accuracy: {accuracy_score(y_test, y_pred) * 100:.2f}%")
4. Create a batch submission script (submit_quick_job.sh)
#!/bin/bash
#SBATCH --job-name=scitkit_job # Job name
#SBATCH --output=scitkit_output_%j.txt # Output log file
#SBATCH --ntasks=1 # Number of tasks
#SBATCH --cpus-per-task=1 # CPU cores per task
#SBATCH --mem=1G # Minimal memory
#SBATCH --time=00:01:00 # 1-minute runtime
#SBATCH --partition=cpu_normal_q # SLURM partition
# Load Python
module load Python/3.10.8-GCCcore-12.2.0
module load scikit-learn/1.1.2-foss-2022a
# Run the Python script
python batch_scitkit.py
5. Submit the job:
[lskywalker@Loki_batch_submission ~]$ sbatch submit_quick_job.sh
[lskywalker@Loki_batch_submission ~]$ squeue -u $USER
6. Check files generated:
[lskywalker@Loki_batch_submission ~]$ ll
Example output:
-rw-r--r-- 1 lskywalker auhpcs_jedi_g 683 Mar 13 13:28 batch_scitkit.py
-rw-r--r-- 1 lskywalker auhpcs_jedi_g 0 Mar 13 09:16 scitkit_output_121394.txt
-rw-r--r-- 1 lskywalker auhpcs_jedi_g 835 Mar 12 15:07 submit_quick_job.sh
[lskywalker@Loki_batch_submission ~]$ cat scitkit_output_121394.txt
Example output:
Interactive Model Accuracy: 83.00%
Interactive Job Executionο
1. Login to the interactive node:
> ssh hsolo@hpc-inter-sub.augusta.edu
2. Request interactive resources:
[hsolo@Loki_inter_submission ~]$ salloc --nodes=1 --ntasks=1 --cpus-per-task=2 --time=02:00:00 --partition=interactive_q
Example output:
salloc: Granted job allocation 121322
[hsolo@Loki_inter_submission ~]$ squeue -u $USER
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
121322 interactive bash hsolo R 2:00 1 cnode018
3. Start an interactive shell on the allocated compute node:
[hsolo@Loki_inter_submission ~]$ srun --jobid 121322 --pty /bin/bash -i
Now youβre inside your allocated compute node session.
4. Load necessary modules:
[hsolo@cnode016 ~]$ module list
[hsolo@cnode016 ~]$ module avail scikit-learn
[hsolo@cnode016 ~]$ module load scikit-learn/1.2.1-gfbf-2022b
5. Create an interactive Python script (interactive_scikit.py)
[hsolo@cnode016 ~]$ vi interactive_scikit.py
6. Run the Python script:
[hsolo@cnode016 ~]$ python interactive_scikit.py
Example output:
Interactive Model Accuracy: 83.00%
7. Handling Job Expiry:
If a job exceeds its time limit, SLURM will revoke the allocation.
[hsolo@cnode016 ~]$ salloc: Job 121322 has exceeded its time limit and its allocation has been revoked.
8. To log out of the interactive session:
[hsolo@cnode016 ~]$ exit
Note
For graphical MATLAB sessions, see Run MATLAB from the Loki cluster over an SSH tunnel.
For further assistance, visit our Support Page or contact our team at auhpcs_support@augusta.edu