Skip to content

Getting Started with Slurm

Essential guide to using Slurm Workload Manager for submitting and managing jobs on HPC clusters.

Understanding Slurm

Slurm (Simple Linux Utility for Resource Management) is an open-source workload manager designed for Linux clusters of all sizes.

Basic Commands

Submitting Jobs

# Submit a job script
sbatch job_script.sh

# Submit an interactive job
srun --pty bash

Monitoring Jobs

# View job queue
squeue -u $USER

# Check job details
scontrol show job <jobid>

Managing Jobs

# Cancel a job
scancel <jobid>

# Hold a job
scontrol hold <jobid>

# Release a held job
scontrol release <jobid>

Job Script Example

#!/bin/bash
#SBATCH --job-name=test_job
#SBATCH --output=test_%j.out
#SBATCH --error=test_%j.err
#SBATCH --time=01:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --mem=8G

# Load required modules
module load gcc/10.2.0

# Run the application
./my_application

Best Practices

  1. Always specify resource requirements
  2. Use appropriate time limits
  3. Monitor resource usage
  4. Choose the right partition

Common Issues and Solutions

Job Not Starting

  • Check resource availability
  • Verify partition access
  • Review job dependencies

Job Failing

  • Check error logs
  • Verify module dependencies
  • Monitor resource usage

Additional Resources

  • Slurm Documentation: https://slurm.schedmd.com
  • Job Script Generator
  • Troubleshooting Guide