Best Practice Guide to Running AI Workloads on Leonardo CINECA

Introduction

This guide walks you through setting up and running AI workloads on the Leonardo supercomputer at CINECA without using containers. We'll cover everything from filesystem usage and authentication to module loading, Python environment setup, job submission with SLURM, inference workflows, and interactive development with Jupyter Notebooks.

This is tailored for researchers or developers who want complete control over the environment and performance, bypassing container overheads.

Filesystems Overview

Understanding the available storage options is essential when working on a supercomputer like Leonardo. Each has specific use cases and limitations.

`$HOME` - Personal Storage

Purpose: Permanent storage for your personal files such as dotfiles, source code, and binaries.
Quota: 50GB.
Use Case: Ideal for configuration files, small personal utilities, and SSH keys.
Limitations: Do not store large datasets or any temporary job outputs here.

`$WORK` - Shared Project Storage

Purpose: Project workspace shared among collaborators.
Quota: 40TB.
Use Case: Store project-specific source code, datasets, models, and logs accessible to all team members.
Retention Policy: Data is removed 6 months after the project ends.
Important: This storage is not backed up.

`$SCRATCH` - Temporary, High-Capacity Storage

Purpose: High-performance, temporary workspace.
Use Case: Store temporary datasets, training checkpoints, or outputs during execution.
Retention Policy: Files are automatically deleted after ~40 days.

`$FAST` - High-Speed NVMe Storage

Purpose: Fast storage, optimized for I/O-heavy workloads.
Use Case: Store pre-trained models, deep learning checkpoints, and frequently accessed data.
Quota: 1TB per project.
Note: Not backed up. Same deletion policy as $WORK.

Accessing Leonardo from Linux/Mac

1. Install the SmallStep Client

Follow the guide: Install SmallStep

2. Bootstrap the CA

step ca bootstrap --ca-url=https://sshproxy.hpc.cineca.it \
  --fingerprint 2ae1543202304d3f434bdc1a2c92eff2cd2b02110206ef06317e70c1c1735ecd

step ssh login "<YOUR EMAIL>" --provisioner cineca-hpc
ssh -o StrictHostKeyChecking=no <USERNAME>@login.leonardo.cineca.it

Optional: SSH Configuration for Automation

Edit your ~/.ssh/config:

Host leonardo
  HostName login.leonardo.cineca.it
  User <USERNAME>
  CertificateFile ~/.step/ssh/<EMAIL>-cert.pub
  IdentityFile ~/.step/ssh/<EMAIL>
  ProxyCommand bash -c 'step ssh login "<EMAIL>" --provisioner cineca-hpc >/dev/null 2>&1; nc %h %p'
  StrictHostKeyChecking no
  UserKnownHostsFile /dev/null

Then simply:

ssh leonardo

Module Setup for AI Workloads

Before you run any Python code, make sure the proper environment is loaded:

ml purge
ml profile/deeplrn
module load nccl/2.19.1-1--gcc--12.2.0-cuda-12.1
module load openmpi/4.1.6--gcc--12.2
module load python/3.11.6--gcc--8.5.0
module load cuda/12.1

Python Virtual Environment in `$WORK`

To ensure consistency across the team and access to optimized libraries, use a shared virtual environment in $WORK:

cd $WORK
python -m venv venv_project
source venv_project/bin/activate
pip cache purge
pip install -r requirements.txt

Deactivate with:

deactivate

Example `requirements.txt`

This file should include all your dependencies, especially those related to training and inference:

torch>=2.2
accelerate
appdirs
loralib
bitsandbytes
black
black[jupyter]
datasets
fire
peft
transformers>=4.45.1
sentencepiece
py7zr
scipy
optimum
matplotlib
chardet
openai
typing-extensions>=4.8.0
tabulate
evaluate
rouge_score
pyyaml==6.0.1
faiss-gpu; python_version < '3.11'
unstructured[pdf]
sentence_transformers
codeshield
gradio
markupsafe==2.0.1

Submitting AI Workloads with SLURM

Storing Pretrained Models in `$FAST`

For best performance, store large models in $FAST. Example:

cd $FAST
mkdir -p pretrained_models
scp /local/path/to/models/* leonardo:$FAST/pretrained_models/

Make sure the files have group-readable permissions so others can use them.

Checkpointing Strategy

Use $SCRATCH during training for checkpointing:

Save every 10s/100s/1000s of epochs (based on model size and training time).
After training, copy to $FAST for reuse or $WORK for long-term sharing.
Remember: $SCRATCH files are deleted automatically after 40 days.

Detailed Explanation of SLURM and Environment Variables

Common SLURM Environment Variables

$SLURM_JOB_NODELIST: Provides the list of allocated nodes for your job. Often used to determine which node acts as the master node in distributed setups.
$SLURM_NNODES: Indicates the total number of nodes allocated to your job.
$SLURM_PROCID: Represents the MPI rank (ranging from 0 to N-1) of the current executing process.
$SLURM_JOB_ID: Unique identifier for the SLURM job, frequently utilized for creating distinct log files.

Bash Variables in Job Scripts

MASTER_NAME=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1): Retrieves the hostname of the first node allocated, designated as the master node, essential for coordinating distributed training.
GPUS_PER_NODE=4: Sets the count of GPUs available per allocated node.
MASTER_PORT=29500: Defines an arbitrary, open port number utilized for initial communication (rendezvous) among distributed processes.
MACHINE_RANK=$SLURM_PROCID: Assigns the MPI rank from SLURM as the machine rank, crucial for correctly initializing distributed operations.
WORLD_SIZE=$(($GPUS_PER_NODE * $NNODES)): Calculates the total number of distributed processes across all allocated nodes.

Exported Environment Variables

These variables ensure optimized GPU-to-GPU and node-to-node communications during training and inference:

NCCL_NET=IB: Forces NCCL (NVIDIA Collective Communication Library) to leverage the InfiniBand backend for high-speed interconnect performance.
NCCL_DEBUG=INFO: Activates verbose NCCL debugging logs to assist with troubleshooting communication issues.
MASTER_ADDR and MASTER_PORT: Required by PyTorch's distributed backend to correctly set up and synchronize processes.
NCCL_IB_ENABLE=1: Explicitly enables InfiniBand communication within NCCL.
NCCL_IB_HCA=mlx5: Specifies the InfiniBand Host Channel Adapter (HCA), typically set to Mellanox mlx5 adapters.
NCCL_SOCKET_IFNAME=ib0: Determines the network interface card (NIC) used for NCCL's inter-node communication.
NCCL_NET_GDR_LEVEL=5: Optimizes GPU Direct RDMA communication by adjusting the NCCL network GPU direct level.

These environment settings collectively enhance performance by optimizing networking, GPU communication, and multi-node synchronization.

Example: Training Job SLURM Script

The following example illustrates a complete SLURM script (multinode70b.sh) for distributed fine-tuning of an AI model:

#!/bin/bash
#SBATCH --job-name=multi_node_gpu
#SBATCH --time=03:30:01
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:4
#SBATCH --partition=boost_usr_prod
#SBATCH --cpus-per-task=32
#SBATCH --output=logs/output_%j.log
#SBATCH --error=logs/error_%j.log

module purge
ml profile/deeplrn
module load python/3.11.6--gcc--8.5.0
module load openmpi/4.1.6--gcc--12.2.0
module load nccl/2.19.1-1--gcc--12.2.0-cuda-12.1
module load cuda/12.1
source $WORK/venv_project/bin/activate

NCCL_NET=IB
NCCL_DEBUG=INFO
MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
MASTER_PORT=29500
GPUS_PER_NODE=4
NNODES=$SLURM_NNODES
MACHINE_RANK=$SLURM_PROCID
WORLD_SIZE=$(($GPUS_PER_NODE * $NNODES))

DATASET = my_dataset
PYTHORCH_SCRIPT= $WORK/llama_cookbook/finetuning.py
LLM_MODEL = $FAST/Llama-3.3-70B-Instruct
OUTPUT_FOLDER = $WORK/output_models/peft/Llama-3.2-70B-Instruct

srun python -u -m torch.distributed.run --nproc_per_node $GPUS_PER_NODE --nnodes $SLURM_JOB_NUM_NODES \
 --rdzv_backend c10d --rdzv_endpoint $MASTER_NAME:$MASTER_PORT \
 $PYTHORCH_SCRIPT \
 --enable_fsdp \
 --use_peft \
 --peft_method lora \
 --model_name $LLM_MODEL \
 --dataset $DATASET \
 --save_model \
 --num_epochs 2 \
 --context_length 8192 \
 --output_dir $OUTPUT_FOLDER \
 --dist_checkpoint_root_folder model_checkpoints \
 --dist_checkpoint_folder fine-tuned

Example: Inference Job SLURM Script

For deploying a model for inference, use the following simplified SLURM job script (submit_slurm.sh):

#!/bin/bash
#SBATCH --job-name=ai_inference
#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=8
#SBATCH --gres=gpu:4
#SBATCH --time=01:00:00
#SBATCH --partition=boost_usr_prod
#SBATCH --output=logs/output_%j.log
#SBATCH --error=logs/error_%j.log
#SBATCH --exclusive

module purge
ml profile/deeplrn
module load nccl/2.19.1-1--gcc--12.2.0-cuda-12.1
module load openmpi/4.1.6--gcc--24.3
module load python/3.11.6--gcc--8.5.0
module load cuda/12.1
source $WORK/venv_project/bin/activate

export NCCL_NET=IB
export NCCL_DEBUG=INFO
export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
export MASTER_PORT=29500

mpirun -np 4 --map-by ppr:4:node --bind-to none \
    -x MASTER_ADDR -x MASTER_PORT \
    -x NCCL_DEBUG -x NCCL_IB_ENABLE -x NCCL_IB_HCA \
    -x NCCL_SOCKET_IFNAME -x NCCL_NET_GDR_LEVEL \
    -x PATH -x LD_LIBRARY_PATH \
    python inference2.py

These comprehensive examples provide clear guidance for efficiently managing AI workloads on multi-node GPU clusters using SLURM.

Using Weights & Biases (WanDB)

Weights & Biases (WanDB) is widely used for experiment tracking and hyperparameter tuning. However, due to network restrictions on compute nodes (such as Leonardo), WanDB must be executed in offline mode. This offline requirement prevents the use of modules that depend on online communication, such as the Sweeps module for hyperparameter optimization.

To use WanDB in offline mode, set the following environment variable in your SLURM job scripts:

export WANDB_MODE=offline

Results can later be synchronized to WanDB's online platform using their CLI from a node with internet access:

wandb sync path/to/offline-runs

Using Jupyter Notebooks on Leonardo Compute Nodes

Running Jupyter notebooks on Leonardo requires an interactive SLURM session and secure port forwarding. Here's how to do it efficiently for GPU-enabled development sessions.

1. Allocate a GPU Node

Request an interactive session on a compute node:

salloc --nodes=1 --ntasks=1 --cpus-per-task=8 --gres=gpu:1 \
  --time=02:00:00 --partition=boost_usr_prod

2. Connect to the Allocated Node

Once you receive your node assignment (e.g., r123c45):

ssh <USERNAME>@r123c45

Use echo $HOSTNAME to confirm.

3. Load Modules and Activate Your Environment

module load cuda/12.3
module load nccl/2.19.1-1--gcc--12.2.0-cuda-12.1
module load openmpi/4.1.6--nvhpc--24.3
module load python/3.11.6--gcc--8.5.0
source $WORK/venv_project/bin/activate

4. Start Jupyter Notebook Server

Start Jupyter in no-browser mode on a chosen port (e.g., 8888):

jupyter notebook --no-browser --ip=0.0.0.0 --port=8888

Copy the token from the output.

5. Set Up SSH Tunneling

Leonardo's compute nodes are on a private network for security reasons—they cannot be accessed directly from the internet or your local machine. Therefore, to access the Jupyter notebook running on a compute node, you must "tunnel" through the login node using SSH port forwarding.

This involves creating a two-step tunnel:

Tunnel 1 (Local → Login Node): Connect your local machine to the Leonardo login node, forwarding a local port (e.g., 8888) to a temporary port (e.g., 44444) on the login node.
Tunnel 2 (Login Node → Compute Node): From the login node, forward that temporary port (44444) to port 8888 on the actual compute node running Jupyter.

This layered forwarding ensures secure access to the notebook server running deep inside the HPC network.

ssh -L 8888:localhost:44444 <USERNAME>@login.leonardo.cineca.it

In that SSH session, run:

ssh -L 44444:localhost:8888 <USERNAME>@r123c45

After both tunnels are open, any request to localhost:8888 on your machine will be routed to the compute node's Jupyter Notebook server securely through the login node.

6. Access Jupyter in Your Browser

Open the following in your local browser:

http://localhost:8888

Paste the token from step 4 when prompted.

7. Wrap Up

Stop the Jupyter server with Ctrl+C.
Exit both SSH sessions.
Close the browser tab when done.

💡 Tip: Use jupyter lab instead of jupyter notebook if you prefer a modern UI.

Final Notes

Always use $SCRATCH for temporary training outputs.
Use $FAST for heavy-read I/O files and pre-trained models.
Keep environments light and reproducible via requirements.txt.
Don't forget to create a logs/ folder for SLURM outputs.

You're now ready to run performant AI workloads on Leonardo HPC without containers!