Large-Scale AI Training on Leonardo: A Case Study

This case study examines the deployment and optimization of large-scale AI workloads on Leonardo at CINECA, focusing on best practices for storage management, job configuration, and performance optimization.

Project Overview

Objectives

Train a large language model using distributed computing
Optimize data pipeline for high-throughput training
Maximize GPU utilization across multiple nodes

System Configuration

Hardware: Leonardo Booster Module
3,456 nodes with 4x NVIDIA A100 GPUs
8x NVIDIA A100 80GB per node
2x Intel Xeon CPUs per node

Running AI Workloads on Leonardo @ CINECA - A Practical Guide

Supercomputing can feel intimidating, especially when your goal is simply to run AI workloads efficiently and reproducibly. In this blog post, we walk you through everything you need to know to get started with AI on Leonardo, the powerful HPC system at CINECA.

This guide assumes you're starting from scratch and takes you through: 1. Setting up your storage and environment. 2. Installing dependencies. 3. Organizing your data and models. 4. Submitting a training job with SLURM.

Step 1: Understand Your Storage Options

Before diving into code or datasets, it's critical to know where things should go. Leonardo provides four filesystems with different purposes:

`$HOME` – Personal Configuration and Small Files

Ideal for source code, scripts, and settings.
Persistent and private.
Limit: 50GB.

`$WORK` – Shared Project Workspace

Use it to store shared datasets, models, and outputs.
Not backed up and deleted 6 months after project ends.

`$SCRATCH` – Temporary High-Speed Space

Use for large I/O-heavy temporary files (e.g. intermediate training outputs, logs).
Auto-deleted after 40 days.

`$FAST` – NVMe Flash Storage for Hot Data

Ideal for frequently accessed data like pre-trained models.
1TB quota. Not backed up.

Step 2: Prepare Your Environment

Before training or running inference, you must load the appropriate software environment. This ensures compatibility and performance on GPUs.

Load the Required Modules

module load cuda/12.3
module load nccl/2.19.1-1--gcc--12.2.0-cuda-12.1
module load openmpi/4.1.6--nvhpc--24.3
module load python/3.11.6--gcc--8.5.0

Set Up a Virtual Environment

cd $WORK
python -m venv venv_project
source venv_project/bin/activate

Install dependencies:

pip install -r requirements.txt

Sample requirements.txt:

torch>=2.2
accelerate
appdirs
loralib
bitsandbytes
black
black[jupyter]
datasets
fire
peft
transformers>=4.45.1
sentencepiece
py7zr
scipy
optimum
matplotlib
chardet
openai
typing-extensions>=4.8.0
tabulate
evaluate
rouge_score
pyyaml==6.0.1
faiss-gpu; python_version < '3.11'
unstructured[pdf]
sentence_transformers
codeshield
gradio
markupsafe==2.0.1

To exit:

deactivate

Step 3: Store Your Pre-trained Models

To minimize data load times and maximize reuse:

cd $FAST
mkdir -p pretrained_models
scp /path/to/model/* leonardo:$FAST/pretrained_models/

Ensure read/write access for your project members if needed.

Step 4: Use an Effective Checkpoint Strategy

During training, you don’t want to lose your model’s progress. Use checkpoints!

Best Practices:

Save checkpoints to $SCRATCH — it’s fast and optimized for large, temporary files.
After the job ends, move the best checkpoints to $WORK (safe) or $FAST (quick reuse).

Frequency:

Save every few epochs (10/100/1000), or after validation improvements.
Avoid saving every step — this overwhelms the filesystem.

Step 5: Submit Your Job with SLURM

SLURM is Leonardo’s job scheduler. Create a script (e.g., submit_slurm.sh) like this:

#!/bin/bash

#SBATCH --job-name=1n3gpu              # Job name for monitoring
#SBATCH --time=03:30:01                # Max run time (hh:mm:ss)
#SBATCH --nodes=4                      # Number of compute nodes
#SBATCH --ntasks-per-node=1           # One task (MPI rank) per node
#SBATCH --gres=gpu:4                   # Request 4 GPUs per node
#SBATCH --partition=boost_usr_prod     # Partition name (user-specific)
#SBATCH --cpus-per-task=32            # CPU cores per task (for data loading etc.)
#SBATCH --output=logs/output_%j.log   # Stdout log path
#SBATCH --error=logs/error_%j.log     # Stderr log path

module purge
module load profile/deeplrn 
module load python/3.11.6--gcc--8.5.0
module load openmpi/4.1.6--gcc--12.2.0
module load nccl/2.19.1-1--gcc--12.2.0-cuda-12.1
module load cuda/12.1
source /leonardo_work/PROJECT_ID/vllm_env/bin/activate

export NCCL_NET=IB               # Use InfiniBand for fast GPU comm
export NCCL_DEBUG=INFO           # Enable debugging logs

GPUS_PER_NODE=4
MASTER_NAME=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
MASTER_PORT=29500
NNODES=$SLURM_NNODES
MACHINE_RANK=$SLURM_PROCID
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))

srun python -u -m torch.distributed.run \
  --nproc_per_node=$GPUS_PER_NODE \
  --nnodes=$SLURM_JOB_NUM_NODES \
  --rdzv_backend=c10d \
  --rdzv_endpoint=$MASTER_NAME:$MASTER_PORT \
  src/llama_cookbook/finetuning.py \
  --enable_fsdp \
  --use_peft \
  --peft_method lora \
  --model_name /leonardo_scratch/fast/PROJECT_ID/Llama-3.3-70B-Instruct \
  --dataset monke_dataset \
  --save_model \
  --num_epochs 2 \
  --context_length 8192 \
  --output_dir /leonardo_work/PROJECT_ID/AI_Model/output_models/peft/Llama-3.2-70B-Instruct \
  --dist_checkpoint_root_folder model_checkpoints \
  --dist_checkpoint_folder fine-tuned

Run and Monitor

sbatch submit_slurm.sh
squeue -u $USER
cat logs/output_<job_id>.log

💡 Tip: Make sure the logs/ directory exists before you submit the job.

Final Thoughts

By following this step-by-step process, you ensure: - Efficient use of storage systems. - A consistent and optimized environment. - Scalable, fault-tolerant job execution.

Leonardo is a powerful tool for accelerating your research — the key is to use it wisely and collaboratively.

Happy HPC hacking! 🚀