Large-Scale AI Training on Leonardo: A Case Study
This case study examines the deployment and optimization of large-scale AI workloads on Leonardo at CINECA, focusing on best practices for storage management, job configuration, and performance optimization.
Project Overview
Objectives
- Train a large language model using distributed computing
- Optimize data pipeline for high-throughput training
- Maximize GPU utilization across multiple nodes
System Configuration
- Hardware: Leonardo Booster Module
- 3,456 nodes with 4x NVIDIA A100 GPUs
- 8x NVIDIA A100 80GB per node
- 2x Intel Xeon CPUs per node
Running AI Workloads on Leonardo @ CINECA - A Practical Guide
Supercomputing can feel intimidating, especially when your goal is simply to run AI workloads efficiently and reproducibly. In this blog post, we walk you through everything you need to know to get started with AI on Leonardo, the powerful HPC system at CINECA.
This guide assumes you're starting from scratch and takes you through: 1. Setting up your storage and environment. 2. Installing dependencies. 3. Organizing your data and models. 4. Submitting a training job with SLURM.
Step 1: Understand Your Storage Options
Before diving into code or datasets, it's critical to know where things should go. Leonardo provides four filesystems with different purposes:
$HOME
– Personal Configuration and Small Files
- Ideal for source code, scripts, and settings.
- Persistent and private.
- Limit: 50GB.
$WORK
– Shared Project Workspace
- Use it to store shared datasets, models, and outputs.
- Not backed up and deleted 6 months after project ends.
$SCRATCH
– Temporary High-Speed Space
- Use for large I/O-heavy temporary files (e.g. intermediate training outputs, logs).
- Auto-deleted after 40 days.
$FAST
– NVMe Flash Storage for Hot Data
- Ideal for frequently accessed data like pre-trained models.
- 1TB quota. Not backed up.
Step 2: Prepare Your Environment
Before training or running inference, you must load the appropriate software environment. This ensures compatibility and performance on GPUs.
Load the Required Modules
module load cuda/12.3
module load nccl/2.19.1-1--gcc--12.2.0-cuda-12.1
module load openmpi/4.1.6--nvhpc--24.3
module load python/3.11.6--gcc--8.5.0
Set Up a Virtual Environment
Install dependencies:
Sample requirements.txt
:
torch>=2.2
accelerate
appdirs
loralib
bitsandbytes
black
black[jupyter]
datasets
fire
peft
transformers>=4.45.1
sentencepiece
py7zr
scipy
optimum
matplotlib
chardet
openai
typing-extensions>=4.8.0
tabulate
evaluate
rouge_score
pyyaml==6.0.1
faiss-gpu; python_version < '3.11'
unstructured[pdf]
sentence_transformers
codeshield
gradio
markupsafe==2.0.1
Step 3: Store Your Pre-trained Models
To minimize data load times and maximize reuse:
Ensure read/write access for your project members if needed.Step 4: Use an Effective Checkpoint Strategy
During training, you don’t want to lose your model’s progress. Use checkpoints!
Best Practices:
- Save checkpoints to
$SCRATCH
— it’s fast and optimized for large, temporary files. - After the job ends, move the best checkpoints to
$WORK
(safe) or$FAST
(quick reuse).
Frequency:
- Save every few epochs (10/100/1000), or after validation improvements.
- Avoid saving every step — this overwhelms the filesystem.
Step 5: Submit Your Job with SLURM
SLURM is Leonardo’s job scheduler. Create a script (e.g., submit_slurm.sh
) like this:
#!/bin/bash
#SBATCH --job-name=1n3gpu # Job name for monitoring
#SBATCH --time=03:30:01 # Max run time (hh:mm:ss)
#SBATCH --nodes=4 # Number of compute nodes
#SBATCH --ntasks-per-node=1 # One task (MPI rank) per node
#SBATCH --gres=gpu:4 # Request 4 GPUs per node
#SBATCH --partition=boost_usr_prod # Partition name (user-specific)
#SBATCH --cpus-per-task=32 # CPU cores per task (for data loading etc.)
#SBATCH --output=logs/output_%j.log # Stdout log path
#SBATCH --error=logs/error_%j.log # Stderr log path
module purge
module load profile/deeplrn
module load python/3.11.6--gcc--8.5.0
module load openmpi/4.1.6--gcc--12.2.0
module load nccl/2.19.1-1--gcc--12.2.0-cuda-12.1
module load cuda/12.1
source /leonardo_work/PROJECT_ID/vllm_env/bin/activate
export NCCL_NET=IB # Use InfiniBand for fast GPU comm
export NCCL_DEBUG=INFO # Enable debugging logs
GPUS_PER_NODE=4
MASTER_NAME=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
MASTER_PORT=29500
NNODES=$SLURM_NNODES
MACHINE_RANK=$SLURM_PROCID
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
srun python -u -m torch.distributed.run \
--nproc_per_node=$GPUS_PER_NODE \
--nnodes=$SLURM_JOB_NUM_NODES \
--rdzv_backend=c10d \
--rdzv_endpoint=$MASTER_NAME:$MASTER_PORT \
src/llama_cookbook/finetuning.py \
--enable_fsdp \
--use_peft \
--peft_method lora \
--model_name /leonardo_scratch/fast/PROJECT_ID/Llama-3.3-70B-Instruct \
--dataset monke_dataset \
--save_model \
--num_epochs 2 \
--context_length 8192 \
--output_dir /leonardo_work/PROJECT_ID/AI_Model/output_models/peft/Llama-3.2-70B-Instruct \
--dist_checkpoint_root_folder model_checkpoints \
--dist_checkpoint_folder fine-tuned
Run and Monitor
💡 Tip: Make sure the
logs/
directory exists before you submit the job.
Final Thoughts
By following this step-by-step process, you ensure: - Efficient use of storage systems. - A consistent and optimized environment. - Scalable, fault-tolerant job execution.
Leonardo is a powerful tool for accelerating your research — the key is to use it wisely and collaboratively.
Happy HPC hacking! 🚀