Skip to content

Protein Folding Simulation at Scale

A deep dive into optimizing GROMACS for large-scale molecular dynamics simulations of the SARS-CoV-2 spike protein.

Case Study: Protein Folding Simulation at Scale

In this case study, we'll explore how we optimized a large-scale protein folding simulation for the SARS-CoV-2 spike protein using GROMACS on a multi-node GPU cluster.

Project Overview

Challenge: Simulate the folding dynamics of the SARS-CoV-2 spike protein (2.2M atoms) for 1µs.

Hardware Available: - 8 nodes with 4× NVIDIA A100 GPUs each - 100 Gb/s InfiniBand interconnect - 1 TB RAM per node

Initial Setup and Challenges

When we first attempted this simulation, we encountered several challenges:

  1. Memory Bottleneck

    # Initial GROMACS error
    Fatal error: Out of memory allocating 24GB for PME grid
    

  2. Poor GPU Utilization

    # GPU utilization from nvidia-smi
    |  GPU-Util  |
    |         25%|  # Much lower than expected
    

Solution Strategy

1. System Preparation

We optimized the system setup using careful periodic boundary conditions:

# Using MDAnalysis to set up the system
import MDAnalysis as mda
from MDAnalysis.analysis import distances

# Load structure
u = mda.Universe('spike.pdb')

# Calculate optimal box size
dims = distances.box_dimensions(u.atoms)
box_margin = 1.5  # nm
new_box = dims + 2 * box_margin

print(f"Optimal box dimensions: {new_box} nm")

2. GPU Task Distribution

We implemented a hybrid parallelization strategy:

# GROMACS run command with optimized parameters
gmx mdrun -deffnm spike \
    -nb gpu \
    -pme gpu \
    -bonded gpu \
    -update gpu \
    -ntomp 8 \
    -npme 2 \
    -nstlist 100 \
    -dlb yes

3. Performance Monitoring

We developed a custom monitoring script:

import matplotlib.pyplot as plt
import numpy as np

def parse_log(logfile):
    performance = []
    with open(logfile, 'r') as f:
        for line in f:
            if 'Performance' in line:
                # Extract ns/day
                perf = float(line.split()[-2])
                performance.append(perf)
    return performance

# Plot performance over time
performance = parse_log('md.log')
plt.plot(performance)
plt.xlabel('Time (steps)')
plt.ylabel('Performance (ns/day)')
plt.show()

Results

Performance Improvements

Optimization Step Performance (ns/day) GPU Utilization
Initial Setup 15.2 25%
Memory Optimized 22.7 45%
GPU Tasks Tuned 42.3 85%
Final Setup 48.9 92%

Scaling Analysis

# Scaling data visualization
nodes = [1, 2, 4, 8]
speedup = [1, 1.85, 3.42, 6.1]

plt.plot(nodes, speedup, 'bo-')
plt.plot(nodes, nodes, 'k--', label='Linear')
plt.xlabel('Number of Nodes')
plt.ylabel('Speedup')
plt.legend()
plt.grid(True)

Key Learnings

  1. Memory Management
  2. Use PME decomposition carefully
  3. Monitor memory usage per rank
  4. Adjust box size based on protein dimensions

  5. GPU Optimization

  6. Balance PME workload
  7. Tune update groups
  8. Optimize neighbor searching

  9. Network Considerations

  10. Use GPUDirect RDMA
  11. Optimize domain decomposition
  12. Monitor communication patterns

Code Snippets

Production Run Script

#!/bin/bash
#SBATCH --job-name=spike_fold
#SBATCH --nodes=8
#SBATCH --gpus-per-node=4
#SBATCH --time=48:00:00

module load GROMACS/2023.2-CUDA

# Environment setup
export GMX_FORCE_UPDATE_DEFAULT_GPU=true
export CUDA_VISIBLE_DEVICES=0,1,2,3

# Run simulation
mpirun -np 32 gmx_mpi mdrun \
    -deffnm spike \
    -maxh 47.5 \
    -nb gpu \
    -pme gpu \
    -bonded gpu \
    -update gpu \
    -ntomp 8 \
    -nstlist 100 \
    -dlb yes

Conclusions

Through careful optimization and monitoring, we achieved: - 3.2× performance improvement - 92% GPU utilization - 6.1× scaling efficiency on 8 nodes

Resources

  • GROMACS Performance Analysis
  • GPU Optimization Guide
  • Raw Data and Scripts

Discussion

Share your experiences with large-scale protein simulations or ask questions below!