Mastering DeepSpeed ZeRO-3 GPU Memory Error Debugging: Advanced Memory Profiling and Distributed Training Optimization

DeepSpeed ZeRO-3 enables training of massive models, but GPU memory errors are a common issue. This article presents practical methods for diagnosing and resolving memory errors in a ZeRO-3 environment through advanced memory profiling techniques and distributed training optimization strategies. Shorten troubleshooting time and maximize GPU utilization to train larger models faster.

1. The Challenge / Context

The recent advancements in Large Language Models (LLMs) have further increased the importance of distributed training technologies like DeepSpeed ZeRO. ZeRO-3 is designed to overcome single GPU memory limitations by sharding model parameters, optimizer states, and gradients across multiple GPUs. However, due to increased complexity, Out-Of-Memory (OOM) errors frequently occur. These errors slow down the model development cycle and consume significant debugging time. Debugging becomes even more challenging, especially when ZeRO-3 is misconfigured or there are memory leaks. A deep understanding of GPU memory usage and effective optimization strategies are essential.

2. Deep Dive: DeepSpeed ZeRO-3 Memory Sharding Strategy

DeepSpeed ZeRO (Zero Redundancy Optimizer) is a type of data parallelism that helps reduce the memory footprint of models, enabling the training of larger models. ZeRO-3 shards memory in the following ways:

Sharded Model Parameters: Instead of replicating model parameters on all GPUs, each GPU stores a portion of the parameters.
Sharded Optimizer States: Optimizer states (e.g., momentum, variance) like Adam and SGD are also distributed and stored per GPU.
Sharded Gradients: Gradients are also distributed and stored, not replicated on all GPUs.
Data Parallelism: Each GPU processes a different batch of data.

ZeRO-3 is designed to balance model size and training efficiency. The core idea is to efficiently distribute data, model, and optimizer states across GPUs to reduce memory pressure on each GPU. However, without correct configuration and debugging techniques, OOM errors are difficult to avoid. Various factors, including data loading, activation function memory management, and batch size settings, particularly affect memory usage.

3. Step-by-Step Guide / Implementation

Now, let's look at a step-by-step guide to debugging and optimizing GPU memory errors in a ZeRO-3 environment.

Step 1: Initial Setup and Environment Preparation

First, ensure that DeepSpeed and the necessary libraries are correctly installed. It is recommended to check the installed DeepSpeed version using the deepspeed --version command and update to the latest version.

pip install deepspeed
pip install torch # PyTorch는 필수
pip install psutil  # 메모리 프로파일링에 유용
pip install pandas  # 데이터 분석에 유용 (선택 사항)

Step 2: Analyzing the DeepSpeed Configuration File (JSON)

The DeepSpeed configuration file (`ds_config.json`) defines various options used for training. Important memory-related settings must be checked.

{
  "train_batch_size": 32,
  "train_micro_batch_size_per_gpu": 4,
  "gradient_accumulation_steps": 8,
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "offload_param": {
      "device": "cpu",
      "pin_memory": true
    },
    "overlap_comm": true,
    "contiguous_gradients": true,
    "reduce_bucket_size": 5e8,
    "stage3_prefetch_bucket_size": 5e8,
    "stage3_param_persistence_threshold": 1e4,
    "sub_group_size": 1e9,
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9
  },
  "fp16": {
    "enabled": true,
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "initial_scale_power": 32,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "gradient_clipping": 1.0
}

`train_batch_size`: Total batch size.
`train_micro_batch_size_per_gpu`: Batch size per GPU. Reducing this value can decrease GPU memory usage.
`gradient_accumulation_steps`: Number of gradient accumulation steps. Increasing this value increases the virtual batch size but also increases memory usage. `train_batch_size` should be equal to `train_micro_batch_size_per_gpu * gradient_accumulation_steps * world_size` (where world_size is the total number of GPUs).
`zero_optimization.stage`: ZeRO stage (1, 2, or 3). Stage 3 is the most memory efficient.
`offload_optimizer` & `offload_param`: Determines whether to offload optimizer states and model parameters to the CPU. Setting this to True can save GPU memory but may slow down training.
`fp16.enabled`: Whether to enable FP16 (half-precision floating-point) training. Setting this to True can save memory usage.
`reduce_bucket_size`, `stage3_prefetch_bucket_size`: These parameters control the communication bucket size. They can be adjusted to reduce communication overhead.
`gradient_clipping`: Gradient clipping threshold. Prevents gradients from becoming too large, improving training stability.

Step 3: Memory Profiling

It is important to analyze GPU memory usage in detail. You can utilize PyTorch's `torch.cuda.memory_summary()` or DeepSpeed's own memory statistics.

import torch
import deepspeed

def print_gpu_memory(rank=0):
    if rank == 0:
        print(torch.cuda.memory_summary())

# DeepSpeed 엔진 초기화 후 (예시)
model_engine, optimizer, _, _ = deepspeed.initialize(
    config_params=ds_config,

Debugging GPU Memory Errors in DeepSpeed ZeRO-3: Advanced Memory Profiling and Distributed Training Optimization

Mastering DeepSpeed ZeRO-3 GPU Memory Error Debugging: Advanced Memory Profiling and Distributed Training Optimization

1. The Challenge / Context

2. Deep Dive: DeepSpeed ZeRO-3 Memory Sharding Strategy

3. Step-by-Step Guide / Implementation

Step 1: Initial Setup and Environment Preparation

Step 2: Analyzing the DeepSpeed Configuration File (JSON)

Step 3: Memory Profiling

`Heeviz Engineering Team`

`Related Posts`

Deep Dive into Qdrant HNSW Parameters for High-Dimensional Data Retrieval

Optimizing Qdrant for Geo-Spatial Search and Analytics: Strategies for Maximizing Location-Based Insights

Deep Dive into vLLM Continuous Batching: Throughput Optimization & Latency Reduction