DeepSpeed Gradient Accumulation Memory Optimization Deep Dive: Practical Strategies for Training Ultra-large Models

Are you experiencing memory shortage issues when training ultra-large models? DeepSpeed's Gradient Accumulation can optimize memory usage while effectively increasing batch size. This post explains the working principle of Gradient Accumulation, its actual setup methods, and memory optimization strategies in detail, presenting ways to maximize the training efficiency of ultra-large models.

1. The Challenge / Context

Training large-scale models such as ultra-large language models (LLMs) requires significant computing resources. One of the biggest constraints is GPU memory capacity. As model size increases, the appropriate batch size for a single GPU decreases, which slows down training and reduces GPU utilization. Data Parallelism solves this problem by distributing data across multiple GPUs, but it incurs communication overhead, and individual GPU memory constraints still exist. Gradient Accumulation is an effective technique that allows achieving the effect of a larger batch size without actually increasing the batch size, but simply applying it alone cannot maximize memory efficiency. Especially for ultra-large models, it is crucial to solve potential memory explosion issues that can occur when applying Gradient Accumulation and to secure optimal performance.

2. Deep Dive: DeepSpeed Gradient Accumulation

DeepSpeed's Gradient Accumulation is a technique that accumulates gradients from multiple mini-batches to perform a single update. It is essentially equivalent to effectively increasing the batch size. For example, if the accumulation steps are 4, the model's weights are updated after accumulating gradients from 4 mini-batches. This reduces the amount of data that needs to be loaded into GPU memory at once, allowing for the training of larger models. DeepSpeed efficiently manages this process to optimize memory usage.

DeepSpeed's Gradient Accumulation works in close integration with PyTorch's autograd engine. After the forward pass of each mini-batch, the gradients calculated through the backward pass are accumulated instead of being immediately applied to the weights. Once gradients are accumulated for the specified number of accumulation steps, the Optimizer uses the accumulated gradients to update the model weights. In this process, DeepSpeed utilizes various memory optimization techniques to reduce the memory footprint.

3. Step-by-Step Guide / Implementation

The following is a step-by-step guide to setting up Gradient Accumulation and optimizing memory using DeepSpeed.

Step 1: DeepSpeed Installation and Environment Setup

First, install DeepSpeed. You can easily install it using pip.

pip install deepspeed

To use DeepSpeed, you need to specify settings through the `ds_config.json` file. This file includes various settings such as data parallelism, optimizer configuration, and gradient accumulation steps.

Step 2: Creating the DeepSpeed Configuration File (ds_config.json)

Below is an example of a `ds_config.json` file for Gradient Accumulation. Other optimization options (e.g., ZeRO) are also included.

{
  "train_batch_size": 32,
  "train_micro_batch_size_per_gpu": 4,
  "gradient_accumulation_steps": 8,
  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": 0.0001,
      "weight_decay": 0.01
    }
  },
  "zero_optimization": {
    "stage": 2,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "offload_param": {
      "device": "cpu",
      "pin_memory": true
    },
    "overlap_comm": true,
    "reduce_scatter": true,
    "contiguous_gradients": true
  },
  "gradient_clipping": 1.0,
  "fp16": {
    "enabled": true,
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "initial_scale_power": 32,
    "hysteresis": 2,
    "min_loss_scale": 1
  }
}

Key parameters are as follows:

`train_batch_size`: Total batch size.
`train_micro_batch_size_per_gpu`: Mini-batch size per GPU.
`gradient_accumulation_steps`: Number of gradient accumulation steps. `train_batch_size` should be equal to `train_micro_batch_size_per_gpu * gradient_accumulation_steps`.
`zero_optimization`: Sets the ZeRO optimization level. Using stage 2 or stage 3 can significantly reduce memory usage.
`offload_optimizer`, `offload_param`: Offloads optimizer states and model parameters to the CPU to free up GPU memory.
`fp16`: Uses FP16 precision to reduce memory usage.

Step 3: DeepSpeed Engine Initialization and Model Training

Initialize the DeepSpeed engine and train the model in your PyTorch training script. Below is a basic code snippet.

import deepspeed
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset

# 간단한 모델 정의
class SimpleModel(nn.Module):
    def __init__(self):
        super(SimpleModel, self).__init__()
        self.linear = nn.Linear(10, 1)

    def forward(self, x):
        return self.linear(x)

# 더미 데이터셋 정의
class DummyDataset(Dataset):
    def __init__(self, size):
        self.size = size
        self.data = torch.randn(size, 10)
        self.labels = torch.randn(size, 1)

    def __len__(self):
        return self.size

    def __getitem__(self, idx):
        return self.data[idx], self.labels[idx]


# DeepSpeed 설정 파일 경로
config_path = 'ds_config.json'

# 모델, 옵티마이저, 데이터 로더 초기화
model = SimpleModel()
optimizer = torch.optim.AdamW(model.parameters(), lr=0.001) # 실제로는 DeepSpeed에 의해 덮어씌워짐
dataset = DummyDataset(size=1000)
dataloader = DataLoader(dataset, batch_size=4)  # micro_batch_size_per_gpu와 일치해야 함

# DeepSpeed 엔진 초기화
model_engine, optimizer, _, _ = deepspeed.initialize(
    model=model,
    optimizer=optimizer,
    config=config_path
)


# 훈련 루프
for epoch in range(2):
    for step, (data, labels) in enumerate(dataloader):
        data = data.to(model_engine.device) # 데이터를 DeepSpeed가 관리하는 device로 옮김
        labels = labels.to(model_engine.device)

        outputs = model_engine(data)
        loss = torch.nn.functional.mse_loss(outputs, labels)

        # 역전파 및 가중치 업데이트 (gradient_accumulation_steps에 따라 자동으로 누적 후 업데이트됨)
        model_engine.backward(loss)
        model_engine.step()

        if step % 10 == 0:
            print(f"Epoch: {epoch}, Step: {step}, Loss: {loss.item()}")

print("훈련 완료!")

This code initializes the DeepSpeed engine and performs a forward pass, backward pass, and weight update for each mini-batch. Weights are updated after gradients have been accumulated for the number of times specified in `gradient_accumulation_steps`.

Step 4: Memory Profiling and Optimization

Monitor memory usage during training to check if further optimization is needed. You can track memory usage using PyTorch utilities like `torch.cuda.memory_allocated()` and `torch.cuda.memory_cached()`. DeepSpeed Profiler allows for more detailed memory profiling. You can add profiling-related settings to the configuration file (`ds_config.json`) and enable them during DeepSpeed engine initialization.

{
  ...,
  "profiling": {
    "enabled": true,
    "profile_steps": [1, 10],  // 프로파일링할 step 범위
    "output_folder": "profiling_output"
  }
}

By analyzing profiling results, you can identify memory bottlenecks and further optimize memory usage through techniques such as ZeRO stage adjustment, offload option adjustment, and activation checkpointing.

4. Real-world Use Case / Example

In a real-world LLM training scenario, an OOM (Out of Memory) error initially occurred when the batch size was set to 16. By setting Gradient Accumulation to 4, effectively increasing the batch size to 64, the OOM error was resolved, and training speed improved by 15%. Activating ZeRO stage 2 and offloading the optimizer to the CPU further reduced GPU memory usage. The activation checkpointing technique further reduced the memory footprint, enabling the training of even larger models.

5. Pros & Cons / Critical Analysis

Pros:
- Overcome GPU Memory Limitations: Achieve the effect of a large batch size even with small memory capacity.
- Improved Training Speed: Prevents GPU underutilization due to small batch sizes and enhances training speed.
- Minimal Code Changes: Integrate DeepSpeed without significant modifications to existing code.
Cons:
- Requires Hyperparameter Tuning: The `gradient_accumulation_steps` value must be set appropriately to achieve optimal performance.
- Potential for Increased Training Time: Excessive Gradient Accumulation can reduce update frequency, slowing down convergence.
- Additional Memory Optimization Work Required: Simply using Gradient Accumulation alone may not be sufficient; additional memory optimization techniques such as ZeRO and Offload should be applied together.

6. FAQ

Q: Does Gradient Accumulation always improve training speed?
A: Not always. If `gradient_accumulation_steps` is too large, the update frequency can decrease, slowing down convergence. An appropriate value must be found.
Q: Which ZeRO stage should I choose?
A: Stage 1 distributes optimizer states, stage 2 distributes optimizer states and gradients, and stage 3 distributes optimizer states, gradients, and model parameters. If memory capacity is very limited, using stage 3 is recommended, but it can increase communication overhead. Generally, stage 2 is a good compromise.
Q: How do I use DeepSpeed Profiler?
A: Add profiling settings to the `ds_config.json` file and set the `profiling=True` option during DeepSpeed engine initialization. Once training is complete, profiling results will be saved in the specified `output_folder`. You can analyze the results using tools like TensorBoard.

7. Conclusion

DeepSpeed's Gradient Accumulation is a powerful technique for overcoming memory constraints and increasing training efficiency when training ultra-large models. However, simply applying it alone is not enough; it must be used in conjunction with various memory optimization techniques such as ZeRO, Offload, and Activation Checkpointing. Utilize the provided step-by-step guide and tips to optimize your model training pipeline and train larger, more powerful models. Install DeepSpeed now and adjust your `ds_config.json` file to find the optimal settings for your environment!

Deep Dive into DeepSpeed Gradient Accumulation Memory Optimization: Practical Strategies for Training Extremely Large Models