DeepSpeed Activation Checkpointing OOM (Out-of-Memory) Debugging Master: Optimizing GPU Memory Usage and Training Strategies for Large-Scale Models

This is the ultimate guide to resolving frequent OOM errors during large-scale model training, especially those encountered when using DeepSpeed Activation Checkpointing. It introduces strategies to significantly reduce GPU memory usage, maximize training efficiency, and train larger models faster.

1. The Challenge / Context

Training large-scale models (LLMs) demands immense GPU memory. As the number of parameters increases, the memory required to store the model's activation values also grows exponentially. Activation Checkpointing is a key technology to address this issue, but incorrect settings or a lack of understanding can actually lead to OOM (Out-of-Memory) errors. With the current fierce competition in AI model development, training larger and more powerful models is crucial, and the ability to effectively utilize Activation Checkpointing is essential for gaining a competitive edge. Maximizing Tensor Core usage, efficient memory management, and establishing a stable training environment all depend on properly understanding and utilizing this technology.

2. Deep Dive: DeepSpeed Activation Checkpointing

DeepSpeed Activation Checkpointing (or Gradient Checkpointing) is a technique that reduces memory usage by recomputing necessary activation values during the backward pass, instead of storing all activation values during the model's forward pass. By recomputing activation values needed during the backward pass from the forward pass, it offers a trade-off between memory usage and computation. DeepSpeed implements this feature even more efficiently and provides various optimization options.

Working Principle: It divides the model's layers into several segments and stores only the input activation values for each segment. During the backward pass, the activation values for each segment are recomputed using the stored input values. This method uses significantly less memory than storing all activation values, but it slightly increases computation time as some forward pass calculations must be repeated.

Key Features:

Selective Activation Checkpointing: Instead of applying Activation Checkpointing to all layers, it can be applied only to specific layers with high memory usage to minimize performance degradation.
CPU Offloading: Activation values can be offloaded to CPU memory to alleviate GPU memory shortage issues.
Distributed Checkpointing: Activation Checkpointing computations can be distributed across multiple GPUs to reduce computation time.

3. Step-by-Step Guide / Implementation

A step-by-step guide to effectively using DeepSpeed Activation Checkpointing and debugging OOM errors.

Step 1: Verify and Configure DeepSpeed Settings

DeepSpeed must be correctly installed and configured. Various options can be set via the deepspeed_config.json file.


    {
      "train_batch_size": 16,
      "train_micro_batch_size_per_gpu": 4,
      "gradient_accumulation_steps": 4,
      "optimizer": {
        "type": "AdamW",
        "params": {
          "lr": 0.0001,
          "weight_decay": 0.01
        }
      },
      "scheduler": {
        "type": "WarmupLR",
        "params": {
          "warmup_min_lr": 0.00001,
          "warmup_max_lr": 0.0001,
          "warmup_num_steps": 1000
        }
      },
      "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
      },
      "activation_checkpointing": {
        "partition_activations": true,
        "contiguous_memory_optimization": true,
        "cpu_checkpointing": false
      },
      "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "reduce_scatter": true,
        "contiguous_gradients": true,
        "allgather_partitions": true
      }
    }

Explanation:

partition_activations: Distributes and stores activation values across multiple GPUs.
contiguous_memory_optimization: Stores activation values in contiguous memory space to reduce memory fragmentation.
cpu_checkpointing: Offloads activation values to CPU memory (useful for resolving OOM).
zero_optimization: ZeRO optimization further saves memory by distributing model parameters, gradients, and optimizer states. Stage 2 or higher is recommended.

Step 2: Apply Activation Checkpointing to the Model

There are two main ways to apply DeepSpeed Activation Checkpointing to a model. The first is to use the DeepSpeed engine, and the second is to implement it directly.

2.1. Using the DeepSpeed Engine

The simplest method is to wrap the model using the DeepSpeed engine.


    import deepspeed
    import torch

    model = ...  # Your PyTorch model
    config = "deepspeed_config.json"

    model_engine, optimizer, _, _ = deepspeed.initialize(
        model=model,
        config=config
    )

    # 이제 model_engine을 사용하여 학습 진행

The DeepSpeed engine automatically applies Activation Checkpointing according to the configuration file (deepspeed_config.json).

2.2. Direct Implementation

For more fine-grained control, you can implement Activation Checkpointing directly using torch.utils.checkpoint.


    import torch
    from torch.utils.checkpoint import checkpoint

    class MyModel(torch.nn.Module):
        def __init__(self):
            super().__init__()
            self.layer1 = torch.nn.Linear(10, 20)
            self.layer2 = torch.nn.Linear(20, 30)
            self.layer3 = torch.nn.Linear(30, 40)

        def forward(self, x):
            x = torch.relu(self.layer1(x))
            x = checkpoint(torch.relu, self.layer2(x))  # Activation Checkpointing 적용
            x = torch.relu(self.layer3(x))
            return x

Explanation: The checkpoint function does not store the input activation values of the specified function; instead, it recomputes them during the backward pass. In the example above, Activation Checkpointing was applied to layer2.

Step 3: Debugging OOM Errors

Even after applying Activation Checkpointing, OOM errors can still occur. In such cases, you can resolve the issue by following these steps:

Reduce Batch Size: The most basic solution is to reduce train_batch_size or train_micro_batch_size_per_gpu.
Enable CPU Offloading: Set cpu_checkpointing to true in the deepspeed_config.json file. It is also recommended to use ZeRO Offload options together.
Utilize Selective Activation Checkpointing: Instead of applying Activation Checkpointing to all layers, apply it only to layers with high memory usage. Profiling tools can be used to identify memory-intensive layers.
Adjust Gradient Accumulation Step: Increasing the Gradient Accumulation Step allows for effectively using larger batch sizes, but it can also increase memory usage. A proper balance must be found.
Mixed Precision Training (FP16): Enabling FP16 can halve memory usage. Check and enable the fp16 setting in the deepspeed_config.json file.
Garbage Collection: Explicitly call `torch.cuda.empty_cache()` to free up GPU memory.
Utilize PyTorch Profiler: Use PyTorch Profiler to diagnose memory leaks and identify memory-intensive operations.

4. Real-world Use Case / Example

I once faced persistent OOM errors while performing a text generation task using a 20B parameter model. Initially, I tried reducing the batch size and increasing the Gradient Accumulation Step, but these were not fundamental solutions. Analyzing memory usage with PyTorch Profiler, I identified that a specific Transformer block was consuming an abnormally large amount of memory. By applying Activation Checkpointing only to that block and enabling CPU Offloading, I was able to proceed with training without OOM errors. Furthermore, using FP16 allowed me to double the training speed. Through this experience, I realized the importance of Selective Activation Checkpointing and the usefulness of profiling tools.

5. Pros & Cons / Critical Analysis

Pros:
- Reduced GPU memory usage
- Ability to train larger models
- Ability to use larger batch sizes
Cons:
- Increased computation time (requires forward pass recomputation)
- Increased complexity in setup and debugging
- Not effective for all models (requires memory usage profiling)

6. FAQ

Q: Does using Activation Checkpointing always reduce memory usage?
A: Not always. Activation Checkpointing offers a trade-off between computation time and memory usage. It is most effective when applied to specific memory-intensive layers of a model.
Q: Does enabling CPU Offloading degrade performance?
A: Yes, CPU Offloading frees up GPU memory but incurs data transfer overhead between the CPU and GPU. However, if it resolves OOM errors, the performance degradation might be an acceptable trade-off.
Q: What is the difference between DeepSpeed and PyTorch's built-in Activation Checkpointing (torch.utils.checkpoint)?
A: DeepSpeed implements Activation Checkpointing more efficiently and provides various optimization options (e.g., Selective Activation Checkpointing, CPU Offloading). PyTorch's built-in Activation Checkpointing offers only basic functionality.

7. Conclusion

DeepSpeed Activation Checkpointing is a powerful technique for resolving OOM errors and optimizing GPU memory usage during large-scale model training. Through the step-by-step methods and debugging strategies presented in this guide, you will be able to train larger models more efficiently and gain a competitive edge in AI model development. Set up DeepSpeed now and apply Activation Checkpointing to maximize your model training efficiency! For more details, please refer to the official DeepSpeed documentation.

DeepSpeed Activation Checkpointing OOM Debugging Master: Optimizing GPU Memory Usage for Ultra-Large Model Training