DeepSpeed Data Parallelism Network Bottleneck Debugging Master: InfiniBand & RoCE Optimization

This guide introduces how to resolve InfiniBand and RoCE network bottlenecks in a DeepSpeed Data Parallelism environment to shorten training time and maximize GPU utilization. Learn how to significantly improve network performance through advanced configuration and debugging techniques.

1. The Challenge / Context

Data Parallelism is essential when training large-scale deep learning models. Frameworks like DeepSpeed effectively support data parallelism, but network bottlenecks often occur even in high-performance network environments like InfiniBand or RoCE (RDMA over Converged Ethernet), slowing down the overall training speed. This problem is particularly pronounced with large models or vast datasets, preventing full utilization of GPU computational power. Therefore, accurately diagnosing and optimizing network bottlenecks is a critical challenge in deep learning research and development.

2. Deep Dive: The Core of InfiniBand & RoCE Optimization

InfiniBand and RoCE are network technologies suitable for deep learning training, offering high bandwidth and low latency. However, to properly utilize these technologies, several key aspects must be understood. First, RDMA (Remote Direct Memory Access) technology enables direct data transfer between memories without involving the CPU, thereby reducing latency. RoCE is a protocol that allows RDMA to be used in an Ethernet environment. InfiniBand generally provides lower latency than RoCE but requires a more expensive and complex network infrastructure. The important point is that Network Interface Card (NIC) settings, QoS (Quality of Service) settings, and DeepSpeed configuration all affect network performance.

3. Step-by-Step Guide / Implementation

Now, let's look at a detailed step-by-step guide to resolve InfiniBand/RoCE network bottlenecks in a DeepSpeed Data Parallelism environment.

Step 1: Verify Network Interface Settings

The first thing to check is the network interface settings on each node. You need to verify that the correct drivers are installed, the interfaces are active, and the MTU (Maximum Transmission Unit) is appropriately configured. Generally, in InfiniBand/RoCE networks, using a larger MTU value (e.g., 9000, jumbo frames) can reduce overhead by transmitting more data at once.

# Check interface list
ip link show

# Check interface status (e.g., ib0)
ip link show ib0

# Check and change MTU setting (root privileges required)
ip link show ib0 | grep mtu
sudo ip link set dev ib0 mtu 9000

Step 2: Verify and Enable RDMA Settings

You need to verify that RDMA is properly configured. Use the ibv_devinfo command to check RDMA device information and confirm that the rdma_cm service is running. Since RDMA allows memory access between nodes, security settings must be managed carefully.

# Check RDMA device information
ibv_devinfo

# Check rdma_cm service status
systemctl status rdma_cm

# Start/restart rdma_cm service (root privileges required)
sudo systemctl start rdma_cm
sudo systemctl restart rdma_cm

Step 3: Optimize DeepSpeed Configuration

You need to optimize network-related settings in the DeepSpeed configuration file (json). Specifically, you can reduce communication overhead by adjusting gradient_accumulation_steps, reduce_bucket_size, and fp16 settings. Increasing gradient_accumulation_steps can reduce communication frequency, but it may increase memory usage. reduce_bucket_size controls the bucket size for gradient aggregation; adjusting this value can optimize network packet size and improve performance. Using fp16 can reduce memory usage, allowing for training larger models or increasing batch sizes.

{
  "train_batch_size": 32,
  "train_micro_batch_size_per_gpu": 4,
  "gradient_accumulation_steps": 8,
  "fp16": {
    "enabled": true,
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "initial_scale_power": 32,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "zero_optimization": {
    "stage": 2,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "offload_param": {
      "device": "cpu",
      "pin_memory": true
    },
    "reduce_bucket_size": 5e8,
    "overlap_comm": true,
    "contiguous_gradients": true,
    "allgather_partitions": true
  }
}

Step 4: NCCL (NVIDIA Collective Communications Library) Configuration

DeepSpeed uses NCCL for inter-node communication. You can improve performance by setting NCCL-related environment variables. For example, you can explicitly specify the network interface to use with NCCL_SOCKET_IFNAME, and disable InfiniBand (useful for troubleshooting) with NCCL_IB_DISABLE. Additionally, setting the NCCL_DEBUG environment variable can output NCCL-related debugging information to assist with troubleshooting.

# Specify the network interface to use
export NCCL_SOCKET_IFNAME=ib0

# Disable InfiniBand (for troubleshooting)
export NCCL_IB_DISABLE=1

# Output debugging information
export NCCL_DEBUG=INFO

Step 5: Network Monitoring and Profiling

Monitor and profile network performance using tools like ibstat, perfquery, and tcpdump. ibstat provides status and statistical information for InfiniBand interfaces, and perfquery

Debugging DeepSpeed Data Parallelism Network Congestion: Optimizing InfiniBand & RoCE

DeepSpeed Data Parallelism Network Bottleneck Debugging Master: InfiniBand & RoCE Optimization

1. The Challenge / Context

2. Deep Dive: The Core of InfiniBand & RoCE Optimization

3. Step-by-Step Guide / Implementation

Step 1: Verify Network Interface Settings

Step 2: Verify and Enable RDMA Settings

Step 3: Optimize DeepSpeed Configuration

Step 4: NCCL (NVIDIA Collective Communications Library) Configuration

Step 5: Network Monitoring and Profiling

`Heeviz Engineering Team`

`Related Posts`

Optimizing Llama 3 Long-Context Inference: Maximizing Memory Efficiency and Inference Speed with KV Cache Compression

Optimizing Vector Databases for High-Throughput RAG: Benchmarking and Tuning Strategies for Pinecone, Weaviate, and Qdrant

Debugging PyTorch DistributedDataParallel Communication Overhead: Optimization Strategies with NCCL, CUDA Graphs, and RDMA