Building a Kubernetes GPU Node Auto-Recovery System: Nvidia DCGM-based Fault Detection and Automatic Restart
Are you struggling to build an auto-recovery system for stable GPU workloads in a Kubernetes environment? By leveraging Nvidia DCGM (Data Center GPU Manager), you can monitor the status of GPU nodes in real-time and build a system that automatically restarts nodes upon failure, minimizing downtime and maximizing workload stability. This is especially crucial for applications that heavily utilize GPU resources, such as machine learning and deep learning model training.
1. The Challenge / Context
Workloads utilizing GPUs, such as machine learning, deep learning model training, and high-performance computing, have high computing resource requirements, and GPU node failures can lead to overall system performance degradation or outages. GPU nodes can become unstable due to various reasons, including hardware issues with the GPU itself, driver errors, memory leaks, or overheating. Manually intervening to restart a node when such a failure occurs is time-consuming and makes immediate response difficult. Especially in 24/7 operating environments, an automated fault detection and recovery system is essential. In a Kubernetes environment, an effective solution to these problems is required.
2. Deep Dive: Nvidia DCGM
Nvidia DCGM (Data Center GPU Manager) is a GPU monitoring and management tool provided by Nvidia. DCGM collects various metrics in real-time, such as GPU temperature, memory usage, power consumption, and GPU utilization, and can diagnose the GPU's status based on these metrics. Furthermore, DCGM provides capabilities to detect GPU errors and automatically respond according to administrator-defined policies. DCGM offers various interfaces, including CLI, REST API, and Prometheus exporter, making it easy to integrate with other systems. In a Kubernetes environment, you can use the DCGM exporter to collect GPU metrics into Prometheus and receive alerts via Alertmanager when failures occur. DCGM is useful when used with the Kubernetes node exporter to monitor the overall system status, including CPU, Memory, and Disk IO, and to diagnose system-wide issues along with GPU-related problems.
3. Step-by-Step Guide / Implementation
Now, let's look at how to build a Kubernetes GPU node auto-recovery system based on Nvidia DCGM, step by step. This system monitors the GPU's status using DCGM and automatically restarts the node if a failure exceeding a configured threshold is detected.
Step 1: DCGM Installation and Configuration
First, you need to install DCGM on each GPU node in your Kubernetes cluster. Download and install the appropriate DCGM package for your operating system according to the Nvidia official documentation.
# Ubuntu Example
sudo apt-get update
sudo apt-get install -y nvidia-dcgm
After DCGM installation, verify that the DCGM service is running correctly.
sudo systemctl status nvidia-dcgmd
Through DCGM configuration, you can define the GPU metrics to monitor and the fault detection conditions. For example, you can set a GPU temperature threshold to trigger an automatic restart upon overheating.
Step 2: DCGM Exporter Installation and Configuration
To collect DCGM metrics into Prometheus, you need to install and configure the DCGM exporter. The DCGM exporter allows Prometheus to collect GPU metrics from DCGM. It is common to run the DCGM exporter using a Docker image.
# Docker Run Example
docker run -d --name dcgm-exporter --net=host --pid=host \
--restart=unless-stopped \
-v /var/run/dcgmd.sock:/var/run/dcgmd.sock \
nvidia/dcgm-exporter:latest
Verify that the DCGM exporter is running correctly and that GPU metrics are exposed at the `http://
Step 3: Prometheus Configuration
Add the DCGM exporter as a scraping target in the Prometheus configuration file (`prometheus.yml`). This allows Prometheus to periodically collect GPU metrics from the DCGM exporter.
scrape_configs:
- job_name: 'dcgm-exporter'
static_configs:
- targets: [':9400'] # IP address and port of the GPU node
After changing the Prometheus configuration, restart Prometheus to apply the changes.
Step 4: Alertmanager Configuration
Define fault detection rules based on GPU metrics collected by Prometheus and configure Alertmanager to receive alerts when failures occur. For example, you can define a rule to trigger an alert if the GPU temperature exceeds 80 degrees Celsius.
# Alertmanager Rule Example (alert.yml)
groups:
- name: GPUAlerts
rules:
- alert: HighGpuTemperature
expr: nvidia_gpu_temperature_degrees > 80
for: 5m
labels:
severity: critical
annotations:
summary: "High GPU Temperature ({{ $labels.instance }})"
description: "GPU temperature has exceeded 80 degrees Celsius. Please check the node."
Through Alertmanager settings, you can configure email, Slack channels, and other notification recipients.


