Published on

GPU Monitoring Kubernetes

Authors

GPU Monitoring Kubernetes: Der ultimative Guide für deutsche AI/ML-Unternehmen 2025

GPU Monitoring in Kubernetes ist entscheidend für deutsche Unternehmen, die AI/ML-Workloads effizient betreiben wollen. Dieser umfassende Guide zeigt Ihnen, wie Sie NVIDIA GPU Monitoring, Prometheus-Integration und Grafana-Dashboards für optimale GPU-Auslastung und Kosteneffizienz implementieren.

Was ist GPU Monitoring Kubernetes?

GPU Monitoring Kubernetes umfasst die Überwachung von GPU-Ressourcen in Kubernetes-Clustern, einschließlich Auslastung, Speicherverbrauch, Temperatur und Performance-Metriken für AI/ML-Workloads.

Warum GPU Monitoring Kubernetes für deutsche Unternehmen wichtig ist:

  • Kostenoptimierung: GPU-Instanzen sind teuer (€1-10/Stunde)
  • Resource Efficiency: Maximale Auslastung teurer GPU-Hardware
  • Performance Optimization: Optimierung von ML-Training und Inferenz
  • Capacity Planning: Vorhersage zukünftiger GPU-Anforderungen

NVIDIA GPU Operator für Kubernetes

Installation des NVIDIA GPU Operators

# Helm Repository hinzufügen
helm repo add nvidia https://nvidia.github.io/gpu-operator
helm repo update

# NVIDIA GPU Operator installieren
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set driver.enabled=true \
  --set toolkit.enabled=true \
  --set devicePlugin.enabled=true \
  --set nodeFeatureDiscovery.enabled=true \
  --set operator.cleanupCRD=true

GPU Node Configuration

# Node mit GPU-Labels
apiVersion: v1
kind: Node
metadata:
  name: gpu-worker-1
  labels:
    kubernetes.io/arch: amd64
    kubernetes.io/os: linux
    nvidia.com/gpu.present: 'true'
    nvidia.com/gpu.count: '4'
    nvidia.com/gpu.product: 'Tesla-V100-SXM2-32GB'
spec:
  capacity:
    nvidia.com/gpu: '4'
  allocatable:
    nvidia.com/gpu: '4'

Prometheus GPU Monitoring Setup

NVIDIA DCGM Exporter Installation

# DCGM Exporter Deployment
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: dcgm-exporter
  namespace: gpu-monitoring
spec:
  selector:
    matchLabels:
      app: dcgm-exporter
  template:
    metadata:
      labels:
        app: dcgm-exporter
    spec:
      nodeSelector:
        nvidia.com/gpu.present: 'true'
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      containers:
        - name: dcgm-exporter
          image: nvcr.io/nvidia/k8s/dcgm-exporter:3.1.8-3.1.5-ubuntu20.04
          securityContext:
            capabilities:
              add: ['SYS_ADMIN']
          ports:
            - name: metrics
              containerPort: 9400
              hostPort: 9400
          env:
            - name: DCGM_EXPORTER_LISTEN
              value: ':9400'
            - name: DCGM_EXPORTER_KUBERNETES
              value: 'true'
          volumeMounts:
            - name: proc
              mountPath: /hostproc
              readOnly: true
            - name: sys
              mountPath: /hostsys
              readOnly: true
      volumes:
        - name: proc
          hostPath:
            path: /proc
        - name: sys
          hostPath:
            path: /sys
      hostNetwork: true
      hostPID: true

Prometheus ServiceMonitor Configuration

# ServiceMonitor für GPU Metriken
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: dcgm-exporter
  namespace: gpu-monitoring
  labels:
    app: dcgm-exporter
spec:
  selector:
    matchLabels:
      app: dcgm-exporter
  endpoints:
    - port: metrics
      interval: 30s
      path: /metrics
      honorLabels: true
---
# Service für DCGM Exporter
apiVersion: v1
kind: Service
metadata:
  name: dcgm-exporter
  namespace: gpu-monitoring
  labels:
    app: dcgm-exporter
spec:
  selector:
    app: dcgm-exporter
  ports:
    - name: metrics
      port: 9400
      targetPort: 9400
  type: ClusterIP

Prometheus Configuration für GPU Monitoring

# Prometheus Config für GPU Metriken
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-gpu-config
  namespace: monitoring
data:
  prometheus.yml: |
    global:
      scrape_interval: 30s
      evaluation_interval: 30s

    rule_files:
      - "/etc/prometheus/rules/*.yml"

    scrape_configs:
    - job_name: 'dcgm-exporter'
      kubernetes_sd_configs:
      - role: endpoints
        namespaces:
          names:
          - gpu-monitoring
      relabel_configs:
      - source_labels: [__meta_kubernetes_service_name]
        action: keep
        regex: dcgm-exporter
      - source_labels: [__meta_kubernetes_endpoint_port_name]
        action: keep
        regex: metrics
      - source_labels: [__meta_kubernetes_pod_node_name]
        target_label: node
      - source_labels: [__meta_kubernetes_pod_name]
        target_label: pod

Wichtige GPU Monitoring Metriken

NVIDIA DCGM Metriken für Kubernetes

# GPU Utilization (0-100%)
DCGM_FI_DEV_GPU_UTIL

# GPU Memory Usage (MB)
DCGM_FI_DEV_FB_USED

# GPU Memory Total (MB)
DCGM_FI_DEV_FB_TOTAL

# GPU Temperature (°C)
DCGM_FI_DEV_GPU_TEMP

# GPU Power Usage (W)
DCGM_FI_DEV_POWER_USAGE

# GPU SM Clock (MHz)
DCGM_FI_DEV_SM_CLOCK

# GPU Memory Clock (MHz)
DCGM_FI_DEV_MEM_CLOCK

# PCIe Throughput (KB/s)
DCGM_FI_DEV_PCIE_TX_THROUGHPUT
DCGM_FI_DEV_PCIE_RX_THROUGHPUT

Custom GPU Metriken für ML Workloads

# Python GPU Monitoring für ML Jobs
import pynvml
from prometheus_client import Gauge, start_http_server
import time

# Prometheus Metriken definieren
gpu_utilization = Gauge('ml_gpu_utilization_percent', 'GPU Utilization', ['gpu_id', 'job_name'])
gpu_memory_used = Gauge('ml_gpu_memory_used_bytes', 'GPU Memory Used', ['gpu_id', 'job_name'])
gpu_memory_total = Gauge('ml_gpu_memory_total_bytes', 'GPU Memory Total', ['gpu_id', 'job_name'])
training_throughput = Gauge('ml_training_samples_per_second', 'Training Throughput', ['job_name'])

def collect_gpu_metrics(job_name="ml-training"):
    pynvml.nvmlInit()
    device_count = pynvml.nvmlDeviceGetCount()

    for i in range(device_count):
        handle = pynvml.nvmlDeviceGetHandleByIndex(i)

        # GPU Utilization
        util = pynvml.nvmlDeviceGetUtilizationRates(handle)
        gpu_utilization.labels(gpu_id=str(i), job_name=job_name).set(util.gpu)

        # GPU Memory
        mem_info = pynvml.nvmlDeviceGetMemoryInfo(handle)
        gpu_memory_used.labels(gpu_id=str(i), job_name=job_name).set(mem_info.used)
        gpu_memory_total.labels(gpu_id=str(i), job_name=job_name).set(mem_info.total)

if __name__ == '__main__':
    start_http_server(8000)
    while True:
        collect_gpu_metrics()
        time.sleep(30)

Grafana Dashboards für GPU Monitoring

GPU Cluster Overview Dashboard

{
  "dashboard": {
    "title": "Kubernetes GPU Monitoring - Cluster Overview",
    "tags": ["kubernetes", "gpu", "monitoring"],
    "templating": {
      "list": [
        {
          "name": "cluster",
          "type": "query",
          "query": "label_values(DCGM_FI_DEV_GPU_UTIL, cluster)"
        },
        {
          "name": "node",
          "type": "query",
          "query": "label_values(DCGM_FI_DEV_GPU_UTIL{cluster=\"$cluster\"}, exported_instance)"
        }
      ]
    },
    "panels": [
      {
        "title": "GPU Utilization by Node",
        "type": "stat",
        "targets": [
          {
            "expr": "avg by (exported_instance) (DCGM_FI_DEV_GPU_UTIL{cluster=\"$cluster\"})",
            "legendFormat": "{{exported_instance}}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "min": 0,
            "max": 100,
            "thresholds": {
              "steps": [
                { "color": "red", "value": 0 },
                { "color": "yellow", "value": 50 },
                { "color": "green", "value": 80 }
              ]
            }
          }
        }
      },
      {
        "title": "GPU Memory Usage",
        "type": "timeseries",
        "targets": [
          {
            "expr": "(DCGM_FI_DEV_FB_USED{cluster=\"$cluster\"} / DCGM_FI_DEV_FB_TOTAL{cluster=\"$cluster\"}) * 100",
            "legendFormat": "{{exported_instance}} GPU {{gpu}}"
          }
        ]
      },
      {
        "title": "GPU Temperature",
        "type": "timeseries",
        "targets": [
          {
            "expr": "DCGM_FI_DEV_GPU_TEMP{cluster=\"$cluster\"}",
            "legendFormat": "{{exported_instance}} GPU {{gpu}}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "celsius",
            "thresholds": {
              "steps": [
                { "color": "green", "value": 0 },
                { "color": "yellow", "value": 70 },
                { "color": "red", "value": 85 }
              ]
            }
          }
        }
      }
    ]
  }
}

ML Workload GPU Dashboard

{
  "dashboard": {
    "title": "ML Workloads GPU Performance",
    "panels": [
      {
        "title": "Training Throughput",
        "type": "timeseries",
        "targets": [
          {
            "expr": "ml_training_samples_per_second",
            "legendFormat": "{{job_name}} - Samples/sec"
          }
        ]
      },
      {
        "title": "GPU Efficiency Score",
        "type": "stat",
        "targets": [
          {
            "expr": "(avg(DCGM_FI_DEV_GPU_UTIL) + avg(DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL * 100)) / 2",
            "legendFormat": "GPU Efficiency %"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "thresholds": {
              "steps": [
                { "color": "red", "value": 0 },
                { "color": "yellow", "value": 60 },
                { "color": "green", "value": 80 }
              ]
            }
          }
        }
      },
      {
        "title": "Cost per Hour",
        "type": "stat",
        "targets": [
          {
            "expr": "count(DCGM_FI_DEV_GPU_UTIL > 0) * 2.5",
            "legendFormat": "€ per hour"
          }
        ]
      }
    ]
  }
}

GPU Workload Monitoring für AI/ML

PyTorch GPU Monitoring Integration

# PyTorch Training mit GPU Monitoring
import torch
import torch.nn as nn
from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time

# Prometheus Metriken für ML Training
training_iterations = Counter('ml_training_iterations_total', 'Training Iterations', ['model_name'])
training_loss = Gauge('ml_training_loss', 'Training Loss', ['model_name'])
training_duration = Histogram('ml_training_batch_duration_seconds', 'Batch Training Duration', ['model_name'])
gpu_memory_allocated = Gauge('ml_gpu_memory_allocated_bytes', 'PyTorch GPU Memory Allocated', ['gpu_id'])

class MonitoredModel(nn.Module):
    def __init__(self, model_name):
        super().__init__()
        self.model_name = model_name
        self.linear = nn.Linear(784, 10)

    def forward(self, x):
        return self.linear(x)

    def train_step(self, batch_data, batch_labels):
        start_time = time.time()

        # Forward pass
        outputs = self(batch_data)
        loss = nn.CrossEntropyLoss()(outputs, batch_labels)

        # Backward pass
        loss.backward()

        # Update Metriken
        training_iterations.labels(model_name=self.model_name).inc()
        training_loss.labels(model_name=self.model_name).set(loss.item())

        # GPU Memory Monitoring
        if torch.cuda.is_available():
            allocated = torch.cuda.memory_allocated()
            gpu_memory_allocated.labels(gpu_id="0").set(allocated)

        # Training Duration
        duration = time.time() - start_time
        training_duration.labels(model_name=self.model_name).observe(duration)

        return loss

# Prometheus Server starten
start_http_server(8000)

# Model Training mit Monitoring
model = MonitoredModel("mnist-classifier")
if torch.cuda.is_available():
    model = model.cuda()

TensorFlow GPU Monitoring

# TensorFlow mit GPU Monitoring
import tensorflow as tf
from prometheus_client import Gauge, Counter
import GPUtil

# GPU Metriken
tf_gpu_memory_usage = Gauge('tf_gpu_memory_usage_mb', 'TensorFlow GPU Memory Usage', ['gpu_id'])
tf_gpu_utilization = Gauge('tf_gpu_utilization_percent', 'TensorFlow GPU Utilization', ['gpu_id'])

class GPUMonitorCallback(tf.keras.callbacks.Callback):
    def on_batch_end(self, batch, logs=None):
        # GPU Utilization überwachen
        gpus = GPUtil.getGPUs()
        for i, gpu in enumerate(gpus):
            tf_gpu_memory_usage.labels(gpu_id=str(i)).set(gpu.memoryUsed)
            tf_gpu_utilization.labels(gpu_id=str(i)).set(gpu.load * 100)

# Model mit GPU Monitoring
model = tf.keras.Sequential([
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(10, activation='softmax')
])

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')

# Training mit GPU Monitoring
model.fit(
    train_data,
    train_labels,
    epochs=10,
    callbacks=[GPUMonitorCallback()]
)

Alert Rules für GPU Monitoring

Prometheus Alert Rules

# GPU Monitoring Alert Rules
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: gpu-monitoring-alerts
  namespace: monitoring
spec:
  groups:
    - name: gpu.rules
      rules:
        - alert: GPUHighTemperature
          expr: DCGM_FI_DEV_GPU_TEMP > 85
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: 'GPU temperature too high'
            description: 'GPU {{$labels.gpu}} on node {{$labels.exported_instance}} has temperature {{$value}}°C'

        - alert: GPULowUtilization
          expr: avg_over_time(DCGM_FI_DEV_GPU_UTIL[1h]) < 20
          for: 30m
          labels:
            severity: warning
          annotations:
            summary: 'GPU underutilized'
            description: 'GPU {{$labels.gpu}} on node {{$labels.exported_instance}} has low utilization {{$value}}%'

        - alert: GPUMemoryHigh
          expr: (DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL) * 100 > 90
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: 'GPU memory usage high'
            description: 'GPU {{$labels.gpu}} memory usage is {{$value}}%'

        - alert: GPUNotResponding
          expr: up{job="dcgm-exporter"} == 0
          for: 2m
          labels:
            severity: critical
          annotations:
            summary: 'GPU monitoring not responding'
            description: 'DCGM exporter on {{$labels.instance}} is not responding'

Alertmanager Configuration

# Alertmanager GPU Alerts
global:
  smtp_smarthost: 'smtp.company.de:587'
  smtp_from: 'gpu-monitoring@company.de'

route:
  group_by: ['alertname', 'cluster']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 12h
  receiver: 'gpu-team'

receivers:
  - name: 'gpu-team'
    email_configs:
      - to: 'gpu-team@company.de'
        subject: '[GPU Alert] {{.GroupLabels.alertname}}'
        body: |
          {{ range .Alerts }}
          Alert: {{ .Annotations.summary }}
          Description: {{ .Annotations.description }}
          Instance: {{ .Labels.exported_instance }}
          GPU: {{ .Labels.gpu }}
          {{ end }}

    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
        channel: '#gpu-monitoring'
        title: 'GPU Alert: {{.GroupLabels.alertname}}'
        text: |
          {{ range .Alerts }}
          🚨 {{ .Annotations.summary }}
          📍 Node: {{ .Labels.exported_instance }}
          🔧 GPU: {{ .Labels.gpu }}
          📊 Details: {{ .Annotations.description }}
          {{ end }}

GPU Resource Management in Kubernetes

GPU Resource Quotas

# GPU Resource Quota
apiVersion: v1
kind: ResourceQuota
metadata:
  name: gpu-quota
  namespace: ml-workloads
spec:
  hard:
    nvidia.com/gpu: '8'
    requests.nvidia.com/gpu: '8'
    limits.nvidia.com/gpu: '8'
    requests.memory: '64Gi'
    requests.cpu: '16'
    limits.memory: '128Gi'
    limits.cpu: '32'

GPU Workload Scheduling

# ML Training Job mit GPU Monitoring
apiVersion: batch/v1
kind: Job
metadata:
  name: ml-training-monitored
  namespace: ml-workloads
spec:
  template:
    metadata:
      labels:
        app: ml-training
        monitoring: enabled
      annotations:
        prometheus.io/scrape: 'true'
        prometheus.io/port: '8000'
    spec:
      nodeSelector:
        nvidia.com/gpu.product: 'Tesla-V100-SXM2-32GB'
      containers:
        - name: training
          image: pytorch/pytorch:latest
          resources:
            requests:
              nvidia.com/gpu: 2
              memory: '16Gi'
              cpu: '4'
            limits:
              nvidia.com/gpu: 2
              memory: '32Gi'
              cpu: '8'
          ports:
            - containerPort: 8000
              name: metrics
          env:
            - name: CUDA_VISIBLE_DEVICES
              value: '0,1'
            - name: NVIDIA_VISIBLE_DEVICES
              value: '0,1'
          volumeMounts:
            - name: training-data
              mountPath: /data
            - name: model-output
              mountPath: /output
      volumes:
        - name: training-data
          persistentVolumeClaim:
            claimName: training-data-pvc
        - name: model-output
          persistentVolumeClaim:
            claimName: model-output-pvc
      restartPolicy: Never

Performance Optimization basierend auf GPU Monitoring

GPU Utilization Optimization

# GPU Utilization Optimizer
import numpy as np
from prometheus_api_client import PrometheusConnect
import time

class GPUOptimizer:
    def __init__(self, prometheus_url):
        self.prom = PrometheusConnect(url=prometheus_url)

    def get_gpu_utilization(self, node=None):
        """GPU Auslastung abfragen"""
        if node:
            query = f'avg(DCGM_FI_DEV_GPU_UTIL{{exported_instance="{node}"}})'
        else:
            query = 'avg(DCGM_FI_DEV_GPU_UTIL)'

        result = self.prom.custom_query(query)
        if result:
            return float(result[0]['value'][1])
        return 0

    def get_gpu_memory_usage(self, node=None):
        """GPU Memory Auslastung"""
        if node:
            query = f'avg(DCGM_FI_DEV_FB_USED{{exported_instance="{node}"}} / DCGM_FI_DEV_FB_TOTAL{{exported_instance="{node}"}}) * 100'
        else:
            query = 'avg(DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL) * 100'

        result = self.prom.custom_query(query)
        if result:
            return float(result[0]['value'][1])
        return 0

    def recommend_scaling(self):
        """Scaling-Empfehlungen basierend auf GPU Metriken"""
        avg_util = self.get_gpu_utilization()
        avg_memory = self.get_gpu_memory_usage()

        if avg_util < 30 and avg_memory < 50:
            return {
                'action': 'scale_down',
                'reason': f'Low utilization: {avg_util:.1f}% GPU, {avg_memory:.1f}% Memory',
                'recommendation': 'Reduce GPU replicas or consolidate workloads'
            }
        elif avg_util > 85 or avg_memory > 90:
            return {
                'action': 'scale_up',
                'reason': f'High utilization: {avg_util:.1f}% GPU, {avg_memory:.1f}% Memory',
                'recommendation': 'Add more GPU nodes or scale workloads'
            }
        else:
            return {
                'action': 'maintain',
                'reason': f'Optimal utilization: {avg_util:.1f}% GPU, {avg_memory:.1f}% Memory',
                'recommendation': 'Current configuration is optimal'
            }

# Verwendung
optimizer = GPUOptimizer('http://prometheus:9090')
recommendation = optimizer.recommend_scaling()
print(f"Action: {recommendation['action']}")
print(f"Reason: {recommendation['reason']}")
print(f"Recommendation: {recommendation['recommendation']}")

Cost Optimization Dashboard

# GPU Cost Monitoring
import pandas as pd
from datetime import datetime, timedelta

class GPUCostAnalyzer:
    def __init__(self, prometheus_url):
        self.prom = PrometheusConnect(url=prometheus_url)
        self.gpu_hourly_cost = {
            'Tesla-V100': 2.48,  # € per hour
            'Tesla-T4': 0.35,
            'A100': 3.20
        }

    def calculate_daily_costs(self):
        """Tägliche GPU Kosten berechnen"""
        # GPU Anzahl pro Typ abfragen
        gpu_query = 'count by (exported_gpu_model) (DCGM_FI_DEV_GPU_UTIL)'
        result = self.prom.custom_query(query=gpu_query)

        total_cost = 0
        cost_breakdown = {}

        for item in result:
            gpu_model = item['metric']['exported_gpu_model']
            gpu_count = int(item['value'][1])

            if gpu_model in self.gpu_hourly_cost:
                daily_cost = gpu_count * self.gpu_hourly_cost[gpu_model] * 24
                cost_breakdown[gpu_model] = {
                    'count': gpu_count,
                    'hourly_cost': self.gpu_hourly_cost[gpu_model],
                    'daily_cost': daily_cost
                }
                total_cost += daily_cost

        return {
            'total_daily_cost': total_cost,
            'breakdown': cost_breakdown,
            'monthly_estimate': total_cost * 30
        }

    def get_utilization_efficiency(self):
        """GPU Effizienz berechnen"""
        util_query = 'avg_over_time(DCGM_FI_DEV_GPU_UTIL[24h])'
        result = self.prom.custom_query(query=util_query)

        if result:
            avg_utilization = float(result[0]['value'][1])
            costs = self.calculate_daily_costs()

            # Wasted costs bei niedriger Auslastung
            efficiency = avg_utilization / 100
            wasted_cost = costs['total_daily_cost'] * (1 - efficiency)

            return {
                'average_utilization': avg_utilization,
                'efficiency_ratio': efficiency,
                'daily_wasted_cost': wasted_cost,
                'potential_monthly_savings': wasted_cost * 30
            }

        return None

# Cost Report Generator
analyzer = GPUCostAnalyzer('http://prometheus:9090')
costs = analyzer.calculate_daily_costs()
efficiency = analyzer.get_utilization_efficiency()

print("=== GPU Cost Report ===")
print(f"Daily Costs: €{costs['total_daily_cost']:.2f}")
print(f"Monthly Estimate: €{costs['monthly_estimate']:.2f}")
print(f"Average Utilization: {efficiency['average_utilization']:.1f}%")
print(f"Daily Wasted Cost: €{efficiency['daily_wasted_cost']:.2f}")
print(f"Potential Monthly Savings: €{efficiency['potential_monthly_savings']:.2f}")

Troubleshooting GPU Monitoring

Common Issues und Lösungen

Problem: DCGM Exporter startet nicht

# Debug DCGM Exporter
kubectl logs -n gpu-monitoring daemonset/dcgm-exporter

# Häufige Fehlerursachen prüfen
kubectl describe node gpu-worker-1 | grep nvidia.com/gpu

# NVIDIA Driver Status prüfen
kubectl exec -it dcgm-exporter-pod -- nvidia-smi

# DCGM Service Status
kubectl exec -it dcgm-exporter-pod -- dcgmi discovery -l

Problem: Fehlende GPU Metriken

# DCGM Exporter Debug Configuration
apiVersion: v1
kind: ConfigMap
metadata:
  name: dcgm-exporter-config
data:
  dcgm-exporter.csv: |
    # Format: metric_name, field_id, DCGM field name, prometheus help, prometheus type
    DCGM_FI_DEV_GPU_UTIL, 1001, "gpu_utilization", "GPU utilization", gauge
    DCGM_FI_DEV_FB_USED, 1013, "fb_used", "GPU memory used", gauge
    DCGM_FI_DEV_FB_TOTAL, 1014, "fb_total", "GPU memory total", gauge
    DCGM_FI_DEV_GPU_TEMP, 1004, "gpu_temperature", "GPU temperature", gauge

Problem: Hohe Cardinality

# Metric Relabeling für reduzierte Cardinality
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-gpu-relabel
data:
  relabel.yml: |
    # GPU Metriken aggregieren
    - source_labels: [__name__]
      regex: 'DCGM_FI_DEV_.*'
      target_label: __tmp_dcgm_metric
      replacement: '${1}'

    # Unnötige Labels entfernen
    - regex: 'exported_job|exported_instance'
      action: labeldrop

Best Practices für GPU Monitoring Kubernetes

1. Resource Planning

# GPU Resource Planning
Resource Planning Checklist:
✅ GPU Typ und Anzahl definieren
✅ Memory Requirements kalkulieren
✅ Network Bandwidth berücksichtigen
✅ Storage Performance planen
✅ Cooling und Power Budget prüfen

2. Cost Optimization

# Cost Optimization Strategies
1. GPU Sharing für Inference Workloads
2. Spot Instances für Training Jobs
3. Automatic Scaling basierend auf Queue Length
4. Multi-tenancy mit Resource Quotas
5. Preemptible Workloads für Batch Jobs

3. Performance Monitoring

# Key Performance Indicators
# GPU Efficiency Score
(avg(DCGM_FI_DEV_GPU_UTIL) + avg(DCGM_FI_DEV_FB_USED/DCGM_FI_DEV_FB_TOTAL*100)) / 2

# Cost per ML Model
increase(ml_training_iterations_total[1h]) * gpu_hourly_cost

# Throughput per GPU
rate(ml_training_samples_total[5m]) / count(DCGM_FI_DEV_GPU_UTIL > 0)

Fazit: GPU Monitoring Kubernetes für deutsche Unternehmen

Warum GPU Monitoring Kubernetes essential ist:

Für AI/ML Startups:

  • Cost Control: GPU-Kosten sind oft 60-80% der Cloud-Rechnung
  • Performance Optimization: Maximale Ausnutzung teurer Hardware
  • Rapid Scaling: Automatische Skalierung basierend auf Workload

Für Enterprise-Unternehmen:

  • Compliance: Audit-Trails für GPU-Nutzung
  • Multi-Tenancy: Sichere Ressourcenteilung zwischen Teams
  • Capacity Planning: Vorhersage zukünftiger GPU-Anforderungen

Implementation Roadmap (8 Wochen):

Woche 1-2: NVIDIA GPU Operator Installation
Woche 3-4: Prometheus + DCGM Exporter Setup
Woche 5-6: Grafana Dashboards und Alerting
Woche 7-8: Cost Optimization und Automation

ROI für deutsche Unternehmen:

  • 25-40% Kosteneinsparung durch optimierte GPU-Auslastung
  • 50-70% Reduzierung der Manual Monitoring Aufwände
  • 15-25% Performance-Steigerung durch optimierte Resource Allocation

Benötigen Sie Unterstützung bei der GPU Monitoring Kubernetes Implementierung? Unsere AI/ML-Infrastruktur-Experten helfen deutschen Unternehmen bei der optimalen GPU-Monitoring-Strategie. Kontaktieren Sie uns für eine kostenlose GPU-Beratung.

Weitere GPU & AI Artikel:

📖 Verwandte Artikel

Weitere interessante Beiträge zu ähnlichen Themen