Published on

Kubernetes GPU Cluster

Authors

Kubernetes GPU Cluster: Der ultimative Guide für deutsche AI/ML-Infrastrukturen 2025

Ein Kubernetes GPU Cluster ist die Basis für skalierbare AI/ML-Infrastrukturen in deutschen Unternehmen. Dieser umfassende Guide zeigt Ihnen, wie Sie production-ready GPU Cluster aufbauen, verwalten und optimieren - von der Hardware-Planung bis zur automatischen Skalierung.

Was ist ein Kubernetes GPU Cluster?

Ein Kubernetes GPU Cluster ist eine orchestrierte Sammlung von GPU-fähigen Nodes, die gemeinsam AI/ML-Workloads verarbeiten. Es ermöglicht effiziente Ressourcennutzung, automatische Skalierung und zentrale Verwaltung von GPU-Infrastrukturen.

Warum Kubernetes GPU Cluster für deutsche Unternehmen?

  • Skalierbarkeit: Von einzelnen GPUs bis zu hunderten Nodes
  • Cost Efficiency: Optimale Ausnutzung teurer GPU-Hardware
  • Multi-Tenancy: Sichere Ressourcenteilung zwischen Teams
  • Compliance: DSGVO-konforme AI-Infrastrukturen

Kubernetes GPU Cluster Architektur

Multi-Node GPU Cluster Design

# Cluster Architecture Overview
Cluster Components:
├── Master Nodes (3x)
│   ├── API Server
│   ├── etcd Cluster
│   ├── Scheduler (GPU-aware)
│   └── Controller Manager
├── GPU Worker Nodes (N x)
│   ├── NVIDIA GPU Driver
│   ├── NVIDIA Container Runtime
│   ├── GPU Device Plugin
│   └── DCGM Exporter
├── Storage Nodes
│   ├── High-Performance SSD
│   ├── Network Attached Storage
│   └── Distributed File System
└── Network Infrastructure
    ├── High-Bandwidth Networking
    ├── InfiniBand (optional)
    └── Load Balancers

GPU Node Specifications

# GPU Node Hardware Specifications
GPU Node Types:
  Training Nodes:
    - CPUs: 32-64 cores
    - RAM: 256-512 GB
    - GPUs: 4-8x NVIDIA A100/H100
    - Network: 100 Gbps
    - Storage: NVMe SSD

  Inference Nodes:
    - CPUs: 16-32 cores
    - RAM: 128-256 GB
    - GPUs: 2-4x NVIDIA T4/L4
    - Network: 25-50 Gbps
    - Storage: SSD

  Development Nodes:
    - CPUs: 8-16 cores
    - RAM: 64-128 GB
    - GPUs: 1-2x NVIDIA RTX/Tesla
    - Network: 10-25 Gbps
    - Storage: Mixed

Kubernetes GPU Cluster Setup

1. Cluster Initialization mit kubeadm

# Master Node Setup
sudo kubeadm init \
  --pod-network-cidr=10.244.0.0/16 \
  --service-cidr=10.96.0.0/12 \
  --kubernetes-version=v1.28.0 \
  --upload-certs \
  --control-plane-endpoint=gpu-cluster.company.de:6443

# Cluster Configuration speichern
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config

# High-Availability Master Setup
sudo kubeadm join gpu-cluster.company.de:6443 \
  --token abcdef.0123456789abcdef \
  --discovery-token-ca-cert-hash sha256:... \
  --control-plane \
  --certificate-key ...

2. GPU Worker Nodes hinzufügen

# GPU Node Prerequisites
# NVIDIA Driver Installation
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit nvidia-driver-525

# Container Runtime Configuration
sudo nvidia-ctk runtime configure --runtime=containerd
sudo systemctl restart containerd

# Join GPU Node to Cluster
sudo kubeadm join gpu-cluster.company.de:6443 \
  --token abcdef.0123456789abcdef \
  --discovery-token-ca-cert-hash sha256:...

3. NVIDIA GPU Operator Installation

# Helm Repository hinzufügen
helm repo add nvidia https://nvidia.github.io/gpu-operator
helm repo update

# GPU Operator für Cluster installieren
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set driver.enabled=true \
  --set toolkit.enabled=true \
  --set devicePlugin.enabled=true \
  --set nodeFeatureDiscovery.enabled=true \
  --set migManager.enabled=true \
  --set operator.defaultRuntime=containerd \
  --set validator.plugin.env[0].name=WITH_WORKLOAD \
  --set validator.plugin.env[0].value=true

4. GPU Cluster Validation

# GPU Nodes überprüfen
kubectl get nodes -l nvidia.com/gpu.present=true

# GPU Resources anzeigen
kubectl describe nodes | grep nvidia.com/gpu

# GPU Operator Status
kubectl get pods -n gpu-operator

# Test GPU Workload
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: gpu-test
spec:
  containers:
  - name: gpu-test
    image: nvidia/cuda:12.0-runtime-ubuntu20.04
    command: ["nvidia-smi"]
    resources:
      limits:
        nvidia.com/gpu: 1
  restartPolicy: Never
EOF

# Test Results
kubectl logs gpu-test

Multi-Node GPU Cluster Networking

High-Performance Networking Setup

# Cluster Network Configuration
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-network-config
  namespace: kube-system
data:
  cni-config: |
    {
      "cniVersion": "0.4.0",
      "name": "gpu-cluster-network",
      "plugins": [
        {
          "type": "calico",
          "datastore_type": "kubernetes",
          "mtu": 9000,
          "nodename_file_optional": false,
          "ipam": {
            "type": "calico-ipam",
            "assign_ipv4": "true",
            "assign_ipv6": "false"
          },
          "container_settings": {
            "allow_ip_forwarding": true
          }
        },
        {
          "type": "bandwidth",
          "capabilities": {
            "bandwidth": true
          }
        }
      ]
    }

InfiniBand Integration (Enterprise)

# InfiniBand Device Plugin
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: infiniband-device-plugin
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app: infiniband-device-plugin
  template:
    metadata:
      labels:
        app: infiniband-device-plugin
    spec:
      nodeSelector:
        infiniband.present: 'true'
      containers:
        - name: infiniband-device-plugin
          image: mellanox/ib-kubernetes:latest
          securityContext:
            privileged: true
          volumeMounts:
            - name: device-plugin
              mountPath: /var/lib/kubelet/device-plugins
            - name: dev
              mountPath: /dev
            - name: sys
              mountPath: /sys
      volumes:
        - name: device-plugin
          hostPath:
            path: /var/lib/kubelet/device-plugins
        - name: dev
          hostPath:
            path: /dev
        - name: sys
          hostPath:
            path: /sys
      hostNetwork: true

GPU Resource Management

GPU Sharing Strategies

# Multi-Instance GPU (MIG) Configuration
apiVersion: v1
kind: ConfigMap
metadata:
  name: mig-config
  namespace: gpu-operator
data:
  config.yaml: |
    version: v1
    mig-configs:
      all-1g.5gb:
        - devices: all
          mig-enabled: true
          mig-devices:
            1g.5gb: 7
      all-2g.10gb:
        - devices: all
          mig-enabled: true
          mig-devices:
            2g.10gb: 3
      mixed:
        - devices: [0,1]
          mig-enabled: true
          mig-devices:
            1g.5gb: 4
            2g.10gb: 1
        - devices: [2,3]
          mig-enabled: false

Time-Slicing für GPU Sharing

# GPU Time-Slicing Configuration
apiVersion: v1
kind: ConfigMap
metadata:
  name: gpu-sharing-config
  namespace: gpu-operator
data:
  config.yaml: |
    version: v1
    sharing:
      timeSlicing:
        resources:
        - name: nvidia.com/gpu
          replicas: 4  # 4 Pods pro GPU
        - name: nvidia.com/mig-1g.5gb
          replicas: 2
        - name: nvidia.com/mig-2g.10gb
          replicas: 1

GPU Resource Quotas

# Namespace GPU Quotas
apiVersion: v1
kind: ResourceQuota
metadata:
  name: gpu-quota-ml-team
  namespace: ml-team
spec:
  hard:
    nvidia.com/gpu: '16'
    nvidia.com/mig-1g.5gb: '32'
    nvidia.com/mig-2g.10gb: '8'
    requests.memory: '512Gi'
    requests.cpu: '128'
    limits.memory: '1024Gi'
    limits.cpu: '256'
    persistentvolumeclaims: '10'
    count/jobs.batch: '50'
---
# Priority Classes für GPU Workloads
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: gpu-high-priority
value: 1000
globalDefault: false
description: 'High priority for critical GPU workloads'
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: gpu-low-priority
value: 100
globalDefault: false
description: 'Low priority for batch GPU workloads'

GPU Cluster Autoscaling

Cluster Autoscaler für GPU Nodes

# Cluster Autoscaler Configuration
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-autoscaler
  namespace: kube-system
spec:
  template:
    spec:
      containers:
        - image: k8s.gcr.io/autoscaling/cluster-autoscaler:v1.28.0
          name: cluster-autoscaler
          command:
            - ./cluster-autoscaler
            - --v=4
            - --stderrthreshold=info
            - --cloud-provider=aws # oder azure, gcp
            - --skip-nodes-with-local-storage=false
            - --expander=least-waste
            - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/gpu-cluster
            - --balance-similar-node-groups
            - --scale-down-enabled=true
            - --scale-down-delay-after-add=10m
            - --scale-down-unneeded-time=10m
            - --max-node-provision-time=15m
          env:
            - name: AWS_REGION
              value: eu-central-1
          resources:
            limits:
              cpu: 100m
              memory: 300Mi
            requests:
              cpu: 100m
              memory: 300Mi

GPU-Aware Scheduling

# Extended Resource Scheduler
apiVersion: v1
kind: ConfigMap
metadata:
  name: scheduler-config
  namespace: kube-system
data:
  config.yaml: |
    apiVersion: kubescheduler.config.k8s.io/v1beta3
    kind: KubeSchedulerConfiguration
    profiles:
    - schedulerName: gpu-scheduler
      plugins:
        filter:
          enabled:
          - name: NodeResourcesFit
          - name: NodeAffinity
        score:
          enabled:
          - name: NodeResourcesFit
          - name: NodeAffinity
          - name: InterPodAffinity
      pluginConfig:
      - name: NodeResourcesFit
        args:
          scoringStrategy:
            type: LeastAllocated
            resources:
            - name: nvidia.com/gpu
              weight: 100
            - name: cpu
              weight: 1
            - name: memory
              weight: 1

GPU Workload Orchestration

Distributed Training mit PyTorch

# PyTorch Distributed Training Job
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: distributed-training
  namespace: ml-workloads
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      template:
        metadata:
          annotations:
            prometheus.io/scrape: 'true'
            prometheus.io/port: '9090'
        spec:
          containers:
            - name: pytorch
              image: pytorch/pytorch:2.0.0-cuda11.7-cudnn8-devel
              command:
                - python
                - -m
                - torch.distributed.launch
                - --nproc_per_node=4
                - --nnodes=4
                - --node_rank=0
                - --master_addr=distributed-training-master-0
                - --master_port=23456
                - train.py
              resources:
                requests:
                  nvidia.com/gpu: 4
                  memory: '64Gi'
                  cpu: '16'
                limits:
                  nvidia.com/gpu: 4
                  memory: '128Gi'
                  cpu: '32'
              env:
                - name: NCCL_DEBUG
                  value: 'INFO'
                - name: NCCL_IB_DISABLE
                  value: '0'
              volumeMounts:
                - name: training-data
                  mountPath: /data
                - name: model-output
                  mountPath: /output
    Worker:
      replicas: 3
      template:
        spec:
          containers:
            - name: pytorch
              image: pytorch/pytorch:2.0.0-cuda11.7-cudnn8-devel
              command:
                - python
                - -m
                - torch.distributed.launch
                - --nproc_per_node=4
                - --nnodes=4
                - --node_rank=$WORKER_RANK
                - --master_addr=distributed-training-master-0
                - --master_port=23456
                - train.py
              resources:
                requests:
                  nvidia.com/gpu: 4
                  memory: '64Gi'
                  cpu: '16'
                limits:
                  nvidia.com/gpu: 4
                  memory: '128Gi'
                  cpu: '32'
              volumeMounts:
                - name: training-data
                  mountPath: /data
                - name: model-output
                  mountPath: /output
          volumes:
            - name: training-data
              persistentVolumeClaim:
                claimName: training-data-pvc
            - name: model-output
              persistentVolumeClaim:
                claimName: model-output-pvc

Model Serving mit GPU Sharing

# GPU Model Serving Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: model-serving-gpu
  namespace: ml-serving
spec:
  replicas: 8
  selector:
    matchLabels:
      app: model-serving
  template:
    metadata:
      labels:
        app: model-serving
      annotations:
        prometheus.io/scrape: 'true'
        prometheus.io/port: '8080'
    spec:
      nodeSelector:
        node-type: gpu-inference
      containers:
        - name: model-server
          image: tritonserver:latest
          ports:
            - containerPort: 8000
              name: http
            - containerPort: 8001
              name: grpc
            - containerPort: 8002
              name: metrics
          resources:
            requests:
              nvidia.com/gpu: 1
              memory: '8Gi'
              cpu: '2'
            limits:
              nvidia.com/gpu: 1
              memory: '16Gi'
              cpu: '4'
          env:
            - name: CUDA_VISIBLE_DEVICES
              value: '0'
            - name: TRITON_MODEL_REPOSITORY
              value: '/models'
          volumeMounts:
            - name: model-storage
              mountPath: /models
          livenessProbe:
            httpGet:
              path: /v2/health/live
              port: 8000
            initialDelaySeconds: 30
            periodSeconds: 10
          readinessProbe:
            httpGet:
              path: /v2/health/ready
              port: 8000
            initialDelaySeconds: 30
            periodSeconds: 5
      volumes:
        - name: model-storage
          persistentVolumeClaim:
            claimName: model-storage-pvc

GPU Cluster Storage Solutions

High-Performance Storage für GPU Workloads

# NVMe SSD StorageClass
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: nvme-ssd-gpu
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  iops: '16000'
  throughput: '1000'
  fsType: ext4
  encrypted: 'true'
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
---
# Distributed Storage für Model Repository
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: distributed-gpu-storage
provisioner: ceph.rook.io/block
parameters:
  clusterID: gpu-cluster
  pool: gpu-pool
  imageFormat: '2'
  imageFeatures: layering
  csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner
  csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
reclaimPolicy: Delete
volumeBindingMode: Immediate

Shared Model Storage

# ReadWriteMany PVC für Model Sharing
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: shared-models-pvc
  namespace: ml-workloads
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 1Ti
  storageClassName: distributed-gpu-storage
---
# Model Repository ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: model-repository-config
  namespace: ml-workloads
data:
  models.json: |
    {
      "models": [
        {
          "name": "bert-base-german",
          "path": "/models/bert-base-german",
          "version": "1.0",
          "gpu_memory": "2Gi",
          "batch_size": 32
        },
        {
          "name": "gpt-german-large",
          "path": "/models/gpt-german-large", 
          "version": "2.1",
          "gpu_memory": "8Gi",
          "batch_size": 8
        }
      ]
    }

GPU Cluster Monitoring & Observability

Comprehensive GPU Monitoring Stack

# GPU Monitoring Stack
apiVersion: v1
kind: Namespace
metadata:
  name: gpu-monitoring
---
# Prometheus für GPU Cluster
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: gpu-cluster-prometheus
  namespace: gpu-monitoring
spec:
  replicas: 2
  retention: 30d
  storage:
    volumeClaimTemplate:
      spec:
        accessModes: ['ReadWriteOnce']
        resources:
          requests:
            storage: 500Gi
        storageClassName: nvme-ssd-gpu
  serviceMonitorSelector:
    matchLabels:
      monitoring: gpu-cluster
  ruleSelector:
    matchLabels:
      monitoring: gpu-cluster
  resources:
    requests:
      memory: '8Gi'
      cpu: '2'
    limits:
      memory: '16Gi'
      cpu: '4'
---
# DCGM Exporter ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: dcgm-exporter
  namespace: gpu-monitoring
  labels:
    monitoring: gpu-cluster
spec:
  selector:
    matchLabels:
      app: dcgm-exporter
  endpoints:
    - port: metrics
      interval: 30s
      path: /metrics
      honorLabels: true
      relabelings:
        - sourceLabels: [__meta_kubernetes_pod_node_name]
          targetLabel: node
        - sourceLabels: [__meta_kubernetes_pod_name]
          targetLabel: pod
        - sourceLabels: [__meta_kubernetes_namespace]
          targetLabel: namespace

GPU Cluster Health Monitoring

# GPU Cluster Health Rules
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: gpu-cluster-health
  namespace: gpu-monitoring
  labels:
    monitoring: gpu-cluster
spec:
  groups:
    - name: gpu-cluster.health
      rules:
        - alert: GPUClusterNodeDown
          expr: up{job="dcgm-exporter"} == 0
          for: 2m
          labels:
            severity: critical
          annotations:
            summary: 'GPU node is down'
            description: 'GPU node {{$labels.instance}} has been down for more than 2 minutes'

        - alert: GPUClusterLowUtilization
          expr: avg(DCGM_FI_DEV_GPU_UTIL) < 20
          for: 30m
          labels:
            severity: warning
          annotations:
            summary: 'GPU cluster underutilized'
            description: 'GPU cluster average utilization is {{$value}}% for 30 minutes'

        - alert: GPUClusterMemoryPressure
          expr: avg(DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL) > 0.85
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: 'GPU cluster memory pressure'
            description: 'GPU cluster memory usage is {{$value | humanizePercentage}}'

        - alert: GPUClusterTemperatureHigh
          expr: max(DCGM_FI_DEV_GPU_TEMP) > 85
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: 'GPU cluster temperature critical'
            description: 'GPU temperature reached {{$value}}°C on node {{$labels.exported_instance}}'

Cost Optimization für GPU Cluster

GPU Cluster Cost Management

# GPU Cluster Cost Analyzer
import pandas as pd
from prometheus_api_client import PrometheusConnect
from datetime import datetime, timedelta

class GPUClusterCostAnalyzer:
    def __init__(self, prometheus_url):
        self.prom = PrometheusConnect(url=prometheus_url)
        self.gpu_costs = {
            'Tesla-V100': 2.48,   # € per hour
            'Tesla-T4': 0.35,
            'A100-40GB': 3.20,
            'A100-80GB': 4.50,
            'H100': 6.80,
            'RTX-4090': 1.20
        }

    def get_cluster_gpu_inventory(self):
        """GPU Cluster Inventar abrufen"""
        query = '''
        count by (exported_gpu_model, exported_instance)
        (DCGM_FI_DEV_GPU_UTIL)
        '''
        result = self.prom.custom_query(query)

        inventory = {}
        for item in result:
            gpu_model = item['metric']['exported_gpu_model']
            node = item['metric']['exported_instance']
            count = int(item['value'][1])

            if node not in inventory:
                inventory[node] = {}
            inventory[node][gpu_model] = count

        return inventory

    def calculate_cluster_costs(self, time_range='24h'):
        """Cluster-Kosten berechnen"""
        inventory = self.get_cluster_gpu_inventory()

        total_cost = 0
        node_costs = {}
        gpu_type_costs = {}

        for node, gpus in inventory.items():
            node_cost = 0
            for gpu_model, count in gpus.items():
                if gpu_model in self.gpu_costs:
                    hourly_cost = self.gpu_costs[gpu_model] * count
                    if time_range == '24h':
                        daily_cost = hourly_cost * 24
                    elif time_range == '30d':
                        daily_cost = hourly_cost * 24 * 30
                    else:
                        daily_cost = hourly_cost

                    node_cost += daily_cost
                    total_cost += daily_cost

                    if gpu_model not in gpu_type_costs:
                        gpu_type_costs[gpu_model] = 0
                    gpu_type_costs[gpu_model] += daily_cost

            node_costs[node] = node_cost

        return {
            'total_cost': total_cost,
            'node_costs': node_costs,
            'gpu_type_costs': gpu_type_costs,
            'inventory': inventory
        }

    def get_utilization_efficiency(self):
        """Cluster-Effizienz berechnen"""
        util_query = 'avg(DCGM_FI_DEV_GPU_UTIL)'
        memory_query = 'avg(DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL * 100)'

        util_result = self.prom.custom_query(util_query)
        memory_result = self.prom.custom_query(memory_query)

        if util_result and memory_result:
            avg_util = float(util_result[0]['value'][1])
            avg_memory = float(memory_result[0]['value'][1])

            efficiency_score = (avg_util + avg_memory) / 2
            costs = self.calculate_cluster_costs('24h')

            return {
                'gpu_utilization': avg_util,
                'memory_utilization': avg_memory,
                'efficiency_score': efficiency_score,
                'daily_cost': costs['total_cost'],
                'wasted_cost': costs['total_cost'] * (1 - efficiency_score / 100),
                'monthly_savings_potential': costs['total_cost'] * (1 - efficiency_score / 100) * 30
            }

        return None

    def recommend_optimizations(self):
        """Optimierungsempfehlungen"""
        efficiency = self.get_utilization_efficiency()
        costs = self.calculate_cluster_costs('24h')

        recommendations = []

        if efficiency['efficiency_score'] < 50:
            recommendations.append({
                'priority': 'high',
                'action': 'Enable GPU sharing/time-slicing',
                'potential_savings': efficiency['monthly_savings_potential'] * 0.6,
                'description': 'Low cluster efficiency detected'
            })

        if efficiency['gpu_utilization'] < 30:
            recommendations.append({
                'priority': 'high',
                'action': 'Implement workload consolidation',
                'potential_savings': efficiency['monthly_savings_potential'] * 0.4,
                'description': 'GPU utilization below optimal threshold'
            })

        # Analyse GPU-Type Mix
        gpu_costs = costs['gpu_type_costs']
        if 'A100-80GB' in gpu_costs and gpu_costs['A100-80GB'] > costs['total_cost'] * 0.7:
            recommendations.append({
                'priority': 'medium',
                'action': 'Consider mixed GPU types for different workloads',
                'potential_savings': gpu_costs['A100-80GB'] * 0.3,
                'description': 'High-end GPUs dominate cluster costs'
            })

        return recommendations

# Verwendung
analyzer = GPUClusterCostAnalyzer('http://prometheus:9090')
costs = analyzer.calculate_cluster_costs('30d')
efficiency = analyzer.get_utilization_efficiency()
recommendations = analyzer.recommend_optimizations()

print("=== GPU Cluster Cost Analysis ===")
print(f"Monthly Cluster Cost: €{costs['total_cost']:.2f}")
print(f"GPU Efficiency Score: {efficiency['efficiency_score']:.1f}%")
print(f"Potential Monthly Savings: €{efficiency['monthly_savings_potential']:.2f}")
print("\n=== Optimization Recommendations ===")
for rec in recommendations:
    print(f"[{rec['priority'].upper()}] {rec['action']}")
    print(f"  Potential Savings: €{rec['potential_savings']:.2f}/month")
    print(f"  Description: {rec['description']}\n")

Automated Cost Optimization

# GPU Cluster Autoscaler mit Cost Awareness
apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-cost-optimizer
  namespace: gpu-monitoring
spec:
  replicas: 1
  template:
    spec:
      containers:
        - name: cost-optimizer
          image: gpu-cost-optimizer:latest
          env:
            - name: PROMETHEUS_URL
              value: 'http://prometheus:9090'
            - name: OPTIMIZATION_INTERVAL
              value: '300' # 5 minutes
            - name: MIN_EFFICIENCY_THRESHOLD
              value: '60' # 60%
            - name: MAX_COST_PER_DAY
              value: '1000' # €1000
          command:
            - python
            - -c
            - |
              import time
              import os
              from kubernetes import client, config
              from cost_analyzer import GPUClusterCostAnalyzer

              config.load_incluster_config()
              v1 = client.CoreV1Api()
              apps_v1 = client.AppsV1Api()

              analyzer = GPUClusterCostAnalyzer(os.getenv('PROMETHEUS_URL'))

              while True:
                  try:
                      efficiency = analyzer.get_utilization_efficiency()
                      costs = analyzer.calculate_cluster_costs('24h')
                      
                      if efficiency['efficiency_score'] < int(os.getenv('MIN_EFFICIENCY_THRESHOLD')):
                          # Scale down underutilized workloads
                          print(f"Low efficiency detected: {efficiency['efficiency_score']:.1f}%")
                          # Implement scaling logic here
                      
                      if costs['total_cost'] > int(os.getenv('MAX_COST_PER_DAY')):
                          print(f"Cost threshold exceeded:{costs['total_cost']:.2f}")
                          # Implement cost reduction logic here
                      
                      time.sleep(int(os.getenv('OPTIMIZATION_INTERVAL')))
                  
                  except Exception as e:
                      print(f"Error in cost optimization: {e}")
                      time.sleep(60)

Security für GPU Cluster

GPU Cluster Security Hardening

# Pod Security Standards für GPU Workloads
apiVersion: v1
kind: Namespace
metadata:
  name: secure-gpu-workloads
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/warn: restricted
---
# Network Policy für GPU Namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: gpu-workload-isolation
  namespace: secure-gpu-workloads
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              name: ml-gateway
        - podSelector:
            matchLabels:
              role: gpu-client
      ports:
        - protocol: TCP
          port: 8080
  egress:
    - to:
        - namespaceSelector:
            matchLabels:
              name: model-registry
      ports:
        - protocol: TCP
          port: 443
    - to: []
      ports:
        - protocol: UDP
          port: 53
---
# RBAC für GPU Resources
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: gpu-user
  namespace: secure-gpu-workloads
rules:
  - apiGroups: ['']
    resources: ['pods', 'pods/log']
    verbs: ['get', 'list', 'create', 'delete']
  - apiGroups: ['batch']
    resources: ['jobs']
    verbs: ['get', 'list', 'create', 'delete']
  - apiGroups: ['apps']
    resources: ['deployments']
    verbs: ['get', 'list']

GPU Workload Encryption

# Encrypted GPU Workload
apiVersion: batch/v1
kind: Job
metadata:
  name: secure-ml-training
  namespace: secure-gpu-workloads
spec:
  template:
    spec:
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        runAsGroup: 1000
        fsGroup: 1000
        seccompProfile:
          type: RuntimeDefault
      containers:
        - name: ml-training
          image: secure-ml-runtime:latest
          securityContext:
            allowPrivilegeEscalation: false
            readOnlyRootFilesystem: true
            capabilities:
              drop:
                - ALL
          resources:
            requests:
              nvidia.com/gpu: 1
              memory: '8Gi'
              cpu: '2'
            limits:
              nvidia.com/gpu: 1
              memory: '16Gi'
              cpu: '4'
          env:
            - name: MODEL_ENCRYPTION_KEY
              valueFrom:
                secretKeyRef:
                  name: model-encryption-secret
                  key: encryption-key
          volumeMounts:
            - name: encrypted-data
              mountPath: /data
              readOnly: true
            - name: tmp-volume
              mountPath: /tmp
            - name: output-volume
              mountPath: /output
      volumes:
        - name: encrypted-data
          secret:
            secretName: encrypted-training-data
        - name: tmp-volume
          emptyDir: {}
        - name: output-volume
          emptyDir: {}
      restartPolicy: Never

Best Practices für Kubernetes GPU Cluster

1. Hardware Planning

# GPU Cluster Hardware Checklist
Hardware Planning:
✅ GPU-to-CPU Ratio: 1:4-8 (GPU:CPU cores)
✅ Memory Ratio: 8-16 GB RAM per GPU
✅ Network: 25+ Gbps für Training, 10+ Gbps für Inference
✅ Storage: NVMe SSD für Training Data
✅ Cooling: Adequate cooling für GPU Nodes
✅ Power: Redundant power supply
✅ InfiniBand: Für Large-Scale Training (optional)

2. Resource Allocation

# Resource Allocation Best Practices
Resource Strategy:
  - Training Nodes: Dedizierte GPUs
  - Inference Nodes: GPU Sharing/MIG
  - Development: Time-slicing
  - Batch Jobs: Preemptible Resources
  - Interactive: Guaranteed Resources

3. Monitoring Strategy

# Essential GPU Cluster Metrics
# Cluster GPU Utilization
avg(DCGM_FI_DEV_GPU_UTIL)

# Cluster Memory Efficiency
avg(DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL * 100)

# Cost per Workload
sum(rate(container_gpu_allocation[1h])) * gpu_hourly_cost

# Queue Depth (Pending Pods)
count(kube_pod_status_phase{phase="Pending"} * on(pod) kube_pod_info{created_by_kind="Job"})

# Cluster Efficiency Score
(avg(DCGM_FI_DEV_GPU_UTIL) + avg(DCGM_FI_DEV_FB_USED/DCGM_FI_DEV_FB_TOTAL*100)) / 2

Troubleshooting GPU Cluster

Common Issues und Lösungen

GPU Node Not Ready

# Debug GPU Node Issues
kubectl describe node gpu-worker-1
kubectl get events --field-selector involvedObject.name=gpu-worker-1

# Check GPU Operator Status
kubectl get pods -n gpu-operator
kubectl logs -n gpu-operator nvidia-device-plugin-daemonset-xxx

# Validate GPU Driver
kubectl exec -it gpu-worker-1 -- nvidia-smi
kubectl exec -it gpu-worker-1 -- nvidia-ml-py --list-gpus

GPU Workload Scheduling Issues

# Debug Scheduling
kubectl describe pod gpu-workload-pod
kubectl get events --field-selector involvedObject.name=gpu-workload-pod

# Check Resource Availability
kubectl describe nodes | grep nvidia.com/gpu
kubectl top nodes | grep gpu

# Verify Resource Quotas
kubectl describe resourcequota -n ml-workloads

Performance Issues

# Performance Debugging Queries
# GPU Throttling Detection
DCGM_FI_DEV_GPU_UTIL < 80 and DCGM_FI_DEV_GPU_TEMP > 83

# Memory Bandwidth Utilization
DCGM_FI_DEV_MEM_UTIL

# PCIe Throughput Issues
DCGM_FI_DEV_PCIE_TX_THROUGHPUT + DCGM_FI_DEV_PCIE_RX_THROUGHPUT < expected_throughput

Fazit: Kubernetes GPU Cluster für deutsche Unternehmen

ROI für deutsche AI/ML-Unternehmen:

Startups (2-10 GPUs):

  • Setup Cost: €50.000-200.000
  • Monthly Operating Cost: €5.000-20.000
  • ROI Break-even: 6-12 Monate
  • Efficiency Gain: 40-60% vs. Cloud-only

Enterprise (50-500 GPUs):

  • Setup Cost: €500.000-2.000.000
  • Monthly Operating Cost: €50.000-200.000
  • ROI Break-even: 12-18 Monate
  • Efficiency Gain: 60-80% vs. Multi-Cloud

Implementation Roadmap für deutsche Unternehmen:

Phase 1 (Wochen 1-4): Hardware Procurement & Installation
Phase 2 (Wochen 5-8): Kubernetes Cluster Setup
Phase 3 (Wochen 9-12): GPU Operator & Monitoring
Phase 4 (Wochen 13-16): Workload Migration & Optimization
Phase 5 (Wochen 17-20): Security Hardening & Compliance
Phase 6 (Wochen 21-24): Automation & Cost Optimization

Compliance für deutsche Unternehmen:

  • DSGVO: Data residency und privacy controls
  • BSI: Security standards implementation
  • TISAX: Automotive industry compliance
  • ISO 27001: Information security management

Benötigen Sie Unterstützung beim Kubernetes GPU Cluster Setup? Unsere AI-Infrastruktur-Experten helfen deutschen Unternehmen bei der Planung, Implementierung und Optimierung von production-ready GPU Clustern. Kontaktieren Sie uns für eine kostenlose GPU-Cluster-Beratung.

Weitere GPU Cluster Artikel:

📖 Verwandte Artikel

Weitere interessante Beiträge zu ähnlichen Themen