AKS GPU Workloads: Der ultimative Kostenrechner für deutsche Unternehmen

GPU-beschleunigte Workloads sind der Schlüssel für moderne Machine Learning, AI und High-Performance Computing Anwendungen. Azure Kubernetes Service (AKS) bietet deutschen Unternehmen eine flexible und skalierbare Plattform für diese rechenintensiven Aufgaben.

In diesem umfassenden Guide zeigen wir Ihnen, wie Sie GPU-Workloads auf AKS optimal planen, implementieren und budgetieren - mit einem interaktiven Kostenrechner für realistische Projektschätzungen.

Was sind GPU-Workloads auf Kubernetes?

GPU-Workloads nutzen die Parallelverarbeitungspower von Graphics Processing Units für rechenintensive Aufgaben, die weit über die traditionelle Grafikbearbeitung hinausgehen:

Typische GPU-Anwendungsfälle:

🤖 Machine Learning Training: Deep Learning Modelle mit TensorFlow, PyTorch
🔬 Scientific Computing: Simulationen, Datenanalyse, Forschung
📊 Data Processing: GPU-beschleunigte Analytics mit RAPIDS, Apache Spark
🎨 Rendering & Visualization: 3D-Rendering, CAD, Video-Processing
💹 Financial Modeling: Risikosimulationen, Algorithmic Trading
🔍 AI Inference: Real-time Model Serving für Produktions-Anwendungen

Warum Kubernetes für GPU-Workloads?

# Kubernetes GPU Resource Management
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
    - name: tensorflow
      image: tensorflow/tensorflow:latest-gpu
      resources:
        limits:
          nvidia.com/gpu: 2 # 2 GPUs anfordern
        requests:
          nvidia.com/gpu: 2

Vorteile von Kubernetes für GPU Computing:

✅ Automatische Orchestrierung: Intelligente GPU-Zuteilung und Scheduling
✅ Skalierbarkeit: Von einzelnen Jobs bis zu Large-Scale Distributed Training
✅ Resource Sharing: Effiziente Nutzung teurer GPU-Hardware
✅ Multi-Tenancy: Verschiedene Teams können GPU-Ressourcen teilen
✅ CI/CD Integration: Automatisierte ML-Pipelines mit GPU-beschleunigtem Training

Azure AKS GPU-Optionen im Überblick

Azure bietet verschiedene GPU-optimierte VM-Serien für unterschiedliche Anforderungen:

Tesla V100 Serie (NCv3)

Ideal für: High-Performance ML Training, Scientific Computing

VM Size	GPUs	GPU Memory	vCPUs	RAM	Preis/h (West Europe)
Standard_NC6s_v3	1x V100	16 GB	6	112 GB	€3.06
Standard_NC12s_v3	2x V100	32 GB	12	224 GB	€6.12
Standard_NC24s_v3	4x V100	64 GB	24	448 GB	€12.24

Tesla T4 Serie (NCasT4_v3)

Ideal für: ML Inference, kostengünstige Workloads

VM Size	GPUs	GPU Memory	vCPUs	RAM	Preis/h (West Europe)
Standard_NC4as_T4_v3	1x T4	16 GB	4	28 GB	€0.53
Standard_NC16as_T4_v3	4x T4	64 GB	16	110 GB	€2.10

A100 Serie (NDv4) - Enterprise Grade

Ideal für: Large-Scale Training, höchste Performance

VM Size	GPUs	GPU Memory	vCPUs	RAM	Preis/h (West Europe)
Standard_ND96asr_v4	8x A100	320 GB	96	900 GB	€27.76

🚀 AKS GPU Workload Kostenrechner

Berechnen Sie die Kosten für GPU-basierte Kubernetes Workloads auf Azure AKS. Preise basieren auf aktuellen Azure-Tarifen für deutsche Unternehmen.

⚙️ Konfiguration

Azure Region

GPU Node Type

Entry-level GPU für ML/AI Workloads

Anzahl Nodes: 2

110

Workload-Profil

Deep Learning Model Training (PyTorch, TensorFlow)

💰 Kostenschätzung

GPU Kosten (stündlich):€0.00

AKS Zusatzkosten:€0.00

Gesamt (stündlich):€0.00

€0

Pro Tag (24h)

€0

Pro Monat (30 Tage)

🎯 Cluster-Übersicht

📊 Gesamt GPUs: 0

💻 Gesamt vCPUs: 12

🧠 Gesamt Memory: 224 GB

⚡ Auslastung: 85%

Hinweis: Die Preise sind Schätzungen basierend auf aktuellen Azure-Tarifen (Stand 2024) und können variieren. Für genaue Preise nutzen Sie den offiziellen Azure Preisrechner. Zusätzliche Kosten für Bandbreite, Storage und Zusatzdienste sind nicht enthalten.

GPU-Workload Implementierung auf AKS

1. AKS Cluster mit GPU-Support erstellen

# Resource Group erstellen
az group create --name aks-gpu-rg --location westeurope

# AKS Cluster mit GPU Node Pool
az aks create \
  --resource-group aks-gpu-rg \
  --name aks-gpu-cluster \
  --node-count 1 \
  --node-vm-size Standard_D2s_v3 \
  --enable-addons monitoring \
  --generate-ssh-keys

# GPU Node Pool hinzufügen
az aks nodepool add \
  --resource-group aks-gpu-rg \
  --cluster-name aks-gpu-cluster \
  --name gpunodepool \
  --node-count 2 \
  --node-vm-size Standard_NC6s_v3 \
  --node-taints nvidia.com/gpu=true:NoSchedule

2. NVIDIA Device Plugin installieren

# nvidia-device-plugin.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  template:
    metadata:
      labels:
        name: nvidia-device-plugin-ds
    spec:
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      containers:
        - name: nvidia-device-plugin-ctr
          image: mcr.microsoft.com/oss/nvidia/k8s-device-plugin:v0.14.1
          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop: ['ALL']
          volumeMounts:
            - name: device-plugin
              mountPath: /var/lib/kubelet/device-plugins
      volumes:
        - name: device-plugin
          hostPath:
            path: /var/lib/kubelet/device-plugins

3. Machine Learning Workload Deploy

# tensorflow-gpu-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: tensorflow-mnist-gpu
spec:
  template:
    spec:
      containers:
        - name: tensorflow
          image: tensorflow/tensorflow:latest-gpu
          command:
            - python
            - -c
            - |
              import tensorflow as tf
              print("TensorFlow version:", tf.__version__)
              print("GPU Available: ", tf.config.list_physical_devices('GPU'))

              # MNIST Training Example
              mnist = tf.keras.datasets.mnist
              (x_train, y_train), (x_test, y_test) = mnist.load_data()
              x_train, x_test = x_train / 255.0, x_test / 255.0

              model = tf.keras.models.Sequential([
                tf.keras.layers.Flatten(input_shape=(28, 28)),
                tf.keras.layers.Dense(128, activation='relu'),
                tf.keras.layers.Dropout(0.2),
                tf.keras.layers.Dense(10)
              ])

              model.compile(optimizer='adam',
                           loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                           metrics=['accuracy'])

              model.fit(x_train, y_train, epochs=5, batch_size=512)

          resources:
            limits:
              nvidia.com/gpu: 1
              memory: '8Gi'
              cpu: '4'
            requests:
              nvidia.com/gpu: 1
              memory: '4Gi'
              cpu: '2'
      restartPolicy: Never
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule

Performance Optimierung für GPU-Workloads

1. GPU Memory Management

# TensorFlow GPU Memory Growth aktivieren
import tensorflow as tf

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
    except RuntimeError as e:
        print(e)

2. Batch Size Optimierung

# Optimale Batch Size ermitteln
def find_optimal_batch_size(model, initial_batch_size=32):
    """Findet die optimale Batch Size für verfügbaren GPU Memory"""
    batch_size = initial_batch_size

    while True:
        try:
            # Test Training Step
            dummy_data = tf.random.normal((batch_size, 224, 224, 3))
            dummy_labels = tf.random.uniform((batch_size,), maxval=1000, dtype=tf.int32)

            with tf.GradientTape() as tape:
                predictions = model(dummy_data, training=True)
                loss = tf.keras.losses.sparse_categorical_crossentropy(dummy_labels, predictions)

            # Wenn erfolgreich, versuche größere Batch Size
            batch_size *= 2
            print(f"✅ Batch Size {batch_size//2} funktioniert")

        except tf.errors.ResourceExhaustedError:
            # GPU Memory erschöpft, verwende vorherige Batch Size
            optimal_batch_size = batch_size // 2
            print(f"🎯 Optimale Batch Size: {optimal_batch_size}")
            return optimal_batch_size

3. Multi-GPU Training Setup

# Distributed Training Strategy
strategy = tf.distribute.MirroredStrategy()
print(f"Number of devices: {strategy.num_replicas_in_sync}")

with strategy.scope():
    model = create_model()
    model.compile(
        optimizer='adam',
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )

# Training mit automatischer Batch Distribution
model.fit(train_dataset, epochs=10, validation_data=val_dataset)

Kostenoptimierung Strategien

1. Spot Instances nutzen

# AKS Node Pool mit Spot Instances
az aks nodepool add \
  --resource-group aks-gpu-rg \
  --cluster-name aks-gpu-cluster \
  --name spotgpupool \
  --priority Spot \
  --eviction-policy Delete \
  --spot-max-price 0.5 \
  --node-vm-size Standard_NC6s_v3 \
  --node-count 3 \
  --min-count 0 \
  --max-count 10 \
  --enable-cluster-autoscaler

Spot Instance Vorteile:

✅ 60-90% Kostenersparnis gegenüber On-Demand Pricing
✅ Ideal für Training Jobs: Können Unterbrechungen verkraften
✅ Automatisches Failover: Auf reguläre Instances bei Bedarf

# NVIDIA MPS (Multi-Process Service) für GPU Sharing
apiVersion: v1
kind: ConfigMap
metadata:
  name: nvidia-mps-config
data:
  start_mps.sh: |
    #!/bin/bash
    export CUDA_VISIBLE_DEVICES=0
    export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
    export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log
    nvidia-cuda-mps-control -d

---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-mps
spec:
  selector:
    matchLabels:
      name: nvidia-mps
  template:
    spec:
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      containers:
        - name: nvidia-mps
          image: nvidia/cuda:11.8-runtime-ubuntu20.04
          command: ['/bin/bash', '/scripts/start_mps.sh']
          securityContext:
            privileged: true
          volumeMounts:
            - name: nvidia-mps-config
              mountPath: /scripts
      volumes:
        - name: nvidia-mps-config
          configMap:
            name: nvidia-mps-config
            defaultMode: 0755

3. Intelligent Scheduling

# GPU Workload Scheduler mit Prioritäten
apiVersion: v1
kind: PriorityClass
metadata:
  name: high-priority-gpu
value: 1000
globalDefault: false
description: 'High priority for production ML inference'

---
apiVersion: v1
kind: PriorityClass
metadata:
  name: low-priority-gpu
value: 100
description: 'Low priority for training jobs'

---
# Production Inference Pod (High Priority)
apiVersion: v1
kind: Pod
metadata:
  name: ml-inference-prod
spec:
  priorityClassName: high-priority-gpu
  containers:
    - name: inference
      image: tensorflow/serving:latest-gpu
      resources:
        limits:
          nvidia.com/gpu: 1

Monitoring und Observability

1. GPU Metrics sammeln

# NVIDIA DCGM Exporter für Prometheus
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-dcgm-exporter
spec:
  selector:
    matchLabels:
      name: nvidia-dcgm-exporter
  template:
    spec:
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      containers:
        - name: nvidia-dcgm-exporter
          image: nvcr.io/nvidia/k8s/dcgm-exporter:3.1.8-3.1.5-ubuntu20.04
          ports:
            - name: metrics
              containerPort: 9400
          securityContext:
            runAsNonRoot: false
            runAsUser: 0
          volumeMounts:
            - name: proc
              mountPath: /host/proc
              readOnly: true
            - name: sys
              mountPath: /host/sys
              readOnly: true
      volumes:
        - name: proc
          hostPath:
            path: /proc
        - name: sys
          hostPath:
            path: /sys

2. Custom Dashboards

# Grafana Dashboard ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: gpu-dashboard
data:
  dashboard.json: |
    {
      "dashboard": {
        "title": "AKS GPU Monitoring",
        "panels": [
          {
            "title": "GPU Utilization",
            "type": "graph",
            "targets": [
              {
                "expr": "DCGM_FI_DEV_GPU_UTIL",
                "legendFormat": "GPU {{gpu}} - {{instance}}"
              }
            ]
          },
          {
            "title": "GPU Memory Usage", 
            "type": "graph",
            "targets": [
              {
                "expr": "DCGM_FI_DEV_MEM_COPY_UTIL",
                "legendFormat": "Memory {{gpu}} - {{instance}}"
              }
            ]
          }
        ]
      }
    }

Security Best Practices für GPU-Workloads

1. Container Security

# Secure GPU Pod Configuration
apiVersion: v1
kind: Pod
metadata:
  name: secure-gpu-pod
spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    fsGroup: 1000
    seccompProfile:
      type: RuntimeDefault
  containers:
    - name: gpu-app
      image: tensorflow/tensorflow:latest-gpu
      securityContext:
        allowPrivilegeEscalation: false
        readOnlyRootFilesystem: true
        capabilities:
          drop:
            - ALL
        runAsNonRoot: true
      resources:
        limits:
          nvidia.com/gpu: 1
          memory: '4Gi'
          cpu: '2'
      volumeMounts:
        - name: tmp
          mountPath: /tmp
        - name: model-cache
          mountPath: /app/models
  volumes:
    - name: tmp
      emptyDir: {}
    - name: model-cache
      emptyDir: {}

2. Network Policies

# GPU Workload Network Policy
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: gpu-workload-netpol
spec:
  podSelector:
    matchLabels:
      workload-type: gpu
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              name: ml-platform
      ports:
        - protocol: TCP
          port: 8080
  egress:
    - to:
        - namespaceSelector:
            matchLabels:
              name: data-storage
      ports:
        - protocol: TCP
          port: 443

Real-World Use Cases und Erfolgsgeschichten

Case Study 1: E-Commerce Recommendation Engine

Herausforderung: Deutsche E-Commerce Plattform benötigte Real-time Produktempfehlungen für 100.000+ gleichzeitige Nutzer.

Lösung:

4x Tesla T4 GPUs für ML Inference
Kubernetes HPA für automatische Skalierung
NGINX Ingress mit GPU-basiertem Load Balancing

Ergebnisse:

✅ 95% Latenz-Reduktion: Von 500ms auf 25ms Response Time
✅ 60% Kostenersparnis: Vs. CPU-only Lösung
✅ 10x bessere Conversion: Durch personalisierte Empfehlungen

# Production Inference Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: recommendation-engine
spec:
  replicas: 4
  selector:
    matchLabels:
      app: recommendation-engine
  template:
    spec:
      containers:
        - name: inference
          image: company/recommendation-model:v2.1
          resources:
            limits:
              nvidia.com/gpu: 1
              memory: '8Gi'
            requests:
              nvidia.com/gpu: 1
              memory: '4Gi'
          env:
            - name: MODEL_PATH
              value: '/models/recommendation-v2.1'
            - name: BATCH_SIZE
              value: '64'

Case Study 2: Fintech Risk Modeling

Herausforderung: Deutsche Bank benötigte komplexe Risikosimulationen für Regulatory Compliance.