- Published on
Kubernetes GPU Cluster
- Authors
- Name
- Phillip Pham
- @ddppham
Kubernetes GPU Cluster: Der ultimative Guide für deutsche AI/ML-Infrastrukturen 2025
Ein Kubernetes GPU Cluster ist die Basis für skalierbare AI/ML-Infrastrukturen in deutschen Unternehmen. Dieser umfassende Guide zeigt Ihnen, wie Sie production-ready GPU Cluster aufbauen, verwalten und optimieren - von der Hardware-Planung bis zur automatischen Skalierung.
Was ist ein Kubernetes GPU Cluster?
Ein Kubernetes GPU Cluster ist eine orchestrierte Sammlung von GPU-fähigen Nodes, die gemeinsam AI/ML-Workloads verarbeiten. Es ermöglicht effiziente Ressourcennutzung, automatische Skalierung und zentrale Verwaltung von GPU-Infrastrukturen.
Warum Kubernetes GPU Cluster für deutsche Unternehmen?
- Skalierbarkeit: Von einzelnen GPUs bis zu hunderten Nodes
- Cost Efficiency: Optimale Ausnutzung teurer GPU-Hardware
- Multi-Tenancy: Sichere Ressourcenteilung zwischen Teams
- Compliance: DSGVO-konforme AI-Infrastrukturen
Kubernetes GPU Cluster Architektur
Multi-Node GPU Cluster Design
# Cluster Architecture Overview
Cluster Components:
├── Master Nodes (3x)
│ ├── API Server
│ ├── etcd Cluster
│ ├── Scheduler (GPU-aware)
│ └── Controller Manager
├── GPU Worker Nodes (N x)
│ ├── NVIDIA GPU Driver
│ ├── NVIDIA Container Runtime
│ ├── GPU Device Plugin
│ └── DCGM Exporter
├── Storage Nodes
│ ├── High-Performance SSD
│ ├── Network Attached Storage
│ └── Distributed File System
└── Network Infrastructure
├── High-Bandwidth Networking
├── InfiniBand (optional)
└── Load Balancers
GPU Node Specifications
# GPU Node Hardware Specifications
GPU Node Types:
Training Nodes:
- CPUs: 32-64 cores
- RAM: 256-512 GB
- GPUs: 4-8x NVIDIA A100/H100
- Network: 100 Gbps
- Storage: NVMe SSD
Inference Nodes:
- CPUs: 16-32 cores
- RAM: 128-256 GB
- GPUs: 2-4x NVIDIA T4/L4
- Network: 25-50 Gbps
- Storage: SSD
Development Nodes:
- CPUs: 8-16 cores
- RAM: 64-128 GB
- GPUs: 1-2x NVIDIA RTX/Tesla
- Network: 10-25 Gbps
- Storage: Mixed
Kubernetes GPU Cluster Setup
1. Cluster Initialization mit kubeadm
# Master Node Setup
sudo kubeadm init \
--pod-network-cidr=10.244.0.0/16 \
--service-cidr=10.96.0.0/12 \
--kubernetes-version=v1.28.0 \
--upload-certs \
--control-plane-endpoint=gpu-cluster.company.de:6443
# Cluster Configuration speichern
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
# High-Availability Master Setup
sudo kubeadm join gpu-cluster.company.de:6443 \
--token abcdef.0123456789abcdef \
--discovery-token-ca-cert-hash sha256:... \
--control-plane \
--certificate-key ...
2. GPU Worker Nodes hinzufügen
# GPU Node Prerequisites
# NVIDIA Driver Installation
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit nvidia-driver-525
# Container Runtime Configuration
sudo nvidia-ctk runtime configure --runtime=containerd
sudo systemctl restart containerd
# Join GPU Node to Cluster
sudo kubeadm join gpu-cluster.company.de:6443 \
--token abcdef.0123456789abcdef \
--discovery-token-ca-cert-hash sha256:...
3. NVIDIA GPU Operator Installation
# Helm Repository hinzufügen
helm repo add nvidia https://nvidia.github.io/gpu-operator
helm repo update
# GPU Operator für Cluster installieren
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--create-namespace \
--set driver.enabled=true \
--set toolkit.enabled=true \
--set devicePlugin.enabled=true \
--set nodeFeatureDiscovery.enabled=true \
--set migManager.enabled=true \
--set operator.defaultRuntime=containerd \
--set validator.plugin.env[0].name=WITH_WORKLOAD \
--set validator.plugin.env[0].value=true
4. GPU Cluster Validation
# GPU Nodes überprüfen
kubectl get nodes -l nvidia.com/gpu.present=true
# GPU Resources anzeigen
kubectl describe nodes | grep nvidia.com/gpu
# GPU Operator Status
kubectl get pods -n gpu-operator
# Test GPU Workload
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: gpu-test
spec:
containers:
- name: gpu-test
image: nvidia/cuda:12.0-runtime-ubuntu20.04
command: ["nvidia-smi"]
resources:
limits:
nvidia.com/gpu: 1
restartPolicy: Never
EOF
# Test Results
kubectl logs gpu-test
Multi-Node GPU Cluster Networking
High-Performance Networking Setup
# Cluster Network Configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-network-config
namespace: kube-system
data:
cni-config: |
{
"cniVersion": "0.4.0",
"name": "gpu-cluster-network",
"plugins": [
{
"type": "calico",
"datastore_type": "kubernetes",
"mtu": 9000,
"nodename_file_optional": false,
"ipam": {
"type": "calico-ipam",
"assign_ipv4": "true",
"assign_ipv6": "false"
},
"container_settings": {
"allow_ip_forwarding": true
}
},
{
"type": "bandwidth",
"capabilities": {
"bandwidth": true
}
}
]
}
InfiniBand Integration (Enterprise)
# InfiniBand Device Plugin
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: infiniband-device-plugin
namespace: kube-system
spec:
selector:
matchLabels:
app: infiniband-device-plugin
template:
metadata:
labels:
app: infiniband-device-plugin
spec:
nodeSelector:
infiniband.present: 'true'
containers:
- name: infiniband-device-plugin
image: mellanox/ib-kubernetes:latest
securityContext:
privileged: true
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
- name: dev
mountPath: /dev
- name: sys
mountPath: /sys
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
- name: dev
hostPath:
path: /dev
- name: sys
hostPath:
path: /sys
hostNetwork: true
GPU Resource Management
GPU Sharing Strategies
# Multi-Instance GPU (MIG) Configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: mig-config
namespace: gpu-operator
data:
config.yaml: |
version: v1
mig-configs:
all-1g.5gb:
- devices: all
mig-enabled: true
mig-devices:
1g.5gb: 7
all-2g.10gb:
- devices: all
mig-enabled: true
mig-devices:
2g.10gb: 3
mixed:
- devices: [0,1]
mig-enabled: true
mig-devices:
1g.5gb: 4
2g.10gb: 1
- devices: [2,3]
mig-enabled: false
Time-Slicing für GPU Sharing
# GPU Time-Slicing Configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: gpu-sharing-config
namespace: gpu-operator
data:
config.yaml: |
version: v1
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 4 # 4 Pods pro GPU
- name: nvidia.com/mig-1g.5gb
replicas: 2
- name: nvidia.com/mig-2g.10gb
replicas: 1
GPU Resource Quotas
# Namespace GPU Quotas
apiVersion: v1
kind: ResourceQuota
metadata:
name: gpu-quota-ml-team
namespace: ml-team
spec:
hard:
nvidia.com/gpu: '16'
nvidia.com/mig-1g.5gb: '32'
nvidia.com/mig-2g.10gb: '8'
requests.memory: '512Gi'
requests.cpu: '128'
limits.memory: '1024Gi'
limits.cpu: '256'
persistentvolumeclaims: '10'
count/jobs.batch: '50'
---
# Priority Classes für GPU Workloads
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: gpu-high-priority
value: 1000
globalDefault: false
description: 'High priority for critical GPU workloads'
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: gpu-low-priority
value: 100
globalDefault: false
description: 'Low priority for batch GPU workloads'
GPU Cluster Autoscaling
Cluster Autoscaler für GPU Nodes
# Cluster Autoscaler Configuration
apiVersion: apps/v1
kind: Deployment
metadata:
name: cluster-autoscaler
namespace: kube-system
spec:
template:
spec:
containers:
- image: k8s.gcr.io/autoscaling/cluster-autoscaler:v1.28.0
name: cluster-autoscaler
command:
- ./cluster-autoscaler
- --v=4
- --stderrthreshold=info
- --cloud-provider=aws # oder azure, gcp
- --skip-nodes-with-local-storage=false
- --expander=least-waste
- --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/gpu-cluster
- --balance-similar-node-groups
- --scale-down-enabled=true
- --scale-down-delay-after-add=10m
- --scale-down-unneeded-time=10m
- --max-node-provision-time=15m
env:
- name: AWS_REGION
value: eu-central-1
resources:
limits:
cpu: 100m
memory: 300Mi
requests:
cpu: 100m
memory: 300Mi
GPU-Aware Scheduling
# Extended Resource Scheduler
apiVersion: v1
kind: ConfigMap
metadata:
name: scheduler-config
namespace: kube-system
data:
config.yaml: |
apiVersion: kubescheduler.config.k8s.io/v1beta3
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: gpu-scheduler
plugins:
filter:
enabled:
- name: NodeResourcesFit
- name: NodeAffinity
score:
enabled:
- name: NodeResourcesFit
- name: NodeAffinity
- name: InterPodAffinity
pluginConfig:
- name: NodeResourcesFit
args:
scoringStrategy:
type: LeastAllocated
resources:
- name: nvidia.com/gpu
weight: 100
- name: cpu
weight: 1
- name: memory
weight: 1
GPU Workload Orchestration
Distributed Training mit PyTorch
# PyTorch Distributed Training Job
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
name: distributed-training
namespace: ml-workloads
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
template:
metadata:
annotations:
prometheus.io/scrape: 'true'
prometheus.io/port: '9090'
spec:
containers:
- name: pytorch
image: pytorch/pytorch:2.0.0-cuda11.7-cudnn8-devel
command:
- python
- -m
- torch.distributed.launch
- --nproc_per_node=4
- --nnodes=4
- --node_rank=0
- --master_addr=distributed-training-master-0
- --master_port=23456
- train.py
resources:
requests:
nvidia.com/gpu: 4
memory: '64Gi'
cpu: '16'
limits:
nvidia.com/gpu: 4
memory: '128Gi'
cpu: '32'
env:
- name: NCCL_DEBUG
value: 'INFO'
- name: NCCL_IB_DISABLE
value: '0'
volumeMounts:
- name: training-data
mountPath: /data
- name: model-output
mountPath: /output
Worker:
replicas: 3
template:
spec:
containers:
- name: pytorch
image: pytorch/pytorch:2.0.0-cuda11.7-cudnn8-devel
command:
- python
- -m
- torch.distributed.launch
- --nproc_per_node=4
- --nnodes=4
- --node_rank=$WORKER_RANK
- --master_addr=distributed-training-master-0
- --master_port=23456
- train.py
resources:
requests:
nvidia.com/gpu: 4
memory: '64Gi'
cpu: '16'
limits:
nvidia.com/gpu: 4
memory: '128Gi'
cpu: '32'
volumeMounts:
- name: training-data
mountPath: /data
- name: model-output
mountPath: /output
volumes:
- name: training-data
persistentVolumeClaim:
claimName: training-data-pvc
- name: model-output
persistentVolumeClaim:
claimName: model-output-pvc
Model Serving mit GPU Sharing
# GPU Model Serving Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: model-serving-gpu
namespace: ml-serving
spec:
replicas: 8
selector:
matchLabels:
app: model-serving
template:
metadata:
labels:
app: model-serving
annotations:
prometheus.io/scrape: 'true'
prometheus.io/port: '8080'
spec:
nodeSelector:
node-type: gpu-inference
containers:
- name: model-server
image: tritonserver:latest
ports:
- containerPort: 8000
name: http
- containerPort: 8001
name: grpc
- containerPort: 8002
name: metrics
resources:
requests:
nvidia.com/gpu: 1
memory: '8Gi'
cpu: '2'
limits:
nvidia.com/gpu: 1
memory: '16Gi'
cpu: '4'
env:
- name: CUDA_VISIBLE_DEVICES
value: '0'
- name: TRITON_MODEL_REPOSITORY
value: '/models'
volumeMounts:
- name: model-storage
mountPath: /models
livenessProbe:
httpGet:
path: /v2/health/live
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /v2/health/ready
port: 8000
initialDelaySeconds: 30
periodSeconds: 5
volumes:
- name: model-storage
persistentVolumeClaim:
claimName: model-storage-pvc
GPU Cluster Storage Solutions
High-Performance Storage für GPU Workloads
# NVMe SSD StorageClass
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: nvme-ssd-gpu
provisioner: ebs.csi.aws.com
parameters:
type: gp3
iops: '16000'
throughput: '1000'
fsType: ext4
encrypted: 'true'
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
---
# Distributed Storage für Model Repository
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: distributed-gpu-storage
provisioner: ceph.rook.io/block
parameters:
clusterID: gpu-cluster
pool: gpu-pool
imageFormat: '2'
imageFeatures: layering
csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner
csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
reclaimPolicy: Delete
volumeBindingMode: Immediate
Shared Model Storage
# ReadWriteMany PVC für Model Sharing
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: shared-models-pvc
namespace: ml-workloads
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 1Ti
storageClassName: distributed-gpu-storage
---
# Model Repository ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
name: model-repository-config
namespace: ml-workloads
data:
models.json: |
{
"models": [
{
"name": "bert-base-german",
"path": "/models/bert-base-german",
"version": "1.0",
"gpu_memory": "2Gi",
"batch_size": 32
},
{
"name": "gpt-german-large",
"path": "/models/gpt-german-large",
"version": "2.1",
"gpu_memory": "8Gi",
"batch_size": 8
}
]
}
GPU Cluster Monitoring & Observability
Comprehensive GPU Monitoring Stack
# GPU Monitoring Stack
apiVersion: v1
kind: Namespace
metadata:
name: gpu-monitoring
---
# Prometheus für GPU Cluster
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: gpu-cluster-prometheus
namespace: gpu-monitoring
spec:
replicas: 2
retention: 30d
storage:
volumeClaimTemplate:
spec:
accessModes: ['ReadWriteOnce']
resources:
requests:
storage: 500Gi
storageClassName: nvme-ssd-gpu
serviceMonitorSelector:
matchLabels:
monitoring: gpu-cluster
ruleSelector:
matchLabels:
monitoring: gpu-cluster
resources:
requests:
memory: '8Gi'
cpu: '2'
limits:
memory: '16Gi'
cpu: '4'
---
# DCGM Exporter ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: dcgm-exporter
namespace: gpu-monitoring
labels:
monitoring: gpu-cluster
spec:
selector:
matchLabels:
app: dcgm-exporter
endpoints:
- port: metrics
interval: 30s
path: /metrics
honorLabels: true
relabelings:
- sourceLabels: [__meta_kubernetes_pod_node_name]
targetLabel: node
- sourceLabels: [__meta_kubernetes_pod_name]
targetLabel: pod
- sourceLabels: [__meta_kubernetes_namespace]
targetLabel: namespace
GPU Cluster Health Monitoring
# GPU Cluster Health Rules
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: gpu-cluster-health
namespace: gpu-monitoring
labels:
monitoring: gpu-cluster
spec:
groups:
- name: gpu-cluster.health
rules:
- alert: GPUClusterNodeDown
expr: up{job="dcgm-exporter"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: 'GPU node is down'
description: 'GPU node {{$labels.instance}} has been down for more than 2 minutes'
- alert: GPUClusterLowUtilization
expr: avg(DCGM_FI_DEV_GPU_UTIL) < 20
for: 30m
labels:
severity: warning
annotations:
summary: 'GPU cluster underutilized'
description: 'GPU cluster average utilization is {{$value}}% for 30 minutes'
- alert: GPUClusterMemoryPressure
expr: avg(DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL) > 0.85
for: 10m
labels:
severity: warning
annotations:
summary: 'GPU cluster memory pressure'
description: 'GPU cluster memory usage is {{$value | humanizePercentage}}'
- alert: GPUClusterTemperatureHigh
expr: max(DCGM_FI_DEV_GPU_TEMP) > 85
for: 5m
labels:
severity: critical
annotations:
summary: 'GPU cluster temperature critical'
description: 'GPU temperature reached {{$value}}°C on node {{$labels.exported_instance}}'
Cost Optimization für GPU Cluster
GPU Cluster Cost Management
# GPU Cluster Cost Analyzer
import pandas as pd
from prometheus_api_client import PrometheusConnect
from datetime import datetime, timedelta
class GPUClusterCostAnalyzer:
def __init__(self, prometheus_url):
self.prom = PrometheusConnect(url=prometheus_url)
self.gpu_costs = {
'Tesla-V100': 2.48, # € per hour
'Tesla-T4': 0.35,
'A100-40GB': 3.20,
'A100-80GB': 4.50,
'H100': 6.80,
'RTX-4090': 1.20
}
def get_cluster_gpu_inventory(self):
"""GPU Cluster Inventar abrufen"""
query = '''
count by (exported_gpu_model, exported_instance)
(DCGM_FI_DEV_GPU_UTIL)
'''
result = self.prom.custom_query(query)
inventory = {}
for item in result:
gpu_model = item['metric']['exported_gpu_model']
node = item['metric']['exported_instance']
count = int(item['value'][1])
if node not in inventory:
inventory[node] = {}
inventory[node][gpu_model] = count
return inventory
def calculate_cluster_costs(self, time_range='24h'):
"""Cluster-Kosten berechnen"""
inventory = self.get_cluster_gpu_inventory()
total_cost = 0
node_costs = {}
gpu_type_costs = {}
for node, gpus in inventory.items():
node_cost = 0
for gpu_model, count in gpus.items():
if gpu_model in self.gpu_costs:
hourly_cost = self.gpu_costs[gpu_model] * count
if time_range == '24h':
daily_cost = hourly_cost * 24
elif time_range == '30d':
daily_cost = hourly_cost * 24 * 30
else:
daily_cost = hourly_cost
node_cost += daily_cost
total_cost += daily_cost
if gpu_model not in gpu_type_costs:
gpu_type_costs[gpu_model] = 0
gpu_type_costs[gpu_model] += daily_cost
node_costs[node] = node_cost
return {
'total_cost': total_cost,
'node_costs': node_costs,
'gpu_type_costs': gpu_type_costs,
'inventory': inventory
}
def get_utilization_efficiency(self):
"""Cluster-Effizienz berechnen"""
util_query = 'avg(DCGM_FI_DEV_GPU_UTIL)'
memory_query = 'avg(DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL * 100)'
util_result = self.prom.custom_query(util_query)
memory_result = self.prom.custom_query(memory_query)
if util_result and memory_result:
avg_util = float(util_result[0]['value'][1])
avg_memory = float(memory_result[0]['value'][1])
efficiency_score = (avg_util + avg_memory) / 2
costs = self.calculate_cluster_costs('24h')
return {
'gpu_utilization': avg_util,
'memory_utilization': avg_memory,
'efficiency_score': efficiency_score,
'daily_cost': costs['total_cost'],
'wasted_cost': costs['total_cost'] * (1 - efficiency_score / 100),
'monthly_savings_potential': costs['total_cost'] * (1 - efficiency_score / 100) * 30
}
return None
def recommend_optimizations(self):
"""Optimierungsempfehlungen"""
efficiency = self.get_utilization_efficiency()
costs = self.calculate_cluster_costs('24h')
recommendations = []
if efficiency['efficiency_score'] < 50:
recommendations.append({
'priority': 'high',
'action': 'Enable GPU sharing/time-slicing',
'potential_savings': efficiency['monthly_savings_potential'] * 0.6,
'description': 'Low cluster efficiency detected'
})
if efficiency['gpu_utilization'] < 30:
recommendations.append({
'priority': 'high',
'action': 'Implement workload consolidation',
'potential_savings': efficiency['monthly_savings_potential'] * 0.4,
'description': 'GPU utilization below optimal threshold'
})
# Analyse GPU-Type Mix
gpu_costs = costs['gpu_type_costs']
if 'A100-80GB' in gpu_costs and gpu_costs['A100-80GB'] > costs['total_cost'] * 0.7:
recommendations.append({
'priority': 'medium',
'action': 'Consider mixed GPU types for different workloads',
'potential_savings': gpu_costs['A100-80GB'] * 0.3,
'description': 'High-end GPUs dominate cluster costs'
})
return recommendations
# Verwendung
analyzer = GPUClusterCostAnalyzer('http://prometheus:9090')
costs = analyzer.calculate_cluster_costs('30d')
efficiency = analyzer.get_utilization_efficiency()
recommendations = analyzer.recommend_optimizations()
print("=== GPU Cluster Cost Analysis ===")
print(f"Monthly Cluster Cost: €{costs['total_cost']:.2f}")
print(f"GPU Efficiency Score: {efficiency['efficiency_score']:.1f}%")
print(f"Potential Monthly Savings: €{efficiency['monthly_savings_potential']:.2f}")
print("\n=== Optimization Recommendations ===")
for rec in recommendations:
print(f"[{rec['priority'].upper()}] {rec['action']}")
print(f" Potential Savings: €{rec['potential_savings']:.2f}/month")
print(f" Description: {rec['description']}\n")
Automated Cost Optimization
# GPU Cluster Autoscaler mit Cost Awareness
apiVersion: apps/v1
kind: Deployment
metadata:
name: gpu-cost-optimizer
namespace: gpu-monitoring
spec:
replicas: 1
template:
spec:
containers:
- name: cost-optimizer
image: gpu-cost-optimizer:latest
env:
- name: PROMETHEUS_URL
value: 'http://prometheus:9090'
- name: OPTIMIZATION_INTERVAL
value: '300' # 5 minutes
- name: MIN_EFFICIENCY_THRESHOLD
value: '60' # 60%
- name: MAX_COST_PER_DAY
value: '1000' # €1000
command:
- python
- -c
- |
import time
import os
from kubernetes import client, config
from cost_analyzer import GPUClusterCostAnalyzer
config.load_incluster_config()
v1 = client.CoreV1Api()
apps_v1 = client.AppsV1Api()
analyzer = GPUClusterCostAnalyzer(os.getenv('PROMETHEUS_URL'))
while True:
try:
efficiency = analyzer.get_utilization_efficiency()
costs = analyzer.calculate_cluster_costs('24h')
if efficiency['efficiency_score'] < int(os.getenv('MIN_EFFICIENCY_THRESHOLD')):
# Scale down underutilized workloads
print(f"Low efficiency detected: {efficiency['efficiency_score']:.1f}%")
# Implement scaling logic here
if costs['total_cost'] > int(os.getenv('MAX_COST_PER_DAY')):
print(f"Cost threshold exceeded: €{costs['total_cost']:.2f}")
# Implement cost reduction logic here
time.sleep(int(os.getenv('OPTIMIZATION_INTERVAL')))
except Exception as e:
print(f"Error in cost optimization: {e}")
time.sleep(60)
Security für GPU Cluster
GPU Cluster Security Hardening
# Pod Security Standards für GPU Workloads
apiVersion: v1
kind: Namespace
metadata:
name: secure-gpu-workloads
labels:
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/audit: restricted
pod-security.kubernetes.io/warn: restricted
---
# Network Policy für GPU Namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: gpu-workload-isolation
namespace: secure-gpu-workloads
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: ml-gateway
- podSelector:
matchLabels:
role: gpu-client
ports:
- protocol: TCP
port: 8080
egress:
- to:
- namespaceSelector:
matchLabels:
name: model-registry
ports:
- protocol: TCP
port: 443
- to: []
ports:
- protocol: UDP
port: 53
---
# RBAC für GPU Resources
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: gpu-user
namespace: secure-gpu-workloads
rules:
- apiGroups: ['']
resources: ['pods', 'pods/log']
verbs: ['get', 'list', 'create', 'delete']
- apiGroups: ['batch']
resources: ['jobs']
verbs: ['get', 'list', 'create', 'delete']
- apiGroups: ['apps']
resources: ['deployments']
verbs: ['get', 'list']
GPU Workload Encryption
# Encrypted GPU Workload
apiVersion: batch/v1
kind: Job
metadata:
name: secure-ml-training
namespace: secure-gpu-workloads
spec:
template:
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000
runAsGroup: 1000
fsGroup: 1000
seccompProfile:
type: RuntimeDefault
containers:
- name: ml-training
image: secure-ml-runtime:latest
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
resources:
requests:
nvidia.com/gpu: 1
memory: '8Gi'
cpu: '2'
limits:
nvidia.com/gpu: 1
memory: '16Gi'
cpu: '4'
env:
- name: MODEL_ENCRYPTION_KEY
valueFrom:
secretKeyRef:
name: model-encryption-secret
key: encryption-key
volumeMounts:
- name: encrypted-data
mountPath: /data
readOnly: true
- name: tmp-volume
mountPath: /tmp
- name: output-volume
mountPath: /output
volumes:
- name: encrypted-data
secret:
secretName: encrypted-training-data
- name: tmp-volume
emptyDir: {}
- name: output-volume
emptyDir: {}
restartPolicy: Never
Best Practices für Kubernetes GPU Cluster
1. Hardware Planning
# GPU Cluster Hardware Checklist
Hardware Planning:
✅ GPU-to-CPU Ratio: 1:4-8 (GPU:CPU cores)
✅ Memory Ratio: 8-16 GB RAM per GPU
✅ Network: 25+ Gbps für Training, 10+ Gbps für Inference
✅ Storage: NVMe SSD für Training Data
✅ Cooling: Adequate cooling für GPU Nodes
✅ Power: Redundant power supply
✅ InfiniBand: Für Large-Scale Training (optional)
2. Resource Allocation
# Resource Allocation Best Practices
Resource Strategy:
- Training Nodes: Dedizierte GPUs
- Inference Nodes: GPU Sharing/MIG
- Development: Time-slicing
- Batch Jobs: Preemptible Resources
- Interactive: Guaranteed Resources
3. Monitoring Strategy
# Essential GPU Cluster Metrics
# Cluster GPU Utilization
avg(DCGM_FI_DEV_GPU_UTIL)
# Cluster Memory Efficiency
avg(DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL * 100)
# Cost per Workload
sum(rate(container_gpu_allocation[1h])) * gpu_hourly_cost
# Queue Depth (Pending Pods)
count(kube_pod_status_phase{phase="Pending"} * on(pod) kube_pod_info{created_by_kind="Job"})
# Cluster Efficiency Score
(avg(DCGM_FI_DEV_GPU_UTIL) + avg(DCGM_FI_DEV_FB_USED/DCGM_FI_DEV_FB_TOTAL*100)) / 2
Troubleshooting GPU Cluster
Common Issues und Lösungen
GPU Node Not Ready
# Debug GPU Node Issues
kubectl describe node gpu-worker-1
kubectl get events --field-selector involvedObject.name=gpu-worker-1
# Check GPU Operator Status
kubectl get pods -n gpu-operator
kubectl logs -n gpu-operator nvidia-device-plugin-daemonset-xxx
# Validate GPU Driver
kubectl exec -it gpu-worker-1 -- nvidia-smi
kubectl exec -it gpu-worker-1 -- nvidia-ml-py --list-gpus
GPU Workload Scheduling Issues
# Debug Scheduling
kubectl describe pod gpu-workload-pod
kubectl get events --field-selector involvedObject.name=gpu-workload-pod
# Check Resource Availability
kubectl describe nodes | grep nvidia.com/gpu
kubectl top nodes | grep gpu
# Verify Resource Quotas
kubectl describe resourcequota -n ml-workloads
Performance Issues
# Performance Debugging Queries
# GPU Throttling Detection
DCGM_FI_DEV_GPU_UTIL < 80 and DCGM_FI_DEV_GPU_TEMP > 83
# Memory Bandwidth Utilization
DCGM_FI_DEV_MEM_UTIL
# PCIe Throughput Issues
DCGM_FI_DEV_PCIE_TX_THROUGHPUT + DCGM_FI_DEV_PCIE_RX_THROUGHPUT < expected_throughput
Fazit: Kubernetes GPU Cluster für deutsche Unternehmen
ROI für deutsche AI/ML-Unternehmen:
Startups (2-10 GPUs):
- Setup Cost: €50.000-200.000
- Monthly Operating Cost: €5.000-20.000
- ROI Break-even: 6-12 Monate
- Efficiency Gain: 40-60% vs. Cloud-only
Enterprise (50-500 GPUs):
- Setup Cost: €500.000-2.000.000
- Monthly Operating Cost: €50.000-200.000
- ROI Break-even: 12-18 Monate
- Efficiency Gain: 60-80% vs. Multi-Cloud
Implementation Roadmap für deutsche Unternehmen:
Phase 1 (Wochen 1-4): Hardware Procurement & Installation
Phase 2 (Wochen 5-8): Kubernetes Cluster Setup
Phase 3 (Wochen 9-12): GPU Operator & Monitoring
Phase 4 (Wochen 13-16): Workload Migration & Optimization
Phase 5 (Wochen 17-20): Security Hardening & Compliance
Phase 6 (Wochen 21-24): Automation & Cost Optimization
Compliance für deutsche Unternehmen:
- DSGVO: Data residency und privacy controls
- BSI: Security standards implementation
- TISAX: Automotive industry compliance
- ISO 27001: Information security management
Benötigen Sie Unterstützung beim Kubernetes GPU Cluster Setup? Unsere AI-Infrastruktur-Experten helfen deutschen Unternehmen bei der Planung, Implementierung und Optimierung von production-ready GPU Clustern. Kontaktieren Sie uns für eine kostenlose GPU-Cluster-Beratung.
Weitere GPU Cluster Artikel:
📖 Verwandte Artikel
Weitere interessante Beiträge zu ähnlichen Themen
AKS GPU Workloads Kostenrechner: Azure Kubernetes Service für ML/AI in Deutschland 2025
🚀 Kompletter Guide für GPU-basierte Workloads auf Azure AKS mit interaktivem Kostenrechner. ML/AI-Projekte in Deutschland optimal planen und budgetieren. Inkl. Tesla V100/T4 Vergleich!
Generative KI in DevOps: Revolution für deutsche Unternehmen
Generative KI revolutioniert DevOps in deutschen Unternehmen: Automatisierung, Code-Generierung, intelligente CI/CD-Pipelines und KI-gestützte Infrastruktur für moderne DevOps-Teams. Steigern Sie Produktivität und reduzieren Sie Risiken mit KI – jetzt informieren!
CUDA Cores Vergleich: Kubernetes GPU für deutsche AI/ML Teams
Optimieren Sie Ihre KI/ML-Workloads! Dieser umfassende Vergleich von NVIDIA GPUs (RTX, Tesla, A100, H100) für Kubernetes in Deutschland analysiert CUDA Cores, Performance, Integration und Kosten. Finden Sie die beste GPU für Ihr Unternehmen.