Warum Kubernetes für KI-Anwendungen?

Kubernetes hat sich als führende Plattform für Machine Learning und AI-Anwendungen etabliert. Für deutsche Unternehmen bietet Kubernetes eine skalierbare und flexible Umgebung für die Entwicklung und den Betrieb von KI-Systemen:

GPU-Cluster Management - Effiziente GPU-Ressourcen-Verwaltung
MLOps Pipeline - Automatisierte ML-Workflows
Scalable Training - Skalierbares Model Training
Production Deployment - Production-ready AI-Deployments
Cost Optimization - Optimierte Ressourcen-Nutzung

Kubernetes AI/ML Architektur

AI/ML Stack auf Kubernetes

Kubernetes AI/ML Architecture
├── Infrastructure Layer
│   ├── GPU Nodes (NVIDIA)
│   ├── CPU Nodes
│   ├── Storage (NFS/Ceph)
│   └── Networking (Calico/Flannel)
├── Platform Layer
│   ├── Kubeflow
│   ├── MLflow
│   ├── TensorFlow Serving
│   └── PyTorch Serve
├── Application Layer
│   ├── Training Jobs
│   ├── Inference Services
│   ├── Data Pipelines
│   └── Model Registry
└── Operations Layer
    ├── Monitoring (Prometheus)
    ├── Logging (ELK Stack)
    ├── CI/CD (Jenkins/ArgoCD)
    └── Security (RBAC/Network Policies)

AI/ML Workflow

Data Ingestion - Datenaufnahme und -vorverarbeitung
Model Training - Modelltraining auf GPU-Clustern
Model Validation - Modellvalidierung und -evaluierung
Model Deployment - Production-Deployment
Model Monitoring - Kontinuierliche Überwachung
Model Retraining - Automatisches Re-Training

GPU-Cluster Management

NVIDIA GPU Support

Kubernetes bietet native Unterstützung für NVIDIA GPUs über den NVIDIA Device Plugin.

GPU Node Setup

# GPU Node Configuration
apiVersion: v1
kind: Node
metadata:
  name: gpu-node-1
  labels:
    accelerator: nvidia-tesla-v100
spec:
  taints:
    - key: nvidia.com/gpu
      effect: NoSchedule

GPU Resource Requests

# GPU Pod Configuration
apiVersion: v1
kind: Pod
metadata:
  name: gpu-training-pod
spec:
  containers:
    - name: training-container
      image: tensorflow/tensorflow:latest-gpu
      resources:
        limits:
          nvidia.com/gpu: 2
        requests:
          nvidia.com/gpu: 1
      volumeMounts:
        - name: training-data
          mountPath: /data
        - name: model-storage
          mountPath: /models
  volumes:
    - name: training-data
      persistentVolumeClaim:
        claimName: training-data-pvc
    - name: model-storage
      persistentVolumeClaim:
        claimName: model-storage-pvc

Multi-GPU Training

# Multi-GPU Training Job
apiVersion: batch/v1
kind: Job
metadata:
  name: multi-gpu-training
spec:
  parallelism: 1
  completions: 1
  template:
    spec:
      containers:
        - name: training
          image: pytorch/pytorch:latest
          command: ['python', 'train.py']
          resources:
            limits:
              nvidia.com/gpu: 4
          env:
            - name: MASTER_ADDR
              value: 'localhost'
            - name: MASTER_PORT
              value: '29500'
            - name: WORLD_SIZE
              value: '4'
            - name: RANK
              value: '0'
      restartPolicy: Never

Kubeflow - ML Platform

Was ist Kubeflow?

Kubeflow ist eine Open-Source Machine Learning Platform für Kubernetes, die den gesamten ML-Workflow unterstützt.

Kubeflow Components

Jupyter Notebooks - Interactive Development
TensorFlow Training - Distributed Training
PyTorch Training - PyTorch Support
Katib - Hyperparameter Tuning
Pipelines - ML Workflow Orchestration
Serving - Model Serving
Metadata - Experiment Tracking

Kubeflow Installation

# Kubeflow Installation
export KF_NAME=my-kubeflow
export BASE_DIR=/opt/kubeflow
export KF_DIR=${BASE_DIR}/${KF_NAME}

# Download Kubeflow
mkdir -p ${KF_DIR}
cd ${KF_DIR}
curl -L -o kfctl.tar.gz https://github.com/kubeflow/kfctl/releases/download/v1.2.0/kfctl_v1.2.0-0-gbc038f9_linux.tar.gz
tar -xvf kfctl.tar.gz

# Deploy Kubeflow
./kfctl apply -V -f https://raw.githubusercontent.com/kubeflow/manifests/v1.2-branch/kfdef/kfctl_k8s_istio.v1.2.0.yaml

Kubeflow Pipelines

# Kubeflow Pipeline Example
import kfp
from kfp import dsl

@dsl.pipeline(
    name='ML Training Pipeline',
    description='A pipeline for training and deploying ML models'
)
def ml_pipeline():
    # Data preprocessing
    preprocess = dsl.ContainerOp(
        name='preprocess',
        image='preprocess:latest',
        command=['python', 'preprocess.py'],
        file_outputs={'output': '/output/data'}
    )

    # Model training
    train = dsl.ContainerOp(
        name='train',
        image='train:latest',
        command=['python', 'train.py'],
        arguments=['--input', preprocess.outputs['output']],
        file_outputs={'model': '/output/model'}
    )

    # Model deployment
    deploy = dsl.ContainerOp(
        name='deploy',
        image='deploy:latest',
        command=['python', 'deploy.py'],
        arguments=['--model', train.outputs['model']]
    )

MLflow - Experiment Tracking

MLflow Features

Experiment Tracking - Versionskontrolle für ML-Experimente
Model Registry - Zentrale Modellverwaltung
Model Serving - Einfaches Model Deployment
Reproducibility - Reproduzierbare Experimente

MLflow auf Kubernetes

# MLflow Server Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: mlflow-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: mlflow
  template:
    metadata:
      labels:
        app: mlflow
    spec:
      containers:
        - name: mlflow
          image: python:3.8
          command: ['mlflow', 'server']
          args:
            - '--host=0.0.0.0'
            - '--port=5000'
            - '--backend-store-uri=sqlite:///mlflow.db'
            - '--default-artifact-root=s3://mlflow-artifacts'
          ports:
            - containerPort: 5000
          env:
            - name: AWS_ACCESS_KEY_ID
              valueFrom:
                secretKeyRef:
                  name: mlflow-secrets
                  key: aws-access-key
            - name: AWS_SECRET_ACCESS_KEY
              valueFrom:
                secretKeyRef:
                  name: mlflow-secrets
                  key: aws-secret-key

MLflow Integration

# MLflow Integration Example
import mlflow
import mlflow.pytorch

# Start experiment
mlflow.set_experiment("image-classification")

# Log parameters
mlflow.log_param("learning_rate", 0.01)
mlflow.log_param("batch_size", 32)

# Train model
model = train_model()

# Log metrics
mlflow.log_metric("accuracy", 0.95)
mlflow.log_metric("loss", 0.05)

# Save model
mlflow.pytorch.log_model(model, "model")

TensorFlow Serving

TensorFlow Serving Setup

# TensorFlow Serving Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: tensorflow-serving
spec:
  replicas: 3
  selector:
    matchLabels:
      app: tensorflow-serving
  template:
    metadata:
      labels:
        app: tensorflow-serving
    spec:
      containers:
        - name: serving
          image: tensorflow/serving:latest
          ports:
            - containerPort: 8500
            - containerPort: 8501
          volumeMounts:
            - name: model-storage
              mountPath: /models
          env:
            - name: MODEL_NAME
              value: 'my-model'
      volumes:
        - name: model-storage
          persistentVolumeClaim:
            claimName: model-pvc

TensorFlow Serving Service

# TensorFlow Serving Service
apiVersion: v1
kind: Service
metadata:
  name: tensorflow-serving
spec:
  selector:
    app: tensorflow-serving
  ports:
    - name: grpc
      port: 8500
      targetPort: 8500
    - name: http
      port: 8501
      targetPort: 8501
  type: LoadBalancer

Model Prediction

# TensorFlow Serving Client
import tensorflow as tf
import numpy as np

# Create client
client = tf.contrib.util.make_tensor_proto(np.array([1, 2, 3]))

# Make prediction
response = requests.post(
    'http://tensorflow-serving:8501/v1/models/my-model:predict',
    json={'instances': client.numpy().tolist()}
)

prediction = response.json()['predictions']

PyTorch Serve

PyTorch Serve Setup

# PyTorch Serve Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: pytorch-serve
spec:
  replicas: 3
  selector:
    matchLabels:
      app: pytorch-serve
  template:
    metadata:
      labels:
        app: pytorch-serve
    spec:
      containers:
        - name: serve
          image: pytorch/torchserve:latest
          ports:
            - containerPort: 8080
            - containerPort: 8081
            - containerPort: 8082
          volumeMounts:
            - name: model-store
              mountPath: /home/model-server/model-store
          command:
            - torchserve
            - --start
            - --model-store=/home/model-server/model-store
            - --models=my-model=my-model.mar
      volumes:
        - name: model-store
          persistentVolumeClaim:
            claimName: model-store-pvc

PyTorch Model Archive

# Create PyTorch Model Archive
torch-model-archiver --model-name my-model \
  --version 1.0 \
  --model-file model.py \
  --serialized-file model.pth \
  --handler image_classifier \
  --export-path model-store

MLOps Pipeline

CI/CD für ML

# ML CI/CD Pipeline
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: ml-pipeline
spec:
  entrypoint: ml-workflow
  templates:
    - name: ml-workflow
      steps:
        - - name: data-validation
            template: validate-data
        - - name: model-training
            template: train-model
        - - name: model-evaluation
            template: evaluate-model
        - - name: model-deployment
            template: deploy-model
            when: "{{steps.evaluate-model.outputs.result}} == 'pass'"

    - name: validate-data
      container:
        image: data-validation:latest
        command: [python, validate.py]

    - name: train-model
      container:
        image: training:latest
        command: [python, train.py]
        resources:
          limits:
            nvidia.com/gpu: 2

    - name: evaluate-model
      container:
        image: evaluation:latest
        command: [python, evaluate.py]

    - name: deploy-model
      container:
        image: deployment:latest
        command: [python, deploy.py]

Automated Retraining

# Automated Retraining CronJob
apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: model-retraining
spec:
  schedule: '0 2 * * *' # Daily at 2 AM
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: retraining
              image: retraining:latest
              command: [python, retrain.py]
              env:
                - name: DATA_DRIFT_THRESHOLD
                  value: '0.1'
                - name: PERFORMANCE_THRESHOLD
                  value: '0.8'
          restartPolicy: OnFailure

AI/ML Monitoring

Model Performance Monitoring

# Model Monitoring Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: model-monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: model-monitoring
  template:
    metadata:
      labels:
        app: model-monitoring
    spec:
      containers:
        - name: monitoring
          image: model-monitoring:latest
          env:
            - name: MODEL_ENDPOINT
              value: 'http://tensorflow-serving:8501'
            - name: ALERT_THRESHOLD
              value: '0.8'
            - name: SLACK_WEBHOOK
              valueFrom:
                secretKeyRef:
                  name: monitoring-secrets
                  key: slack-webhook

Data Drift Detection

# Data Drift Detection
import numpy as np
from scipy import stats

def detect_data_drift(reference_data, current_data):
    # Calculate distribution difference
    ks_statistic, p_value = stats.ks_2samp(reference_data, current_data)

    # Alert if significant drift detected
    if p_value < 0.05:
        send_alert(f"Data drift detected: p-value={p_value}")

    return p_value

AI/ML Security

Model Security

Model Encryption - Verschlüsselung von Modellen
Access Control - Zugriffskontrolle auf Modelle
Audit Logging - Audit-Protokollierung
Secure Inference - Sichere Inferenz

Data Security

Data Encryption - Datenverschlüsselung
Data Masking - Datenmaskierung
Access Logging - Zugriffsprotokollierung
Compliance - DSGVO-Compliance

Security Best Practices

# Security Context
securityContext:
  runAsNonRoot: true
  runAsUser: 1000
  capabilities:
    drop:
      - ALL
  readOnlyRootFilesystem: true
  allowPrivilegeEscalation: false

Cost Optimization

GPU Resource Optimization

GPU Sharing - GPU-Ressourcen teilen
Spot Instances - Günstige Spot-Instances nutzen
Auto-scaling - Automatische Skalierung
Resource Quotas - Ressourcen-Quotas

Cost Monitoring

# Cost Monitoring
apiVersion: v1
kind: ConfigMap
metadata:
  name: cost-monitoring
data:
  config.yaml: |
    gpu_cost_per_hour: 2.50
    cpu_cost_per_hour: 0.10
    memory_cost_per_gb: 0.05
    alert_threshold: 100.00

Erfolgsgeschichten

Fallstudie: Manufacturing AI

Ausgangssituation:

Manuelle Qualitätskontrolle
Hohe Fehlerrate
Langsame Inspektion
Hohe Kosten

Lösung:

Kubernetes GPU-Cluster
Computer Vision Model
Real-time Inference
Automated Quality Control

Ergebnisse:

95% Genauigkeit bei Qualitätskontrolle
80% schnellere Inspektion
60% Kosteneinsparung
Vollständige Automatisierung

Fallstudie: Financial Services AI