Despliegue de LLMs a Escala con Kubernetes¶
Guía completa para desplegar y escalar modelos de lenguaje grandes (LLMs) en entornos de producción usando Kubernetes, vLLM y estrategias de optimización avanzadas.
🎯 Objetivos de Aprendizaje¶
Después de completar esta guía, podrás:
- Desplegar LLMs usando vLLM en Kubernetes
- Configurar auto-scaling basado en métricas de GPU
- Implementar estrategias de caching y optimización
- Gestionar múltiples modelos en producción
- Monitorear rendimiento y costos de inferencia
📋 Prerrequisitos¶
- Conocimientos básicos de Kubernetes
- Experiencia con Docker y Helm
- Familiaridad con LLMs y vLLM
- Cluster Kubernetes con GPUs (opcional pero recomendado)
🏗️ Arquitectura de Despliegue¶
Componentes Principales¶
graph TB
A[Ingress/Load Balancer] --> B[API Gateway]
B --> C[vLLM Service 1]
B --> D[vLLM Service 2]
B --> E[vLLM Service N]
C --> F[GPU Node Pool]
D --> F
E --> F
G[Prometheus] --> H[Metrics Server]
H --> I[HPA Controller]
I --> C
I --> D
I --> E
J[Model Registry] --> K[Init Container]
K --> C
Estrategias de Despliegue¶
- Single Model per Pod: Aislamiento completo
- Multi-Model per Pod: Optimización de recursos
- Model Sharding: Distribución de modelos grandes
- Dynamic Loading: Carga bajo demanda
🚀 Despliegue Básico con vLLM¶
1. Preparación del Cluster¶
# Verificar GPUs disponibles
kubectl get nodes -o json | jq '.items[].status.capacity."nvidia.com/gpu"'
# Instalar NVIDIA GPU Operator (si no está instalado)
helm repo add nvidia https://nvidia.github.io/gpu-operator
helm repo update
helm install gpu-operator nvidia/gpu-operator \
--create-namespace \
--namespace gpu-operator
2. Crear Namespace y ConfigMaps¶
# vllm-namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
name: vllm-system
labels:
name: vllm-system
# vllm-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: vllm-config
namespace: vllm-system
data:
MODEL_NAME: "microsoft/DialoGPT-medium"
MODEL_REVISION: "main"
DTYPE: "float16"
MAX_MODEL_LEN: "2048"
GPU_MEMORY_UTILIZATION: "0.9"
MAX_NUM_SEQS: "256"
TENSOR_PARALLEL_SIZE: "1"
3. Despliegue con Helm¶
# vllm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-deployment
namespace: vllm-system
spec:
replicas: 1
selector:
matchLabels:
app: vllm
template:
metadata:
labels:
app: vllm
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
ports:
- containerPort: 8000
envFrom:
- configMapRef:
name: vllm-config
resources:
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 5
periodSeconds: 5
4. Servicio y Ingress¶
# vllm-service.yaml
apiVersion: v1
kind: Service
metadata:
name: vllm-service
namespace: vllm-system
spec:
selector:
app: vllm
ports:
- port: 80
targetPort: 8000
type: ClusterIP
# vllm-ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: vllm-ingress
namespace: vllm-system
annotations:
nginx.ingress.kubernetes.io/rewrite-target: /
spec:
rules:
- host: vllm.yourdomain.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: vllm-service
port:
number: 80
📊 Auto-Scaling con HPA¶
Configuración de Métricas¶
# metrics-server.yaml (si no está instalado)
apiVersion: v1
kind: ServiceAccount
metadata:
name: metrics-server
namespace: kube-system
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: metrics-server
namespace: kube-system
spec:
template:
spec:
containers:
- name: metrics-server
image: k8s.gcr.io/metrics-server/metrics-server:v0.6.3
args:
- --kubelet-insecure-tls
- --kubelet-preferred-address-types=InternalIP
HPA para vLLM¶
# vllm-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: vllm-hpa
namespace: vllm-system
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm-deployment
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
- type: External
external:
metric:
name: nvidia_com_gpu_utilization
selector:
matchLabels:
app: vllm
target:
type: AverageValue
averageValue: 80
🔧 Optimizaciones Avanzadas¶
1. Model Caching y Warm-up¶
# vllm-deployment-optimized.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-deployment-optimized
namespace: vllm-system
spec:
template:
spec:
initContainers:
- name: model-cache
image: vllm/vllm-openai:latest
command: ["/bin/sh", "-c"]
args:
- |
python -c "
from vllm import LLM
llm = LLM(model='microsoft/DialoGPT-medium', download_dir='/tmp/models')
print('Model cached successfully')
"
volumeMounts:
- name: model-cache
mountPath: /tmp/models
containers:
- name: vllm
image: vllm/vllm-openai:latest
env:
- name: VLLM_CACHE_DIR
value: /tmp/models
volumeMounts:
- name: model-cache
mountPath: /tmp/models
volumes:
- name: model-cache
emptyDir: {}
2. Multi-Model Deployment¶
# multi-model-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: multi-model-config
namespace: vllm-system
data:
models.json: |
[
{
"name": "gpt2-medium",
"model": "microsoft/DialoGPT-medium",
"max_model_len": 1024
},
{
"name": "gpt2-large",
"model": "microsoft/DialoGPT-large",
"max_model_len": 1024
}
]
3. GPU Memory Optimization¶
# vllm-gpu-optimized.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-gpu-optimized
namespace: vllm-system
spec:
template:
spec:
containers:
- name: vllm
env:
- name: VLLM_GPU_MEMORY_UTILIZATION
value: "0.95"
- name: VLLM_MAX_NUM_SEQS
value: "128"
- name: VLLM_MAX_NUM_BATCHED_TOKENS
value: "4096"
- name: VLLM_ENABLE_CHUNKED_PREFILL
value: "true"
resources:
limits:
nvidia.com/gpu: 1
memory: 32Gi
requests:
nvidia.com/gpu: 1
memory: 16Gi
📈 Monitoreo y Observabilidad¶
Métricas de vLLM¶
# prometheus-service-monitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: vllm-monitor
namespace: vllm-system
spec:
selector:
matchLabels:
app: vllm
endpoints:
- port: metrics
path: /metrics
interval: 30s
Dashboard de Grafana¶
{
"dashboard": {
"title": "vLLM Performance Dashboard",
"panels": [
{
"title": "GPU Utilization",
"type": "graph",
"targets": [
{
"expr": "nvidia_gpu_utilization{namespace=\"vllm-system\"}",
"legendFormat": "{{ pod }}"
}
]
},
{
"title": "Request Latency",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(vllm_request_duration_seconds_bucket[5m]))",
"legendFormat": "95th percentile"
}
]
}
]
}
}
🔄 Estrategias de Actualización¶
Rolling Updates¶
# vllm-deployment-rolling.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-deployment
namespace: vllm-system
spec:
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
template:
# ... resto de la configuración
Blue-Green Deployment¶
# blue-green-deployment.sh
#!/bin/bash
# Crear nueva versión (green)
kubectl apply -f vllm-deployment-green.yaml
# Esperar a que esté listo
kubectl wait --for=condition=available --timeout=300s deployment/vllm-deployment-green -n vllm-system
# Cambiar el servicio al green
kubectl patch service vllm-service -n vllm-system -p '{"spec":{"selector":{"version":"green"}}}'
# Verificar que funciona
# ... tests ...
# Eliminar blue
kubectl delete deployment vllm-deployment-blue -n vllm-system
🛡️ Seguridad y Compliance¶
Network Policies¶
# vllm-network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: vllm-network-policy
namespace: vllm-system
spec:
podSelector:
matchLabels:
app: vllm
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: ingress-nginx
ports:
- protocol: TCP
port: 8000
egress:
- to:
- podSelector:
matchLabels:
k8s-app: kube-dns
ports:
- protocol: UDP
port: 53
Secret Management¶
# vllm-secrets.yaml
apiVersion: v1
kind: Secret
metadata:
name: vllm-secrets
namespace: vllm-system
type: Opaque
data:
huggingface-token: <base64-encoded-token>
api-key: <base64-encoded-key>
📊 Cost Optimization¶
Spot Instances y Preemptible¶
# spot-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-spot
namespace: vllm-system
spec:
template:
spec:
tolerations:
- key: "cloud.google.com/gke-spot"
operator: "Equal"
value: "true"
effect: "NoSchedule"
nodeSelector:
cloud.google.com/gke-spot: "true"
# ... resto de configuración
Auto-scaling basado en costos¶
# cost-based-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: vllm-cost-hpa
namespace: vllm-system
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm-deployment
minReplicas: 1
maxReplicas: 5
metrics:
- type: External
external:
metric:
name: cloud_provider_cost_per_hour
target:
type: AverageValue
averageValue: 2.0
🔍 Troubleshooting¶
Problemas Comunes¶
-
Out of Memory (OOM):
# Verificar logs kubectl logs -f deployment/vllm-deployment -n vllm-system # Ajustar configuración kubectl edit configmap vllm-config -n vllm-system -
GPU Not Available:
# Verificar GPU allocation kubectl describe node <node-name> # Check GPU operator kubectl get pods -n gpu-operator -
Slow Inference:
# Verificar métricas kubectl exec -it deployment/vllm-deployment -n vllm-system -- curl http://localhost:8000/metrics # Ajustar batch size kubectl edit configmap vllm-config -n vllm-system
🎯 Mejores Prácticas¶
Performance¶
- Usa GPU A100/H100 para mejor rendimiento
- Configura
tensor_parallel_sizepara múltiples GPUs - Implementa caching de modelos
- Monitorea constantemente métricas
Reliability¶
- Implementa health checks apropiados
- Usa rolling updates para zero-downtime
- Configura resource limits y requests
- Implementa circuit breakers
Security¶
- Usa secrets para tokens de API
- Implementa network policies
- Audita logs de acceso
- Mantén modelos actualizados
Cost Management¶
- Usa spot instances cuando sea posible
- Implementa auto-scaling inteligente
- Monitorea costos en tiempo real
- Optimiza uso de GPU
📚 Recursos Adicionales¶
🤝 Contribuir¶
Esta guía es parte del proyecto Frikiteam Docs. Si encuentras errores o quieres contribuir mejoras:
- Fork el repositorio
- Crea una rama para tu feature
- Envía un Pull Request
¡Gracias por contribuir al conocimiento compartido!