Saltar a contenido

Stack Completo de Observabilidad

Guía completa para implementar un sistema de observabilidad moderno que incluye métricas, logs, traces, alerting y troubleshooting avanzado en entornos Kubernetes y cloud-native.

📋 Resumen

Esta guía cubre la implementación completa de un stack de observabilidad moderno que incluye métricas con Prometheus, visualización con Grafana, logs centralizados con Loki, tracing distribuido con Jaeger/OpenTelemetry, y alerting avanzado con Alertmanager. Aprenderás desde la instalación básica hasta configuraciones de producción, dashboards personalizados, y estrategias de troubleshooting.

🎯 Audiencia

  • DevOps/SRE engineers
  • Administradores de sistemas cloud-native
  • Desarrolladores que necesitan monitoreo en producción
  • Equipos de operaciones que requieren alerting y troubleshooting avanzado

📚 Prerrequisitos

  • Conocimientos básicos de Kubernetes
  • Familiaridad con Docker y contenedores
  • Entendimiento básico de métricas, logs y tracing
  • Cluster Kubernetes funcionando (minikube, k3s, EKS, GKE, AKS, o self-hosted)

🏗️ Arquitectura Completa del Stack

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Aplicaciones  │───▶│   OpenTelemetry │───▶│     Jaeger      │
│   (Instrumented)│    │   Collector     │    │  (Tracing)      │
└─────────────────┘    └─────────────────┘    └─────────────────┘
         │                       │                       │
         ▼                       ▼                       ▼
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Prometheus    │    │      Loki       │    │    Grafana      │
│   (Métricas)    │    │    (Logs)       │    │ (Visualización) │
└─────────────────┘    └─────────────────┘    └─────────────────┘
         │                       │                       │
         └───────────────────────┼───────────────────────┘
                                 ▼
                   ┌─────────────────┐
                   │  Alertmanager  │
                   │   (Alerting)   │
                   └─────────────────┘

Componentes del Stack

  • Prometheus: Recolección y almacenamiento de métricas time-series
  • Grafana: Visualización, dashboards y alerting
  • Loki: Logs centralizados y búsqueda eficiente
  • Jaeger: Tracing distribuido y análisis de latencia
  • OpenTelemetry: Estándar para instrumentación de aplicaciones
  • Alertmanager: Gestión y enrutamiento de alertas
  • Node Exporter: Métricas del sistema operativo
  • cAdvisor/Kubelet: Métricas de contenedores y Kubernetes

🚀 Instalación en Kubernetes con Helm

Preparar el entorno

# Añadir repositorios de Helm
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add grafana https://grafana.github.io/helm-charts
helm repo add jaegertracing https://jaegertracing.github.io/helm-charts
helm repo update

# Crear namespace para observabilidad
kubectl create namespace observability

Instalar Prometheus Stack completo

# Instalar kube-prometheus-stack (Prometheus + Grafana + Alertmanager)
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace observability \
  --set grafana.enabled=true \
  --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
  --set prometheus.prometheusSpec.ruleSelectorNilUsesHelmValues=false \
  --set prometheus.prometheusSpec.retention=30d \
  --set prometheus.prometheusSpec.retentionSize="50GB" \
  --wait

Instalar Loki para logs centralizados

# Instalar Loki Stack (Loki + Promtail + Grafana datasource)
helm install loki grafana/loki-stack \
  --namespace observability \
  --set grafana.enabled=true \
  --set prometheus.enabled=true \
  --set loki.persistence.enabled=true \
  --set loki.persistence.size=50Gi \
  --set loki.config.limits_config.retention_period=168h \
  --wait

Instalar Jaeger para tracing distribuido

# Instalar Jaeger completo con Cassandra
helm install jaeger jaegertracing/jaeger \
  --namespace observability \
  --set cassandra.config.max_heap_size=1024M \
  --set cassandra.config.heap_new_size=256M \
  --set storage.type=cassandra \
  --wait

Instalar OpenTelemetry Collector

# otel-collector-values.yaml
mode: deployment
image:
  repository: otel/opentelemetry-collector-contrib

config:
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: 0.0.0.0:4317
        http:
          endpoint: 0.0.0.0:4318
    prometheus:
      config:
        scrape_configs:
          - job_name: 'otel-collector'
            static_configs:
              - targets: ['localhost:8888']

  processors:
    batch:
      timeout: 1s
      send_batch_size: 1024
    memory_limiter:
      limit_mib: 512
      spike_limit_mib: 128

  exporters:
    jaeger:
      endpoint: jaeger-collector:14268
      tls:
        insecure: true
    prometheus:
      endpoint: "prometheus-kube-prometheus-prometheus:9090"
    loki:
      endpoint: "http://loki:3100/loki/api/v1/push"

  service:
    pipelines:
      traces:
        receivers: [otlp]
        processors: [memory_limiter, batch]
        exporters: [jaeger]
      metrics:
        receivers: [prometheus]
        processors: [memory_limiter, batch]
        exporters: [prometheus]
      logs:
        receivers: [otlp]
        processors: [memory_limiter, batch]
        exporters: [loki]

# Instalar
helm install opentelemetry-collector open-telemetry/opentelemetry-collector \
  --namespace observability \
  -f otel-collector-values.yaml \
  --wait

📊 Exposición de Servicios

# Exponer Prometheus
kubectl port-forward -n observability svc/prometheus-kube-prometheus-prometheus 9090:9090

# Exponer Grafana
kubectl port-forward -n observability svc/prometheus-grafana 3000:3000

# Exponer Jaeger UI
kubectl port-forward -n observability svc/jaeger-query 16686:16686

# Exponer Loki (opcional)
kubectl port-forward -n observability svc/loki 3100:3100

Acceso inicial a Grafana

# Obtener contraseña de admin
kubectl get secret -n observability prometheus-grafana -o jsonpath="{.data.admin-password}" | base64 --decode ; echo

# Usuario: admin
# URL: http://localhost:3000

🎨 Dashboards Personalizados en Grafana

Dashboard de Nodes de Kubernetes

Importa el dashboard ID 1860 (Node Exporter Full) o crea uno personalizado:

{
  "dashboard": {
    "title": "Kubernetes Nodes - Custom",
    "tags": ["kubernetes", "nodes", "custom"],
    "panels": [
      {
        "title": "CPU Usage by Node",
        "type": "bargauge",
        "targets": [
          {
            "expr": "100 - (avg by (instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
            "legendFormat": "{{ instance }}"
          }
        ]
      },
      {
        "title": "Memory Usage by Node",
        "type": "bargauge",
        "targets": [
          {
            "expr": "(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100",
            "legendFormat": "{{ instance }}"
          }
        ]
      },
      {
        "title": "Network I/O",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(node_network_receive_bytes_total[5m])",
            "legendFormat": "RX {{ instance }}"
          },
          {
            "expr": "rate(node_network_transmit_bytes_total[5m])",
            "legendFormat": "TX {{ instance }}"
          }
        ]
      },
      {
        "title": "Disk I/O",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(node_disk_reads_completed_total[5m])",
            "legendFormat": "Reads {{ instance }}"
          },
          {
            "expr": "rate(node_disk_writes_completed_total[5m])",
            "legendFormat": "Writes {{ instance }}"
          }
        ]
      }
    ]
  }
}

Dashboard de Pods y Contenedores

{
  "dashboard": {
    "title": "Kubernetes Pods - Custom",
    "tags": ["kubernetes", "pods", "containers"],
    "panels": [
      {
        "title": "Pod CPU Usage",
        "type": "table",
        "targets": [
          {
            "expr": "sum(rate(container_cpu_usage_seconds_total{container!=\"\"}[5m])) by (pod, namespace)",
            "legendFormat": "{{ pod }}"
          }
        ],
        "fieldConfig": {
          "overrides": [
            {
              "matcher": { "id": "byName", "options": "Time" },
              "properties": [{ "id": "custom.hidden", "value": true }]
            }
          ]
        }
      },
      {
        "title": "Pod Memory Usage",
        "type": "table",
        "targets": [
          {
            "expr": "sum(container_memory_usage_bytes{container!=\"\"}) by (pod, namespace) / 1024 / 1024",
            "legendFormat": "{{ pod }}"
          }
        ]
      },
      {
        "title": "Container Restarts",
        "type": "table",
        "targets": [
          {
            "expr": "kube_pod_container_status_restarts_total",
            "legendFormat": "{{ pod }}"
          }
        ]
      }
    ]
  }
}

Dashboard de Ingress y Services

{
  "dashboard": {
    "title": "Kubernetes Ingress & Services",
    "tags": ["kubernetes", "ingress", "services"],
    "panels": [
      {
        "title": "HTTP Requests by Ingress",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(rate(nginx_ingress_controller_requests{ingress!=\"\"}[5m])) by (ingress, status)",
            "legendFormat": "{{ ingress }} - published"
          }
        ]
      },
      {
        "title": "Response Time by Ingress",
        "type": "graph",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rate(nginx_ingress_controller_request_duration_seconds_bucket[5m]))",
            "legendFormat": "p95 {{ ingress }}"
          }
        ]
      },
      {
        "title": "Active Connections",
        "type": "singlestat",
        "targets": [
          {
            "expr": "nginx_ingress_controller_nginx_process_connections_total{state=\"active\"}",
            "legendFormat": "Active"
          }
        ]
      }
    ]
  }
}

� OpenTelemetry y Tracing Distribuido

Instrumentación de Aplicaciones

Python con OpenTelemetry

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor

# Configurar tracing
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)

# Configurar exportador OTLP
otlp_exporter = OTLPSpanExporter(
    endpoint="otel-collector.observability.svc.cluster.local:4317",
    insecure=True
)
span_processor = BatchSpanProcessor(otlp_exporter)
trace.get_tracer_provider().add_span_processor(span_processor)

# Auto-instrumentación para frameworks populares
FlaskInstrumentor().instrument()
RequestsInstrumentor().instrument()

# Instrumentación manual
@app.route('/api/users')
def get_users():
    with tracer.start_as_current_span("get_users") as span:
        span.set_attribute("operation.name", "database_query")
        span.set_attribute("db.table", "users")

        # Tu lógica de negocio aquí
        users = db.query("SELECT * FROM users")

        span.set_attribute("db.rows_returned", len(users))
        return jsonify(users)

JavaScript/Node.js con OpenTelemetry

const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-otlp-grpc');
const { ExpressInstrumentation } = require('@opentelemetry/instrumentation-express');
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');

const provider = new NodeTracerProvider();
const exporter = new OTLPTraceExporter({
  url: 'http://otel-collector.observability.svc.cluster.local:4318/v1/traces'
});

provider.addSpanProcessor(new BatchSpanProcessor(exporter));
provider.register();

// Auto-instrumentación
provider.addSpanProcessor(new ExpressInstrumentation());
provider.addSpanProcessor(new HttpInstrumentation());

// Instrumentación manual
app.get('/api/users', (req, res) => {
  const span = trace.getTracer('my-service').startSpan('get_users');
  span.setAttribute('operation.name', 'database_query');

  // Tu lógica aquí
  db.query('SELECT * FROM users', (err, results) => {
    span.setAttribute('db.rows_returned', results.length);
    span.end();
    res.json(results);
  });
});

Java con OpenTelemetry

import io.opentelemetry.api.trace.Tracer;
import io.opentelemetry.api.trace.Span;
import io.opentelemetry.exporter.otlp.trace.OtlpGrpcSpanExporter;
import io.opentelemetry.sdk.trace.SdkTracerProvider;
import io.opentelemetry.sdk.trace.export.BatchSpanProcessor;

public class MyService {
    private static final Tracer tracer = GlobalOpenTelemetry.getTracer("my-service");

    public void processRequest() {
        Span span = tracer.spanBuilder("process_request").startSpan();

        try (Scope scope = span.makeCurrent()) {
            span.setAttribute("operation.name", "business_logic");

            // Tu lógica de negocio
            doBusinessLogic();

            span.setAttribute("result", "success");
        } catch (Exception e) {
            span.setAttribute("error", true);
            span.recordException(e);
            throw e;
        } finally {
            span.end();
        }
    }
}

Configuración del Collector OpenTelemetry

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

  # Receivers adicionales para métricas del sistema
  prometheus:
    config:
      scrape_configs:
        - job_name: 'otel-collector'
          static_configs:
            - targets: ['localhost:8888']

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024
  memory_limiter:
    limit_mib: 512
    spike_limit_mib: 128

  # Procesador para añadir atributos comunes
  resource:
    attributes:
      - key: service.namespace
        value: "production"
        action: insert
      - key: k8s.cluster.name
        value: "my-cluster"
        action: insert

exporters:
  jaeger:
    endpoint: jaeger-collector.observability.svc.cluster.local:14268
    tls:
      insecure: true
  prometheus:
    endpoint: "prometheus-kube-prometheus-prometheus.observability.svc.cluster.local:9090"
  loki:
    endpoint: "http://loki.observability.svc.cluster.local:3100/loki/api/v1/push"

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, resource, batch]
      exporters: [jaeger]
    metrics:
      receivers: [otlp, prometheus]
      processors: [memory_limiter, resource, batch]
      exporters: [prometheus]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, resource, batch]
      exporters: [loki]

🔔 Alerting Avanzado con Alertmanager

Reglas de Alerting para Kubernetes

# alert-rules.yaml
groups:
- name: kubernetes-apps
  rules:
  - alert: HighPodRestartRate
    expr: rate(kube_pod_container_status_restarts_total[15m]) > 0.1
    for: 10m
    labels:
      severity: warning
      team: platform
    annotations:
      summary: "High pod restart rate"
      description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} has restarted {{ $value }} times in the last 15 minutes."
      runbook_url: "https://docs.frikiteam.es/doc/monitoring/troubleshooting#pod-restarts"

  - alert: PodNotReady
    expr: kube_pod_status_ready{condition="false"} == 1
    for: 5m
    labels:
      severity: critical
      team: platform
    annotations:
      summary: "Pod not ready"
      description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} is not ready for 5+ minutes."

  - alert: HighMemoryUsage
    expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 90
    for: 5m
    labels:
      severity: warning
      team: infrastructure
    annotations:
      summary: "High memory usage on {{ $labels.instance }}"
      description: "Memory usage is {{ $value }}% on {{ $labels.instance }}."

  - alert: HighCPUUsage
    expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
    for: 5m
    labels:
      severity: warning
      team: infrastructure
    annotations:
      summary: "High CPU usage on {{ $labels.instance }}"
      description: "CPU usage is {{ $value }}% on {{ $labels.instance }}."

- name: application-metrics
  rules:
  - alert: HighErrorRate
    expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100 > 5
    for: 5m
    labels:
      severity: critical
      team: backend
    annotations:
      summary: "High error rate on {{ $labels.service }}"
      description: "Error rate is {{ $value }}% for service {{ $labels.service }}."

  - alert: SlowResponseTime
    expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2
    for: 5m
    labels:
      severity: warning
      team: backend
    annotations:
      summary: "Slow response time on {{ $labels.service }}"
      description: "95th percentile response time is {{ $value }}s for service {{ $labels.service }}."

Configuración Avanzada de Alertmanager

# alertmanager-config.yaml
global:
  smtp_smarthost: 'smtp.gmail.com:587'
  smtp_from: 'alerts@frikiteam.es'
  smtp_auth_username: 'alerts@frikiteam.es'
  smtp_auth_password: 'your-app-password'

# Plantillas de notificaciones
templates:
  - '/etc/alertmanager/templates/*.tmpl'

route:
  group_by: ['alertname', 'team']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'default'
  routes:
  - match:
      severity: critical
    receiver: 'critical-alerts'
    continue: true
  - match:
      team: platform
    receiver: 'platform-team'
  - match:
      team: infrastructure
    receiver: 'infra-team'
  - match:
      team: backend
    receiver: 'backend-team'

receivers:
- name: 'default'
  slack_configs:
  - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
    channel: '#alerts'
    title: '{{ .GroupLabels.alertname }}'
    text: '{{ .CommonAnnotations.description }}'
    color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'

- name: 'critical-alerts'
  slack_configs:
  - api_url: 'https://hooks.slack.com/services/YOUR/CRITICAL/WEBHOOK'
    channel: '#critical-alerts'
    title: '🚨 CRITICAL: {{ .GroupLabels.alertname }}'
    text: '{{ .CommonAnnotations.description }}'
    color: 'danger'
  pagerduty_configs:
  - service_key: 'your-pagerduty-integration-key'

- name: 'platform-team'
  slack_configs:
  - api_url: 'https://hooks.slack.com/services/PLATFORM/WEBHOOK'
    channel: '#platform-alerts'

- name: 'infra-team'
  slack_configs:
  - api_url: 'https://hooks.slack.com/services/INFRA/WEBHOOK'
    channel: '#infra-alerts'
  email_configs:
  - to: 'infra@frikiteam.es'
    subject: 'Infrastructure Alert: {{ .GroupLabels.alertname }}'
    body: '{{ .CommonAnnotations.description }}'

- name: 'backend-team'
  slack_configs:
  - api_url: 'https://hooks.slack.com/services/BACKEND/WEBHOOK'
    channel: '#backend-alerts'
  email_configs:
  - to: 'backend@frikiteam.es'

# Inhibiciones para evitar alertas duplicadas
inhibit_rules:
  - source_match:
      alertname: 'NodeDown'
    target_match:
      alertname: 'PodNotReady|HighMemoryUsage|HighCPUUsage'
    equal: ['instance']

� Monitoreo de Storage (Ceph, Pure Storage, NetApp)

Métricas de Ceph

# ceph-exporter-service-monitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: ceph-mgr
  namespace: observability
spec:
  selector:
    matchLabels:
      app: ceph-mgr
  endpoints:
  - port: metrics
    path: /metrics
    interval: 30s
    scrapeTimeout: 10s

Dashboard de Ceph en Grafana

{
  "dashboard": {
    "title": "Ceph Cluster Overview",
    "tags": ["ceph", "storage"],
    "panels": [
      {
        "title": "Cluster Health",
        "type": "stat",
        "targets": [
          {
            "expr": "ceph_health_status",
            "legendFormat": "Health"
          }
        ]
      },
      {
        "title": "OSD Usage",
        "type": "bargauge",
        "targets": [
          {
            "expr": "ceph_osd_stat_bytes_used / ceph_osd_stat_bytes",
            "legendFormat": "{{ osd }}"
          }
        ]
      },
      {
        "title": "Pool Usage",
        "type": "table",
        "targets": [
          {
            "expr": "ceph_pool_bytes_used / ceph_pool_max_avail * 100",
            "legendFormat": "{{ pool }}"
          }
        ]
      },
      {
        "title": "IOPS by Pool",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(ceph_pool_rd[5m]) + rate(ceph_pool_wr[5m])",
            "legendFormat": "{{ pool }}"
          }
        ]
      }
    ]
  }
}

Métricas de Pure Storage

# pure-storage-servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: pure-storage-exporter
  namespace: observability
spec:
  selector:
    matchLabels:
      app: pure-storage-exporter
  endpoints:
  - port: metrics
    interval: 60s

Métricas de NetApp

# Consultas PromQL para NetApp
# Usage por volumen
netapp_volume_size_used / netapp_volume_size_total * 100

# IOPS por LUN
rate(netapp_lun_read_ops[5m]) + rate(netapp_lun_write_ops[5m])

# Latencia de storage
netapp_lun_avg_read_latency + netapp_lun_avg_write_latency

🌐 Monitoreo de Networking

Métricas de Red Básicas

# Bandwidth por interface
rate(node_network_receive_bytes_total[5m]) * 8 / 1000000

# Latencia de red (si tienes blackbox exporter)
probe_duration_seconds

# Errores de red
rate(node_network_receive_errs_total[5m])
rate(node_network_transmit_errs_total[5m])

# Conexiones TCP
node_netstat_Tcp_CurrEstab
node_netstat_Tcp_ActiveOpens

Dashboard de Networking

{
  "dashboard": {
    "title": "Network Monitoring",
    "tags": ["networking", "infrastructure"],
    "panels": [
      {
        "title": "Network Traffic by Interface",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(node_network_receive_bytes_total[5m]) * 8 / 1000000",
            "legendFormat": "RX {{ device }}"
          },
          {
            "expr": "rate(node_network_transmit_bytes_total[5m]) * 8 / 1000000",
            "legendFormat": "TX {{ device }}"
          }
        ]
      },
      {
        "title": "Network Errors",
        "type": "table",
        "targets": [
          {
            "expr": "rate(node_network_receive_errs_total[5m])",
            "legendFormat": "RX Errors {{ device }}"
          },
          {
            "expr": "rate(node_network_transmit_errs_total[5m])",
            "legendFormat": "TX Errors {{ device }}"
          }
        ]
      },
      {
        "title": "TCP Connections",
        "type": "singlestat",
        "targets": [
          {
            "expr": "node_netstat_Tcp_CurrEstab",
            "legendFormat": "Established"
          }
        ]
      }
    ]
  }
}

🔧 Troubleshooting con Jaeger

Configuración de Jaeger para Producción

# jaeger-production-values.yaml
collector:
  enabled: true
  service:
    annotations:
      prometheus.io/scrape: "true"
      prometheus.io/port: "14268"

storage:
  type: cassandra
  cassandra:
    host: jaeger-cassandra
    keyspace: jaeger_v1_dc1
    # Para producción, considera Elasticsearch:
    # type: elasticsearch
    # elasticsearch:
    #   host: elasticsearch-master
    #   indexPrefix: jaeger

query:
  enabled: true
  service:
    type: ClusterIP

# Configuración de sampling
collector:
  otlp:
    enabled: true
  sampling:
    strategies:
      probabilistic:
        samplingRate: 0.1  # 10% de traces
      rateLimiting:
        maxTracesPerSecond: 100

Análisis de Traces en Jaeger

# Buscar traces por servicio
curl -G "http://jaeger-query:16686/api/traces" \
  --data-urlencode "service=my-service" \
  --data-urlencode "limit=20"

# Buscar traces con errores
curl -G "http://jaeger-query:16686/api/traces" \
  --data-urlencode "service=my-service" \
  --data-urlencode "tags={\"error\":\"true\"}"

# Buscar traces por duración
curl -G "http://jaeger-query:16686/api/traces" \
  --data-urlencode "service=my-service" \
  --data-urlencode "maxDuration=5s"

Consultas TraceQL en Grafana

# Traces lentas
{ duration > 5s }

# Traces con errores
{ status = error }

# Traces por servicio y operación
{ service.name = "my-service" && name = "http_request" }

# Traces con tags específicos
{ resource.service.name = "api-gateway" && span.http.status_code = "500" }

🚀 Configuración de Producción

Alta Disponibilidad

# prometheus-ha-values.yaml
prometheus:
  prometheusSpec:
    replicas: 2
    retention: 90d
    retentionSize: "200GB"
    ruleSelector:
      matchLabels:
        prometheus: kube-prometheus
    serviceMonitorSelector:
      matchLabels:
        prometheus: kube-prometheus
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: fast-ssd
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 100Gi
    securityContext:
      runAsUser: 65534
      runAsNonRoot: true
      runAsGroup: 65534
      fsGroup: 65534

alertmanager:
  alertmanagerSpec:
    replicas: 2
    storage:
      volumeClaimTemplate:
        spec:
          storageClassName: fast-ssd
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 10Gi

Backup y Recuperación

# Backup de Prometheus
kubectl exec -n observability prometheus-prometheus-0 -- tar czf /tmp/prometheus-backup.tar.gz -C /prometheus .

# Backup de Loki
kubectl exec -n observability loki-0 -- tar czf /tmp/loki-backup.tar.gz -C /loki .

# Backup de Grafana
kubectl exec -n observability prometheus-grafana-0 -- tar czf /tmp/grafana-backup.tar.gz -C /var/lib/grafana .

# Copiar backups a storage seguro
kubectl cp observability/prometheus-prometheus-0:tmp/prometheus-backup.tar.gz ./backups/
kubectl cp observability/loki-0:tmp/loki-backup.tar.gz ./backups/
kubectl cp observability/prometheus-grafana-0:tmp/grafana-backup.tar.gz ./backups/

Monitoreo de Costos

# Costo estimado por namespace (AWS ejemplo)
sum(rate(container_cpu_usage_seconds_total[1h])) by (namespace) * 0.000024 * 730 +
sum(container_memory_usage_bytes / 1024 / 1024 / 1024) by (namespace) * 0.000018 * 730

# Costo de storage
sum(kube_persistentvolumeclaim_resource_requests_storage_bytes) by (namespace) / 1024 / 1024 / 1024 * 0.12

📊 SLOs y SLIs

Definición de SLOs

# slos.yaml
- name: api-availability
  objective: 99.9
  sli:
    events:
      good:
        metric: http_requests_total{status!~"5.."}
      total:
        metric: http_requests_total

- name: api-latency
  objective: 99
  sli:
    events:
      good:
        metric: http_request_duration_seconds{quantile="0.95"} < 0.1
      total:
        metric: http_request_duration_seconds_count

- name: storage-availability
  objective: 99.99
  sli:
    events:
      good:
        metric: ceph_health_status == 0
      total:
        metric: up{job="ceph-mgr"}

Dashboard de SLOs

{
  "dashboard": {
    "title": "Service Level Objectives",
    "tags": ["slo", "reliability"],
    "panels": [
      {
        "title": "API Availability",
        "type": "stat",
        "targets": [
          {
            "expr": "1 - (rate(http_requests_total{status=~\"5..\"}[30d]) / rate(http_requests_total[30d]))",
            "legendFormat": "Availability"
          }
        ]
      },
      {
        "title": "API Latency p95",
        "type": "stat",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[30d]))",
            "legendFormat": "p95 Latency"
          }
        ]
      }
    ]
  }
}

🛠️ Comandos Útiles

# Ver estado de componentes
kubectl get pods -n observability
kubectl get svc -n observability

# Logs de troubleshooting
kubectl logs -n observability -l app=prometheus --tail=100
kubectl logs -n observability -l app.kubernetes.io/name=grafana --tail=100

# Reiniciar componentes
kubectl rollout restart deployment/prometheus-grafana -n observability
kubectl rollout restart statefulset/prometheus-prometheus -n observability

# Ver alertas activas
kubectl port-forward -n observability svc/prometheus-kube-prometheus-alertmanager 9093:9093
# Luego: http://localhost:9093

# Backup de configuraciones
kubectl get configmap -n observability -o yaml > observability-configs.yaml

# Ver métricas de Prometheus
kubectl port-forward -n observability svc/prometheus-kube-prometheus-prometheus 9090:9090
# Luego: http://localhost:9090

# Consultas directas a Loki
kubectl port-forward -n observability svc/loki 3100:3100
curl "http://localhost:3100/loki/api/v1/query_range?query={namespace=\"default\"}&start=1640995200&end=1640998800&step=60"

# Ver traces en Jaeger
kubectl port-forward -n observability svc/jaeger-query 16686:16686
# Luego: http://localhost:16686

# Monitoreo de recursos
kubectl top nodes
kubectl top pods -n observability

# Ver eventos de Kubernetes
kubectl get events -n observability --sort-by=.metadata.creationTimestamp

# Debug de ServiceMonitors
kubectl get servicemonitor -n observability
kubectl describe servicemonitor prometheus-kube-prometheus -n observability

📚 Recursos Adicionales

Documentación Oficial

Dashboards y Configuraciones

Comunidad y Soporte

Libros y Cursos

✅ Checklist Completo de Implementación

Instalación Base

  • [x] Namespace observability creado
  • [x] Prometheus instalado con Helm
  • [x] Grafana instalado con Helm
  • [x] Loki instalado con Helm
  • [x] Jaeger instalado con Helm
  • [x] OpenTelemetry Collector configurado

Configuración

  • [x] Servicios expuestos correctamente
  • [x] Data sources configurados en Grafana
  • [x] Reglas de alerting definidas
  • [x] Dashboards básicos importados

Monitoreo Avanzado

  • [x] Dashboards personalizados para Kubernetes
  • [x] Instrumentación OpenTelemetry en aplicaciones
  • [x] Alerting con enrutamiento por equipos
  • [x] Monitoreo de storage (Ceph/Pure/NetApp)
  • [x] Monitoreo de networking
  • [x] SLOs y SLIs definidos

Producción

  • [x] Alta disponibilidad configurada
  • [x] Persistencia de datos habilitada
  • [x] Backups automáticos configurados
  • [x] Seguridad (TLS, RBAC) implementada
  • [x] Retención de datos optimizada
  • [x] Monitoreo de costos habilitado

Troubleshooting

  • [x] Runbooks documentados
  • [x] Alertas con contexto suficiente
  • [x] Tracing distribuido operativo
  • [x] Logs centralizados funcionando
  • [x] Métricas de rendimiento recopiladas