Observability Without Datadog: A $20/Month Stack

By DevOps Ninja Editorial · Published 2026-05-09 · // cornerstone

A complete production observability setup — metrics, logs, traces, alerting — running on a single $20/mo VPS. Real config, real retention strategy.

A complete production observability setup — metrics, logs, traces, alerting — running on a single $20/month VPS. Real config. Real retention. We've deployed this stack at multiple companies; it handles workloads that would cost $5,000+/mo on Datadog.

The Stack

Prometheus — metrics
Loki — logs (Prometheus-style label model)
Tempo — distributed traces
Grafana — single pane for all three
Alertmanager — alert routing to PagerDuty / Slack / email
VictoriaMetrics (optional) — drop-in replacement for Prometheus that handles 10x the cardinality on the same hardware

The Hardware

Hetzner CX22 (2 vCPU / 4GB / 40GB SSD) at $4.59/mo — or a CX32 (4 vCPU / 8GB / 80GB) at $7.95/mo if you're handling real volume. Add a CPX31 ($14/mo) if you outgrow that. Total for a real production stack: $15-30/month.

The docker-compose.yml

services:
  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - prom_data:/prometheus
    command:
      - --config.file=/etc/prometheus/prometheus.yml
      - --storage.tsdb.retention.time=30d
      - --web.enable-remote-write-receiver
    ports: ["9090:9090"]

  loki:
    image: grafana/loki:latest
    volumes:
      - ./loki/loki.yml:/etc/loki/local-config.yaml
      - loki_data:/loki
    ports: ["3100:3100"]

  tempo:
    image: grafana/tempo:latest
    volumes:
      - ./tempo/tempo.yml:/etc/tempo.yaml
      - tempo_data:/var/tempo
    ports: ["3200:3200", "4317:4317"]

  grafana:
    image: grafana/grafana:latest
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/datasources.yml:/etc/grafana/provisioning/datasources/datasources.yml
    ports: ["3000:3000"]
    environment:
      - GF_AUTH_ANONYMOUS_ENABLED=false
      - GF_SECURITY_ADMIN_PASSWORD=changeme

  alertmanager:
    image: prom/alertmanager:latest
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
    ports: ["9093:9093"]

volumes:
  prom_data:
  loki_data:
  tempo_data:
  grafana_data:

The Retention Strategy

The mistake most teams make: trying to keep raw metrics for a year. Don't. Use a tiered retention policy:

30 days — full resolution metrics in Prometheus. Covers 99% of debugging needs.
1 year — downsampled aggregates (5m / 1h buckets) in VictoriaMetrics or Mimir. Covers trend analysis.
7-14 days — full logs in Loki.
3 days — full traces in Tempo. Sample at 1-10% if traffic is high.

The Cardinality Discipline

This is the #1 reason self-hosted observability fails. Every label combination is a unique series. user_id as a label = one series per user = your Prometheus dies. Rules of thumb:

Max 100 unique values per label.
Never label by user_id, request_id, or any unbounded identifier.
Use recording rules to pre-aggregate hot queries.
Audit cardinality monthly with topk(20, count by (__name__)({__name__=~".+"})).

The Alerts That Matter

Most teams over-alert. The rule: every alert that pages must require human action. If the response is 'ack and check again in 10 minutes,' it's a false positive — fix it.

SLO burn rate alerts on the top 3-5 user-facing endpoints.
Saturation on disk, memory, CPU at 85% sustained for 10 minutes.
Dependency unavailable on database / queue / cache for > 1 minute.
Job failure on critical scheduled jobs (backups, billing exports).

That's it. Five alert categories cover 95% of real incidents.

What This Doesn't Replace

Honest about the gaps:

RUM (Real User Monitoring). Not in this stack. Use Cloudflare Web Analytics (free) or Plausible (cheap).
Synthetic monitoring. Use UptimeRobot ($7/mo) or Better Stack.
Auto-instrumentation across every language. Datadog's APM auto-instrumentation is genuinely better. We use OpenTelemetry SDKs in code we control.
Single vendor support. When something breaks, you debug it. That's the tradeoff.

The Bill

Item	Cost
Hetzner CX32 (compute)	$7.95/mo
Hetzner Storage Box (off-host backup, 1TB)	$3.79/mo
Cloudflare (DNS / TLS / DDoS, all free)	$0
UptimeRobot (synthetic)	$7/mo
Better Uptime / OnCall (paging)	$0-29/mo
Total	$18.74-48/mo

Compare that to a Datadog bill on the same workload: typically $1,500-5,000/mo. The savings fund senior engineering time, by a wide margin.

This is part of the DevOps Ninja cornerstone series. Honest critique welcome.