Observability Without Datadog: A $20/Month Stack
A complete production observability setup — metrics, logs, traces, alerting — running on a single $20/mo VPS. Real config, real retention strategy.
A complete production observability setup — metrics, logs, traces, alerting — running on a single $20/month VPS. Real config. Real retention. We've deployed this stack at multiple companies; it handles workloads that would cost $5,000+/mo on Datadog.
The Stack
- Prometheus — metrics
- Loki — logs (Prometheus-style label model)
- Tempo — distributed traces
- Grafana — single pane for all three
- Alertmanager — alert routing to PagerDuty / Slack / email
- VictoriaMetrics (optional) — drop-in replacement for Prometheus that handles 10x the cardinality on the same hardware
The Hardware
Hetzner CX22 (2 vCPU / 4GB / 40GB SSD) at $4.59/mo — or a CX32 (4 vCPU / 8GB / 80GB) at $7.95/mo if you're handling real volume. Add a CPX31 ($14/mo) if you outgrow that. Total for a real production stack: $15-30/month.
The docker-compose.yml
services:
prometheus:
image: prom/prometheus:latest
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- prom_data:/prometheus
command:
- --config.file=/etc/prometheus/prometheus.yml
- --storage.tsdb.retention.time=30d
- --web.enable-remote-write-receiver
ports: ["9090:9090"]
loki:
image: grafana/loki:latest
volumes:
- ./loki/loki.yml:/etc/loki/local-config.yaml
- loki_data:/loki
ports: ["3100:3100"]
tempo:
image: grafana/tempo:latest
volumes:
- ./tempo/tempo.yml:/etc/tempo.yaml
- tempo_data:/var/tempo
ports: ["3200:3200", "4317:4317"]
grafana:
image: grafana/grafana:latest
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/datasources.yml:/etc/grafana/provisioning/datasources/datasources.yml
ports: ["3000:3000"]
environment:
- GF_AUTH_ANONYMOUS_ENABLED=false
- GF_SECURITY_ADMIN_PASSWORD=changeme
alertmanager:
image: prom/alertmanager:latest
volumes:
- ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
ports: ["9093:9093"]
volumes:
prom_data:
loki_data:
tempo_data:
grafana_data:
The Retention Strategy
The mistake most teams make: trying to keep raw metrics for a year. Don't. Use a tiered retention policy:
- 30 days — full resolution metrics in Prometheus. Covers 99% of debugging needs.
- 1 year — downsampled aggregates (5m / 1h buckets) in VictoriaMetrics or Mimir. Covers trend analysis.
- 7-14 days — full logs in Loki.
- 3 days — full traces in Tempo. Sample at 1-10% if traffic is high.
The Cardinality Discipline
This is the #1 reason self-hosted observability fails. Every label combination is a unique series. user_id as a label = one series per user = your Prometheus dies. Rules of thumb:
- Max 100 unique values per label.
- Never label by user_id, request_id, or any unbounded identifier.
- Use
recording rulesto pre-aggregate hot queries. - Audit cardinality monthly with
topk(20, count by (__name__)({__name__=~".+"})).
The Alerts That Matter
Most teams over-alert. The rule: every alert that pages must require human action. If the response is 'ack and check again in 10 minutes,' it's a false positive — fix it.
- SLO burn rate alerts on the top 3-5 user-facing endpoints.
- Saturation on disk, memory, CPU at 85% sustained for 10 minutes.
- Dependency unavailable on database / queue / cache for > 1 minute.
- Job failure on critical scheduled jobs (backups, billing exports).
That's it. Five alert categories cover 95% of real incidents.
What This Doesn't Replace
Honest about the gaps:
- RUM (Real User Monitoring). Not in this stack. Use Cloudflare Web Analytics (free) or Plausible (cheap).
- Synthetic monitoring. Use UptimeRobot ($7/mo) or Better Stack.
- Auto-instrumentation across every language. Datadog's APM auto-instrumentation is genuinely better. We use OpenTelemetry SDKs in code we control.
- Single vendor support. When something breaks, you debug it. That's the tradeoff.
The Bill
| Item | Cost |
|---|---|
| Hetzner CX32 (compute) | $7.95/mo |
| Hetzner Storage Box (off-host backup, 1TB) | $3.79/mo |
| Cloudflare (DNS / TLS / DDoS, all free) | $0 |
| UptimeRobot (synthetic) | $7/mo |
| Better Uptime / OnCall (paging) | $0-29/mo |
| Total | $18.74-48/mo |
Compare that to a Datadog bill on the same workload: typically $1,500-5,000/mo. The savings fund senior engineering time, by a wide margin.
This is part of the DevOps Ninja cornerstone series. Honest critique welcome.