Operations

Monitoring with Prometheus and Grafana

Monitoring with Prometheus and Grafana

First dashboards to watch

  • queue wait distribution
  • quorum-ack distribution
  • batch size and lane depth
  • profile switches
  • watch lag and dispatch behavior
  • large-value upload and hydrate paths when enabled

Single-node production alerts

Use refs/sandbox/prometheus/astra-single-node-alerts.yml as the starting rule set for:

  • Astra restart and OOM loops
  • datastore probe failures
  • K3s readiness failures
  • sustained queue-wait and quorum-ack inflation
  • disk and archive-path health regressions

What matters more than aggregate p99

  • queue wait vs quorum ack split
  • leader stability under follower degradation
  • watch lag under fanout load
  • LIST behavior under large cardinality