Operations
Monitoring with Prometheus and Grafana
Monitoring with Prometheus and Grafana
First dashboards to watch
- queue wait distribution
- quorum-ack distribution
- batch size and lane depth
- profile switches
- watch lag and dispatch behavior
- large-value upload and hydrate paths when enabled
Single-node production alerts
Use refs/sandbox/prometheus/astra-single-node-alerts.yml as the starting rule set for:
- Astra restart and OOM loops
- datastore probe failures
- K3s readiness failures
- sustained queue-wait and quorum-ack inflation
- disk and archive-path health regressions
What matters more than aggregate p99
- queue wait vs quorum ack split
- leader stability under follower degradation
- watch lag under fanout load
- LIST behavior under large cardinality