Operations

Monitoring with Prometheus and Grafana

Monitoring with Prometheus and Grafana

First dashboards to watch

queue wait distribution
quorum-ack distribution
batch size and lane depth
profile switches
watch lag and dispatch behavior
large-value upload and hydrate paths when enabled

Single-node production alerts

Use refs/sandbox/prometheus/astra-single-node-alerts.yml as the starting rule set for:

Astra restart and OOM loops
datastore probe failures
K3s readiness failures
sustained queue-wait and quorum-ack inflation
disk and archive-path health regressions

What matters more than aggregate p99

queue wait vs quorum ack split
leader stability under follower degradation
watch lag under fanout load
LIST behavior under large cardinality