Operations
Troubleshooting
Troubleshooting
High queue wait, normal quorum ack
Likely cause:
- batching or dispatch backlog before Raft commit.
Actions:
- inspect queue-depth and queue-wait metrics,
- reduce burst pressure or retune batch limits,
- compare queue wait against quorum-ack telemetry instead of using p99 alone.
High quorum ack
Likely cause:
- follower health, network replication, or commit-path pressure.
Actions:
- inspect Raft stage telemetry,
- confirm peer health and follower lag,
- verify disk or network pressure on the current leader.
Astra containers restart after control-plane expansion
Symptoms:
- Astra containers restart with
exit=137 docker inspectshowsOOMKilled=true- K3s loses readiness after installing add-ons or increasing control-plane churn
Likely cause:
- using the low-memory validation profile as a production baseline.
Actions:
- use the production single-node deploy path and keep
--astra-container-memory-limit 2048Munless you have measured a lower safe ceiling, - recreate Astra one node at a time when recovering an existing cluster,
- confirm restart counts stay flat during a soak window before proceeding.
kubectl get crd fails with decoded message length too large
Likely cause:
- an older Astra build where follower-forwarded internal clients still used the default 4 MiB gRPC decode limit.
Actions:
- roll out a build that applies the configured gRPC message limits to forwarded internal clients as well as the public server handlers,
- as an emergency workaround only, pin K3s temporarily to one stable Astra endpoint to avoid follower-forwarded large LIST responses.
Pods fail after switching to stargz
Likely cause:
- stale cached image metadata from the previous snapshotter.
Actions:
- remove the affected image with
crictl rmi <image>, - restart the pod so containerd re-pulls it,
- verify the node runtime override still sets the intended snapshotter.
MetalLB is planned but svclb-* pods still appear
Likely cause:
- bundled ServiceLB is still enabled.
Actions:
- disable
servicelbin the K3s config, - restart K3s,
- confirm bundled
svclb-*pods disappear before introducing MetalLB or another LB controller.
Migration parity mismatch
Actions:
- verify tenant-to-source mapping in
astra-forge converge, - validate snapshot checksums and manifest upload paths,
- compare post-import key counts before cutting traffic over.