Operations

Troubleshooting

Troubleshooting

High queue wait, normal quorum ack

Likely cause:

batching or dispatch backlog before Raft commit.

Actions:

inspect queue-depth and queue-wait metrics,
reduce burst pressure or retune batch limits,
compare queue wait against quorum-ack telemetry instead of using p99 alone.

High quorum ack

Likely cause:

follower health, network replication, or commit-path pressure.

Actions:

inspect Raft stage telemetry,
confirm peer health and follower lag,
verify disk or network pressure on the current leader.

Astra containers restart after control-plane expansion

Symptoms:

Astra containers restart with exit=137
docker inspect shows OOMKilled=true
K3s loses readiness after installing add-ons or increasing control-plane churn

Likely cause:

using the low-memory validation profile as a production baseline.

Actions:

use the production single-node deploy path and keep --astra-container-memory-limit 2048M unless you have measured a lower safe ceiling,
recreate Astra one node at a time when recovering an existing cluster,
confirm restart counts stay flat during a soak window before proceeding.

`kubectl get crd` fails with `decoded message length too large`

Likely cause:

an older Astra build where follower-forwarded internal clients still used the default 4 MiB gRPC decode limit.

Actions:

roll out a build that applies the configured gRPC message limits to forwarded internal clients as well as the public server handlers,
as an emergency workaround only, pin K3s temporarily to one stable Astra endpoint to avoid follower-forwarded large LIST responses.

Pods fail after switching to `stargz`

Likely cause:

stale cached image metadata from the previous snapshotter.

Actions:

remove the affected image with crictl rmi <image>,
restart the pod so containerd re-pulls it,
verify the node runtime override still sets the intended snapshotter.

MetalLB is planned but `svclb-*` pods still appear

Likely cause:

bundled ServiceLB is still enabled.

Actions:

disable servicelb in the K3s config,
restart K3s,
confirm bundled svclb-* pods disappear before introducing MetalLB or another LB controller.

Migration parity mismatch

Actions:

verify tenant-to-source mapping in astra-forge converge,
validate snapshot checksums and manifest upload paths,
compare post-import key counts before cutting traffic over.