Operations

Troubleshooting

Troubleshooting

High queue wait, normal quorum ack

Likely cause:

  • batching or dispatch backlog before Raft commit.

Actions:

  • inspect queue-depth and queue-wait metrics,
  • reduce burst pressure or retune batch limits,
  • compare queue wait against quorum-ack telemetry instead of using p99 alone.

High quorum ack

Likely cause:

  • follower health, network replication, or commit-path pressure.

Actions:

  • inspect Raft stage telemetry,
  • confirm peer health and follower lag,
  • verify disk or network pressure on the current leader.

Astra containers restart after control-plane expansion

Symptoms:

  • Astra containers restart with exit=137
  • docker inspect shows OOMKilled=true
  • K3s loses readiness after installing add-ons or increasing control-plane churn

Likely cause:

  • using the low-memory validation profile as a production baseline.

Actions:

  • use the production single-node deploy path and keep --astra-container-memory-limit 2048M unless you have measured a lower safe ceiling,
  • recreate Astra one node at a time when recovering an existing cluster,
  • confirm restart counts stay flat during a soak window before proceeding.

kubectl get crd fails with decoded message length too large

Likely cause:

  • an older Astra build where follower-forwarded internal clients still used the default 4 MiB gRPC decode limit.

Actions:

  • roll out a build that applies the configured gRPC message limits to forwarded internal clients as well as the public server handlers,
  • as an emergency workaround only, pin K3s temporarily to one stable Astra endpoint to avoid follower-forwarded large LIST responses.

Pods fail after switching to stargz

Likely cause:

  • stale cached image metadata from the previous snapshotter.

Actions:

  • remove the affected image with crictl rmi <image>,
  • restart the pod so containerd re-pulls it,
  • verify the node runtime override still sets the intended snapshotter.

MetalLB is planned but svclb-* pods still appear

Likely cause:

  • bundled ServiceLB is still enabled.

Actions:

  • disable servicelb in the K3s config,
  • restart K3s,
  • confirm bundled svclb-* pods disappear before introducing MetalLB or another LB controller.

Migration parity mismatch

Actions:

  • verify tenant-to-source mapping in astra-forge converge,
  • validate snapshot checksums and manifest upload paths,
  • compare post-import key counts before cutting traffic over.