Observability Runbook
Use this runbook when telemetry signals indicate runtime degradation.
Signals to watch
GET /agents/v1/metrics- Dashboard -> Readiness -> Telemetry Snapshot
GET /agents/v1/readiness
Alert guidance
Auth failures > 5
- Inspect caller identity/scopes with
GET /agents/v1/auth/whoami. - Rotate/revoke affected credentials.
Rate-limit denials > 10
- Add backoff/jitter to callers.
- Tune
PLUGIN_AGENTS_RATE_LIMIT_*only when sustained demand requires it.
Repeated 5xx responses
- Check Craft logs and readiness diagnostics.
- Escalate with request IDs and sample payloads.
Queue depth > 50 or DLQ > 0
- Replay DLQ events.
- Verify queue workers and webhook receiver availability.
Sync-state lag > 300s
- Inspect lag rows in
/agents/v1/sync-state/lag. - Restore checkpoint writes in
/agents/v1/sync-state/checkpoint.