Skip to content

Observability Runbook

Use this runbook when telemetry signals indicate runtime degradation.

Signals to watch

  • GET /agents/v1/metrics
  • Dashboard -> Readiness -> Telemetry Snapshot
  • GET /agents/v1/readiness

Alert guidance

Auth failures > 5

  • Inspect caller identity/scopes with GET /agents/v1/auth/whoami.
  • Rotate/revoke affected credentials.

Rate-limit denials > 10

  • Add backoff/jitter to callers.
  • Tune PLUGIN_AGENTS_RATE_LIMIT_* only when sustained demand requires it.

Repeated 5xx responses

  • Check Craft logs and readiness diagnostics.
  • Escalate with request IDs and sample payloads.

Queue depth > 50 or DLQ > 0

  • Replay DLQ events.
  • Verify queue workers and webhook receiver availability.

Sync-state lag > 300s

  • Inspect lag rows in /agents/v1/sync-state/lag.
  • Restore checkpoint writes in /agents/v1/sync-state/checkpoint.

Agents plugin docs for Craft CMS