Incident Response Playbook
⚠️ Operators only — for the oncall responder. This is not a page for a first-time RP integrator.
Incident classification
| sev | Definition | Response time |
|---|---|---|
| sev1 | api.1pass.dev/up returns 5xx/404, all OAuth token issuance failing, JWKS unresponsive — every RP login is down | immediate |
| sev2 | Some endpoints 5xx, p95 latency sustained > 2s, webhook delivery failing in bulk | within 1 hour |
| sev3 | DLQ accumulation (a single RP), merge race condition, single-user incident | within 24 hours |
First-line triage (common to all sev levels)
# 1. health
curl -i https://api.1pass.dev/up
# 2. Render service status
curl -s "https://api.render.com/v1/services/<LOGI_WEB_SERVICE_ID>" \
-H "Authorization: Bearer $RENDER_API_KEY" | jq '.serviceDetails.suspended, .status'
# 3. last 5 deploys
curl -s "https://api.render.com/v1/services/<LOGI_WEB_SERVICE_ID>/deploys?limit=5" \
-H "Authorization: Bearer $RENDER_API_KEY" | jq '.[].deploy | {id, status, commit: .commit.id, createdAt}'If you suspect a regression: roll back to the last green deploy immediately, then investigate the cause. See Deploy Runbook §Rollback.
Webhook failures / DLQ
The DLQ is not a separate table but the webhook_outbox_entries.dlq_at state (see app/jobs/logi/webhooks/delivery_job.rb). 5xx/408/429 are retried; any other 4xx goes straight to the DLQ.
Monitoring
Directly against the DB (read-only) — use the Render MCP query_render_postgres:
SELECT oauth_application_id, COUNT(*) AS dead_count
FROM webhook_outbox_entries
WHERE dlq_at IS NOT NULL AND delivered_at IS NULL
GROUP BY oauth_application_id
ORDER BY dead_count DESC;Or the admin API: GET /api/v1/admin/webhook_outbox?status=dead (requires an admin token).
Reprocessing
Single retry (admin UI recommended — requires step-up authentication): POST /api/v1/admin/webhook_outbox/:id/retry body: { "action_request_nonce": "..." } → resets dlq_at, next_retry_at, and attempts + sets enqueued_at = now. The next dispatcher pass picks it up immediately.
There is no bulk-retry rake task yet. A console one-liner:
# bin/rails runner -e production '...'
WebhookOutboxEntry.where(oauth_application_id: APP_ID).dead.find_each do |e|
e.update!(dlq_at: nil, next_retry_at: nil, attempts: 0, last_error: nil, enqueued_at: Time.current)
endTODO: confirm with ops — bulk reprocessing through the admin UI alone is inefficient. Consider adding a
webhooks:replay_dlq[oauth_app_id]rake.
Common causes
- RP webhook endpoint 5xx (an RP-side incident — escalate to the RP oncall).
- RP signature verification failure — the
kidis not in the RP's cache. Check the grace period in webhook key rotation. - Idempotency conflict — on a retry with the same
idempotency_key, the RP returns 409. This is correct behavior, so it should be settled asdelivered_atrather than ending up in the DLQ.
Merge race / data consistency
The identity_links table is the source of truth for user merges. RPs such as EB track links via LogiIdentityLink rows.
Force-invalidate the canonical resolution cache
When a merged user cannot see their own data — in the Rails console:
# Full invalidation (the heavy option)
Rails.cache.delete_matched("user:canonical*")
# A specific user only (same pattern as merge_service)
Rails.cache.delete_matched("user:canonical*:#{user.id}*")(Same key pattern as app/services/logi/identity/merge_service.rb:245.)
Caution when flipping ENFORCE_CANONICAL_RESOLUTION
⚠️ Before turning this env on in prod, confirm that every RP has received its LogiIdentityLink rows (run webhooks:backfill_existing_links_to_rp[app_id] first). If you flip it while the data is not mirrored, merged users will fail lookups on the RP side.
backfill rake:
ssh -o StrictHostKeyChecking=no <LOGI_WEB_SERVICE_ID>@<RENDER_SSH_HOST> \
"cd /opt/render/project/src/server && \
/opt/render/project/.gems/bin/bundle exec rails 'webhooks:backfill_existing_links_to_rp[<APP_ID>]' RAILS_ENV=production"Idempotent — rows already sent are skipped.
JWKS / key rotation incident
kid mismatch 401
When an RP suddenly cannot find a kid:
- Confirm the JWKS endpoint responds normally:
curl -s https://api.1pass.dev/.well-known/jwks.json | jq '.keys[].kid' - Check that the current active
kidis included in the response. - RP-side JWKS cache TTL — ask the RP to force a refresh.
Suspected webhook signing key exposure
Run the webhooks:compromise[app_id,kid] rake immediately. For the procedure, see webhook key rotation §Emergency rotation (compromise) — within an atomic transaction it automatically revokes the key, issues a new one, re-signs undelivered outbox entries, and sends the webhook_key.compromised event.
Escalation
| Tier | Target |
|---|---|
| 1st | oncall engineer (internal contacts in a separate doc) |
| 2nd | project owner (internal contacts in a separate doc) |
| External | status page update (sev1 only) |
For a sev1, escalate to the first-tier contact within 5 minutes → if there is no response, escalate to the second-tier contact after 15 minutes. Post external notices only when RP integrators are directly affected (for example, api.1pass.dev down for more than 30 minutes).
TODO: confirm with ops — the status page URL/tool is undecided. Need a decision on whether to adopt PagerDuty/Statuspage.io.