Incident Response Playbook

⚠️ Operators only — for the oncall responder. This is not a page for a first-time RP integrator.

Incident classification

sev	Definition	Response time
sev1	`api.1pass.dev/up` returns 5xx/404, all OAuth token issuance failing, JWKS unresponsive — every RP login is down	immediate
sev2	Some endpoints 5xx, p95 latency sustained > 2s, webhook delivery failing in bulk	within 1 hour
sev3	DLQ accumulation (a single RP), merge race condition, single-user incident	within 24 hours

First-line triage (common to all sev levels)

bash

# 1. health
curl -i https://api.1pass.dev/up

# 2. Render service status
curl -s "https://api.render.com/v1/services/<LOGI_WEB_SERVICE_ID>" \
  -H "Authorization: Bearer $RENDER_API_KEY" | jq '.serviceDetails.suspended, .status'

# 3. last 5 deploys
curl -s "https://api.render.com/v1/services/<LOGI_WEB_SERVICE_ID>/deploys?limit=5" \
  -H "Authorization: Bearer $RENDER_API_KEY" | jq '.[].deploy | {id, status, commit: .commit.id, createdAt}'

If you suspect a regression: roll back to the last green deploy immediately, then investigate the cause. See Deploy Runbook §Rollback.

Webhook failures / DLQ

The DLQ is not a separate table but the webhook_outbox_entries.dlq_at state (see app/jobs/logi/webhooks/delivery_job.rb). 5xx/408/429 are retried; any other 4xx goes straight to the DLQ.

Monitoring

Directly against the DB (read-only) — use the Render MCP query_render_postgres:

sql

SELECT oauth_application_id, COUNT(*) AS dead_count
FROM webhook_outbox_entries
WHERE dlq_at IS NOT NULL AND delivered_at IS NULL
GROUP BY oauth_application_id
ORDER BY dead_count DESC;

Or the admin API: GET /api/v1/admin/webhook_outbox?status=dead (requires an admin token).

Reprocessing

Single retry (admin UI recommended — requires step-up authentication): POST /api/v1/admin/webhook_outbox/:id/retry body: { "action_request_nonce": "..." } → resets dlq_at, next_retry_at, and attempts + sets enqueued_at = now. The next dispatcher pass picks it up immediately.

There is no bulk-retry rake task yet. A console one-liner:

ruby

# bin/rails runner -e production '...'
WebhookOutboxEntry.where(oauth_application_id: APP_ID).dead.find_each do |e|
  e.update!(dlq_at: nil, next_retry_at: nil, attempts: 0, last_error: nil, enqueued_at: Time.current)
end

TODO: confirm with ops — bulk reprocessing through the admin UI alone is inefficient. Consider adding a webhooks:replay_dlq[oauth_app_id] rake.

Common causes

RP webhook endpoint 5xx (an RP-side incident — escalate to the RP oncall).
RP signature verification failure — the kid is not in the RP's cache. Check the grace period in webhook key rotation.
Idempotency conflict — on a retry with the same idempotency_key, the RP returns 409. This is correct behavior, so it should be settled as delivered_at rather than ending up in the DLQ.

Merge race / data consistency

The identity_links table is the source of truth for user merges. RPs such as EB track links via LogiIdentityLink rows.

Force-invalidate the canonical resolution cache

When a merged user cannot see their own data — in the Rails console:

ruby

# Full invalidation (the heavy option)
Rails.cache.delete_matched("user:canonical*")

# A specific user only (same pattern as merge_service)
Rails.cache.delete_matched("user:canonical*:#{user.id}*")

(Same key pattern as app/services/logi/identity/merge_service.rb:245.)

Caution when flipping `ENFORCE_CANONICAL_RESOLUTION`

⚠️ Before turning this env on in prod, confirm that every RP has received its LogiIdentityLink rows (run webhooks:backfill_existing_links_to_rp[app_id] first). If you flip it while the data is not mirrored, merged users will fail lookups on the RP side.

backfill rake:

bash

ssh -o StrictHostKeyChecking=no <LOGI_WEB_SERVICE_ID>@<RENDER_SSH_HOST> \
  "cd /opt/render/project/src/server && \
   /opt/render/project/.gems/bin/bundle exec rails 'webhooks:backfill_existing_links_to_rp[<APP_ID>]' RAILS_ENV=production"

Idempotent — rows already sent are skipped.

JWKS / key rotation incident

kid mismatch 401

When an RP suddenly cannot find a kid:

Confirm the JWKS endpoint responds normally: curl -s https://api.1pass.dev/.well-known/jwks.json | jq '.keys[].kid'
Check that the current active kid is included in the response.
RP-side JWKS cache TTL — ask the RP to force a refresh.

Suspected webhook signing key exposure

Run the webhooks:compromise[app_id,kid] rake immediately. For the procedure, see webhook key rotation §Emergency rotation (compromise) — within an atomic transaction it automatically revokes the key, issues a new one, re-signs undelivered outbox entries, and sends the webhook_key.compromised event.

Escalation

Tier	Target
1st	oncall engineer (internal contacts in a separate doc)
2nd	project owner (internal contacts in a separate doc)
External	status page update (sev1 only)

For a sev1, escalate to the first-tier contact within 5 minutes → if there is no response, escalate to the second-tier contact after 15 minutes. Post external notices only when RP integrators are directly affected (for example, api.1pass.dev down for more than 30 minutes).

TODO: confirm with ops — the status page URL/tool is undecided. Need a decision on whether to adopt PagerDuty/Statuspage.io.

Incident Response Playbook ​

Incident classification ​

First-line triage (common to all sev levels) ​

Webhook failures / DLQ ​

Monitoring ​

Reprocessing ​

Common causes ​

Merge race / data consistency ​

Force-invalidate the canonical resolution cache ​

Caution when flipping ENFORCE_CANONICAL_RESOLUTION ​

JWKS / key rotation incident ​

kid mismatch 401 ​

Suspected webhook signing key exposure ​

Escalation ​