Incident Mailbox
12 alerts
Last triage 20 min ago
Active Alerts
PagerDuty Alerts
alerts@pagerduty.com
Feb 18
Trigger: p95 latency crossed 2.8s for 7 minutes in us-east-1. Current impact: - 14% of requests timing out - Retries up 3.2x - Cart updates delayed for mobile users Primary runbook and dashboards are attached. Please acknowledge within 5 minutes.
Kim from SRE
kim.sre@opsbridge.io
Quick update from SRE: - Increased edge cache TTL for product metadata - Warmed top 5,000 keys - Error rate dropped from 5.6% to 1.2% Continuing to monitor for the next 30 minutes before full resolution call.
CloudWatch Monitor
monitor@aws.amazon.com
CloudWatch detected sustained CPU > 92% on payments-worker-04. Timeline: - Spike began at 08:41 UTC - Queue depth currently 11,238 messages - No packet loss observed Suggested action: drain node and shift traffic to warm standby.
Security Response Bot
soc-bot@opsbridge.io
SOC bot blocked 1,842 suspicious auth attempts in 12 minutes. Notes: - Source ASN matches previously flagged botnet - No successful account takeover observed - Temporary WAF rule auto-deployed Please review the incident card for rule expiry timing.
On-Call Scheduler
schedule@opsbridge.io
Good morning team, Handoff summary: - Active: checkout latency (SEV-1) with mitigation in progress - Monitoring: auth burst pattern, no customer impact - Planned: rotate Redis credentials at 14:00 UTC Escalation tree and backup contacts attached.
Ava Patel
ava.patel@opsbridge.io
Kim, Please confirm if we can safely open a rollback window at 09:45 UTC for checkout-service. I need your go/no-go note before I notify leadership. Thanks, Ava
Feb 17
Incident RES-4821 closed. Root cause: - Memory leak in query expansion worker after dependency update. Fix: - Rolled back worker image to 2.9.14 - Added guardrail alert on heap delta > 12% Postmortem scheduled for Friday.
Reminder from Security Response Bot: Quarterly credential rotation is due this week. Targets: - staging/api-gateway - prod/redis-cache - prod/payments-webhook Runbook includes zero-downtime sequence.
24h platform digest: - API availability: 99.96% - Mean incident acknowledgment: 4m 13s - Mean incident resolution: 31m 42s - Error budget burn: 12.4% No unresolved critical alerts at report generation time.
Feb 16
During yesterday's incident we lost 6 minutes deciding between rollback and feature flag fallback. Proposal: 1. Add explicit flag fallback step after cache purge. 2. Include owner + confirmation checklist. 3. Add command snippets for both regions. Can you review by end of day?
Coverage update: - Saturday primary: Kim - Saturday backup: Ava - Sunday primary: Nora - Sunday backup: Theo Calendar and pager routing are now synced.
Approved action items for checkout latency incident: 1. Add queue depth saturation alert at 8k. 2. Add canary gate for cache dependency changes. 3. Document fallback query path in on-call handbook. 4. Run chaos replay in staging on Wednesday. Please confirm owners in the tracker thread.
Select an alert to review