Automation

Webhook reliability at scale: the patterns that stopped 3am pages

Webhooks fail. The question is how loudly. Five patterns we apply to every client workflow to keep silent failures from being silent.

Updated 2 min read

Half of “the automation broke” incidents are webhooks that quietly failed weeks ago. By the time anyone notices, the data backlog is hours of recovery work. These five patterns are what we apply by default on every client workflow.

1. Idempotency keys on every receiver#

Webhooks retry. If your downstream effect is “create CRM record”, you will create duplicates on retry. Use the webhook delivery ID as an idempotency key in your receiver; reject duplicates with 200 (so the sender stops retrying).

2. Dead-letter queue for failed deliveries#

Any failed webhook should land in a DLQ (SQS, Redis list, even a Postgres table). Reprocess on a schedule. Never silently drop.

3. Signature verification before any work#

Validate HMAC signature first. Webhooks without auth get spoofed; we have seen this in the wild.

4. Health monitor that catches “no events”#

An “is the webhook working?” check is not “did the last call succeed?” It is “have we received an event in the last expected window?” If your store gets an order every 10 minutes, alert when nothing arrives for 30.

5. Replay endpoint, not a panic restore#

Build an endpoint that re-emits webhook deliveries for a date range. When the worst happens (whole receiver was down), you replay; you do not write a one-off script under pressure.

The 3am page#

One client (e-commerce, 400 orders/day) was missing 8% of orders silently for a month. Patterns 2 + 4 above would have alerted on day one. Now they do.

We ship these patterns by default on every workflow automation engagement.

More from Automation