Skip to main content

Circuit breaker

The circuit breaker protects both Mambu and your systems from repeatedly calling slow or failing webhook endpoints.
When a destination keeps timing out or returning errors, the circuit breaker can temporarily pause delivery to that destination, then probe it again and resume when it has recovered.

This page describes the behavior from the Notifications service point of view, which is responsible for HTTP dispatch, timeouts, retries, and the circuit breaker.

What the circuit breaker does

  • Monitors webhook delivery outcomes on a per-destination basis (per tenant and template).
  • Counts failures such as:
    • Non-2xx HTTP responses
    • Request timeouts
    • Network/TLS/connect errors
  • When failures exceed configured thresholds, it opens the circuit for that destination and temporarily pauses delivery.
  • After a cool-down period, it probes the destination again with a limited number of trial requests.
  • If those trial requests succeed, normal delivery resumes; if failures continue, the circuit re-opens.

Exact numeric thresholds and timings (timeouts, failure ratios, delays) are configuration-driven and may differ across environments. They are not fixed to “500 ms”, “5 out of 10”, or “5 seconds”.

How it works (states and behavior)

At a high level, the circuit breaker for webhooks behaves as follows:

  • Closed (normal operation)

    • All webhook deliveries are attempted.
    • A delivery is considered successful only if your endpoint returns an HTTP 2xx status.
    • Non-2xx responses, timeouts, and network/TLS errors are treated as failures.
    • The HTTP client applies a per-request timeout that is configured by Mambu (values may differ for initial attempts vs retry attempts).
  • Opening the circuit

    • Notifications tracks recent executions per destination.
    • When the failure rate or number of failures in a recent window crosses configured thresholds, the breaker opens for that destination.
    • While Open, webhook deliveries to that destination are not attempted; they are short-circuited by the breaker.
  • Open (cool-down / pause)

    • The circuit remains Open for a configured cool-down period.
    • During this period, new deliveries to that destination are blocked by the breaker and treated as failures from the retry policy’s point of view (no HTTP call is made).
  • Half-open (probe)

    • After the cool-down, the breaker moves to Half-open and allows a limited number of trial requests through.
    • If enough of these trial calls succeed (2xx within the normal timeout), the breaker transitions back to Closed.
    • If failures continue, the breaker returns to Open for another cool-down period.

The overall goal is to reduce pressure on unstable endpoints, while still allowing them to recover and rejoin normal traffic once they start responding successfully again.

Interaction with retries

Webhook delivery uses two main phases:

  1. Initial (fast-lane) attempt

    • A new notification is dispatched once with a per-request timeout.
    • If it succeeds (HTTP 2xx), the message is marked as sent.
    • If it fails in a way that is considered retryable (for example, timeout or transient network/TLS error), the message may be moved to a slow lane for retries.
  2. Slow-lane retries (under retry policy + circuit breaker)

    • On the slow lane, retries are executed under a retry policy (with backoff) and, when enabled, a circuit breaker.
    • These policies are combined, so the circuit breaker does influence retries:
      • If the breaker is Closed, retries will attempt HTTP calls according to the retry policy.
      • If the breaker is Open, attempts are short-circuited and no HTTP call is made until the cool-down expires.
      • In Half-open, only a limited number of trial attempts are allowed; their outcome determines whether the breaker closes again or returns to Open.
    • Retries are still at-least-once: under transient failures, you may see duplicate deliveries for the same logical event.

Because of this, your endpoint should always be prepared for:

  • At-least-once delivery (duplicates are possible).
  • Bursts of traffic when a previously failing endpoint recovers and traffic resumes.
  • Periods with no traffic after repeated failures, while the breaker is Open.

For details on how failed notifications are retried and surfaced in the UI, see
Managing notifications.

Recommendations

To work well with the retry policy and circuit breaker:

  • Acknowledge quickly with 2xx

    • Return an HTTP 2xx status as soon as you accept the event.
    • Process the payload asynchronously on your side (for example, queue it internally) instead of doing heavy work in the synchronous response.
    • Avoid redirects (3xx) and long-running synchronous processing, which increase timeouts and failure counts.
  • Design for at-least-once delivery

    • Make your handlers idempotent.
    • Use the x-notifications-idempotency-key header to detect and safely ignore duplicate deliveries when present.
    • If you accept an event but encounter a downstream error, consider returning 2xx and handling the error asynchronously, instead of returning 5xx on the webhook call.
  • Monitor failures and latency

    • Track:
      • 2xx vs non-2xx response rates (especially 4xx and 5xx)
      • Timeouts and request latency
      • Trends around incident times (spikes in errors followed by quiet periods)
    • Persistent non-2xx responses and timeouts increase failure counts and may open the circuit for your destination.
  • React quickly to 4xx errors

    • 4xx often indicate misconfiguration (wrong URL, invalid credentials, authorization failures).
    • Fix these quickly to restore delivery; otherwise, failures will persist and may keep the breaker from closing.
  • Capacity and rate limiting

    • Size your infrastructure to handle:
      • Normal traffic
      • Additional retry traffic during incidents
      • A potential burst when a destination recovers and the circuit closes
    • Apply your own rate limits and backpressure mechanisms where appropriate.

For more guidance on endpoint design and observability, see
Webhook best practices and
Troubleshooting.