Implementing Resilience Patterns: Bulkheads, Circuit Breakers, and Retry Strategies

TL;DR: A comprehensive specification for HTTP retry with exponential backoff, bulkhead isolation, and circuit breaker patterns for distributed systems.

Building resilient distributed systems requires handling failures gracefully. This article provides a detailed specification for implementing three critical resilience patterns: connection pooling with bulkheads, retry strategies, and circuit breakers.

Overview

When your service communicates with external APIs or partner systems, failures are inevitable. Network issues, timeouts, and service unavailability can cascade through your system. These patterns help contain and recover from such failures.

Connection Pooling

Before implementing resilience patterns, establish proper HTTP connection pooling.

http:
  connectionPool:
    maxPoolSize: 500
    staleCheckIntervalInMs: 60000
    maxSizePerRoute: 500
    validateIntervalInMs: 3000

These settings ensure:

  • Efficient connection reuse
  • Stale connection detection
  • Per-route limits to prevent one slow endpoint from exhausting the pool

Bulkhead Pattern

The bulkhead pattern isolates failures so issues with one partner don’t affect others. Think of it like compartments in a ship—if one floods, the others stay dry.

Per-Partner Configuration

bulkhead:
  # Limits concurrent executions
  maxConcurrentCalls: 10
  maxWaitDuration: 0  # Fail fast if bulkhead is full
  
  # Thread pool settings
  maxThreadPoolSize: 50
  queueCapacity: 50
  coreThreadPoolSize: 25
  keepAliveDuration: 3000

Key Principles

  1. Partner isolation: Each external partner gets its own bulkhead
  2. Fail fast: When the bulkhead is full, reject immediately rather than queue indefinitely
  3. Reset interface: Provide a way to reset bulkheads for recovery

Retry Strategy

Different HTTP status codes warrant different retry behaviors.

Retry Decision Matrix

Status CodeRetry?Strategy
200, 201, 202NoSuccess
301, 307, 308YesFollow redirect + retry
400-407NoClient error, fix request
408YesLong retry (timeout)
409-428NoClient error
429Circuit BreakRate limited
500, 501, 502YesShort retry
503, 504YesLong retry (service issue)

Short Retry Configuration (Server Errors)

For HTTP 500, 501, 502 and most 4xx errors:

retry:
  shortRetry:
    maxAttempts: 25        # Maximum 10 minutes total
    waitDuration: 500      # Initial 500ms
    exponentialBackoffMultiplier: 2
    maxDuration: 600000    # 10 minutes cap

Long Retry Configuration (Timeouts & Service Unavailable)

For HTTP 408, 503, 504:

retry:
  longRetry:
    maxAttempts: 50        # Maximum 2 hours total
    waitDuration: 500      # Initial 500ms
    exponentialBackoffMultiplier: 2
    maxDuration: 7200000   # 2 hours cap

Exponential Backoff Example

With 500ms initial wait and 2x multiplier:

AttemptWait TimeCumulative
1500ms500ms
21s1.5s
32s3.5s
44s7.5s
58s15.5s

Predicate Handling

Configure success detection beyond just HTTP status:

// Success predicate - check response body
retryOnResultPredicate(response -> {
    // Example: API returns 200 but with error code
    return response.getBody().getErrorCode() != 0;
})

// Exception predicate - which exceptions to retry
retryExceptionPredicate(ex -> {
    return ex instanceof TimeoutException 
        || ex instanceof ConnectionException;
})

Handling 504 Gateway Timeout

HTTP 504 requires special handling to prevent duplicates:

// After max retries, check for existing records
if (maxRetriesExceeded && statusCode == 504) {
    // Call reconciliation endpoint
    existingRecord = retrieveAPI.get(transactionId);
    if (existingRecord != null) {
        return existingRecord; // Don't create duplicate
    }
}

Circuit Breaker Pattern

Circuit breakers prevent cascading failures by stopping requests to failing services.

For HTTP 429 (Rate Limited)

circuitBreaker:
  rateLimited:
    waitDurationInOpenState: 300000     # 5 minutes
    maxWaitDurationInHalfOpenState: 600000  # 10 minutes
    action: STOP_CONSUMING_MESSAGES

For HTTP 408 (Request Timeout)

circuitBreaker:
  timeout:
    waitDurationInOpenState: 900000     # 15 minutes
    maxWaitDurationInHalfOpenState: 1800000  # 30 minutes
    action: STOP_CONSUMING_MESSAGES

Core Circuit Breaker Configuration

circuitBreaker:
  core:
    slidingWindowType: TIME_BASED
    slidingWindowSize: 60              # 60 second window
    failureRateThreshold: 50           # 50% failure opens circuit
    minimumNumberOfCalls: 10           # Minimum calls before evaluating
    automaticTransitionFromOpenToHalfOpenEnabled: true
    permittedNumberOfCallsInHalfOpenState: 5

Circuit Breaker States

CLOSED → (failure threshold exceeded) → OPEN

                          (wait duration elapsed)

                                    HALF_OPEN
                                     ↙     ↘
                        (success)           (failure)
                           ↓                    ↓
                        CLOSED                OPEN

Notification and Monitoring

Admin Notifications

When retries are exhausted, notify administrators:

notifications:
  enabled: true
  recipients: ["ops-team@company.com"]
  includeInNotification:
    - partnerDetails
    - endpoint
    - payloadSnippet
    - retryStatistics
    - lastErrorMessage

Event Stream Recording

Log every retry attempt for debugging:

{
  "eventType": "RETRY_ATTEMPT",
  "timestamp": "2024-01-15T10:30:00Z",
  "partner": "payment-provider",
  "endpoint": "/api/transactions",
  "attempt": 3,
  "maxAttempts": 25,
  "statusCode": 503,
  "waitBeforeNextRetry": 4000
}

Implementation Checklist

Connection Layer

  • Configure connection pool per partner
  • Set appropriate timeout values (start at 500ms)
  • Make timeouts configurable

Bulkhead Layer

  • Implement per-partner bulkheads
  • Add reset interface for operations
  • Monitor bulkhead saturation

Retry Layer

  • Implement short retry for server errors
  • Implement long retry for timeouts
  • Add predicate handling for application-level errors
  • Handle 504 reconciliation
  • Record all retry attempts in event stream

Circuit Breaker Layer

  • Implement circuit breaker for 429 (rate limiting)
  • Implement circuit breaker for 408 (timeout)
  • Add manual circuit breaker reset interface
  • Ensure circuit breakers are partner-specific

Monitoring

  • Configure admin notifications
  • Set up alerts for MaxRetriesExceededException
  • Monitor circuit breaker state changes

Evaluation Points

After implementation, evaluate under medium load:

  1. Resource consumption: CPU, memory, thread pool utilization
  2. Rate limiter applicability: May need rate limiting for bursty traffic
  3. Time limiter needs: Maximum execution time per request
  4. Cache opportunities: Can some responses be cached?

Conclusion

Implementing these resilience patterns transforms your service from fragile to robust. The key is layering: connection pools provide efficient resource use, bulkheads provide isolation, retries handle transient failures, and circuit breakers prevent cascade failures.

Start with sensible defaults, monitor in production, and tune based on actual failure patterns you observe.