Implementing Resilience Patterns: Bulkheads, Circuit Breakers, and Retry Strategies

Building resilient distributed systems requires handling failures gracefully. This article provides a detailed specification for implementing three critical resilience patterns: connection pooling with bulkheads, retry strategies, and circuit breakers.

Overview

When your service communicates with external APIs or partner systems, failures are inevitable. Network issues, timeouts, and service unavailability can cascade through your system. These patterns help contain and recover from such failures.

Connection Pooling

Before implementing resilience patterns, establish proper HTTP connection pooling.

Recommended Configuration

http:
  connectionPool:
    maxPoolSize: 500
    staleCheckIntervalInMs: 60000
    maxSizePerRoute: 500
    validateIntervalInMs: 3000

These settings ensure:

Efficient connection reuse
Stale connection detection
Per-route limits to prevent one slow endpoint from exhausting the pool

Bulkhead Pattern

The bulkhead pattern isolates failures so issues with one partner don’t affect others. Think of it like compartments in a ship—if one floods, the others stay dry.

Per-Partner Configuration

bulkhead:
  # Limits concurrent executions
  maxConcurrentCalls: 10
  maxWaitDuration: 0  # Fail fast if bulkhead is full
  
  # Thread pool settings
  maxThreadPoolSize: 50
  queueCapacity: 50
  coreThreadPoolSize: 25
  keepAliveDuration: 3000

Key Principles

Partner isolation: Each external partner gets its own bulkhead
Fail fast: When the bulkhead is full, reject immediately rather than queue indefinitely
Reset interface: Provide a way to reset bulkheads for recovery

Retry Strategy

Different HTTP status codes warrant different retry behaviors.

Retry Decision Matrix

Status Code	Retry?	Strategy
200, 201, 202	No	Success
301, 307, 308	Yes	Follow redirect + retry
400-407	No	Client error, fix request
408	Yes	Long retry (timeout)
409-428	No	Client error
429	Circuit Break	Rate limited
500, 501, 502	Yes	Short retry
503, 504	Yes	Long retry (service issue)

Short Retry Configuration (Server Errors)

For HTTP 500, 501, 502 and most 4xx errors:

retry:
  shortRetry:
    maxAttempts: 25        # Maximum 10 minutes total
    waitDuration: 500      # Initial 500ms
    exponentialBackoffMultiplier: 2
    maxDuration: 600000    # 10 minutes cap

Long Retry Configuration (Timeouts & Service Unavailable)

For HTTP 408, 503, 504:

retry:
  longRetry:
    maxAttempts: 50        # Maximum 2 hours total
    waitDuration: 500      # Initial 500ms
    exponentialBackoffMultiplier: 2
    maxDuration: 7200000   # 2 hours cap

Exponential Backoff Example

With 500ms initial wait and 2x multiplier:

Attempt	Wait Time	Cumulative
1	500ms	500ms
2	1s	1.5s
3	2s	3.5s
4	4s	7.5s
5	8s	15.5s
…	…	…

Predicate Handling

Configure success detection beyond just HTTP status:

// Success predicate - check response body
retryOnResultPredicate(response -> {
    // Example: API returns 200 but with error code
    return response.getBody().getErrorCode() != 0;
})

// Exception predicate - which exceptions to retry
retryExceptionPredicate(ex -> {
    return ex instanceof TimeoutException 
        || ex instanceof ConnectionException;
})

Handling 504 Gateway Timeout

HTTP 504 requires special handling to prevent duplicates:

// After max retries, check for existing records
if (maxRetriesExceeded && statusCode == 504) {
    // Call reconciliation endpoint
    existingRecord = retrieveAPI.get(transactionId);
    if (existingRecord != null) {
        return existingRecord; // Don't create duplicate
    }
}

Circuit Breaker Pattern

Circuit breakers prevent cascading failures by stopping requests to failing services.

For HTTP 429 (Rate Limited)

circuitBreaker:
  rateLimited:
    waitDurationInOpenState: 300000     # 5 minutes
    maxWaitDurationInHalfOpenState: 600000  # 10 minutes
    action: STOP_CONSUMING_MESSAGES

For HTTP 408 (Request Timeout)

circuitBreaker:
  timeout:
    waitDurationInOpenState: 900000     # 15 minutes
    maxWaitDurationInHalfOpenState: 1800000  # 30 minutes
    action: STOP_CONSUMING_MESSAGES

Core Circuit Breaker Configuration

circuitBreaker:
  core:
    slidingWindowType: TIME_BASED
    slidingWindowSize: 60              # 60 second window
    failureRateThreshold: 50           # 50% failure opens circuit
    minimumNumberOfCalls: 10           # Minimum calls before evaluating
    automaticTransitionFromOpenToHalfOpenEnabled: true
    permittedNumberOfCallsInHalfOpenState: 5

Circuit Breaker States

CLOSED → (failure threshold exceeded) → OPEN
                                          ↓
                          (wait duration elapsed)
                                          ↓
                                    HALF_OPEN
                                     ↙     ↘
                        (success)           (failure)
                           ↓                    ↓
                        CLOSED                OPEN

Notification and Monitoring

Admin Notifications

When retries are exhausted, notify administrators:

notifications:
  enabled: true
  recipients: ["ops-team@company.com"]
  includeInNotification:
    - partnerDetails
    - endpoint
    - payloadSnippet
    - retryStatistics
    - lastErrorMessage

Event Stream Recording

Log every retry attempt for debugging:

{
  "eventType": "RETRY_ATTEMPT",
  "timestamp": "2024-01-15T10:30:00Z",
  "partner": "payment-provider",
  "endpoint": "/api/transactions",
  "attempt": 3,
  "maxAttempts": 25,
  "statusCode": 503,
  "waitBeforeNextRetry": 4000
}

Implementation Checklist

Connection Layer

Configure connection pool per partner
Set appropriate timeout values (start at 500ms)
Make timeouts configurable

Bulkhead Layer

Implement per-partner bulkheads
Add reset interface for operations
Monitor bulkhead saturation

Retry Layer

Implement short retry for server errors
Implement long retry for timeouts
Add predicate handling for application-level errors
Handle 504 reconciliation
Record all retry attempts in event stream

Circuit Breaker Layer

Implement circuit breaker for 429 (rate limiting)
Implement circuit breaker for 408 (timeout)
Add manual circuit breaker reset interface
Ensure circuit breakers are partner-specific

Monitoring

Configure admin notifications
Set up alerts for MaxRetriesExceededException
Monitor circuit breaker state changes

Evaluation Points

After implementation, evaluate under medium load:

Resource consumption: CPU, memory, thread pool utilization
Rate limiter applicability: May need rate limiting for bursty traffic
Time limiter needs: Maximum execution time per request
Cache opportunities: Can some responses be cached?

Conclusion

Implementing these resilience patterns transforms your service from fragile to robust. The key is layering: connection pools provide efficient resource use, bulkheads provide isolation, retries handle transient failures, and circuit breakers prevent cascade failures.

Start with sensible defaults, monitor in production, and tune based on actual failure patterns you observe.

Comments & Discussion

Want to suggest corrections or improvements?

Have a correction, suggestion, or idea for improvement?

Comment below using GitHub Discussions (recommended)
Email directly via LinkedIn for detailed feedback
Open an issue on GitHub for technical corrections

All constructive feedback is welcome and helps improve the content for everyone.