Building Resilient Services: Retry, Circuit Breaker, and Bulkhead Patterns

TL;DR: Comprehensive guide to implementing resilience patterns including retry strategies, circuit breakers, and bulkheads for fault-tolerant distributed systems

Overview

When building distributed systems, failures are inevitable. This guide covers implementing resilience patterns to build fault-tolerant services that gracefully handle downstream failures.

Resilience Patterns Overview

┌───────────────────────────────────────────────────────────────────┐
│                    Resilience Patterns Stack                       │
├───────────────────────────────────────────────────────────────────┤
│                                                                    │
│    ┌─────────────┐   ┌─────────────────┐   ┌─────────────────┐   │
│    │             │   │                 │   │                 │   │
│    │   Retry     │──►│ Circuit Breaker │──►│    Bulkhead     │   │
│    │             │   │                 │   │                 │   │
│    └─────────────┘   └─────────────────┘   └─────────────────┘   │
│          │                    │                      │            │
│          ▼                    ▼                      ▼            │
│    Transient       Prevent cascade         Isolate failures      │
│    failure         failures                between partners       │
│    recovery                                                       │
└───────────────────────────────────────────────────────────────────┘

Connection Pooling

Before implementing resilience, configure proper connection pooling:

# application.yml
http:
  connection-pool:
    maxPoolSize: 500
    staleCheckIntervalInMs: 60000
    maxSizePerRoute: 500
    validateIntervalInMs: 3000
@Configuration
public class HttpClientConfig {
    
    @Bean
    public CloseableHttpClient httpClient() {
        PoolingHttpClientConnectionManager connectionManager = 
            new PoolingHttpClientConnectionManager();
        connectionManager.setMaxTotal(500);
        connectionManager.setDefaultMaxPerRoute(500);
        connectionManager.setValidateAfterInactivity(3000);
        
        return HttpClients.custom()
            .setConnectionManager(connectionManager)
            .evictIdleConnections(60, TimeUnit.SECONDS)
            .build();
    }
}

Bulkhead Pattern

Bulkheads isolate failures between partners, preventing issues with one from affecting others.

Configuration

# Per-partner bulkhead configuration
resilience4j:
  bulkhead:
    instances:
      partner-a:
        maxConcurrentCalls: 10
        maxWaitDuration: 0
      partner-b:
        maxConcurrentCalls: 10
        maxWaitDuration: 0
        
  thread-pool-bulkhead:
    instances:
      partner-a:
        maxThreadPoolSize: 50
        coreThreadPoolSize: 25
        queueCapacity: 50
        keepAliveDuration: 3000ms

Implementation

@Service
public class PartnerService {
    
    @Bulkhead(name = "partner-a", type = Bulkhead.Type.THREADPOOL)
    public CompletableFuture<Response> callPartnerA(Request request) {
        return CompletableFuture.supplyAsync(() -> {
            return restTemplate.postForObject(
                partnerAUrl, 
                request, 
                Response.class
            );
        });
    }
    
    @Bulkhead(name = "partner-b", type = Bulkhead.Type.THREADPOOL)
    public CompletableFuture<Response> callPartnerB(Request request) {
        return CompletableFuture.supplyAsync(() -> {
            return restTemplate.postForObject(
                partnerBUrl, 
                request, 
                Response.class
            );
        });
    }
}

Retry Pattern

Strategy by HTTP Status Code

┌────────────────────────────────────────────────────────────────────┐
│                    Retry Strategy Matrix                            │
├─────────────────┬──────────────┬────────────────┬──────────────────┤
│  HTTP Status    │ Max Attempts │ Wait Duration  │ Backoff          │
├─────────────────┼──────────────┼────────────────┼──────────────────┤
│ 2xx (success)   │ No retry     │ -              │ -                │
│ 4xx (except     │ 25 attempts  │ 500ms initial  │ Exponential (2x) │
│   408,429)      │ (~10 min)    │                │                  │
│ 408, 503, 504   │ 50 attempts  │ 500ms initial  │ Exponential (2x) │
│                 │ (~2 hours)   │                │                  │
│ 5xx (except     │ 25 attempts  │ 500ms initial  │ Exponential (2x) │
│   503,504)      │ (~10 min)    │                │                  │
│ 301, 307, 308   │ Follow +     │ -              │ -                │
│ (redirects)     │ admin notify │                │                  │
└─────────────────┴──────────────┴────────────────┴──────────────────┘

Configuration

resilience4j:
  retry:
    instances:
      default-retry:
        maxAttempts: 25
        waitDuration: 500ms
        enableExponentialBackoff: true
        exponentialBackoffMultiplier: 2
        failAfterMaxAttempts: true
        retryExceptions:
          - java.io.IOException
          - java.net.ConnectException
          - org.springframework.web.client.ResourceAccessException
        ignoreExceptions:
          - com.example.BusinessException
          
      long-retry:
        maxAttempts: 50
        waitDuration: 500ms
        enableExponentialBackoff: true
        exponentialBackoffMultiplier: 2
        retryOnResultPredicate: com.example.RetryPredicate

Implementation with Predicates

@Service
@Slf4j
public class PartnerCallService {
    
    @Retry(name = "default-retry", fallbackMethod = "fallback")
    public PartnerResponse callPartner(PartnerRequest request) {
        ResponseEntity<PartnerResponse> response = restTemplate.exchange(
            partnerUrl,
            HttpMethod.POST,
            new HttpEntity<>(request),
            PartnerResponse.class
        );
        
        // Custom success validation
        if (!isSuccessful(response.getBody())) {
            throw new RetryableException("Partner returned failure status");
        }
        
        return response.getBody();
    }
    
    // Result-based retry predicate
    public static class RetryPredicate implements Predicate<PartnerResponse> {
        @Override
        public boolean test(PartnerResponse response) {
            // Retry if response indicates failure
            return response != null && 
                   response.getErrorCode() != 0;
        }
    }
    
    // Fallback after max retries
    public PartnerResponse fallback(PartnerRequest request, 
                                    MaxRetriesExceededException ex) {
        log.error("Max retries exceeded for partner call", ex);
        
        // Notify admin
        notificationService.sendAdminAlert(
            "Partner call failed after max retries",
            request,
            ex
        );
        
        // Return fallback or throw
        throw new PartnerUnavailableException("Partner temporarily unavailable");
    }
}

Circuit Breaker Pattern

Prevent cascade failures by stopping calls to failing downstream services.

Configuration

resilience4j:
  circuitbreaker:
    instances:
      # For rate limiting errors (429)
      rate-limit-breaker:
        slidingWindowType: TIME_BASED
        slidingWindowSize: 60
        failureRateThreshold: 50
        waitDurationInOpenState: 300000  # 5 minutes
        permittedNumberOfCallsInHalfOpenState: 10
        automaticTransitionFromOpenToHalfOpenEnabled: true
        recordExceptions:
          - com.example.RateLimitedException
          
      # For timeout errors (408)
      timeout-breaker:
        slidingWindowType: TIME_BASED
        slidingWindowSize: 120
        failureRateThreshold: 50
        waitDurationInOpenState: 900000   # 15 minutes
        permittedNumberOfCallsInHalfOpenState: 5
        automaticTransitionFromOpenToHalfOpenEnabled: true
        recordExceptions:
          - java.net.SocketTimeoutException
          - org.apache.http.conn.ConnectTimeoutException

Circuit Breaker States

                    ┌─────────────────┐
       (recovery)   │                 │   (failure threshold)
     ┌──────────────│    CLOSED       │──────────────────┐
     │              │  (Normal ops)   │                  │
     │              └────────┬────────┘                  │
     │                       │                           │
     │                       │ Failure rate > 50%       │
     │                       ▼                           │
     │              ┌─────────────────┐                  │
     │              │                 │                  │
     │              │     OPEN        │◄─────────────────┘
     │              │  (Reject all)   │
     │              └────────┬────────┘
     │                       │
     │                       │ Wait duration elapsed
     │                       ▼
     │              ┌─────────────────┐
     │              │                 │
     └──────────────│   HALF-OPEN     │
      Success rate  │  (Test calls)   │────────────────┐
      > threshold   └─────────────────┘  Still failing │

                                       Back to OPEN ◄──┘

Implementation

@Service
public class ResilientPartnerService {
    
    @CircuitBreaker(name = "rate-limit-breaker", 
                    fallbackMethod = "handleRateLimited")
    @Retry(name = "default-retry")
    @Bulkhead(name = "partner-bulkhead")
    public Response callPartner(Request request) {
        ResponseEntity<Response> response = restTemplate.exchange(
            partnerUrl,
            HttpMethod.POST,
            new HttpEntity<>(request),
            Response.class
        );
        
        if (response.getStatusCodeValue() == 429) {
            throw new RateLimitedException("Rate limited by partner");
        }
        
        return response.getBody();
    }
    
    public Response handleRateLimited(Request request, 
                                      CallNotPermittedException ex) {
        log.warn("Circuit breaker open for partner, request queued");
        
        // Queue for later processing
        messageQueue.send(new DeferredRequest(request));
        
        return Response.builder()
            .status("QUEUED")
            .message("Request will be processed when partner recovers")
            .build();
    }
}

Admin Notifications

@Component
@Slf4j
public class ResilienceNotificationService {
    
    @Autowired
    private EmailService emailService;
    
    @Value("${admin.notification.emails}")
    private List<String> adminEmails;
    
    public void notifyMaxRetriesExceeded(String partner, 
                                         Object request, 
                                         Exception ex) {
        String subject = String.format(
            "Alert: Max retries exceeded for %s", partner);
        
        String body = String.format("""
            Partner: %s
            Endpoint: %s
            Payload: %s
            Error: %s
            Time: %s
            """,
            partner,
            getEndpoint(request),
            truncatePayload(request),
            ex.getMessage(),
            Instant.now()
        );
        
        emailService.sendToMultiple(adminEmails, subject, body);
    }
    
    // Listen for circuit breaker state changes
    @EventListener
    public void onCircuitBreakerStateChange(CircuitBreakerOnStateTransitionEvent event) {
        if (event.getStateTransition().getToState() == CircuitBreaker.State.OPEN) {
            notifyCircuitBreakerOpen(event.getCircuitBreakerName());
        }
    }
}

Testing Resilience

Load Testing Results

IterationMessagesCPU UsageMemoryResult
120,00045-60%35%Handled with retries (callback server down)
2120,00070-85%55%Processed in ~45 mins

Verification Script

@SpringBootTest
class ResilienceTest {
    
    @Test
    void testRetryOnTransientFailure() {
        // Simulate transient failure
        mockServer.stubFor(post("/api")
            .inScenario("retry-test")
            .whenScenarioStateIs(STARTED)
            .willReturn(serverError())
            .willSetStateTo("second-attempt"));
            
        mockServer.stubFor(post("/api")
            .inScenario("retry-test")
            .whenScenarioStateIs("second-attempt")
            .willReturn(ok()));
        
        // Should succeed after retry
        Response response = partnerService.callPartner(request);
        assertThat(response).isNotNull();
    }
    
    @Test
    void testCircuitBreakerOpens() {
        // Simulate repeated failures
        mockServer.stubFor(post("/api")
            .willReturn(status(429)));
        
        // Make enough calls to open circuit
        for (int i = 0; i < 10; i++) {
            assertThrows(RateLimitedException.class, 
                () -> partnerService.callPartner(request));
        }
        
        // Next call should fail fast
        assertThrows(CallNotPermittedException.class,
            () -> partnerService.callPartner(request));
    }
}

Best Practices

  1. Configure per-partner - Different partners have different SLAs and failure modes
  2. Use thread pool bulkheads - Prevent one slow partner from blocking all threads
  3. Log all retry attempts - Essential for debugging and monitoring
  4. Set reasonable timeouts - Don’t let connections hang indefinitely
  5. Implement fallbacks - Always have a graceful degradation path
  6. Monitor circuit breaker states - Alert when circuits open
  7. Test failure scenarios - Use chaos engineering to verify resilience