Spring WebFlux Performance Optimization

TL;DR: Diagnosing and fixing performance issues in reactive Spring WebFlux applications, including event loop tuning, memory leaks, and thread pool configuration

Problem Statement

Identify the reason for instance health check failure, high response time, and instability in a reactive service.

Understanding Spring WebFlux vs MVC

Spring MVC (Blocking)

Each request takes a thread and blocks it until a response is returned. This works well for traditional synchronous applications.

Spring WebFlux (Non-blocking)

Threads are used in a non-blocking manner, so threads are free to accept other requests. It works using a publisher/consumer pattern where Mono and Flux subscribe to the thread and publish the result as soon as processing completes.

Critical: The event loop is single-threaded and should not be blocked, as it can cause system instability.

Observations

Thread Profiling

EpollEventLoop was taking excessive time, likely because there weren’t enough threads for execution. By default, the event loop uses 2 threads, thereby not using all cores at full potential.

APM Tracing

WebClient was taking too long to process responses, even though the downstream identity service response time was 10ms (tp99). The solution: use boundedElastic thread (separate thread pool) for all IO operations to avoid blocking worker threads.

Instance Choking

One of the earlier started instances always choked. Changed load balancing strategy to LOR (Least Outstanding Requests) - targeting instances with fewer pending requests.

Memory Leaks

  1. URI builder in WebClient causes memory leaks

    // Bad - causes memory leak
    webClient.uri(uriBuilder -> uriBuilder.path("/api").build())
    
    // Good - prevents memory leak
    webClient.uri("/api")
    
  2. WebClient error handling - Response body should be released:

    .onStatus(HttpStatus::isError, clientResponse -> {
        Mono<? extends Throwable> res = clientResponse
            .bodyToMono(String.class)
            .map(CustomException::new);
        clientResponse.releaseBody();
        return res;
    })
    
  3. Netty ByteBuf leaks - Upgraded netty version:

    <netty.version>4.1.94.Final</netty.version>
    
  4. ObjectMapper soft references - Disable buffer recycling:

    @Bean(name = "customObjectMapper")
    public ObjectMapper objectMapper() {
        JsonFactory jsonFactory = new JsonFactory()
            .configure(JsonFactory.Feature.USE_THREAD_LOCAL_FOR_BUFFER_RECYCLING, false);
        return new ObjectMapper(jsonFactory);
    }
    

Optimizations Applied

Thread Configuration

# Increased worker thread count (default 4)
-Dreactor.netty.ioWorkerCount=32

# Increased event loop thread count (default 2)
-Dio.netty.eventLoopThreads=4

# BoundedElastic thread pool configuration
-Dreactor.schedulers.defaultPoolSize=64
-Dreactor.schedulers.defaultBoundedElasticSize=100
-Dreactor.schedulers.defaultBoundedElasticQueueSize=50

Additional Changes

  • Added Redis connection pooling with command timeout
  • Added HTTP connection timeout
  • Load balancing strategy → LOR (not released in initial iteration)

Test Configuration

  • Instance type: m5d.xlarge
  • Auto-scaling enabled: min 2, max 10

Results

Before Optimization

  • Max RPS: 60
  • Frequent spikes
  • Instance health check failures
  • Max instances reached: 10

After Optimization

  • Max RPS: 155 (158% improvement)
  • Very few spikes
  • No unhealthy instances
  • Achieved 155 RPS with only 8 instances

Key Takeaways

  1. Don’t block the event loop - Keep event loop processing time minimal
  2. More threads ≠ better performance - WebFlux is designed to use threads in a non-blocking manner
  3. Write non-blocking code - Use BlockHound to detect blocking calls from non-blocking threads
  4. Async vs Reactive are different - Understand the distinction
  5. Task separation - CPU-intensive tasks should use worker threads; IO-bound tasks should use boundedElastic threads

References

Acknowledgements
  • Prem — Detailed profiling and analysis