RabbitMQ Cluster Migration Guide

TL;DR: Migrating from single-node RabbitMQ to clustered mode for improved availability and throughput

Overview

Migrating RabbitMQ from single-node to clustered mode improves availability and throughput. This guide covers the migration process for NT services.

Pre-Migration Architecture

Single Node Configuration:

  • Single RabbitMQ instance
  • Limited memory capacity
  • Single point of failure

Target Architecture

Clustered Configuration:

  • Multiple RabbitMQ nodes
  • Higher RAM capacity on primary node
  • Automatic failover capability

Queue Migrations

Queues to Migrate

Queue NamePurposeMessage Volume
nt_delayDelayed message processingHigh
event_logs_queueEvent logging and analyticsMedium

Migration Process

Phase 1: Cluster Setup

  1. Add New Node

    # On new node, join existing cluster
    rabbitmqctl stop_app
    rabbitmqctl reset
    rabbitmqctl join_cluster rabbit@primary-node
    rabbitmqctl start_app
    
  2. Verify Cluster Status

    rabbitmqctl cluster_status
    

Phase 2: Queue Replication

Configure queue mirroring for high-availability:

# Set policy for all queues
rabbitmqctl set_policy ha-all "^" '{"ha-mode":"all","ha-sync-mode":"automatic"}'

Or for specific queues:

# Mirror specific queues
rabbitmqctl set_policy ha-nt "^nt" \
  '{"ha-mode":"all","ha-sync-mode":"automatic"}'

Phase 3: Service Updates

Update connection strings in consuming services:

Application Configuration

# Before
spring:
  rabbitmq:
    host: rabbitmq-single.internal
    port: 5672

# After - Clustered
spring:
  rabbitmq:
    addresses: rabbitmq-node1:5672,rabbitmq-node2:5672,rabbitmq-node3:5672
    connection-timeout: 5000

Services to Update

  1. NT Center

    • Update connection string
    • Verify queue bindings
    • Test failover behavior
  2. Event Processor

    • Update connection string
    • Validate message ordering (if required)

Connection String Format

Java/Spring Applications

@Configuration
public class RabbitConfig {
    
    @Bean
    public ConnectionFactory connectionFactory() {
        CachingConnectionFactory factory = new CachingConnectionFactory();
        factory.setAddresses("node1:5672,node2:5672,node3:5672");
        factory.setUsername("app_user");
        factory.setPassword("${rabbitmq.password}");
        factory.setVirtualHost("/");
        
        // Connection recovery
        factory.setConnectionTimeout(5000);
        factory.setRequestedHeartbeat(30);
        
        return factory;
    }
}

Python Applications

import pika

parameters = [
    pika.ConnectionParameters(host='node1', port=5672),
    pika.ConnectionParameters(host='node2', port=5672),
    pika.ConnectionParameters(host='node3', port=5672),
]

connection = pika.BlockingConnection(parameters)

Verification Steps

1. Cluster Health

# Check all nodes are running
rabbitmqctl cluster_status

# Verify queue mirroring
rabbitmqctl list_queues name policy slave_pids

2. Message Flow Test

# Publish test message
rabbitmqadmin publish exchange=amq.default \
  routing_key=nt_delay \
  payload="test message"

# Verify receipt
rabbitmqadmin get queue=nt_delay

3. Failover Test

  1. Stop one node: rabbitmqctl stop_app
  2. Verify messages continue processing
  3. Restart node: rabbitmqctl start_app
  4. Verify synchronization

Monitoring

Key Metrics

MetricThresholdAlert
Queue depth> 10,000Warning
Message rate< 100/sWarning
Node memory> 80%Critical
Unacknowledged messages> 5,000Warning

Prometheus Metrics

# rabbitmq.conf
prometheus.return_per_object_metrics = true

Rollback Plan

If issues arise:

  1. Revert Connection Strings

    • Point services back to single node
    • Deploy configuration changes
  2. Drain Cluster Queues

    # Export messages
    rabbitmqadmin export cluster-backup.json
    
    # Import to single node
    rabbitmqadmin import cluster-backup.json
    
  3. Remove Cluster Nodes

    rabbitmqctl forget_cluster_node rabbit@node2
    

Post-Migration Checklist

  • All services updated with cluster addresses
  • Queue mirroring policies verified
  • Monitoring dashboards updated
  • Alerting thresholds adjusted
  • Documentation updated
  • Runbook for cluster operations created