Creating a Debug Documentation Template for Microservices

TL;DR: Standardized template for documenting service debugging procedures including health checks, component thresholds, and troubleshooting runbooks

When an alert fires at 3 AM, you need a clear debugging guide—not scattered Slack threads. Here’s a template for creating standardized debug documentation for each microservice.

Template Structure

┌────────────────────────────────────────────────────────────────────┐
│                    DEBUG DOCUMENTATION TEMPLATE                    │
├────────────────────────────────────────────────────────────────────┤
│                                                                    │
│  ┌──────────────────────────────────────────────────────────────┐ │
│  │  1. SERVICE DETAILS                                          │ │
│  │     Cluster, metrics URLs, latency dashboards                │ │
│  └──────────────────────────────────────────────────────────────┘ │
│                          │                                         │
│                          ▼                                         │
│  ┌──────────────────────────────────────────────────────────────┐ │
│  │  2. COMPONENT TOLERANCE THRESHOLDS                           │ │
│  │     What's "normal" vs "investigate"                         │ │
│  └──────────────────────────────────────────────────────────────┘ │
│                          │                                         │
│                          ▼                                         │
│  ┌──────────────────────────────────────────────────────────────┐ │
│  │  3. HYGIENE CHECKS                                           │ │
│  │     Component → Observation → Resolution Steps               │ │
│  └──────────────────────────────────────────────────────────────┘ │
│                                                                    │
└────────────────────────────────────────────────────────────────────┘

Section 1: Service Details

Provide quick links to everything needed during an incident:

PropertyValue
Clusterproduction-east-1
ECS Servicect-ranker-service
Metrics Dashboard[CloudWatch Dashboard URL]
Latency Dashboard[NewRelic APM URL]
Logs[Graylog/CloudWatch Logs URL]
Config Server[Spring Cloud Config URL]
Runbook[Confluence/Wiki Link]

Sample Validation Endpoint

curl -X GET https://api.example.com/health \
  -H "Authorization: Bearer $TOKEN"

# Expected response:
{
  "status": "UP",
  "components": {
    "db": {"status": "UP"},
    "redis": {"status": "UP"},
    "rabbit": {"status": "UP"}
  }
}

Section 2: Component Tolerance Thresholds

Define what’s normal for each component:

ComponentDimensionThresholdAlert Level
RabbitMQ QueueQueue Size≤300 in 5 minWarning
RabbitMQ QueueQueue Size>1000 in 5 minCritical
RedisResponse Time<10ms p99Normal
RedisResponse Time10-50ms p99Warning
RedisResponse Time>50ms p99Critical
MongoDBQuery Time<100ms p95Normal
MongoDBQuery Time>500ms p95Critical
ServiceCPU<70%Normal
ServiceMemory<80%Normal
ServiceError Rate<1%Normal
ServiceError Rate>5%Critical

Section 3: Hygiene Check Runbook

For each common issue, provide step-by-step debugging:

RabbitMQ Queue Backup

Observation: Queue size has increased beyond acceptable threshold

Resolution Steps:

  1. Check incoming load on the service

    # View request rate in last 15 minutes
    aws cloudwatch get-metric-statistics \
      --namespace Custom/API \
      --metric-name RequestCount \
      --start-time $(date -d '15 minutes ago' -u +%Y-%m-%dT%H:%M:%SZ) \
      --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
      --period 60 \
      --statistics Sum
    
  2. Check queue consumer list

    # RabbitMQ Management API
    curl -u admin:$RABBIT_PASS \
      "http://rabbitmq.example.com:15672/api/queues/%2f/content-queue"
    
  3. Check service status

    curl -X GET "https://api.example.com/actuator/health" \
      -H "Authorization: Bearer $TOKEN"
    
  4. Check for ACK timeouts

    # Search logs for unacknowledged messages
    grep "ack timeout" /var/log/service/*.log | tail -20
    
  5. Potential fixes:

    • Restart consumers if stuck
    • Scale up consumer instances
    • Check for blocking downstream calls

Redis Latency Issues

Observation: Redis response times have gone beyond acceptable thresholds

Resolution Steps:

  1. Check Redis cluster load

    redis-cli -h redis.example.com INFO stats | grep -E "instantaneous_ops|used_memory"
    
  2. Check for expensive operations

    # Look for KEYS, SCAN with large counts
    redis-cli -h redis.example.com SLOWLOG GET 10
    
  3. Check application load

    # Connection pool status
    curl "https://api.example.com/actuator/metrics/redis.pool.active"
    
  4. Check freeable memory

    aws cloudwatch get-metric-statistics \
      --namespace AWS/ElastiCache \
      --metric-name FreeableMemory \
      --dimensions Name=CacheClusterId,Value=prod-redis \
      --period 300 --statistics Average
    
  5. Verify cache hit ratio

    redis-cli -h redis.example.com INFO stats | grep -E "keyspace_hits|keyspace_misses"
    # Hit ratio should be >90%
    
  6. Potential fixes:

    • Identify and optimize slow queries
    • Increase cache TTL for frequently accessed data
    • Scale Redis cluster
    • Add memory to existing nodes

Database Slow Queries

Observation: MongoDB query times exceeding thresholds

Resolution Steps:

  1. Check current operations

    db.currentOp({"active": true, "secs_running": {"$gt": 5}})
    
  2. Review slow query log

    grep "COMMAND" /var/log/mongodb/mongod.log | \
      jq 'select(.attr.durationMillis > 1000)'
    
  3. Check index usage

    db.collection.aggregate([
      { $indexStats: {} }
    ])
    
  4. Check connection pool

    db.serverStatus().connections
    // current should be << available
    
  5. Potential fixes:

    • Add missing indexes
    • Optimize query patterns
    • Increase connection pool size
    • Scale MongoDB resources

Template Usage Tips

  1. Keep it updated - Review after every incident
  2. Add examples - Include actual commands, not placeholders
  3. Link to dashboards - One click should get you there
  4. Version control - Track changes in git
  5. On-call training - Walk through docs during onboarding

Automation Opportunity

Convert threshold checks into monitoring alerts:

# Example CloudWatch Alarm
AlarmName: ContentRanker-QueueBackup
MetricName: ApproximateNumberOfMessagesVisible
Namespace: AWS/SQS
Threshold: 300
ComparisonOperator: GreaterThanThreshold
EvaluationPeriods: 1
Period: 300
AlarmActions:
  - arn:aws:sns:us-east-1:123456789:oncall-alerts

A well-maintained debug doc turns a stressful 3 AM incident into a methodical troubleshooting process.