Creating a Debug Documentation Template for Microservices

When an alert fires at 3 AM, you need a clear debugging guide—not scattered Slack threads. Here’s a template for creating standardized debug documentation for each microservice.

Template Structure

┌────────────────────────────────────────────────────────────────────┐
│                    DEBUG DOCUMENTATION TEMPLATE                    │
├────────────────────────────────────────────────────────────────────┤
│                                                                    │
│  ┌──────────────────────────────────────────────────────────────┐ │
│  │  1. SERVICE DETAILS                                          │ │
│  │     Cluster, metrics URLs, latency dashboards                │ │
│  └──────────────────────────────────────────────────────────────┘ │
│                          │                                         │
│                          ▼                                         │
│  ┌──────────────────────────────────────────────────────────────┐ │
│  │  2. COMPONENT TOLERANCE THRESHOLDS                           │ │
│  │     What's "normal" vs "investigate"                         │ │
│  └──────────────────────────────────────────────────────────────┘ │
│                          │                                         │
│                          ▼                                         │
│  ┌──────────────────────────────────────────────────────────────┐ │
│  │  3. HYGIENE CHECKS                                           │ │
│  │     Component → Observation → Resolution Steps               │ │
│  └──────────────────────────────────────────────────────────────┘ │
│                                                                    │
└────────────────────────────────────────────────────────────────────┘

Section 1: Service Details

Provide quick links to everything needed during an incident:

Property	Value
Cluster	`production-east-1`
ECS Service	`ct-ranker-service`
Metrics Dashboard	[CloudWatch Dashboard URL]
Latency Dashboard	[NewRelic APM URL]
Logs	[Graylog/CloudWatch Logs URL]
Config Server	[Spring Cloud Config URL]
Runbook	[Confluence/Wiki Link]

Sample Validation Endpoint

curl -X GET https://api.example.com/health \
  -H "Authorization: Bearer $TOKEN"

# Expected response:
{
  "status": "UP",
  "components": {
    "db": {"status": "UP"},
    "redis": {"status": "UP"},
    "rabbit": {"status": "UP"}
  }
}

Section 2: Component Tolerance Thresholds

Define what’s normal for each component:

Component	Dimension	Threshold	Alert Level
RabbitMQ Queue	Queue Size	≤300 in 5 min	Warning
RabbitMQ Queue	Queue Size	>1000 in 5 min	Critical
Redis	Response Time	<10ms p99	Normal
Redis	Response Time	10-50ms p99	Warning
Redis	Response Time	>50ms p99	Critical
MongoDB	Query Time	<100ms p95	Normal
MongoDB	Query Time	>500ms p95	Critical
Service	CPU	<70%	Normal
Service	Memory	<80%	Normal
Service	Error Rate	<1%	Normal
Service	Error Rate	>5%	Critical

Section 3: Hygiene Check Runbook

For each common issue, provide step-by-step debugging:

RabbitMQ Queue Backup

Observation: Queue size has increased beyond acceptable threshold

Resolution Steps:

Check incoming load on the service

# View request rate in last 15 minutes
aws cloudwatch get-metric-statistics \
  --namespace Custom/API \
  --metric-name RequestCount \
  --start-time $(date -d '15 minutes ago' -u +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 60 \
  --statistics Sum

Check queue consumer list

# RabbitMQ Management API
curl -u admin:$RABBIT_PASS \
  "http://rabbitmq.example.com:15672/api/queues/%2f/content-queue"

Check service status

curl -X GET "https://api.example.com/actuator/health" \
  -H "Authorization: Bearer $TOKEN"

Check for ACK timeouts

# Search logs for unacknowledged messages
grep "ack timeout" /var/log/service/*.log | tail -20

Potential fixes:
- Restart consumers if stuck
- Scale up consumer instances
- Check for blocking downstream calls

Redis Latency Issues

Observation: Redis response times have gone beyond acceptable thresholds

Resolution Steps:

Check Redis cluster load

redis-cli -h redis.example.com INFO stats | grep -E "instantaneous_ops|used_memory"

Check for expensive operations

# Look for KEYS, SCAN with large counts
redis-cli -h redis.example.com SLOWLOG GET 10

Check application load

# Connection pool status
curl "https://api.example.com/actuator/metrics/redis.pool.active"

Check freeable memory

aws cloudwatch get-metric-statistics \
  --namespace AWS/ElastiCache \
  --metric-name FreeableMemory \
  --dimensions Name=CacheClusterId,Value=prod-redis \
  --period 300 --statistics Average

Verify cache hit ratio

redis-cli -h redis.example.com INFO stats | grep -E "keyspace_hits|keyspace_misses"
# Hit ratio should be >90%

Potential fixes:
- Identify and optimize slow queries
- Increase cache TTL for frequently accessed data
- Scale Redis cluster
- Add memory to existing nodes

Database Slow Queries

Observation: MongoDB query times exceeding thresholds

Resolution Steps:

Check current operations

db.currentOp({"active": true, "secs_running": {"$gt": 5}})

Review slow query log

grep "COMMAND" /var/log/mongodb/mongod.log | \
  jq 'select(.attr.durationMillis > 1000)'

Check index usage

db.collection.aggregate([
  { $indexStats: {} }
])

Check connection pool

db.serverStatus().connections
// current should be << available

Potential fixes:
- Add missing indexes
- Optimize query patterns
- Increase connection pool size
- Scale MongoDB resources

Template Usage Tips

Keep it updated - Review after every incident
Add examples - Include actual commands, not placeholders
Link to dashboards - One click should get you there
Version control - Track changes in git
On-call training - Walk through docs during onboarding

Automation Opportunity

Convert threshold checks into monitoring alerts:

# Example CloudWatch Alarm
AlarmName: ContentRanker-QueueBackup
MetricName: ApproximateNumberOfMessagesVisible
Namespace: AWS/SQS
Threshold: 300
ComparisonOperator: GreaterThanThreshold
EvaluationPeriods: 1
Period: 300
AlarmActions:
  - arn:aws:sns:us-east-1:123456789:oncall-alerts

A well-maintained debug doc turns a stressful 3 AM incident into a methodical troubleshooting process.

Comments & Discussion

Want to suggest corrections or improvements?

Have a correction, suggestion, or idea for improvement?

Comment below using GitHub Discussions (recommended)
Email directly via LinkedIn for detailed feedback
Open an issue on GitHub for technical corrections

All constructive feedback is welcome and helps improve the content for everyone.