Service Debug Documentation Template

Overview

This template provides a standardized structure for documenting debugging procedures for microservices. Having consistent debug documentation accelerates incident response and onboarding.

Service Details

Property	Value
Cluster	`<cluster-name>`
Metrics Dashboard	`<cloudwatch-url>`
APM/Latency	`<newrelic/datadog-url>`
Log Aggregation	`<graylog/elk-url>`
Health Endpoint	`/actuator/health`

Component Tolerance Thresholds

Define acceptable operating ranges for each dependency:

Component	Dimension	Threshold
RabbitMQ Queue	Queue Size	≤300 in last 5 minutes
Redis	Response Time	<10ms p99
Database	Connection Pool	<80% utilized
API Gateway	Error Rate	<1%
ECS	CPU Utilization	<70% sustained

Hygiene Checks

RabbitMQ Queue Issues

Observation: Queue size exceeded acceptable threshold

Resolution Steps:

Check incoming load on the service
Verify queue consumers are connected:
```
rabbitmqctl list_consumers -p /
```

Validate service health:

curl http://172.xx.xx.xx:8080/actuator/health

Check for consumer ack timeouts in logs
Review consumer prefetch settings

Redis Performance Issues

Observation: Redis response times exceed thresholds

Resolution Steps:

Check Redis cluster load:
```
redis-cli INFO stats
```
Identify scanning/wildcard queries:
```
redis-cli SLOWLOG GET 10
```
Verify application connection pool status
Check freeable memory on Redis:
```
redis-cli INFO memory
```
Review cache hit/miss ratio

Database Connection Issues

Observation: Connection pool exhaustion

Resolution Steps:

Check active connections:
```
SELECT count(*) FROM pg_stat_activity;
```

Identify long-running queries:

SELECT pid, query, state, query_start 
FROM pg_stat_activity 
WHERE state != 'idle' 
ORDER BY query_start;

Review connection pool configuration
Check for connection leaks in application logs

Common Debug Commands

Service Health

# Basic health check
curl -s http://172.xx.xx.xx:8080/actuator/health | jq

# Detailed health with components
curl -s http://172.xx.xx.xx:8080/actuator/health/liveness
curl -s http://172.xx.xx.xx:8080/actuator/health/readiness

Log Analysis

# Recent errors
grep -i "error\|exception" /var/log/app/service.log | tail -50

# Request tracing
grep "<trace-id>" /var/log/app/service.log

Resource Monitoring

# Container stats
docker stats <container-id>

# JVM memory (if applicable)
curl -s http://172.xx.xx.xx:8080/actuator/metrics/jvm.memory.used | jq

Escalation Matrix

Severity	Response Time	Escalation Path
P1 (Critical)	15 minutes	On-call → Team Lead → Engineering Manager
P2 (High)	1 hour	On-call → Team Lead
P3 (Medium)	4 hours	On-call
P4 (Low)	Next business day	Ticket queue

Runbook Links

Comments & Discussion

Want to suggest corrections or improvements?

Have a correction, suggestion, or idea for improvement?

Comment below using GitHub Discussions (recommended)
Email directly via LinkedIn for detailed feedback
Open an issue on GitHub for technical corrections

All constructive feedback is welcome and helps improve the content for everyone.