Service Debug Documentation Template

TL;DR: A standardized template for documenting service debugging procedures, including component thresholds and hygiene checks.

Overview

This template provides a standardized structure for documenting debugging procedures for microservices. Having consistent debug documentation accelerates incident response and onboarding.

Service Details

PropertyValue
Cluster<cluster-name>
Metrics Dashboard<cloudwatch-url>
APM/Latency<newrelic/datadog-url>
Log Aggregation<graylog/elk-url>
Health Endpoint/actuator/health

Component Tolerance Thresholds

Define acceptable operating ranges for each dependency:

ComponentDimensionThreshold
RabbitMQ QueueQueue Size≤300 in last 5 minutes
RedisResponse Time<10ms p99
DatabaseConnection Pool<80% utilized
API GatewayError Rate<1%
ECSCPU Utilization<70% sustained

Hygiene Checks

RabbitMQ Queue Issues

Observation: Queue size exceeded acceptable threshold

Resolution Steps:

  1. Check incoming load on the service
  2. Verify queue consumers are connected:
    rabbitmqctl list_consumers -p /
    
  3. Validate service health:
    curl http://172.xx.xx.xx:8080/actuator/health
    
  4. Check for consumer ack timeouts in logs
  5. Review consumer prefetch settings

Redis Performance Issues

Observation: Redis response times exceed thresholds

Resolution Steps:

  1. Check Redis cluster load:
    redis-cli INFO stats
    
  2. Identify scanning/wildcard queries:
    redis-cli SLOWLOG GET 10
    
  3. Verify application connection pool status
  4. Check freeable memory on Redis:
    redis-cli INFO memory
    
  5. Review cache hit/miss ratio

Database Connection Issues

Observation: Connection pool exhaustion

Resolution Steps:

  1. Check active connections:
    SELECT count(*) FROM pg_stat_activity;
    
  2. Identify long-running queries:
    SELECT pid, query, state, query_start 
    FROM pg_stat_activity 
    WHERE state != 'idle' 
    ORDER BY query_start;
    
  3. Review connection pool configuration
  4. Check for connection leaks in application logs

Common Debug Commands

Service Health

# Basic health check
curl -s http://172.xx.xx.xx:8080/actuator/health | jq

# Detailed health with components
curl -s http://172.xx.xx.xx:8080/actuator/health/liveness
curl -s http://172.xx.xx.xx:8080/actuator/health/readiness

Log Analysis

# Recent errors
grep -i "error\|exception" /var/log/app/service.log | tail -50

# Request tracing
grep "<trace-id>" /var/log/app/service.log

Resource Monitoring

# Container stats
docker stats <container-id>

# JVM memory (if applicable)
curl -s http://172.xx.xx.xx:8080/actuator/metrics/jvm.memory.used | jq

Escalation Matrix

SeverityResponse TimeEscalation Path
P1 (Critical)15 minutesOn-call → Team Lead → Engineering Manager
P2 (High)1 hourOn-call → Team Lead
P3 (Medium)4 hoursOn-call
P4 (Low)Next business dayTicket queue
  • Deployment Rollback Procedure
  • Database Failover Process
  • Cache Invalidation Steps
  • Scaling Playbook
  • Disaster Recovery Plan