RCA: SSL Certificate Expiry Incident

TL;DR: Root cause analysis of a service outage caused by expired SSL certificates, with preventive measures for certificate lifecycle management.

Incident Synopsis

FieldValue
RCA StatusComplete
PriorityHighest
Date of IncidentCan occur anytime
DurationCan vary depending on detection and response
Impacted ServicesCan impact all services

Timeline

MetricDuration
Time to Detect5 minutes
Time to Engage5 minutes
Time to Repair45 minutes
Time to Resolve1 hour

Executive Summary

Services became unavailable due to an expired SSL certificate on the domain. Multiple customers reported failures accessing functionality.

Business Impact

Service access failures affecting multiple customers. Issue reproduced by engineering team confirming complete service unavailability.

Root Cause Analysis: 5 Whys

Why-1: Why did this incident occur?

The SSL certificate assigned to the service domain expired.

Why-2: Why did this happen suddenly?

No proactive alert indicated the approaching expiry date.

Why-3: Why was there no alert?

A Lambda function designed to periodically check and report certificate status had not been running for 4 months.

Why-4: Why did the certificate not auto-renew?

Auto-renewal required email authorization. The email was configured for a mailbox that was not actively monitored.

Key Learnings

  1. Auto-renewal verification: Setup auto-renewal and verify the process works end-to-end
  2. Monitoring Lambda health: Ensure certificate monitoring Lambda functions are actively running and alerting
  3. Email routing: Setup dedicated postmaster@ or ssl-alerts@ email accounts to receive certificate notifications

Corrective Actions

ActionOwnerStatus
Setup Lambda to monitor certificate expiry datesPlatform Team✅ Done
Configure email alerts to monitored inboxDevOps✅ Done
Document certificate renewal runbookSRE✅ Done

Prevention Checklist

For future certificate management:

  • Use AWS Certificate Manager (ACM) for automatic renewal where possible
  • For non-ACM certificates, implement 30/14/7 day expiry alerts
  • Maintain a certificate inventory with expiry dates
  • Route certificate emails to team distribution list, not individual mailboxes
  • Schedule quarterly certificate audit reviews
  • Test renewal process in non-production environments

Lambda Monitoring Script

Simple certificate expiry check pattern:

import ssl
import socket
from datetime import datetime

def check_cert_expiry(hostname, port=443):
    context = ssl.create_default_context()
    with socket.create_connection((hostname, port)) as sock:
        with context.wrap_socket(sock, server_hostname=hostname) as ssock:
            cert = ssock.getpeercert()
            expiry = datetime.strptime(cert['notAfter'], '%b %d %H:%M:%S %Y %Z')
            days_remaining = (expiry - datetime.utcnow()).days
            return days_remaining

# Alert thresholds
WARN_DAYS = 30
CRITICAL_DAYS = 14

References

  • AWS ACM Documentation for managed certificate renewal
  • Let’s Encrypt for free automated certificates
  • certbot for certificate automation