πŸ“ Draft - This article is not yet published

Building Edge-to-Cloud Event Pipelines: Architecture and Patterns

TL;DR: A comprehensive architecture for IoT event flow from edge devices through cloud processing, including MQTT, SQS routing, retry patterns, and partner integration.

Building reliable event pipelines from edge devices to the cloud is challenging. Network connectivity is unreliable, events must never be lost, and different consumers need different views of the same data. This article presents a battle-tested architecture for edge-to-cloud event flow.

Architecture Overview

The pipeline consists of three layers:

  1. Remote Edge Layer - Where physical events originate
  2. Cloud Layer - Event routing and processing
  3. External Partner Layer - Whitelisted event delivery

1. Remote Edge Layer

This is where all physical-world events originateβ€”from devices installed at sites.

1.1 MQTT Broker (Mosquitto)

The local MQTT broker acts as a message bus for all edge devices:

  • All event-generating devices publish events here
  • Provides buffering and decoupling between devices and the Edge Service
  • Handles intermittent connectivity gracefully
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Ajax   β”‚    β”‚   ZigBee    β”‚    β”‚     NVR     β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
       β”‚                  β”‚                  β”‚
       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          β”‚
                          β–Ό
                β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                β”‚  MQTT Broker    β”‚
                β”‚  (Mosquitto)    β”‚
                β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜

1.2 Events Forwarder

A lightweight component that:

  • Collects events from various devices (Ajax, ZigBee, NVR, IDP, etc.)
  • Normalizes events into a standard format
  • Publishes ALL events to the local MQTT broker
  • Does not decide or transformβ€”only forwards

1.3 Edge Service

The intelligent component that generates and processes events:

Responsibilities:

  • Generates internal transactional/user events (check-ins, jobs, robot actions)
  • Subscribes to MQTT topics to receive all forwarder events
  • Manages reliable delivery to the cloud

Event Processing Flow:

For every event (internal or forwarder):
1. Publish realtime job (in-memory, not stored)
2. Attempt to publish to SQS
   
   If SUCCESS:
   └── Close the job
   
   If FAILURE:
   β”œβ”€β”€ Create persistent job (in DB)
   β”œβ”€β”€ Job Poller picks up the job
   β”œβ”€β”€ Retry up to 3 times
   └── After 3 failures β†’ push to local DLQ
   
DLQ jobs are processed daily at 2am

This ensures no event is lost at the edge, even if internet/cloud is down.

2. Cloud Layer

2.1 SQS Queue (Event Sink)

The central collector of events from ALL sites:

  • Durability - Events persist until processed
  • Ordering - FIFO if required
  • Scalability - Handles millions of events
# SQS Configuration
queue:
  type: FIFO  # Or Standard for higher throughput
  visibilityTimeout: 300
  messageRetentionPeriod: 1209600  # 14 days
  deadLetterQueue:
    maxReceiveCount: 3

2.2 Event Consumer

Reads events from SQS and routes them to appropriate destinations:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Event Consumer β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
    β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚         β”‚             β”‚            β”‚
    β–Ό         β–Ό             β–Ό            β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚Partnerβ”‚ β”‚Grafanaβ”‚ β”‚Internal SQSβ”‚ β”‚S3 Parquetβ”‚
β””β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ Queues     β”‚ β”‚ Archival β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

a) Send to Partner

Only whitelisted events go to partners:

  • Payload goes through Payload Transformer
  • Only agreed fields are sent (privacy/security)
  • Output matches Partner Schema
# Example transformation
def transform_for_partner(event, partner_config):
    whitelist = partner_config['allowed_fields']
    return {
        field: event.get(field)
        for field in whitelist
        if field in event
    }

b) Send to Grafana

Operational/monitoring events pushed to Grafana (via Loki or Prometheus):

  • Used for dashboards
  • Real-time observability
  • Alerting

c) Internal Processing + Database

Events routed to priority queues based on urgency:

QueuePriorityProcessing Time
HighUser-facing< 1 second
MediumBusiness logic< 10 seconds
LowAnalyticsBest effort

Consumers handle:

  • Database insertion
  • Triggering automation
  • Status updates
  • Internal workflow initiation

d) S3 Parquet Archival

Every event is permanently stored in S3 as Parquet:

  • Long-term analytics
  • Compliance and audits
  • Machine learning pipelines
  • Historical reporting
# Parquet archival structure
s3://events-archive/
β”œβ”€β”€ year=2024/
β”‚   β”œβ”€β”€ month=01/
β”‚   β”‚   β”œβ”€β”€ day=15/
β”‚   β”‚   β”‚   β”œβ”€β”€ hour=00/
β”‚   β”‚   β”‚   β”‚   └── events_0001.parquet

3. External Partner Layer

Partners receive only:

  • Whitelisted events they’ve subscribed to
  • Sanitized and transformed payloads
  • Delivery via webhook or API
# Partner configuration
partner:
  name: security-monitoring
  webhook: https://partner.example.com/events
  subscriptions:
    - intrusion_alert
    - door_access
  transform:
    strip_fields: [internal_id, raw_payload]
    rename:
      timestamp: event_time
      site_id: location_id

Full Flow Summary

1. Physical devices β†’ Forwarder β†’ MQTT
2. Edge Service reads events from MQTT + adds its own events
3. Edge Service creates jobs β†’ Job Poller β†’ Push to SQS
4. SQS stores all events
5. Event Consumer reads from SQS
6. Event Consumer routes:
   β†’ Partner (only selected events with transformed payload)
   β†’ Grafana (for monitoring/analytics)
   β†’ Internal processing + DB
   β†’ S3 Archival (all events)

Reliability Guarantees

At the Edge

ScenarioHandling
Network downEvents queued in local DB
Service restartJob poller resumes
Extended outageDLQ with daily retry

In the Cloud

ScenarioHandling
Consumer crashSQS visibility timeout, auto-retry
Processing failureDead-letter queue
Partner downSeparate retry queue per partner

Monitoring Points

Critical metrics to track:

metrics:
  edge:
    - jobs_pending_count
    - dlq_size
    - mqtt_message_rate
    
  cloud:
    - sqs_queue_depth
    - consumer_lag
    - processing_error_rate
    - partner_delivery_success_rate
    
  archival:
    - parquet_files_written
    - archival_lag_minutes

Implementation Tips

1. Event Schema Versioning

Always include schema version in events:

{
  "schema_version": "2.1",
  "event_type": "door_access",
  "timestamp": "2024-01-15T10:30:00Z",
  "payload": { ... }
}

2. Idempotent Processing

Include event IDs for deduplication:

{
  "event_id": "uuid-v4",
  "idempotency_key": "site-123-door-456-2024-01-15T10:30:00Z"
}

3. Partner Retry Strategy

Use exponential backoff for partner delivery:

delays = [1, 5, 15, 60, 300]  # seconds
for attempt, delay in enumerate(delays):
    if deliver_to_partner(event):
        break
    sleep(delay)
else:
    move_to_partner_dlq(event)

Conclusion

This architecture provides:

  • Reliability - No event loss, even with network issues
  • Scalability - SQS and parallel consumers handle growth
  • Flexibility - Easy to add new consumers or partners
  • Auditability - Complete event history in S3

The key principles:

  1. Buffer locally, deliver remotely
  2. Persist before acknowledging
  3. Transform at the boundary
  4. Archive everything

Start simple with the core flow, then add partners and analytics consumers as needed.

Acknowledgements
  • Vinayak β€” Organising the literature