Building Edge-to-Cloud Event Pipelines: Architecture and Patterns

Building reliable event pipelines from edge devices to the cloud is challenging. Network connectivity is unreliable, events must never be lost, and different consumers need different views of the same data. This article presents a battle-tested architecture for edge-to-cloud event flow.

Architecture Overview

The pipeline consists of three layers:

Remote Edge Layer - Where physical events originate
Cloud Layer - Event routing and processing
External Partner Layer - Whitelisted event delivery

1. Remote Edge Layer

This is where all physical-world events originate—from devices installed at sites.

1.1 MQTT Broker (Mosquitto)

The local MQTT broker acts as a message bus for all edge devices:

All event-generating devices publish events here
Provides buffering and decoupling between devices and the Edge Service
Handles intermittent connectivity gracefully

┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   Ajax   │    │   ZigBee    │    │     NVR     │
└──────┬──────┘    └──────┬──────┘    └──────┬──────┘
       │                  │                  │
       └──────────────────┼──────────────────┘
                          │
                          ▼
                ┌─────────────────┐
                │  MQTT Broker    │
                │  (Mosquitto)    │
                └────────┬────────┘

1.2 Events Forwarder

A lightweight component that:

Collects events from various devices (Ajax, ZigBee, NVR, IDP, etc.)
Normalizes events into a standard format
Publishes ALL events to the local MQTT broker
Does not decide or transform—only forwards

1.3 Edge Service

The intelligent component that generates and processes events:

Responsibilities:

Generates internal transactional/user events (check-ins, jobs, robot actions)
Subscribes to MQTT topics to receive all forwarder events
Manages reliable delivery to the cloud

Event Processing Flow:

For every event (internal or forwarder):
1. Publish realtime job (in-memory, not stored)
2. Attempt to publish to SQS
   
   If SUCCESS:
   └── Close the job
   
   If FAILURE:
   ├── Create persistent job (in DB)
   ├── Job Poller picks up the job
   ├── Retry up to 3 times
   └── After 3 failures → push to local DLQ
   
DLQ jobs are processed daily at 2am

This ensures no event is lost at the edge, even if internet/cloud is down.

2. Cloud Layer

2.1 SQS Queue (Event Sink)

The central collector of events from ALL sites:

Durability - Events persist until processed
Ordering - FIFO if required
Scalability - Handles millions of events

# SQS Configuration
queue:
  type: FIFO  # Or Standard for higher throughput
  visibilityTimeout: 300
  messageRetentionPeriod: 1209600  # 14 days
  deadLetterQueue:
    maxReceiveCount: 3

2.2 Event Consumer

Reads events from SQS and routes them to appropriate destinations:

┌─────────────────┐
│  Event Consumer │
└────────┬────────┘
         │
    ┌────┴────┬─────────────┬────────────┐
    │         │             │            │
    ▼         ▼             ▼            ▼
┌───────┐ ┌───────┐ ┌────────────┐ ┌──────────┐
│Partner│ │Grafana│ │Internal SQS│ │S3 Parquet│
└───────┘ └───────┘ │ Queues     │ │ Archival │
                    └────────────┘ └──────────┘

a) Send to Partner

Only whitelisted events go to partners:

Payload goes through Payload Transformer
Only agreed fields are sent (privacy/security)
Output matches Partner Schema

# Example transformation
def transform_for_partner(event, partner_config):
    whitelist = partner_config['allowed_fields']
    return {
        field: event.get(field)
        for field in whitelist
        if field in event
    }

b) Send to Grafana

Operational/monitoring events pushed to Grafana (via Loki or Prometheus):

Used for dashboards
Real-time observability
Alerting

c) Internal Processing + Database

Events routed to priority queues based on urgency:

Queue	Priority	Processing Time
High	User-facing	< 1 second
Medium	Business logic	< 10 seconds
Low	Analytics	Best effort

Consumers handle:

Database insertion
Triggering automation
Status updates
Internal workflow initiation

d) S3 Parquet Archival

Every event is permanently stored in S3 as Parquet:

Long-term analytics
Compliance and audits
Machine learning pipelines
Historical reporting

# Parquet archival structure
s3://events-archive/
├── year=2024/
│   ├── month=01/
│   │   ├── day=15/
│   │   │   ├── hour=00/
│   │   │   │   └── events_0001.parquet

3. External Partner Layer

Partners receive only:

Whitelisted events they’ve subscribed to
Sanitized and transformed payloads
Delivery via webhook or API

# Partner configuration
partner:
  name: security-monitoring
  webhook: https://partner.example.com/events
  subscriptions:
    - intrusion_alert
    - door_access
  transform:
    strip_fields: [internal_id, raw_payload]
    rename:
      timestamp: event_time
      site_id: location_id

Full Flow Summary

1. Physical devices → Forwarder → MQTT
2. Edge Service reads events from MQTT + adds its own events
3. Edge Service creates jobs → Job Poller → Push to SQS
4. SQS stores all events
5. Event Consumer reads from SQS
6. Event Consumer routes:
   → Partner (only selected events with transformed payload)
   → Grafana (for monitoring/analytics)
   → Internal processing + DB
   → S3 Archival (all events)

Reliability Guarantees

At the Edge

Scenario	Handling
Network down	Events queued in local DB
Service restart	Job poller resumes
Extended outage	DLQ with daily retry

In the Cloud

Scenario	Handling
Consumer crash	SQS visibility timeout, auto-retry
Processing failure	Dead-letter queue
Partner down	Separate retry queue per partner

Monitoring Points

Critical metrics to track:

metrics:
  edge:
    - jobs_pending_count
    - dlq_size
    - mqtt_message_rate
    
  cloud:
    - sqs_queue_depth
    - consumer_lag
    - processing_error_rate
    - partner_delivery_success_rate
    
  archival:
    - parquet_files_written
    - archival_lag_minutes

Implementation Tips

1. Event Schema Versioning

Always include schema version in events:

{
  "schema_version": "2.1",
  "event_type": "door_access",
  "timestamp": "2024-01-15T10:30:00Z",
  "payload": { ... }
}

2. Idempotent Processing

Include event IDs for deduplication:

{
  "event_id": "uuid-v4",
  "idempotency_key": "site-123-door-456-2024-01-15T10:30:00Z"
}

3. Partner Retry Strategy

Use exponential backoff for partner delivery:

delays = [1, 5, 15, 60, 300]  # seconds
for attempt, delay in enumerate(delays):
    if deliver_to_partner(event):
        break
    sleep(delay)
else:
    move_to_partner_dlq(event)

Conclusion

This architecture provides:

Reliability - No event loss, even with network issues
Scalability - SQS and parallel consumers handle growth
Flexibility - Easy to add new consumers or partners
Auditability - Complete event history in S3

The key principles:

Buffer locally, deliver remotely
Persist before acknowledging
Transform at the boundary
Archive everything

Start simple with the core flow, then add partners and analytics consumers as needed.

Comments & Discussion

Want to suggest corrections or improvements?

Have a correction, suggestion, or idea for improvement?

Comment below using GitHub Discussions (recommended)
Email directly via LinkedIn for detailed feedback
Open an issue on GitHub for technical corrections

All constructive feedback is welcome and helps improve the content for everyone.