ML Pipeline Monitoring

TL;DR: A comprehensive guide for monitoring machine learning pipeline components using Zabbix, including Hadoop, Spark, HBase, Elasticsearch, and PredictionIO.

Zabbix monitoring configuration for ML infrastructure

Overview

This document provides detailed configuration for monitoring machine learning pipeline components using Zabbix. The monitoring covers:

  • Hadoop (NameNode, DataNode, SecondaryNameNode)
  • Spark (Master, Worker)
  • HBase (Master)
  • Elasticsearch
  • PredictionIO Event Server
  • ML Pipelines

Landscape Preparation

Each component requires JVM options to enable process monitoring through Zabbix.

Hadoop Startup Augmentation

# Stop HDFS
/usr/local/hadoop/sbin/stop-dfs.sh

# Edit hadoop-env.sh
vi /usr/local/hadoop/etc/hadoop/hadoop-env.sh

Add monitoring flags:

export HADOOP_NAMENODE_OPTS="-Dhadoop.namenode.mon=true $HADOOP_NAMENODE_OPTS"
export HADOOP_DATANODE_OPTS="-Dhadoop.datanode.mon=true $HADOOP_DATANODE_OPTS"
export HADOOP_SECONDARYNAMENODE_OPTS="-Dhadoop.secondarynamenode.mon=true $HADOOP_SECONDARYNAMENODE_OPTS"
# Start HDFS
/usr/local/hadoop/sbin/start-dfs.sh

# Verify processes
ps -fu organisation

# Test Zabbix keys
zabbix_get -s 192.168.136.90 -k 'proc.num[,organisation,,hadoop.namenode.mon]'
zabbix_get -s 192.168.136.90 -k 'proc.num[,organisation,,hadoop.datanode.mon]'
zabbix_get -s 192.168.136.90 -k 'proc.num[,organisation,,hadoop.secondarynamenode.mon]'

Spark Startup Augmentation

# Stop Spark
/usr/local/spark/sbin/stop-all.sh

# Edit spark-env.sh
vi /usr/local/spark/conf/spark-env.sh

Add monitoring flags:

SPARK_MASTER_OPTS="-Dspark.master.mon=true $SPARK_MASTER_OPTS"
SPARK_WORKER_OPTS="-Dspark.worker.mon=true $SPARK_WORKER_OPTS"
# Start Spark
/usr/local/spark/sbin/start-all.sh

# Verify processes
ps -fu organisation

# Test Zabbix keys (process count, CPU, memory)
zabbix_get -s 192.168.136.90 -k 'proc.num[,organisation,,spark.master.mon]'
zabbix_get -s 192.168.136.90 -k 'proc.num[,organisation,,spark.worker.mon]'
zabbix_get -s 192.168.136.90 -k 'proc.cpu.util[,organisation,,spark.master.mon]'
zabbix_get -s 192.168.136.90 -k 'proc.cpu.util[,organisation,,spark.worker.mon]'
zabbix_get -s 192.168.136.90 -k 'proc.mem[,organisation,,spark.master.mon,rss]'
zabbix_get -s 192.168.136.90 -k 'proc.mem[,organisation,,spark.worker.mon,rss]'

HBase Startup Augmentation

# Stop HBase
/usr/local/hbase/bin/stop-hbase.sh

# Edit hbase-env.sh
vi /usr/local/hbase/conf/hbase-env.sh

Add monitoring flag:

export HBASE_MASTER_OPTS="-Dhbase.master.mon=true $HBASE_MASTER_OPTS"
# Start HBase
/usr/local/hbase/bin/start-hbase.sh

# Test Zabbix key
zabbix_get -s 192.168.136.90 -k 'proc.num[,organisation,,hbase.master.mon]'

Elasticsearch Startup Augmentation

# Find and stop Elasticsearch
ps -aux | grep lasticsearch
kill <elasticsearch-pid>

# Edit jvm.options
vi /usr/local/elasticsearch/config/jvm.options

Add monitoring flag:

-Delastic.master.mon=true
# Start Elasticsearch
/usr/local/elasticsearch/bin/elasticsearch -d &

# Test Zabbix key
zabbix_get -s 192.168.136.90 -k 'proc.num[,organisation,,elastic.master.mon]'

Event Server Startup Augmentation

# Find and stop event server
ps -aux | grep eventserver
kill <eventserver-pid>

# Edit profile
vi ~/.profile

Add monitoring flag:

export JAVA_OPTS="-Dpio.eventserver.mon=true $JAVA_OPTS"
# Apply changes
source ~/.profile

# Start event server
nohup pio eventserver --ip 192.168.136.90 --port 7070 &

# Test Zabbix key
zabbix_get -s 192.168.136.90 -k 'proc.num[,organisation,,pio.eventserver.mon]'

Pipeline Startup Augmentation

# Find and stop pipeline
/usr/bin/lsof -t -i:<port>
kill <deploy-pid>

Add to deploy script:

-Dpio.pipeline.$PORT.mon=true
# Start pipeline
./deploy

# Test Zabbix key
zabbix_get -s 192.168.136.90 -k 'proc.num[,organisation,,pio.pipeline.17071.mon]'

Item Key Reference

Hadoop Items

ApplicationItem Key
NameNodeproc.num[,organisation,,hadoop.namenode.mon]
DataNodeproc.num[,organisation,,hadoop.datanode.mon]
SecondaryNameNodeproc.num[,organisation,,hadoop.secondarynamenode.mon]

Spark Items

MetricItem Key
Master Processproc.num[,organisation,,spark.master.mon]
Worker Processproc.num[,organisation,,spark.worker.mon]
Master CPUproc.cpu.util[,organisation,,spark.master.mon]
Worker CPUproc.cpu.util[,organisation,,spark.worker.mon]
Master Memoryproc.mem[,organisation,,spark.master.mon,rss]
Worker Memoryproc.mem[,organisation,,spark.worker.mon,rss]

Other Components

ComponentItem Key
HBase Masterproc.num[,organisation,,hbase.master.mon]
Elasticsearchproc.num[,organisation,,elastic.master.mon]
PIO Event Serverproc.num[,organisation,,pio.eventserver.mon]
Pipeline (port 17071)proc.num[,organisation,,pio.pipeline.17071.mon]

Log Monitoring

Log TypeItem Key
Train Errorslog[/var/log/organisation/pio/content-similarity/logs/train_*.log,"ERROR"]
Deploy Errorslog[/var/log/organisation/pio/content-similarity/logs/deploy_*.log,"ERROR"]
Train OOMlog[/var/log/organisation/pio/content-similarity/logs/train_*.log,"java.lang.OutOfMemoryError"]
Deploy OOMlog[/var/log/organisation/pio/content-similarity/logs/deploy_*.log,"java.lang.OutOfMemoryError"]

Trigger Configuration

Process Count Triggers

Ensure exactly one process is running:

{HMLP components:proc.num[,organisation,,hadoop.namenode.mon].last()}<>1
{HMLP components:proc.num[,organisation,,hadoop.datanode.mon].last()}<>1
{HMLP components:proc.num[,organisation,,spark.master.mon].last()}<>1
{HMLP components:proc.num[,organisation,,hbase.master.mon].last()}<>1
{HMLP components:proc.num[,organisation,,elastic.master.mon].last()}<>1
{HMLP components:proc.num[,organisation,,pio.eventserver.mon].last()}<>1

For workers (at least one):

{HMLP components:proc.num[,organisation,,spark.worker.mon].last()}<1

CPU Utilization Triggers

Alert on high CPU (140-160% threshold over 1 hour average):

{HMLP components:proc.cpu.util[,organisation,,spark.master.mon].avg(3600)}>=140
{HMLP components:proc.cpu.util[,organisation,,spark.worker.mon].avg(3600)}>=140
{HML content similarity:proc.cpu.util[,organisation,,pio.pipeline.17071.mon].avg(3600)}>=160

Memory Triggers

Alert when memory exceeds 2GB (2147483648 bytes):

{HMLP components:proc.mem[,organisation,,spark.master.mon,rss].last(#10)}>2147483648
{HMLP components:proc.mem[,organisation,,spark.worker.mon,rss].last(#10)}>2147483648
{HML content similarity:proc.mem[,organisation,,pio.pipeline.17071.mon,rss].last(#10)}>2147483648

Log Triggers

Alert on error patterns in logs:

{HML content similarity:logrt[/var/log/organisation/pio/content-similarity/logs/train_.*.log,"ERROR"].strlen()}>0
{HML content similarity:logrt[/var/log/organisation/pio/content-similarity/logs/deploy_.*.log,"ERROR"].strlen()}>0
{HML content similarity:logrt[/var/log/organisation/pio/content-similarity/logs/train_.*.log,"java.lang.OutOfMemoryError"].strlen()}>0

Useful Commands

# Test Zabbix agent locally
zabbix_get -s localhost -k 'proc.num[zabbix_agentd,zabbix]'

# List user processes
ps -fu organisation

# Monitor user processes
top -u organisation

References


Configure alerts and dashboards in Zabbix UI based on these templates for comprehensive ML pipeline monitoring.