ML Constituent Setup

TL;DR: A comprehensive guide for setting up PredictionIO machine learning pipeline with Hadoop, Spark, HBase, and Elasticsearch.

Complete setup guide for PredictionIO and its dependencies

Overview

This document provides step-by-step instructions for setting up a PredictionIO (PIO) machine learning pipeline. The stack includes:

  • Hadoop 2.8.2 - Distributed storage (HDFS)
  • Spark 2.1.2 - Distributed computing
  • HBase 1.3.2 - Event data storage
  • Elasticsearch 5.6.4 - Metadata storage
  • PredictionIO 0.13.0 - ML server

Prerequisites

Prepare Machine

# Create organisation user
sudo adduser organisation
sudo usermod -a -G sudo organisation
sudo su - organisation

Configure Hosts

Edit /etc/hosts:

192.168.136.90 quest-master

Setup SSH Keys

ssh-keygen -t rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys
ssh ubuntu-xenial  # Test passwordless SSH

Download Components

mkdir ~/tmp && cd ~/tmp

wget http://archive.apache.org/dist/hadoop/common/hadoop-2.8.2/hadoop-2.8.2.tar.gz
wget http://archive.apache.org/dist/spark/spark-2.1.2/spark-2.1.2-bin-hadoop2.7.tgz
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.6.4.tar.gz
wget http://archive.apache.org/dist/hbase/1.3.2/hbase-1.3.2-bin.tar.gz
wget http://archive.apache.org/dist/predictionio/0.13.0/apache-predictionio-0.13.0.tar.gz

Java Setup

sudo apt-get update
sudo apt-get install openjdk-8-jdk
sudo update-alternatives --config java

Set JAVA_HOME in /etc/environment:

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

Directory Structure

# Create directories
sudo mkdir -p /opt/{hadoop,spark,elasticsearch,hbase,pio}
sudo chown organisation:organisation /opt/{hadoop,spark,elasticsearch,hbase,pio}

# Extract and move
cd ~/tmp
tar zxvf hadoop-2.8.2.tar.gz && sudo mv hadoop-2.8.2/ /opt/hadoop/
tar zxvf spark-2.1.2-bin-hadoop2.7.tgz && sudo mv spark-2.1.2-bin-hadoop2.7/ /opt/spark/
tar zxvf elasticsearch-5.6.4.tar.gz && sudo mv elasticsearch-5.6.4/ /opt/elasticsearch/
tar zxvf hbase-1.3.2-bin.tar.gz && sudo mv hbase-1.3.2/ /opt/hbase/

# Create symlinks
sudo ln -s /opt/hadoop/hadoop-2.8.2 /usr/local/hadoop
sudo ln -s /opt/spark/spark-2.1.2-bin-hadoop2.7 /usr/local/spark
sudo ln -s /opt/elasticsearch/elasticsearch-5.6.4 /usr/local/elasticsearch
sudo ln -s /opt/hbase/hbase-1.3.2 /usr/local/hbase

Hadoop Configuration

Environment Variables

Add to ~/.bashrc:

export HADOOP_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_INSTALL=$HADOOP_HOME

Core Configuration

Edit $HADOOP_HOME/etc/hadoop/core-site.xml:

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://quest-master:9000</value>
    </property>
</configuration>

HDFS Configuration

Edit $HADOOP_HOME/etc/hadoop/hdfs-site.xml:

<configuration>
    <property>
        <name>dfs.data.dir</name>
        <value>file:///usr/local/hadoop/dfs/name/data</value>
        <final>true</final>
    </property>
    <property>
        <name>dfs.name.dir</name>
        <value>file:///usr/local/hadoop/dfs/name</value>
        <final>true</final>
    </property>
    <property>
        <name>dfs.replication</name>
        <value>2</value>
    </property>
    <property>
        <name>dfs.namenode.http-address</name>
        <value>192.168.136.90:50070</value>
    </property>
</configuration>

Initialize HDFS

bin/hadoop namenode -format
bin/hadoop datanode -format
sbin/start-dfs.sh

# Create required directories
hdfs dfs -mkdir /hbase
hdfs dfs -mkdir /zookeeper
hdfs dfs -mkdir /models
hdfs dfs -mkdir /user
hdfs dfs -mkdir /user/organisation

Web UI: http://192.168.136.90:50070/

Spark Configuration

Master/Slave Files

Edit conf/masters:

quest-master

Edit conf/slaves:

quest-master

Spark Environment

Edit conf/spark-env.sh:

SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true -Dspark.worker.cleanup.interval=3600 -Dspark.worker.cleanup.appDataTtl=172800"
SPARK_LOCAL_IP=192.168.136.90

Start Spark:

sbin/start-all.sh

Web UI: http://192.168.136.90:8080/

Elasticsearch Configuration

Edit /usr/local/elasticsearch/config/elasticsearch.yml:

cluster.name: quest-cluster
http.host: 192.168.136.90
discovery.zen.ping.unicast.hosts: ["quest-master"]

Memory settings in config/jvm.options:

-Xms512m
-Xmx512m

Start Elasticsearch:

source ~/.profile
bin/elasticsearch -d

Verify: http://192.168.136.90:9200/

HBase Configuration

Edit /usr/local/hbase/conf/hbase-site.xml:

<configuration>
    <property>
        <name>hbase.rootdir</name>
        <value>hdfs://quest-master:9000/hbase</value>
    </property>
    <property>
        <name>hbase.zookeeper.property.dataDir</name>
        <value>hdfs://quest-master:9000/zookeeper</value>
    </property>
    <property>
        <name>hbase.zookeeper.quorum</name>
        <value>localhost</value>
    </property>
    <property>
        <name>hbase.zookeeper.property.clientPort</name>
        <value>2181</value>
    </property>
</configuration>

Edit conf/hbase-env.sh:

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HBASE_MANAGES_ZK=true

Start HBase:

bin/start-hbase.sh

Web UI: http://192.168.136.90:16010

PredictionIO Setup

Build PredictionIO

mkdir pio_staging && cd pio_staging
tar zxvf apache-predictionio-0.13.0.tar.gz
export TERM=xterm-color  # Fix terminal issues on Ubuntu 18
./make-distribution.sh
tar zxvf PredictionIO-0.13.0.tar.gz
mv PredictionIO-0.13.0/ /opt/pio
sudo ln -s /opt/pio/PredictionIO-0.13.0 /usr/local/pio

Environment Configuration

Add to ~/.profile:

# Java
export JAVA_OPTS="-Xmx1g"
export ES_JAVA_OPTS="-Xmx512m"

# Spark
export MASTER=spark://quest-master:7077
export SPARK_HOME=/usr/local/spark

# PIO
export PATH=$PATH:/usr/local/pio/bin:/usr/local/pio

PIO Environment

Edit /usr/local/pio/conf/pio-env.sh:

SPARK_HOME=/usr/local/spark
ES_CONF_DIR=/usr/local/elasticsearch
HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
HBASE_CONF_DIR=/usr/local/hbase/conf

# Storage paths
PIO_FS_BASEDIR=$HOME/.pio_store
PIO_FS_ENGINESDIR=$PIO_FS_BASEDIR/engines
PIO_FS_TMPDIR=$PIO_FS_BASEDIR/tmp

# Repository configuration
PIO_STORAGE_REPOSITORIES_METADATA_NAME=pio_meta
PIO_STORAGE_REPOSITORIES_METADATA_SOURCE=ELASTICSEARCH

PIO_STORAGE_REPOSITORIES_EVENTDATA_NAME=pio_event
PIO_STORAGE_REPOSITORIES_EVENTDATA_SOURCE=HBASE

PIO_STORAGE_REPOSITORIES_MODELDATA_NAME=pio_model
PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE=HDFS

# Elasticsearch
PIO_STORAGE_SOURCES_ELASTICSEARCH_TYPE=elasticsearch
PIO_STORAGE_SOURCES_ELASTICSEARCH_HOME=/usr/local/elasticsearch
PIO_STORAGE_SOURCES_ELASTICSEARCH_CLUSTERNAME=quest-cluster
PIO_STORAGE_SOURCES_ELASTICSEARCH_HOSTS=quest-master

# HDFS
PIO_STORAGE_SOURCES_HDFS_TYPE=hdfs
PIO_STORAGE_SOURCES_HDFS_PATH=hdfs://quest-master:9000/models

# HBase
PIO_STORAGE_SOURCES_HBASE_TYPE=hbase
PIO_STORAGE_SOURCES_HBASE_HOME=/usr/local/hbase
PIO_STORAGE_SOURCES_HBASE_HOSTS=quest-master

Start PredictionIO

source ~/.profile
pio-start-all
pio status

Start Event Server

nohup pio eventserver --ip 192.168.136.90 --port 7070 &
curl -i -X GET http://192.168.136.90:7070

Verification

Use jps -l to verify all services are running:

ServicePort
HDFS NameNode50070
Spark Master8080
Elasticsearch9200
HBase Master16010
PIO Event Server7070

Cluster Configuration (HA Setup)

For a small HA cluster, configure multiple nodes:

# /etc/hosts on all nodes
10.0.0.1 some-master
10.0.0.2 some-slave-1
10.0.0.3 some-slave-2

Update masters/slaves files:

# masters
some-master

# slaves
some-master
some-slave-1
some-slave-2

This guide covers single-machine and basic cluster setups. For production deployments, consider additional security, monitoring, and backup configurations.