ml-constituent-setup-guide

Complete setup guide for PredictionIO and its dependencies

Overview

This document provides step-by-step instructions for setting up a PredictionIO (PIO) machine learning pipeline. The stack includes:

Hadoop 2.8.2 - Distributed storage (HDFS)
Spark 2.1.2 - Distributed computing
HBase 1.3.2 - Event data storage
Elasticsearch 5.6.4 - Metadata storage
PredictionIO 0.13.0 - ML server

Prerequisites

Prepare Machine

# Create organisation user
sudo adduser organisation
sudo usermod -a -G sudo organisation
sudo su - organisation

Configure Hosts

Edit /etc/hosts:

192.168.136.90 quest-master

Setup SSH Keys

ssh-keygen -t rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys
ssh ubuntu-xenial  # Test passwordless SSH

Download Components

mkdir ~/tmp && cd ~/tmp

wget http://archive.apache.org/dist/hadoop/common/hadoop-2.8.2/hadoop-2.8.2.tar.gz
wget http://archive.apache.org/dist/spark/spark-2.1.2/spark-2.1.2-bin-hadoop2.7.tgz
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.6.4.tar.gz
wget http://archive.apache.org/dist/hbase/1.3.2/hbase-1.3.2-bin.tar.gz
wget http://archive.apache.org/dist/predictionio/0.13.0/apache-predictionio-0.13.0.tar.gz

Java Setup

sudo apt-get update
sudo apt-get install openjdk-8-jdk
sudo update-alternatives --config java

Set JAVA_HOME in /etc/environment:

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

Directory Structure

# Create directories
sudo mkdir -p /opt/{hadoop,spark,elasticsearch,hbase,pio}
sudo chown organisation:organisation /opt/{hadoop,spark,elasticsearch,hbase,pio}

# Extract and move
cd ~/tmp
tar zxvf hadoop-2.8.2.tar.gz && sudo mv hadoop-2.8.2/ /opt/hadoop/
tar zxvf spark-2.1.2-bin-hadoop2.7.tgz && sudo mv spark-2.1.2-bin-hadoop2.7/ /opt/spark/
tar zxvf elasticsearch-5.6.4.tar.gz && sudo mv elasticsearch-5.6.4/ /opt/elasticsearch/
tar zxvf hbase-1.3.2-bin.tar.gz && sudo mv hbase-1.3.2/ /opt/hbase/

# Create symlinks
sudo ln -s /opt/hadoop/hadoop-2.8.2 /usr/local/hadoop
sudo ln -s /opt/spark/spark-2.1.2-bin-hadoop2.7 /usr/local/spark
sudo ln -s /opt/elasticsearch/elasticsearch-5.6.4 /usr/local/elasticsearch
sudo ln -s /opt/hbase/hbase-1.3.2 /usr/local/hbase

Hadoop Configuration

Environment Variables

Add to ~/.bashrc:

export HADOOP_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_INSTALL=$HADOOP_HOME

Core Configuration

Edit $HADOOP_HOME/etc/hadoop/core-site.xml:

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://quest-master:9000</value>
    </property>
</configuration>

HDFS Configuration

Edit $HADOOP_HOME/etc/hadoop/hdfs-site.xml:

<configuration>
    <property>
        <name>dfs.data.dir</name>
        <value>file:///usr/local/hadoop/dfs/name/data</value>
        <final>true</final>
    </property>
    <property>
        <name>dfs.name.dir</name>
        <value>file:///usr/local/hadoop/dfs/name</value>
        <final>true</final>
    </property>
    <property>
        <name>dfs.replication</name>
        <value>2</value>
    </property>
    <property>
        <name>dfs.namenode.http-address</name>
        <value>192.168.136.90:50070</value>
    </property>
</configuration>

Initialize HDFS

bin/hadoop namenode -format
bin/hadoop datanode -format
sbin/start-dfs.sh

# Create required directories
hdfs dfs -mkdir /hbase
hdfs dfs -mkdir /zookeeper
hdfs dfs -mkdir /models
hdfs dfs -mkdir /user
hdfs dfs -mkdir /user/organisation

Web UI: http://192.168.136.90:50070/

Spark Configuration

Master/Slave Files

Edit conf/masters:

quest-master

Edit conf/slaves:

quest-master

Spark Environment

Edit conf/spark-env.sh:

SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true -Dspark.worker.cleanup.interval=3600 -Dspark.worker.cleanup.appDataTtl=172800"
SPARK_LOCAL_IP=192.168.136.90

Start Spark:

sbin/start-all.sh

Web UI: http://192.168.136.90:8080/

Elasticsearch Configuration

Edit /usr/local/elasticsearch/config/elasticsearch.yml:

cluster.name: quest-cluster
http.host: 192.168.136.90
discovery.zen.ping.unicast.hosts: ["quest-master"]

Memory settings in config/jvm.options:

-Xms512m
-Xmx512m

Start Elasticsearch:

source ~/.profile
bin/elasticsearch -d

Verify: http://192.168.136.90:9200/

HBase Configuration

Edit /usr/local/hbase/conf/hbase-site.xml:

<configuration>
    <property>
        <name>hbase.rootdir</name>
        <value>hdfs://quest-master:9000/hbase</value>
    </property>
    <property>
        <name>hbase.zookeeper.property.dataDir</name>
        <value>hdfs://quest-master:9000/zookeeper</value>
    </property>
    <property>
        <name>hbase.zookeeper.quorum</name>
        <value>localhost</value>
    </property>
    <property>
        <name>hbase.zookeeper.property.clientPort</name>
        <value>2181</value>
    </property>
</configuration>

Edit conf/hbase-env.sh:

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HBASE_MANAGES_ZK=true

Start HBase:

bin/start-hbase.sh

Web UI: http://192.168.136.90:16010

PredictionIO Setup

Build PredictionIO

mkdir pio_staging && cd pio_staging
tar zxvf apache-predictionio-0.13.0.tar.gz
export TERM=xterm-color  # Fix terminal issues on Ubuntu 18
./make-distribution.sh
tar zxvf PredictionIO-0.13.0.tar.gz
mv PredictionIO-0.13.0/ /opt/pio
sudo ln -s /opt/pio/PredictionIO-0.13.0 /usr/local/pio

Environment Configuration

Add to ~/.profile:

# Java
export JAVA_OPTS="-Xmx1g"
export ES_JAVA_OPTS="-Xmx512m"

# Spark
export MASTER=spark://quest-master:7077
export SPARK_HOME=/usr/local/spark

# PIO
export PATH=$PATH:/usr/local/pio/bin:/usr/local/pio

PIO Environment

Edit /usr/local/pio/conf/pio-env.sh:

SPARK_HOME=/usr/local/spark
ES_CONF_DIR=/usr/local/elasticsearch
HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
HBASE_CONF_DIR=/usr/local/hbase/conf

# Storage paths
PIO_FS_BASEDIR=$HOME/.pio_store
PIO_FS_ENGINESDIR=$PIO_FS_BASEDIR/engines
PIO_FS_TMPDIR=$PIO_FS_BASEDIR/tmp

# Repository configuration
PIO_STORAGE_REPOSITORIES_METADATA_NAME=pio_meta
PIO_STORAGE_REPOSITORIES_METADATA_SOURCE=ELASTICSEARCH

PIO_STORAGE_REPOSITORIES_EVENTDATA_NAME=pio_event
PIO_STORAGE_REPOSITORIES_EVENTDATA_SOURCE=HBASE

PIO_STORAGE_REPOSITORIES_MODELDATA_NAME=pio_model
PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE=HDFS

# Elasticsearch
PIO_STORAGE_SOURCES_ELASTICSEARCH_TYPE=elasticsearch
PIO_STORAGE_SOURCES_ELASTICSEARCH_HOME=/usr/local/elasticsearch
PIO_STORAGE_SOURCES_ELASTICSEARCH_CLUSTERNAME=quest-cluster
PIO_STORAGE_SOURCES_ELASTICSEARCH_HOSTS=quest-master

# HDFS
PIO_STORAGE_SOURCES_HDFS_TYPE=hdfs
PIO_STORAGE_SOURCES_HDFS_PATH=hdfs://quest-master:9000/models

# HBase
PIO_STORAGE_SOURCES_HBASE_TYPE=hbase
PIO_STORAGE_SOURCES_HBASE_HOME=/usr/local/hbase
PIO_STORAGE_SOURCES_HBASE_HOSTS=quest-master

Start PredictionIO

source ~/.profile
pio-start-all
pio status

Start Event Server

nohup pio eventserver --ip 192.168.136.90 --port 7070 &
curl -i -X GET http://192.168.136.90:7070

Verification

Use jps -l to verify all services are running:

Service	Port
HDFS NameNode	50070
Spark Master	8080
Elasticsearch	9200
HBase Master	16010
PIO Event Server	7070

Cluster Configuration (HA Setup)

For a small HA cluster, configure multiple nodes:

# /etc/hosts on all nodes
10.0.0.1 some-master
10.0.0.2 some-slave-1
10.0.0.3 some-slave-2

Update masters/slaves files:

# masters
some-master

# slaves
some-master
some-slave-1
some-slave-2

This guide covers single-machine and basic cluster setups. For production deployments, consider additional security, monitoring, and backup configurations.

Comments & Discussion

Want to suggest corrections or improvements?

Have a correction, suggestion, or idea for improvement?

Comment below using GitHub Discussions (recommended)
Email directly via LinkedIn for detailed feedback
Open an issue on GitHub for technical corrections

All constructive feedback is welcome and helps improve the content for everyone.