How to Install and Use Apache PredictionIO for Machine Learning on CentOS 7
Traditional approaches to data analysis are impossible to use when datasets reach a certain size. A modern alternative to analyzing the huge sets of data is using machine learning methods. Machine learning is able to produce accurate results when using a fast and efficient algorithm.
Apache PredictionIO is an open source machine learning server used to create predictive engines for any machine learning task. It shortens the time of machine learning application from lab to production by using customizable engine templates which can be built and deployed quickly. It provides the data collection and serving components, and abstracts underlying technology to expose an API that allows developers to focus on transformation components. Once the engine server of PredictionIO is deployed as a web service, it can respond to dynamic queries in real-time.
Apache PredictionIO consists of different components.
- PredictionIO Platform: An open source machine learning stack built on the top of some state-of-the-art open source application such as Apache Spark, Apache Hadoop, Apache HBase and Elasticsearch.
- Event Server: This continuously gathers data from your web server or mobile application server in real-time mode or batch mode. The gathered data can be used to train the engine or to provide a unified view for data analysis. The event server uses Apache HBase to store the data.
- Engine Server: The engine server is responsible for making the actual prediction. It reads the training data from the data store and uses one or more machine learning algorithm for building the predictive models. An engine, once deployed as a web service, responds to the queries made by a web or mobile app using REST API or SDK.
- Template Gallery: This gallery offers various types of pre-built engine templates. You can choose a template which is similar to your use case and modify it according to your requirements.
Prerequisites
- A Vultr CentOS 7 server instance with at least 8GB RAM. For testing and development purpose, you can choose an instance with 4GB RAM and another 4GB swap memory.
- A sudo user.
In this tutorial, we will use 192.0.2.1
as the public IP address of the server. Replace all occurrences of 192.0.2.1
with your Vultr public IP address.
Update your base system using the guide How to Update CentOS 7. Once your system has been updated, proceed to install Java.
Install Java
Many of the components of PredictionIO require JDK, or Java Development Kit, version 8 to work. It supports both OpenJDK and Oracle Java. In this tutorial, we will install OpenJDK version 8.
OpenJDK can be easily installed, as the package is available in the default YUM repository.
sudo yum -y install java-1.8.0-openjdk-devel
Verify Java's version to ensure it was installed correctly.
java -version
You will get a similar output.
[user@vultr ~]$ java -version
openjdk version "1.8.0_151"
OpenJDK Runtime Environment (build 1.8.0_151-b12)
OpenJDK 64-Bit Server VM (build 25.151-b12, mixed mode)
Before we can proceed further, we will need to set up the JAVA_HOME
and JRE_HOME
environment variables. Find the absolute path of the JAVA executable in your system.
readlink -f $(which java)
You will see a similar output.
[user@vultr ~]$ readlink -f $(which java)
/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.el7_4.x86_64/jre/bin/java
Now, set JAVA_HOME
and JRE_HOME
environment variable according to the path of the Java directory.
echo "export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.el7_4.x86_64" >> ~/.bash_profile
echo "export JRE_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.el7_4.x86_64/jre" >> ~/.bash_profile
Execute the bash_profile
file.
source ~/.bash_profile
Now you can run the echo $JAVA_HOME
command to check if the environment variable is set.
[user@vultr ~]$ echo $JAVA_HOME
/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.el7_4.x86_64
Install PredictionIO
Apache provides PredictionIO source files which can be downloaded and compiled locally. Create a new temporary directory to download and compile the source file.
mkdir /tmp/pio_sourcefiles && cd /tmp/pio_sourcefiles
Download the PredictionIO source file archive using any Apache Mirror site.
wget http://apache.mirror.vexxhost.com/incubator/predictionio/0.12.0-incubating/apache-predictionio-0.12.0-incubating.tar.gz
Extract the archive and compile the source to create a distribution of PredictionIO.
tar xf apache-predictionio-0.12.0-incubating.tar.gz
./make-distribution.sh
The above distribution will be built against the default versions of the dependencies, which are Scala 2.11.8
, Spark 2.1.1
, Hadoop 2.7.3
and ElasticSearch 5.5.2
. Wait for the build to finish, it will take around ten minutes to complete depending upon your system's performance.
Note: You are free to use the latest supported version of the dependencies, but you may see some warnings during the build as some functions might be deprecated. Run ./make-distribution.sh -Dscala.version=2.11.11 -Dspark.version=2.1.2 -Dhadoop.version=2.7.4 -Delasticsearch.version=5.5.3
, replacing the version number according to your choice.
Once the build successfully finishes, you will see the following message at the end.
...
PredictionIO-0.12.0-incubating/python/pypio/__init__.py
PredictionIO-0.12.0-incubating/python/pypio/utils.py
PredictionIO-0.12.0-incubating/python/pypio/shell.py
PredictionIO binary distribution created at PredictionIO-0.12.0-incubating.tar.gz
The PredictionIO binary files will be saved in the PredictionIO-0.12.0-incubating.tar.gz
archive. Extract the archive in the /opt
directory and provide the ownership to the current user.
sudo tar xf PredictionIO-0.12.0-incubating.tar.gz -C /opt/
sudo chown -R $USER:$USER /opt/PredictionIO-0.12.0-incubating
Set the PIO_HOME
environment variable.
echo "export PIO_HOME=/opt/PredictionIO-0.12.0-incubating" >> ~/.bash_profile
source ~/.bash_profile
Install Required Dependencies
Create a new directory to install PredictionIO dependencies such as HBase
, Spark
and Elasticsearch
.
mkdir /opt/PredictionIO-0.12.0-incubating/vendors
Download Scala version 2.11.8 and extract it into the vendors
directory.
wget https://downloads.lightbend.com/scala/2.11.8/scala-2.11.8.tgz
tar xf scala-2.11.8.tgz -C /opt/PredictionIO-0.12.0-incubating/vendors
Download Apache Hadoop version 2.7.3 and extract it into the vendors
directory.
wget https://archive.apache.org/dist/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz
tar xf hadoop-2.7.3.tar.gz -C /opt/PredictionIO-0.12.0-incubating/vendors
Apache Spark is the default processing engine for PredictionIO. Download Spark version 2.1.1 and extract it into the vendors
directory.
wget https://archive.apache.org/dist/spark/spark-2.1.1/spark-2.1.1-bin-hadoop2.7.tgz
tar xf spark-2.1.1-bin-hadoop2.7.tgz -C /opt/PredictionIO-0.12.0-incubating/vendors
Download Elasticsearch version 5.5.2 and extract it into the vendors
directory.
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.5.2.tar.gz
tar xf elasticsearch-5.5.2.tar.gz -C /opt/PredictionIO-0.12.0-incubating/vendors
Finally, download HBase version 1.2.6 and extract it into the vendors
directory.
wget https://archive.apache.org/dist/hbase/stable/hbase-1.2.6-bin.tar.gz
tar xf hbase-1.2.6-bin.tar.gz -C /opt/PredictionIO-0.12.0-incubating/vendors
Open the hbase-site.xml
configuration file to configure HBase to work in a standalone environment.
nano /opt/PredictionIO-0.12.0-incubating/vendors/hbase-1.2.6/conf/hbase-site.xml
Find the empty configuration block and replace it with the following configuration.
<configuration>
<property>
<name>hbase.rootdir</name>
<value>file:///home/user/PredictionIO-0.12.0-incubating/vendors/hbase-1.2.6/data</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/home/user/PredictionIO-0.12.0-incubating/vendors/hbase-1.2.6/zookeeper</value>
</property>
</configuration>
The data directory will be created automatically by HBase. Edit the HBase environment file to set the JAVA_HOME
path.
nano /opt/PredictionIO-0.12.0-incubating/vendors/hbase-1.2.6/conf/hbase-env.sh
Uncomment line number 27 and set JAVA_HOME
to the path of jre
, your Java installation. You can find the path to the JAVA executable using the readlink -f $(which java)
command.
# The java implementation to use. Java 1.7+ required.
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.el7_4.x86_64/jre
Also, comment out line numbers 46 and 47 as they are not required for JAVA 8.
# Configure PermSize. Only needed in JDK7. You can safely remove it for JDK8+
# export HBASE_MASTER_OPTS="$HBASE_MASTER_OPTS -XX:PermSize=128m -XX:MaxPermSize=128m"
# export HBASE_REGIONSERVER_OPTS="$HBASE_REGIONSERVER_OPTS -XX:PermSize=128m -XX:MaxPermSize=128m"
Configure the PredictionIO Environment
The default configuration in the PredictionIO environment file pio-env.sh
assumes that we are using PostgreSQL or MySQL. As we have used HBase and Elasticsearch, we will need to modify nearly every configuration in the file. It's best to take a backup of the existing file and create a new PredictionIO environment file.
mv /opt/PredictionIO-0.12.0-incubating/conf/pio-env.sh /opt/PredictionIO-0.12.0-incubating/conf/pio-env.sh.bak
Now create a new file for PredictionIO environment configuration.
nano /opt/PredictionIO-0.12.0-incubating/conf/pio-env.sh
Populate the file with the following configuration.
# PredictionIO Main Configuration
#
# This section controls core behavior of PredictionIO. It is very likely that
# you need to change these to fit your site.
# SPARK_HOME: Apache Spark is a hard dependency and must be configured.
SPARK_HOME=$PIO_HOME/vendors/spark-2.1.1-bin-hadoop2.7
# POSTGRES_JDBC_DRIVER=$PIO_HOME/lib/postgresql-42.0.0.jar
# MYSQL_JDBC_DRIVER=$PIO_HOME/lib/mysql-connector-java-5.1.41.jar
# ES_CONF_DIR: You must configure this if you have advanced configuration for
# your Elasticsearch setup.
ES_CONF_DIR=$PIO_HOME/vendors/elasticsearch-5.5.2/config
# HADOOP_CONF_DIR: You must configure this if you intend to run PredictionIO
# with Hadoop 2.
HADOOP_CONF_DIR=$PIO_HOME/vendors/spark-2.1.1-bin-hadoop2.7/conf
# HBASE_CONF_DIR: You must configure this if you intend to run PredictionIO
# with HBase on a remote cluster.
HBASE_CONF_DIR=$PIO_HOME/vendors/hbase-1.2.6/conf
# Filesystem paths where PredictionIO uses as block storage.
PIO_FS_BASEDIR=$HOME/.pio_store
PIO_FS_ENGINESDIR=$PIO_FS_BASEDIR/engines
PIO_FS_TMPDIR=$PIO_FS_BASEDIR/tmp
# PredictionIO Storage Configuration
#
# This section controls programs that make use of PredictionIO's built-in
# storage facilities. Default values are shown below.
#
# For more information on storage configuration please refer to
# http://predictionio.incubator.apache.org/system/anotherdatastore/
# Storage Repositories
# Default is to use PostgreSQL
PIO_STORAGE_REPOSITORIES_METADATA_NAME=pio_meta
PIO_STORAGE_REPOSITORIES_METADATA_SOURCE=ELASTICSEARCH
PIO_STORAGE_REPOSITORIES_EVENTDATA_NAME=pio_event
PIO_STORAGE_REPOSITORIES_EVENTDATA_SOURCE=HBASE
PIO_STORAGE_REPOSITORIES_MODELDATA_NAME=pio_model
PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE=LOCALFS
# Storage Data Sources
# PostgreSQL Default Settings
# Please change "pio" to your database name in PIO_STORAGE_SOURCES_PGSQL_URL
# Please change PIO_STORAGE_SOURCES_PGSQL_USERNAME and
# PIO_STORAGE_SOURCES_PGSQL_PASSWORD accordingly
# PIO_STORAGE_SOURCES_PGSQL_TYPE=jdbc
# PIO_STORAGE_SOURCES_PGSQL_URL=jdbc:postgresql://localhost/pio
# PIO_STORAGE_SOURCES_PGSQL_USERNAME=pio
# PIO_STORAGE_SOURCES_PGSQL_PASSWORD=pio
# MySQL Example
# PIO_STORAGE_SOURCES_MYSQL_TYPE=jdbc
# PIO_STORAGE_SOURCES_MYSQL_URL=jdbc:mysql://localhost/pio
# PIO_STORAGE_SOURCES_MYSQL_USERNAME=pio
# PIO_STORAGE_SOURCES_MYSQL_PASSWORD=pio
# Elasticsearch Example
PIO_STORAGE_SOURCES_ELASTICSEARCH_TYPE=elasticsearch
PIO_STORAGE_SOURCES_ELASTICSEARCH_HOSTS=localhost
PIO_STORAGE_SOURCES_ELASTICSEARCH_PORTS=9200
PIO_STORAGE_SOURCES_ELASTICSEARCH_SCHEMES=http
PIO_STORAGE_SOURCES_ELASTICSEARCH_CLUSTERNAME=pio
PIO_STORAGE_SOURCES_ELASTICSEARCH_HOME=$PIO_HOME/vendors/elasticsearch-5.5.2
# Optional basic HTTP auth
# PIO_STORAGE_SOURCES_ELASTICSEARCH_USERNAME=my-name
# PIO_STORAGE_SOURCES_ELASTICSEARCH_PASSWORD=my-secret
# Elasticsearch 1.x Example
# PIO_STORAGE_SOURCES_ELASTICSEARCH_TYPE=elasticsearch
# PIO_STORAGE_SOURCES_ELASTICSEARCH_CLUSTERNAME=<elasticsearch_cluster_name>
# PIO_STORAGE_SOURCES_ELASTICSEARCH_HOSTS=localhost
# PIO_STORAGE_SOURCES_ELASTICSEARCH_PORTS=9300
# PIO_STORAGE_SOURCES_ELASTICSEARCH_HOME=$PIO_HOME/vendors/elasticsearch-1.7.6
# Local File System Example
PIO_STORAGE_SOURCES_LOCALFS_TYPE=localfs
PIO_STORAGE_SOURCES_LOCALFS_PATH=$PIO_FS_BASEDIR/models
# HBase Example
PIO_STORAGE_SOURCES_HBASE_TYPE=hbase
PIO_STORAGE_SOURCES_HBASE_HOME=$PIO_HOME/vendors/hbase-1.2.6
# AWS S3 Example
# PIO_STORAGE_SOURCES_S3_TYPE=s3
# PIO_STORAGE_SOURCES_S3_BUCKET_NAME=pio_bucket
# PIO_STORAGE_SOURCES_S3_BASE_PATH=pio_model
Save the file and exit from the editor.
Open the Elasticsearch configuration file.
nano /opt/PredictionIO-0.12.0-incubating/vendors/elasticsearch-5.5.2/config/elasticsearch.yml
Uncomment the line and set the cluster name to exactly the same as the one provided in the PredictionIO environment file. The cluster name is set to pio
in the above configuration.
# Use a descriptive name for your cluster:
#
cluster.name: pio
Now add the $PIO_HOME/bin
directory into the PATH variable so that the PredictionIO executables are executed directly.
echo "export PATH=$PATH:$PIO_HOME/bin" >> ~/.bash_profile
source ~/.bash_profile
At this point, PredictionIO is successfully installed on your server.
Starting PredictionIO
You can start all the services in PredictionIO such as Elasticsearch, HBase and Event server using a single command.
pio-start-all
You will see the following output.
[user@vultr ~]$ pio-start-all
Starting Elasticsearch...
Starting HBase...
starting master, logging to /opt/PredictionIO-0.12.0-incubating/vendors/hbase-1.2.6/bin/../logs/hbase-user-master-vultr.guest.out
Waiting 10 seconds for Storage Repositories to fully initialize...
Starting PredictionIO Event Server...
Use the following command to check the status of the PredictionIO server.
pio status
You will see the following output.
[user@vultr ~]$ pio status
[INFO] [Management$] Inspecting PredictionIO...
[INFO] [Management$] PredictionIO 0.12.0-incubating is installed at /opt/PredictionIO-0.12.0-incubating
[INFO] [Management$] Inspecting Apache Spark...
[INFO] [Management$] Apache Spark is installed at /opt/PredictionIO-0.12.0-incubating/vendors/spark-2.1.1-bin-hadoop2.7
[INFO] [Management$] Apache Spark 2.1.1 detected (meets minimum requirement of 1.3.0)
[INFO] [Management$] Inspecting storage backend connections...
[INFO] [Storage$] Verifying Meta Data Backend (Source: ELASTICSEARCH)...
[INFO] [Storage$] Verifying Model Data Backend (Source: LOCALFS)...
[INFO] [Storage$] Verifying Event Data Backend (Source: HBASE)...
[INFO] [Storage$] Test writing to Event Store (App Id 0)...
[INFO] [HBLEvents] The namespace pio_event doesn't exist yet. Creating now...
[INFO] [HBLEvents] The table pio_event:events_0 doesn't exist yet. Creating now...
[INFO] [HBLEvents] Removing table pio_event:events_0...
[INFO] [Management$] Your system is all ready to go.
As we can see in the above messages, our system is ready to use for implementing an engine template and predicting data.
Implementing an Engine Template
Several ready to use engine templates are available on the PredictionIO Template Gallery which can be easily installed on the PredictionIO server. You are free to browse through the list of engine templates to find the one that is close to your requirements or you can write your own engine.
In this tutorial, we will implement the E-Commerce Recommendation
engine template to demonstrate the functionality of PredictionIO server using some sample data. This engine template provides some personal recommendation to a user in an e-commerce website. By default, it has features such as excluding out of stock items or providing recommendations to a user who signs up after the model is trained. Also, by default, the engine template takes a user's view and buy events, items with categories and properties and list of unavailable items. Once the engine has been trained and deployed, you can send a query with the user id and number of items to be recommended. The generated output will be a ranked list of recommended item ids.
Install Git, as it will be used to clone the repository.
cd ~
sudo yum -y install git
Clone the E-Commerce Recommender engine template on your system.
git clone https://github.com/apache/incubator-predictionio-template-ecom-recommender.git MyEComRecomm
Create a new application for the E-Commerce Recommendation template engine. Each application in PredictionIO is used to store the data for a separate website. If you have multiple websites, then you can create multiple apps to store each website's data into a different application. You are free to choose any name for your application.
cd MyEComRecomm/
pio app new myecom
You will see the following output.
[user@vultr MyEComRecomm]$ pio app new myecom
[INFO] [HBLEvents] The table pio_event:events_1 doesn't exist yet. Creating now...
[INFO] [App$] Initialized Event Store for this app ID: 1.
[INFO] [Pio$] Created a new app:
[INFO] [Pio$] Name: myecom
[INFO] [Pio$] ID: 1
[INFO] [Pio$] Access Key: a_DnDr4uyvjsKRldPoJAFMuPvb-QBz-BhUFyGehXoTKbm89r00Gx4ygnqspTJx4t
The output above also contains the access key which will be used to authenticate when sending the input data to the event server.
You can always find the access key along with the list of available applications by running.
pio app list
You will see the following output containing a list of applications and the access key.
[user@vultr MyEComRecomm]$ pio app list
[INFO] [Pio$] Name | ID | Access Key | Allowed Event(s)
[INFO] [Pio$] myecom | 1 | a_DnDr4uyvjsKRldPoJAFMuPvb-QBz-BhUFyGehXoTKbm89r00Gx4ygnqspTJx4t | (all)
[INFO] [Pio$] Finished listing 1 app(s).
Now that we have created a new application, we will add some data to it. In the production environment, you would want to automatically send the data to the event server by integrating the event server API into the application. To learn how PredictionIO works, we will import some sample data into it. The template engine provides a Python script which can be easily used to import the sample data into the event server.
Install Python pip.
sudo yum -y install python-pip
sudo pip install --upgrade pip
Install PredictionIO Python SDK using pip.
sudo pip install predictionio
Run the Python script to add the sample data to the event server.
python data/import_eventserver.py --access_key a_DnDr4uyvjsKRldPoJAFMuPvb-QBz-BhUFyGehXoTKbm89r00Gx4ygnqspTJx4t
Make sure to replace the access key with your actual access key. You will see a similar output.
[user@vultr MyEComRecomm]$ python data/import_eventserver.py --access_key a_DnDr4uyvjsKRldPoJAFMuPvb-QBz-BhUFyGehXoTKbm89r00Gx4ygnqspTJx4t
Namespace(access_key='a_DnDr4uyvjsKRldPoJAFMuPvb-QBz-BhUFyGehXoTKbm89r00Gx4ygnqspTJx4t', url='http://localhost:7070')
{u'status': u'alive'}
Importing data...
('Set user', 'u1')
('Set user', 'u2')
...
('User', 'u10', 'buys item', 'i30')
('User', 'u10', 'views item', 'i40')
('User', 'u10', 'buys item', 'i40')
204 events are imported.
The above script imports 10 users, 50 items in 6 categories and some random events of purchase and views. To check if the events are imported or not, you can run the following query.
curl -i -X GET "http://localhost:7070/events.json?accessKey=a_DnDr4uyvjsKRldPoJAFMuPvb-QBz-BhUFyGehXoTKbm89r00Gx4ygnqspTJx4t"
The output will show you the list of all the imported events in JSON format.
Now, open the engine.json
file into the editor. This file contains the configuration of the engine.
nano engine.json
Find both the occurrences of appName
and replace the value with the actual name of the app you have created earlier.
{
"id": "default",
"description": "Default settings",
"engineFactory": "org.example.ecommercerecommendation.ECommerceRecommendationEngine",
"datasource": {
"params" : {
"appName": "myecom"
}
},
"algorithms": [
{
"name": "ecomm",
"params": {
"appName": "myecom",
"unseenOnly": true,
"seenEvents": ["buy", "view"],
"similarEvents": ["view"],
"rank": 10,
"numIterations" : 20,
"lambda": 0.01,
"seed": 3
}
}
]
}
Build the application.
pio build --verbose
If you do not want to see the log messages, remove the --verbose
option. Building the engine template for the first time will take few minutes. You will see a similar output when the build successfully finishes.
[user@vultr MyEComRecomm]$ pio build --verbose
[INFO] [Engine$] Using command '/opt/PredictionIO-0.12.0-incubating/sbt/sbt' at /home/user/MyEComRecomm to build.
...
[INFO] [Engine$] Build finished successfully.
[INFO] [Pio$] Your engine is ready for training.
Train the engine now. During the training, the engine analyzes the data set and trains itself according to the provided algorithm.
pio train
Before we deploy the application, we will need to open the port 8000
so that the status of the application can be viewed on the Web GUI. Also, the websites and applications using the event server will send and receive their queries through this port.
sudo firewall-cmd --zone=public --permanent --add-port=8000/tcp
sudo firewall-cmd --reload
Now you can deploy the PredictionIO engine.
pio deploy
The above command will deploy the engine and the built-in web server on port 8000
to respond to the queries from the e-commerce websites and applications. You will see the following output at the end once the engine is successfully deployed.
[INFO] [HttpListener] Bound to /0.0.0.0:8000
[INFO] [MasterActor] Engine is deployed and running. Engine API is live at http://0.0.0.0:8000.
You can verify the status of the engine by going to http://192.0.2.1:8000
using any modern browser. Make sure that you replace 192.0.2.1
with your actual Vultr IP address.
This signifies that the engine template for E-Commerce recommendation is deployed and running successfully. You can query the engine template to fetch five recommendations for user u5
by running the following query in a new terminal session.
curl -H "Content-Type: application/json" \
-d '{ "user": "u5", "num": 5 }' \
http://localhost:8000/queries.json
You will see the generated recommendations for user u5
.
[user@vultr ~]$ curl -H "Content-Type: application/json" \
> -d '{ "user": "u5", "num": 5 }' \
> http://localhost:8000/queries.json
{"itemScores":[{"item":"i25","score":0.9985169366745619},{"item":"i10","score":0.996613946803819},{"item":"i27","score":0.996613946803819},{"item":"i17","score":0.9962796867639341},{"item":"i8","score":0.9955868705972656}]}
Wrapping Up
Congratulations, Apache PredictionIO has been successfully deployed on your server. You can now use the API of the event server to import the data into the engine to predict the recommendations for the user. If you want, you can use some other templates from the template gallery. Be sure to check out the Universal Recommender engine template which can be used in almost all use cases including e-commerce, news or video.