How to Build a Machine Learning Classifier in Python with Scikit Learn

Updated on October 26, 2022
How to Build a Machine Learning Classifier in Python with Scikit Learn header image

Introduction

Machine learning (ML) is a subset of Artificial Intelligence (AI) focused on building systems that learn by leveraging data to improve task performance. These systems run on machine learning algorithms that grasp patterns and make predictions from data. Examples of machine learning systems include speech recognition, email filtering, and computer vision.

Machine learning has two major approaches:

  • Supervised Learning: utilizes labeled datasets to train algorithms that can accurately classify data or predict outcomes. The model feeds on input data, with weights continuously adjusted, so it fits appropriately. Use cases include spam classification in emails.
  • Unsupervised Learning: utilizes unlabeled datasets to discover patterns for association or clustering problems.

Scikit-Learn is a python module for machine learning built on top of SciPy, NumPy, and matplotlib. It provides efficient tools for predictive data analysis, including classification algorithms for applications like spam detection and image detection, regression algorithms for applications like stock price analysis, and clustering algorithms for grouping applications.

This guide explains how to build a machine learning classifier in Python using Scikit-Learn.

Prerequisites

  • Working knowledge of Python.
  • Properly installed and configured python toolchain, including pip (Python version >= 3.3).

Setting Up The Project Virtual Environment

To create an isolated virtual environment for your application:

  1. Install the virtualenv python package:

     $ pip install virtualenv
  2. Create the project directory:

     $ mkdir ml_classifier
  3. Navigate into the new directory:

     $ cd ml_classifier
  4. Create the virtual environment:

     $ python3 -m venv env

    This creates a new folder named env containing scripts to control the virtual environment, including program libraries.

  5. Activate the virtual environment:

     $ source env/bin/activate

Installing Scikit-Learn

To install Scikit Learn, enter the following command:

    $ pip install -U scikit-learn

Installing Pandas

Pandas is a fast, flexible and powerful tool used for data analysis and manipulation that makes working with labeled data straightforward and intuitive through expressive data structures (including Series and DataFrame). Series is a one-dimensional labeled array capable of holding data of any type, while DataFrame is a two-dimensional mutable tabular data structure with labeled axes (rows and columns). Pandas features include:

  • Size mutability: DataFrames support insertion or deletion of columns.
  • Effortless handling of missing data.
  • Intuitive merging and joining of datasets
  • Intelligent label-based slicing, fancy indexing, and subsetting of large data sets

To install pandas, enter the following command:

$ pip install pandas

This guide uses pandas to load and manipulate the dataset.

Choosing The Dataset

The dataset used in this guide is the Banknote Authentication Dataset. This dataset contains 1372 items representing images of bank notes with four predictor variables (image variance, skewness, kurtosis, entropy), and an encoded variable to predict the note's authenticity (0 - authentic, 1 - forged).

3.6216,8.6661,-2.8073,-0.44699,0
4.5459,8.1674,-2.4586,-1.4621,0
…
-1.3971,3.3191,-1.3927,-1.9948,1
0.39012,-0.14279,-0.031994,0.35084,1
…

This guide uses this dataset to build a machine-learning classification model to predict the authenticity of a note.

Importing The Dataset

Download the dataset specified above into the project directory. The directory should look like this:

.
├── env
└── data_banknote_authentication.txt

The data_banknote_authentication.txt is a CSV (Comma-Separated Values) file containing the dataset.

Create the main.py file inside the project directory:

$ touch main.py

Open main.py, and add the following line:

import pandas as pd

This imports the pandas library into scope to parse the downloaded data set. Load the dataset:

datast = pd.read_csv(
        "data_banknote_authentication.txt",
        header=0,
        names=['image_variance', 'skewness', 'kurtosis', 'entropy', 'forged'])

This uses the read_csv function from the pandas library to load the downloaded dataset. read_csv takes the path to the dataset as the first argument, because the dataset is within the same directory as the main.py file - specify just the filename.

header=0 is an optional argument specifying the row number to use as the column names. Passing a header value of 0 overrides the column names - as no row specifies column names in the file.

names=[...] is another optional argument specifying the user-defined column names to use. Passing header=0 before this option lets you override the column names with the provided ones. This sets the column names from column 1..5 to ('image_variance', 'skewness', 'kurtosis', 'entropy', 'forged’) respectively.

Printing the dataset displays the loaded DataFrame with the specified column names as follows:

print(dataset)

Output:
          image_variance        skewness        kurtosis        entropy         forged
0                4.54590         8.16740         -2.4586       -1.46210              0
1                3.86600        -2.63830          1.9242        0.10645              0
2                3.45660         9.52280         -4.0112       -3.59440              0
3                0.32924        -4.45520          4.5718       -0.98880              0
4                4.36840         9.67180         -3.9606       -3.16250              0
...                  ...             ...             ...            ...            ...
1366             0.40614         1.34920         -1.4501       -0.55949              1
1367            -1.38870        -4.87730          6.4774        0.34179              1
1368            -3.75030       -13.45860         17.5932       -2.77710              1
1369            -3.56370        -8.38270         12.3930       -1.28230              1
1370            -2.54190        -0.65804          2.6842        1.19520              1

[1371 rows x 5 columns]

Next, add the following line to split the loaded dataset into its features and binary result set:

features = dataset[dataset.columns[0:4]]
forged = dataset['forged']

Pandas DataFrames lets you index columns like you would a dictionary. The first line:

features = dataset[dataset.columns[0:4]]

Creates a new DataFrame variable called features containing the first four columns (image_variance, skewness, kurtosis, entropy) of the loaded dataset. These four columns are the features of the dataset that derive a binary classification of 0 representing an authentic note and 1 representing a forged note.

Next, create a new variable called forged containing the last column specifying the classification result:

forged = dataset['forged']

Supervised learning is the machine learning approach used in this example.

Splitting The Dataset

During the evaluation of a classifier model, it is imperative to always test the model on unseen data to determine its performance. A split of 70% for training data and 30% for test data is a good ratio. Train the model using the training data, and evaluate using the test data. This approach measures the models' performance and robustness.

Scikit Learn provides a train_test_split function which splits a given dataset into different sets. To use this function to split the data, add the following lines:

…
from sklearn.model_selection import train_test_split

…

# Split the data
train, test, train_labels, test_labels = train_test_split(features,
                                                          forged,
                                                          test_size=0.30,
                                                          random_state=42)

The first line imports the train_test_split function. This function takes a variable number of array arguments (a sequence of indexables with the same length or shape) and five optional arguments for different tuning effects. Pass the features set containing the dataset attributes and the forged set containing the classification as arguments.

The test_size optional argument takes a float between 0.0 and 1.0. This represents the part of the dataset to include in the test split. A value of 0.30 translates to using 30% of the dataset as test data. test_size has a default value of 0.25.

The random_state optional argument controls the shuffling applied to the data before applying the split. This takes an integer in the range [0, 2^32 - 1] or a numpy.random.RandomState instance. Passing a seed value between 0 and 42 is frequently used as it usually provides enough randomization of the data.

Randomization of data split between training and test data is important to remove selection and accidental bias.

The train_test_split function returns 2 ^ num_of_arrays_passed lists containing the train-test split of inputs. Passing two arguments yields 2 ^ 2 = 4 lists.

  • train: training part of the features set.
  • test: test part of the features set.
  • train_labels: training part of the forged set containing the classification result.
  • test_labels: test part of the forged set containing the classification result.

Building and Evaluating The Model

There are many machine learning models, each with its strengths and weaknesses. A Naive Bayes model adopts the class conditional independence principle from the Bayes Theorem. This means that the presence of one feature does not impact the presence of another in the probability of a given outcome, with each predictor having an equal effect on the result.

Naive Bayes usually works well for binary classification tasks like this. To initialize the model, add the following lines:

…
from sklearn.naive_bayes import GaussianNB
…

# Initialize the classifier
gnb = GaussianNB()

# Train the classifier
model = gnb.fit(train, train_labels)

# Make predictions
preds = gnb.predict(test)

First, import the GaussianNB module. Then, initialize the classifier:

# Initialize the classifier
gnb = GaussianNB()

After initializing the classifier, the next step is to train the model by fitting it to the data using the fit method:

# Train the classifier
model = gnb.fit(train, train_labels)

The fit method takes two arguments - a two-dimensional array-like object with n number of samples and m number of features and a one-dimensional array-like object with n target values for the classification.

This trains the model using the training split part of the dataset with its targeted classification.

After training the model, use the trained model to make predictions on the split test dataset by adding the following lines:

…

# Make predictions
preds = gnb.predict(test)

This uses the predict method to return a one-dimensional array composed of 0's and 1's representing the predicted values for the authenticity class (0 - authentic, 1 - forged).

Evaluating Model Accuracy

To evaluate the models’ accuracy, compare the test_labels and preds array using Scikit Learn’s accuracy_score function. Add the following lines:

…
from sklearn.metrics import accuracy_score

…

# Evaluate accuracy
accuracy = accuracy_score(test_labels, preds)

print("Naive Bayes accuracy -> ", accuracy)

First, import the accuracy_score function from sklearn. The function takes two one-dimensional arrays of the same size as arguments and computes the accuracy score using both arrays. The highest value is 1.0, which signifies the best performance.

Adding Another Classification Model

While Naive Bayes is a good model for this banknote problem, real-world problems often require tests across different classification models. Another classification model is the Support Vector Machine (SVM), which maps data to a high-dimensional feature space for categorization even when the data is not linearly separable.

To add this classification model, add the following lines:

…
from sklearn.svm import SVC
…

# Initialize SVC the classifier
sv = SVC()

# Train the classifier
sv_model = sv.fit(train, train_labels)

# Make predictions
sv_preds = sv.predict(test)

# Evaluate accuracy
sv_accuracy = accuracy_score(test_labels, sv_preds)

print("SVM accuracy -> ", sv_accuracy)

Adding multiple models lets you compare the accuracy across the different classifiers to find the best-performing one.

Final Code

For reference, the final code in the main.py file:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

dataset = pd.read_csv(
        "data_banknote_authentication.txt",
        header=0,
        names=['image_variance', 'skewness', 'kurtosis', 'entropy', 'forged'])

features = dataset[dataset.columns[0:4]]
forged = dataset['forged']

# Split the data
train, test, train_labels, test_labels = train_test_split(features,
                                                          forged,
                                                          test_size=0.30,
                                                          random_state=42)

# Initialize the classifier
gnb = GaussianNB()

# Train the classifier
model = gnb.fit(train, train_labels)

# Make predictions
preds = gnb.predict(test)

# Evaluate accuracy
accuracy = accuracy_score(test_labels, preds)

print("Naive Bayes accuracy -> ", accuracy)

# Initialize SVC the classifier
sv = SVC()

# Train the classifier
sv_model = sv.fit(train, train_labels)

# Make predictions
sv_preds = sv.predict(test)

# Evaluate accuracy
sv_accuracy = accuracy_score(test_labels, sv_preds)

print("SVM accuracy -> ", sv_accuracy)

Running The Code

Open a terminal inside the virtual environment, enter:

$ python3 main.py

This outputs:

Naive Bayes accuracy ->  0.8422330097087378
SVM accuracy ->  0.9951456310679612

This shows that the Naive Bayes classifier has an accuracy score of about 0.842, which means that the model will be accurate 84.2% of the time. While the SVM classifier has a higher accuracy score of 0.995, translating to accurate classifications 99.5% of the time.

Here the SVM classifier outperforms the Naive Bayes for this banknote authentication problem.

Conclusion

This guide covered how to build a machine learning classifier model in Python using Scikit Learn and evaluate model accuracy. For more information, check out the Scikit Learn website.