Deploy Machine Learning Models to Production

Deploy a Machine Learning Model to Production using TorchServe

Updated on February 1, 2023
Deploy a Machine Learning Model to Production using TorchServe header image

Introduction

TorchServe is an open-source tool for deploying PyTorch models on production. It allows you to deploy PyTorch models as RESTful services with minimal configuration. It supports multi-model serving, model versioning, and monitoring. It allows you to focus on model development and training without worrying about the underlying infrastructure and deployment.

Deploying PyTorch models using TorchServe on Vultr Cloud GPU servers allows for efficient and detached serving of models, utilizing the high-performance capabilities of the underlying hardware. This approach enables the ability to scale the models dynamically as traffic and usage increase. Additionally, deploying the models on a remote server allows you to access and use them from any location and device connected to the internet without needing local computational resources.

This article demonstrates the steps to package a PyTorch model into a model archive file, deploy the model achieve files using TorchServe, run inference using the REST API and manage the models using the management API.

Prerequisites

Before you begin, you should:

  • Deploy a Cloud GPU server with the NVIDIA NGC marketplace application.
  • Point 2 subdomains to the server using an A record. This article uses inference.torchserve.example.com and management.torchserve.example.com for demonstration.

Understanding TorchServe APIs

TorchServe features 3 different APIs, each designed to offer specific functionality.

  • Management API
  • Inference API
  • Metrics API

Management API allows you to manage and organize your models, such as registering, unregistering, and listing models. It also enables you to specify configurations for the models, such as the number of instances and the batch size. It listens on port 8080. Refer to Management API to learn more.

Inference API provides an interface for making predictions and inferences using your models. It allows you to send input data and receive output predictions in a standard format. It listens on port 8081. Refer to Inference API to learn more.

Metrics API allows you to monitor the performance of your models. It provides real-time metrics such as request rate, response time, and error rate. It listens on port 8082. Refer to TorchServe Metrics to learn more.

Create a Model Archive File

TorchServe uses the model archive files to load and serve a PyTorch model. A Model Archive (MAR) file is a format used to package a model. Creating a model archive file is a convenient way to store and distribute models, as it includes the dependencies, architecture and pre-trained weights in a single file.

Before you create a model archive file, you need to export your model into a serialized file.

>>> torch.save(model, 'path/to/model.pth')

The .pt or .pth file is the serialized version of a PyTorch model that consists of the weights, parameters and architecture. You can use this file to load the model and use it for inference or even training it further. This article does not cover the steps to train or save a model. You can refer to Demo Notebook that trains a model on a subset of the Food101 dataset and the ResNet18 pre-trained weights.

Install the torch-model-archiver package using pip.

# pip install torch-model-archiver

You can use the torch-model-archiver command to create a model archive file using the following parameters.

  • --model-name: Set the name of the model.
  • --version: Set the version of the model.
  • --model: The path to the file that declares the model.
  • --serialized-file: The path to the serialized file.
  • --extra-files: Any other dependencies seperated with a comma.
  • --handler: Choose from default handlers or the path to the file that declares custom handler logic.

The following are the default handler options that you can use for creating a model archive file.

  • image_classifier
  • text_classifier
  • object_detector
  • image_segmentation

Refer to Default Handler Documentation or Custom Handler Documentation for more information.

The following is an example of the command used for creating a model archive file.

# torch-model-archiver --model-name desserts \
                       --version 1.0 \
                       --model model.py \
                       --serialized-file desserts_resnet18.pth \
                       --handler handler.py \
                       --extra-files index_to_name.json

The above command creates a new file named desserts.mar in the working directory. As TorchServe supports serving multiple models, you can store multiple model archive files in a directory, which you use in the later steps to serve the models.

You can download the desserts.mar and subset.mar files into the /root/model-store directory to test the workflow as shown in the later steps.

Build the Container Image

The TorchServe container image available on the DockerHub may not be compatible with the hardware due to constantly changing drivers. The best approach is to build the container image using the build-image.sh script by providing the options to match your specifications. This section demonstrates the steps to clone the GitHub repository, fetch the CUDA version and build the container image.

Clone the TorchServe GitHub repository.

# git clone https://github.com/pytorch/serve

The above command clones the TorchServe repository into the serve directory.

Enter the docker directory.

# cd serve/docker

The above command enters the docker directory, which contains all the files related to the TorchServe container image.

Fetch the CUDA version.

# nvidia-smi

The above command outputs the information about the GPUs connected to the host machine. Note the CUDA version on the top right corner of the output. You use this in the next command to specify the version.

Build the container image.

# ./build-image.sh -g -cu 116

The above command uses the build-image.sh script to build the container image named pytorch/torchserve:latest-gpu. This process may take upto 10 to 15 minutes. It uses the -g option to specify that you want to build a GPU-accelerated container image. It also uses the -cu 116 to specify the CUDA version by removing the . symbol from the CUDA version fetched in the previous command. Refer to the Create TorchServe docker image section to learn more.

Deploy TorchServe using Docker

You built the TorchServe container image in the previous section using the build-image.sh script. This section demonstrates how to deploy TorchServe using Docker, setting up all the options, and binding the model store directory using Docker volumes.

Run a temporary container using the pytorch/torchserve:latest-gpu image.

# docker run --rm -it --gpus all -p 8080:8080 -p 8081:8081 -v /root/model-store:/model-store pytorch/torchserve:latest-gpu torchserve --model-store /model-store --models desserts=desserts.mar subset=subset.mar

The above command creates a new container using the defined image. This command edits the entry command to define the model name and path to instruct TorchServe to serve the defined models. However, pre-defining the models is not the best approach. In the next step, you run the container without editing the entry command.

The following is the explanation for each parameter used in the above command.

  • --rm: Remove the container from the disk when stopped.
  • -it: Interactive session. Allow keyboard interrupt.
  • --gpus all: GPU access inside the container.
  • -p 8080:8080: Bind port 8080 with the host machine, inference API.
  • -p 8081:8081: Bind port 8080 with the host machine, management API.
  • -v /root/model-store:/model-store: Bind the /root/model-store directory with the /model-store directory inside the container.

Exit the container using Ctrl + C.

Run a temporary container using the pytorch/torchserve:latest-gpu image without editing the command.

# docker run --rm -it --gpus all -p 8080:8080 -p 8081:8081 -v /root/model-store:/home/model-server/model-store pytorch/torchserve:latest-gpu

The above command creates a new container using the defined image without editing the entry command. It uses the same options as the previous command, but for binding the model store, you now use the /home/model-server/model-store directory inside the container matching the default directory defined in the config.properties file inside the serve/docker directory.

Creating the container without editing the entry command will not start serving the models until you register the models using the management API, as demonstrated in the next section.

Register a New Model using the Management API

This section demonstrates the steps to register a new model on TorchServe using the management API.

Register a new model.

# curl -X POST "http://localhost:8080/models?url=desserts.mar&initial_workers=1"
# curl -X POST "http://localhost:8080/models?url=subset.mar&initial_workers=1"

The above commands send a POST request to the /models endpoint for registering a new model. It uses the url parameter to define the path to the model archive file. It also uses the initial_workers parameter to set the number of workers to 1 as the default value is 0.

Fetch the list of models.

# curl http://localhost:8080/models

The above command requests the list of available models from TorchServe.

Fetch individual model details.

# curl http://localhost:8080/models/desserts
# curl http://localhost:8080/models/subset_resnet18

The above commands send a GET request to fetch the details of the individual models. The output contains the metadata and the list of workers running for the specified model.

Set the number of minimum workers.

# curl -X PUT "http://localhost:8080/models/desserts?min_workers=3"

The above command sends a PUT request to the specified model to set the number of minimum workers to 3. The minimum worker values define the number of workers that are always up. TorchServe spawns a new worker if an existing worker crashes due to unexpected behavior.

Fetch Model Predictions using the Inference API

This section demonstrates the steps to run inference in a model served by TorchServe using the inference API.

Run inference on the model.

# curl -T dessert-example.jpg http://localhost:8081/predictions/desserts
# curl -T subset-example.jpg http://localhost:8081/predictions/subset_resnet18

The above commands send a GET request with the example dessert and subset images to the demo models to fetch the predictions.

Deploy TorchServe using Docker Compose

Deploying TorchServe using Docker Compose allows you to serve the PyTorch models using TorchServe, protected with a Let's Encrypt SSL certificate and run all the containers in the background. This section demonstrates the steps to deploy TorchServe using Docker Compose and set up the reverse proxy server using Nginx.

Create a new file named docker-compose.yaml.

# nano docker-compose.yaml

The docker-compose.yaml file allows you to run multi-container Docker applications using the docker-compose command.

Add the following contents to the file.

services:
  torchserve:
    image: pytorch/torchserve:latest-gpu
    restart: unless-stopped
    volumes:
      - "/root/model-store:/home/model-server/model-store"
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]

  nginx:
    image: nginx
    restart: unless-stopped
    ports:
      - 80:80
      - 443:443
    volumes:
      - ./nginx/nginx.conf:/etc/nginx/nginx.conf
      - ./certbot/conf:/etc/letsencrypt
      - ./certbot/www:/var/www/certbot


  certbot:
    image: certbot/certbot
    container_name: certbot
    volumes:
      - ./certbot/conf:/etc/letsencrypt
      - ./certbot/www:/var/www/certbot
    command: certonly --webroot -w /var/www/certbot --force-renewal --email YOUR_EMAIL -d inference.torchserve.example.com -d management.torchserve.example.com --agree-tos

The above configuration defines three services. The torchserve service runs the GPU-accelerated TorchServe container, and it uses the volumes attribute to pass the model archive files in the /root/model-store directory. The nginx service runs a container using the official Nginx image that acts as a reverse proxy server between clients and the torchserve service. The certbot service runs a container using the official Certbot image that issues a Let's Encrypt SSL certificate for the specified domain name. Replace YOUR_EMAIL with your email address.

Create a new directory named nginx.

# mkdir nginx

Create a new file named nginx/nginx.conf inside the directory.

# nano nginx/nginx.conf

Add the following contents to the file.

http {
    server_tokens off;
    charset utf-8;

    server {
        listen 80 default_server;
        server_name _;

        location ~ /.well-known/acme-challenge/ {
            root /var/www/certbot;
        }
    }
}

The above configuration instructs the Nginx server to serve the ACME challenge generated by Certbot. You must perform this step for the Certbot container to verify the ownership of the subdomains and issue an SSL certificate for them. You swap this configuration in the later steps to set up the reverse proxy server.

Start the services.

# docker-compose up -d

The above command starts the services defined in the docker-compose.yaml file in detached mode. This means that the services will start in the background, and you can use your terminal for other commands.

Verify the SSL issuance.

# ls certbot/conf/live/inference.torchserve.example.com

The above command outputs the list of contents inside the directory created by Certbot for your domain name. The output should contain the fullchain.pem and the privkey.pem files. It may take up to five minutes to generate the SSL certificate. If it takes longer than that, you can troubleshoot by viewing the logs using the docker-compose logs certbot command.

Update the nginx.conf file.

# nano nginx/nginx.conf

Add the following contents to the file.

events {}

http {
    server_tokens off;
    charset utf-8;

    server {
        listen 80 default_server;
        server_name _;

        return 301 https://$host$request_uri;
    }

    server {
        listen 443 ssl http2;

        server_name inference.torchserve.example.com;

        ssl_certificate     /etc/letsencrypt/live/inference.torchserve.example.com/fullchain.pem;
        ssl_certificate_key /etc/letsencrypt/live/inference.torchserve.example.com/privkey.pem;

        location / {
            proxy_pass http://torchserve:8080;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header Host $host;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        }

        location ~ /.well-known/acme-challenge/ {
            root /var/www/certbot;
        }
    }

    server {
        listen 443 ssl http2;

        server_name management.torchserve.example.com;

        ssl_certificate     /etc/letsencrypt/live/inference.torchserve.example.com/fullchain.pem;
        ssl_certificate_key /etc/letsencrypt/live/inference.torchserve.example.com/privkey.pem;

        location / {
            proxy_pass http://torchserve:8081;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header Host $host;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;

            proxy_http_version 1.1;
            proxy_set_header Upgrade $http_upgrade;
            proxy_set_header Connection $connection_upgrade;
            proxy_set_header X-Scheme $scheme;

            proxy_buffering off;
        }

        location ~ /.well-known/acme-challenge/ {
            root /var/www/certbot;
        }
    }
}

The above configuration declares 2 server blocks. The first server block listens to the incoming traffic on inference.torchserve.example.com and channels it to the TorchServe container on port 8080. The second server block listens to the incoming traffic on management.torchserve.example.com and channels it to the TorchServe container on port 8081. Both the server blocks use the SSL certificate generated by Certbot and contain a location block to serve ACME challenge files for SSL renewals using Cron.

Restart the Nginx service.

# docker-compose restart nginx

The above command restarts the Nginx container to enable the updated configuration. You can confirm the deployment by opening https://inference.torchserve.example.com/ping in your web browser. After confirming the deployment, you can register the models using the management demonstrated in the previous sections. You can additionally restrict access to the management API to a specific IP address using the allow and deny keywords in the Nginx configuration. Refer to the Module ngx_http_access_module for more information.

Set Up Automatic SSL Renewal

Cron is a built-in job scheduler in the Linux operating system to run the specified commands at a scheduled time. Refer to How to Use the Cron Task Scheduler to learn more.

Edit the Cron table.

# crontab -e

The above command opens the Cron table editor.

Add the following entries to the table.

0 5 1 */2 *  /usr/local/bin/docker-compose start -f /root/torchserve/docker-compose.yaml certbot
5 5 1 */2 *  /usr/local/bin/docker-compose restart -f /root/torchserve/docker-compose.yaml nginx

The above statements define two tasks that start the Certbot container to regenerate the SSL certificate and restart the Nginx container to reload the configuration using the latest SSL certificate.

To exit the editor, press Esc, type !wq, and press Enter

Configure the Firewall Rules

Add the firewall rules.

# ufw allow 22
# ufw allow 80
# ufw allow 443

The above commands enable the firewall and allow the incoming connection on port 80 for HTTP traffic, 443 for HTTPS traffic, and 22 for SSH connections.

Enable the firewall.

# ufw enable

Conclusion

This article demonstrated the steps to package a PyTorch model into a model archive file, deploy the model achieve files using TorchServe, run inference using the REST API and manage the models using the management API. You can also refer to the Performance Guide to optimize the PyTorch models for efficiently serving them using TorchServe.

More Information