Deploy a PyTorch Workspace on a Vultr Cloud GPU Server

Introduction

PyTorch is an open-source deep learning framework for natural language processing and computer vision applications. It offers ease of use and flexibility, allowing for fast and seamless integration of deep learning models into a wide range of applications.

Deploying a PyTorch workspace on Vultr enables you to leverage the power of the Cloud GPU servers that feature the NVIDIA A100 and the A40 GPUs to perform resource-hungry tasks using the torch module. Combining JupyterLab and the PyTorch container image provides an efficient remote development environment, allowing you to work with others on a machine-learning project.

This article demonstrates the steps to inherit the PyTorch container image and install JupyterLab, creating a new container image. It also walks you through the deployment using Docker and Docker Compose on Vultr Cloud GPU servers using the NVIDIA Docker Toolkit.

Prerequisites

Before you begin, you should:

Deploy a Cloud GPU server with the NVIDIA NGC marketplace application.
Point a subdomain to the server using an A record. This article uses pytorch.example.com for demonstration.

Verify the GPU Availability

The Vultr Cloud GPU servers feature NVIDIA GPUs for machine learning, artificial intelligence, and so on. They come with licensed NVIDIA drivers and the CUDA Toolkit, which are essential for the proper functioning of the GPUs. This section provides an overview of the PyTorch container image and demonstrates the steps to verify the GPU availability on the server and inside a container.

Execute the nvidia-smi command on the server.

# nvidia-smi

The above command outputs the information about the connected GPU. It includes information such as the driver version, CUDA version, GPU model, available memory, GPU usage, and so on.

Run a temporary container using the pytorch/pytorch:latest image.

# docker run --rm -it --gpus all pytorch/pytorch:latest

The above command uses the PyTorch container image to verify the GPU access inside a container. The NVIDIA Docker Toolkit enables you to use the GPU inside the containers using the --gpus option. The -it option provides access to an interactive terminal session of the container. The --rm option removes the container from the disk once the container ends.

Enter the Python console.

root@f28fee5c54e5:/workspace# python

The above command enters a new Python console inside the container.

Test the GPU availability using the torch module.

Python 3.10.8 (main, Nov  4 2022, 13:48:29) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()

The above commands import the torch module and use the torch.cuda.is_available() function to verify if you can use the CUDA features with the PyTorch module. If the output is False make sure you used the --gpus all option while creating the container.

Quit the Python console with the quit() command and exit the container terminal using the exit command.

Build the Docker Image

The PyTorch container image consists of the runtime and the dependencies to use the PyTorch module. This section demonstrates the steps to create a configuration for the JupyterLab server and create a new container image that combines PyTorch and JupyterLab.

The NVIDIA NGC marketplace application installs the JupyterLab server alongside the NVIDIA Docker Toolkit on the server. You can use the pre-installed JupyterLab server to generate a configuration file to set up default options in the container image.

Create a new file named config.py.

# nano config.py

Add the following contents to the file.

c.ServerApp.ip = '0.0.0.0'
c.ServerApp.allow_root = True
c.ServerApp.allow_remote_access = True
c.ServerApp.password = ''

The above configuration instructs the JupyterLab server to allow remote connection and listen on the 0.0.0.0 IP address so that you can access the workspace from the public IP address. You can add any other option in this file to pre-configure the container image.

Create the password hash using the passwd function.

# python3 -c "from jupyter_server.auth import passwd; print(passwd('YOUR_PASSWORD'))"

The above command uses the pre-installed Jupyter module and the passwd() function to create a new password hash to protect the workspace. Replace the c.ServerApp.password value in the config.py file with the output.

Create a new file named Dockerfile.

# nano Dockerfile

The DockerFile declares the steps to build the container image.

Add the following contents to the file.

FROM pytorch/pytorch:latest

RUN pip install jupyterlab
RUN pip install -U ipywidgets ipykernel

COPY config.py /root/.jupyter/jupyter_lab_config.py
EXPOSE 8888

CMD ["bash", "-c", "jupyter lab"]

The above instructions inherit the official PyTorch container image as the base image. It installs the JupyterLab library using pip, copies the configuration file you created in the previous steps, exposes port 8888 and uses the bash -c jupyter lab command to spawn the JupyterLab server.

Build the Docker image.

# docker build -t pytorch-jupyter .

The above command builds a new container image named pytorch-jupyter. You can also push this container image to your DockerHub account in a private repository so that it is ready to use for deploying a temporary PyTorch workspace for performing resource-hungry tasks.

Deploy the Workspace using Docker

You created the pytorch-jupyter image in the previous section combining JupyterLab and the PyTorch container image. This section demonstrates the steps to deploy the workspace using Docker for temporary tasks or testing the functionality.

Disable the firewall.

# ufw disable

The above command disables the firewall to allow incoming connections on port 8888.

Run a temporary container using the pytorch-jupyter image.

# docker run --rm -it --gpus all -p 8888:8888 pytorch-jupyter

The above command creates a new container using the defined image. The --gpus all option provides access to all the GPUs connected to the host machine inside the container. The -it option provides access to an interactive session of the container. The --rm option removes the container from the disk once the container ends.

You can confirm the deployment by opening http://PUBLIC_IP:8888 in your web browser. To log in to the JupyterLab interface, use the password you used for creating the password hash in the previous sections. You can use the torch.cuda.is_available() function in a new notebook to verify the GPU availability.

Exit the container using Ctrl + C.

Deploy the Workspace using Docker Compose

Deploying the PyTorch Workspace on a Vultr Cloud GPU server provides more than just access to high-end GPUs. The JupyterLab interface allows you to work with others on a machine-learning project, offering more flexibility and scalability than a local setup. It also allows you to access and manage your machine learning resources from anywhere with an internet connection. This section demonstrates the steps to deploy a persistent PyTorch workspace on a Vultr Cloud GPU server using Docker Compose.

Create and enter a new directory named pytorch-environment.

# mkdir ~/pytorch-environment
# cd ~/pytorch-environment

The above commands create and enter a new directory named pytorch-environment in the /root directory. You use this directory to store all the configuration files related to the PyTorch Workspace, such as the Nginx configuration, SSL certificate, and so on.

Create a new file named docker-compose.yaml.

# nano docker-compose.yaml

The docker-compose.yaml file allows you to run multi-container Docker applications using the docker-compose command.

Add the following contents to the file.

services:
  jupyter:
    image: pytorch-jupyter
    restart: unless-stopped
    volumes:
      - "/root/workspace:/workspace"
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]

  nginx:
    image: nginx
    restart: unless-stopped
    ports:
      - 80:80
      - 443:443
    volumes:
      - ./nginx/nginx.conf:/etc/nginx/nginx.conf
      - ./certbot/conf:/etc/letsencrypt
      - ./certbot/www:/var/www/certbot


  certbot:
    image: certbot/certbot
    container_name: certbot
    volumes:
      - ./certbot/conf:/etc/letsencrypt
      - ./certbot/www:/var/www/certbot
    command: certonly --webroot -w /var/www/certbot --force-renewal --email YOUR_EMAIL -d pytorch.example.com --agree-tos

The above configuration defines three services. The jupyter service runs the container that contains the GPU-accelerated PyTorch workspace, and it uses the volumes attribute to store all the workspace files in the /root/workspace directory. The nginx service runs a container using the official Nginx image that acts as a reverse proxy server between clients and the jupyter service. The certbot service runs a container using the official Certbot image that issues a Let's Encrypt SSL certificate for the specified domain name. Replace YOUR_EMAIL with your email address.

Create a new directory named nginx.

# mkdir nginx

Create a new file named nginx/nginx.conf inside the directory.

# nano nginx/nginx.conf

Add the following contents to the file.

http {
    server_tokens off;
    charset utf-8;

    server {
        listen 80 default_server;
        server_name _;

        location ~ /.well-known/acme-challenge/ {
            root /var/www/certbot;
        }
    }
}

The above configuration instructs the Nginx server to serve the ACME challenge generated by Certbot. You must perform this step for the Certbot container to verify the ownership of the domain name and issue an SSL certificate for it. You swap this configuration in the later steps to set up the reverse proxy server.

Start the PyTorch workspace.

# docker-compose up -d

The above command starts the services defined in the docker-compose.yaml file in detached mode. This means that the services will start in the background, and you can use your terminal for other commands.

Verify the SSL issuance.

# ls certbot/conf/live/pytorch.example.com

The above command outputs the list of contents inside the directory created by Certbot for your domain name. The output should contain the fullchain.pem and the privkey.pem files. It may take up to five minutes to generate the SSL certificate. If it takes longer than that, you can troubleshoot by viewing the logs using the docker-compose logs certbot command.

Update the nginx.conf file.

# nano nginx/nginx.conf

Add the following contents to the file.

events {}

http {
    server_tokens off;
    charset utf-8;

    map $http_upgrade $connection_upgrade {
        default upgrade;
        '' close;
    }

    server {
        listen 80 default_server;
        server_name _;

        return 301 https://$host$request_uri;
    }

    server {
        listen 443 ssl http2;

        server_name pytorch.example.com;

        ssl_certificate     /etc/letsencrypt/live/pytorch.example.com/fullchain.pem;
        ssl_certificate_key /etc/letsencrypt/live/pytorch.example.com/privkey.pem;

        location / {
            proxy_pass http://jupyter:8888;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header Host $host;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;

            proxy_http_version 1.1;
            proxy_set_header Upgrade $http_upgrade;
            proxy_set_header Connection $connection_upgrade;
            proxy_set_header X-Scheme $scheme;

            proxy_buffering off;
        }

        location ~ /.well-known/acme-challenge/ {
            root /var/www/certbot;
        }
    }
}

The above configuration uses the SSL certificate generated by Certbot. It configures a reverse proxy server that channels the incoming traffic to the container on port 8888. It also defines a location block to serve ACME challenge files for SSL renewals using Cron.

Restart the Nginx service.

# docker-compose restart nginx

The above command restarts the Nginx container to enable the updated configuration. You can confirm the deployment of the workspace by opening https://pytorch.example.com in your web browser.

Set Up Automatic SSL Renewal

Cron is a built-in job scheduler in the Linux operating system to run the specified commands at a scheduled time. Refer to How to Use the Cron Task Scheduler to learn more.

Edit the Cron table.

# crontab -e

The above command opens the Cron table editor.

Add the following entries to the table.

0 5 1 */2 *  /usr/local/bin/docker-compose start -f /root/pytorch-environment/docker-compose.yaml certbot
5 5 1 */2 *  /usr/local/bin/docker-compose restart -f /root/pytorch-environment/docker-compose.yaml nginx

The above statements define two tasks that start the Certbot container to regenerate the SSL certificate and restart the Nginx container to reload the configuration using the latest SSL certificate.

To exit the editor, press Esc, type !wq, and press Enter

Configure the Firewall Rules

Add the firewall rules.

# ufw allow 22
# ufw allow 80
# ufw allow 443

The above commands enable the firewall and allow the incoming connection on port 80 for HTTP traffic, 443 for HTTPS traffic, and 22 for SSH connections.

Enable the firewall.

# ufw enable

Conclusion

This article demonstrated the steps to inherit the PyTorch container image and install JupyterLab to create a new container image. It also walks you through the deployment using Docker and Docker Compose on Vultr Cloud GPU servers. You can also push the container image you built to your DockerHub account for deploying temporary PyTorch workspaces in the future.