Exploring Vultr GPU Stack | Generative AI Series

Exploring Vultr GPU Stack | Generative AI Series

Updated on December 6, 2023
Exploring Vultr GPU Stack | Generative AI Series header image

Introduction

Llama 2 is an open-source large language model from Hugging Face. You can use the model in your application to perform natural language processing (NLP) tasks.

The Vultr GPU Stack is a preconfigured compute instance with all the essential components for developing and deploying AI and ML applications. In this tutorial, you'll explore the Vultr GPU Stack environment and run a Llama 2 model in a Docker container.

Prerequisites

Before you begin:

Explore the Vultr GPU Stack environment

The Vultr GPU stack has many packages that simplify AI model development. Follow the steps below to ensure your environment is up and running:

  1. Check the configuration of the NVIDIA GPU server by running the nvidia-smi command.

    console
    $ nvidia-smi
    
  2. Inspect the docker runtime environment by running the following commands:

    • Check the Docker version.

      console
      $ sudo docker version
      

      Output.

      Client: Docker Engine - Community
      Version:           24.0.7
      API version:       1.43
      Go version:        go1.20.10
      Git commit:        afdd53b
      Built:             Thu Oct 26 09:07:41 2023
      OS/Arch:           linux/amd64
      Context:           default
      
      Server: Docker Engine - Community
      Engine:
       Version:          24.0.7
       API version:      1.43 (minimum version 1.12)
       Go version:       go1.20.10
       Git commit:       311b9ff
       Built:            Thu Oct 26 09:07:41 2023
       OS/Arch:          linux/amd64
       Experimental:     false
      containerd:
       Version:          1.6.24
       GitCommit:        61f9fd88f79f081d64d6fa3bb1a0dc71ec870523
      runc:
       Version:          1.1.9
       GitCommit:        v1.1.9-0-gccaecfc
      docker-init:
       Version:          0.19.0
       GitCommit:        de40ad0
    • Display the Docker system information.

      console
      $ sudo docker info
      

      Output.

      Client: Docker Engine - Community
       Version:    24.0.7
       Context:    default
       Debug Mode: false
       Plugins:
        buildx: Docker Buildx (Docker Inc.)
          Version:  v0.11.2
          Path:     /usr/libexec/docker/cli-plugins/docker-buildx
        compose: Docker Compose (Docker Inc.)
          Version:  v2.21.0
          Path:     /usr/libexec/docker/cli-plugins/docker-compose
      
      Server:
      ...
       Runtimes: runc io.containerd.runc.v2 nvidia
       Default Runtime: runc
      ...
       Docker Root Dir: /var/lib/docker
       Debug Mode: false
       Experimental: false
       Insecure Registries:
        127.0.0.0/8
       Live Restore Enabled: false

    The above output omits some settings for brevity. However, the critical aspect is the availability of the NVIDIA container runtime, which enables Docker containers to access the underlying GPU.

  3. Run an Ubuntu image and execute the nvidia-smi command within the container.

    console
    $ sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
    

Run Llama2 Model on a Vultr GPU Stack

In this step, you'll launch the Hugging Face Text Generation Inference container to expose the Llama-2-7b-chat-hf parameter model through an API. Follow the steps below:

  1. Fill out the Llama 2 model request form.

  2. Use the same email address to sign up for a Hugging Face account and create an access token.

    Creating a Hugging Face access token

  3. Request access to the Llama-2-7b-chat-hf repository.

  4. Run the following command on your SSH interface to initialize some environment variables. Replace YOUR_HF_TOKEN with the correct Hugging Face token.

    console
    model=meta-llama/Llama-2-7b-chat-hf
    volume=$PWD/data
    token=YOUR_HF_TOKEN
    
  5. Create a data directory to store the model's artifacts in your home directory.

    console
    $ mkdir data
    
  6. Run the command below to launch the ghcr.io/huggingface/text-generation-inference:1.1.0 Docker container and initialize the Llama 2 model.

    console
    $ sudo docker run -d \
        --name hf-tgi \
        --runtime=nvidia \
        --gpus all \
        -e HUGGING_FACE_HUB_TOKEN=$token \
        -p 8080:80 \
        -v $volume:/data \
        ghcr.io/huggingface/text-generation-inference:1.1.0 --model-id $model
    

    Output.

    Digest: sha256:55...45871608f903f7f71d7d
    Status: Downloaded newer image for ghcr.io/huggingface/text-generation-inference:1.1.0
    78a39...f3e1dca928e00f859
    
  7. Wait for the container to start and check the logs.

    console
    $ sudo docker logs -f hf-tgi
    

    The last few lines below indicate the host is now listening to incoming HTTP connections, and the API is ready.

     ...
     ...Connected
     ...Invalid hostname, defaulting to 0.0.0.0
    
  8. Run the following curl command to query the API.

    console
    $ curl 127.0.0.1:8080/generate \
        -X POST \
        -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":128}}' \
        -H 'Content-Type: application/json'
    

    Output.

    json
    {"generated_text":"\n\nDeep learning (also known as deep structured learning) is part of a broader family of machine learning techniques based on artificial neural networks—specifically, on the representation and processing of data using multiple layers of neural networks. Learning can be supervised, semi-supervised, or unsupervised.\n\nDeep-learning architectures such as Deep Neural Networks, Deep Belief Networks, and Deep Reinforcement Learning have been applied to fields including visual recognition, natural language processing, speech recognition, and expert system.\n\nDeep learning has been described as a \"paradigm shift\""}
    

    The output confirms that the LLM is running.

Query the Llama2 Model Using Jupyter Notebook

Use the Python client to invoke the model from a Jupyter Notebook by following the steps below:

  1. Install the HF Text Generation client by running the command below.

    console
    $ pip install text-generation
    
  2. Run the Jupyter Lab and retrieve the access token.

    $ jupyter lab --ip 0.0.0.0 --port 8890

    Output.

    http://YOUR_SERVER_HOST_NAME:8890/lab?token=b7ab2bdscb366edsddssfsff0faeb5fa68b6b0cf
    
  3. Allow port 8890 through the firewall.

    console
    $ sudo ufw allow 8890
    $ sudo ufw reload
    
  4. Access the Jupyter Lab on a browser and replace YOUR_SERVER_IP with the public IP address of the GPU instance.

    https://YOUR_SERVER_IP:8888/lab?token=YOUR_JUPYTER_LAB_TOKEN
  5. Click Python 3 ipykernel under Notebook and paste the following Python code.

    python
    from text_generation import Client
    
    URI='http://localhost:8080'
    
    tgi_client = Client(URI)
    prompt='What is the most important tourist attraction in Paris?'
    
    print(tgi_client.generate(prompt, max_new_tokens=100).generated_text.strip())
    
  6. Run the above code. The LLM responds to your query and displays the following response.

    Paris, the City of Light, is known for its iconic landmarks, cultural institutions, and historical significance. As one of the most popular tourist destinations in the world, Paris has a plethora of attractions that draw visitors from all over the globe. While opinions may vary, some of the most important tourist attractions in Paris include:
    
    1. The Eiffel Tower: The most iconic symbol of Paris, the Eiffel Tower  

Conclusion

This tutorial walked you through running the LLama 2 model in a container on the Vultr GPU Stack. In the next section, you'll explore other advanced LLM models.