Play with Hugging Face Models on Vultr

How to Use Hugging Face Transformer Models on Vultr Cloud GPU

Updated on June 28, 2023

Introduction

Transformers are a type of neural network used for deep learning. They are often referred to as foundation models. Foundation models are trained using self-supervised learning or unsupervised learning on large amounts of data. Therefore, they can be adapted (fine-tuned) to perform a range of different tasks.

In particular, transformers have proved to be useful in Natural Language Processing (NLP) applications. Transformers trained on large volumes of raw text and capable of various language-related tasks, such as sentence completion, summarization, and so on are called Large Language Models (LLMs). This article explains how to use Transformer models, including recent and popular LLMs.

Prerequisites

To follow the examples in this article, make sure you:

Deploy a Debian A100 Cloud GPU Server on Vultr with at least:
- 1/7 GPU
- 10GB GPU RAM
- 2 vCPUs
- 15GB Memory
- 170GB NVME Storage
Use SSH to access the server as a non-root user with sudo privileges.
Have some basic Python programming skills.

Using Transformers

Transformer models are trained on large datasets. The training process can take days and requires a large amount of GPU. Hence, this article does not cover how to train a transformer model, but uses pre-trained models. To train models, you can deploy a Vultr Cloud GPU instance and train depending on your interests.

Pre-trained models are good general-purpose models, they perform well on many tasks but do not excel at any specific task. Pre-trained models can be further trained on specific tasks in a process called fine-tuning.

Hugging Face

Hugging Face is a platform for studying, and downloading data science and machine learning tools and models. It's also a community of data science professionals with a vast collection of learning materials. The "model card" page of each model describes the scope, parameters, and specifications of the model and also contains code samples for using it.

The examples used in this article are based on downloading models and model components from Hugging Face.

Set up the Server

Download the Conda installer.

 $ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

Run the installer.
```
 $ bash Miniconda3-latest-Linux-x86_64.sh
```
Press enter to continue and start the installation process. Use the space bar to navigate through the Terms and Conditions pager, then enter yes to start the installation.
When the installation is successful, end your SSH session.
```
 $ exit
```
Log in to the server again to activate Conda.
```
 $ ssh example-user@your-server-ip
```
Notice that your shell prompt changes with a (base) declaration as below.
```
 (base) example-user@server:~$
```
Upgrade Conda.
```
 $ conda upgrade -y conda
```
Create a new Conda environment env1 with Python 3.9.
```
 $ conda create -y --name env1 python=3.9
```
Activate the new Conda environment.
```
 $ conda activate env1
```
Upgrade pip.
```
 $ pip install --upgrade pip
```

Using Conda, install the CUDA GPU packages:

 $ conda install -y -c conda-forge cudatoolkit=11.8 cudnn=8.2

Install Pytorch and all required GPU dependencies.

$ conda install -y -c pytorch -c nvidia pytorch pytorch-cuda=11.8

Set the appropriate paths to initialize Conda:

$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/

Create the activation directory:

$ mkdir -p $CONDA_PREFIX/etc/conda/activate.d

Increment the activation shell script with paths to the NVIDIA tools as below.

$ echo 'CUDNN_PATH=$(dirname $(python -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)"))' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
$ echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/' > $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh

Activate Conda.

$ source $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh

Using Pip, install tensorflow,transformers,huggingface-hub, NVIDIA tools and, dependencies like einops, accelerate,and xformers.
```
$ pip install nvidia-cudnn-cu11==8.6.0.163 tensorflow==2.12.* transformers huggingface-hub einops accelerate xformers
```

Verify that Python is installed with GPU support.

$ python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

Output.

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

The above output shows that the Python and Conda environments have access to your server's GPU.

Using the Models

Pipelines are high-level tools that package components required to perform different predefined tasks such as text-generation, question-answering, sentiment-analysis, among others. You can run pipelines by specifying a task and letting it use the default settings (for that task) for everything else. It's also possible to custom-build a pipeline by specifying the model, tokenizer, and other parameters.

The examples in this section cover both pipeline approaches and are based on text generation models/tasks. Each example also mentions the amount of GPU RAM used to run the model. More complex tasks (such as generating longer text) require more memory. Before loading the models, it's recommended to close and reopen the Python shell. This clears out the old models from memory and frees up space for new models.

Pipelines with Default Models

Pipelines can automatically choose a default model for a task. Default models often have low memory and storage requirements. This makes them good as a learning tool.

Enter the Python shell.
```
 $ python
```
Import the pipeline module.
```
 >>> from transformers import pipeline
```
Below is the syntax for using the pipeline's default pipeline model.
```
 >>> # pseudo-code: my_pipeline = pipeline(task="task_name")
```
task_name can be any of the following values:
- audio-classification
- automatic-speech-recognition
- image-classification
- object-detection
- image-segmentation
- depth-estimation
- sentiment-analysis
- ner (Named Entity Recognition)
- question-answering
- summarization
- translation
- text-generation
- fill-mask
For a full description of the different tasks, visit the Hugging Face Pipelines documentation.
For example, to use a pipeline for text-generation, run the following command.
```
 >>> my_text_generator = pipeline("text-generation", device=0)
```
By default, this task downloads OpenAI's GPT-2 model and uses 500 MB of storage. Loading the model into memory uses 1.5GB of GPU RAM. Running the pipeline below to generate text uses 2 GB of GPU RAM.
```
 >>> my_text = "Vultr is a cloud service provider"
 >>> my_text_generator(my_text, max_length=200)
```

Custom-built Pipelines

You have specified a task and let Pipelines pick the default model for that task. It's also possible to specify a particular model to use for the task. Before using the model, verify that it can be used for that task. The model-card page of most models describes what it can be used for.

To declare a pipeline for a given task, follow the syntax below.

>>> # pseudo-code: my_pipeline = pipeline(task="task_name", model="model_name")

It's also possible to specify other parameters to construct the pipeline. The following examples describe how you can do this.

Falcon-7B

The Falcon model has two variants - Falcon-7B and Falcon-40B. They are both pre-trained models and perform reasonably well on a broad range of text-based tasks.Falcon-7B is based on 7 billion parameters. The model uses about 14GB of storage.

Loading it into memory (by creating a pipeline) with 16-bit weights uses 14GB of GPU RAM. Therefore, it's advisable to run this model on a system with over 16GB GPU RAM. Loading the model using the default 32-bit floats takes over 22 GB of GPU. Running it can take more than 25 GB GPU RAM.

To use the model, import the following packages.

 >>> from transformers import AutoTokenizer, AutoModelForCausalLM
 >>> import transformers, torch

Declare the model name with a variable.
```
 >>> model = "tiiuae/falcon-7b"
```

Initialize the tokenizer corresponding to the model.

 >>> tokenizer = AutoTokenizer.from_pretrained(model)

Declare the pipeline with 16-bit weights.
```
 >>> pipeline = transformers.pipeline(
   "text-generation",
   model=model,
   tokenizer=tokenizer,
   torch_dtype=torch.bfloat16,
   trust_remote_code=True,
   device_map="auto",
 )
```
> The above pipeline loads model weights as 16-bit floats. To load the weights in 32-bit, omit the line torch_dtype=torch.bfloat16,. Using 32-bit weights offers better results, at the cost of using more memory.

Generate text based on an input prompt.

 >>> sequences = pipeline("Vultr is a cloud service provider", do_sample=True)

View the contents of the generated text.

 >>> for seq in sequences:
         print(f"Result: {seq['generated_text']}")

Output:

 Result: Vultr is a cloud service provider with global nodes available in 35 countries. It's

By default, the generated text is not long. To enhance the generated text, specify a few additional parameters to the pipeline function call.

 >>> sequences = pipeline(
      "Vultr is a cloud service provider",
       max_length=200,
       do_sample=True,
       num_return_sequences=1,
   )

View the generated text.

 >>> for seq in sequences:
         print(f"Result: {seq['generated_text']}")

Output:

 Result: Vultr is a cloud service provider that offers web hosting services, Vultr is a part of the Cloud services. You can manage all the servers VULTR from the account's "Servers" option. Each VPS in the Vultr platform has one master Public IP address. If you just buy a VPS account and connect to it now with a router's private ip, you will get access to the Vultr vps without the command vultr console login.

Falcon 40B

Falcon-40B is based on 40 billion parameters and uses about 76GB of storage. Loading it to memory uses about 76GB of GPU RAM. After loading this model in memory, a system with one A100 80GB GPU is insufficient. So, to run this model, use a system with at least two A100 or A40 Vultr Cloud GPUs.

The Falcon 40B steps are similar to Falcon 7B. In the previous code examples, replace 7b with 40b as below.

>>> model = "tiiuae/falcon-7b"

GPT-J

GPT Neo is an open-source transformer model based on OpenAI's GPT architecture. GPT-J is the successor to GPT Neo and it's a class of models. GPT-J-6B is a specific GPT-J model with 6 billion parameters. It's a large model. The default 32-bit variant uses around 24GB of GPU RAM, and the 16-bit variant (float16) has a smaller memory footprint of around 12 GB of GPU RAM.

To use GPT-J in a pipeline, import the required packages.

 >>> from transformers import GPTJForCausalLM, pipeline, AutoTokenizer
 >>> import torch

Specify the model parameters with 16-bit weights.

 >>> model = GPTJForCausalLM.from_pretrained(
       "EleutherAI/gpt-j-6B",
       revision="float16",
       torch_dtype=torch.float16,
       low_cpu_mem_usage=True
   )

> To load the default 32-bit variant of GPT-J-6B, omit the lines revision="float16" and torch_dtype=torch.float16.

Initialize the tokenizer.

 >>> tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")

Create the pipeline based on the model and tokenizer declared earlier.

 >>> gen_gptj = pipeline(task="text-generation",
       model=model,
       tokenizer=tokenizer,
       device=0,
       max_length=200
   )

Use the pipeline.
```
 >>> gen_gptj("Vultr is a cloud service provider")
```
This outputs the generated text based on the input prompt.

Hardware Considerations

By default, the model files are stored in the ~/.cache/huggingface/hub directory. To view the disk space consumed by downloaded models, run the following command.

$ du -h -d 1 ~/.cache/huggingface/hub/

Your output should look like the one below.

256M    /root/.cache/huggingface/hub/models--distilbert-base-uncased-finetuned-sst-2-english
460M    /root/.cache/huggingface/hub/models--openai-gpt
503M    /root/.cache/huggingface/hub/models--EleutherAI--gpt-neo-125M
2.0G    /root/.cache/huggingface/hub/models--bigscience--bloomz-1b1
5.0G    /root/.cache/huggingface/hub/models--EleutherAI--gpt-neo-1.3B
12G     /root/.cache/huggingface/hub/models--EleutherAI--gpt-j-6B
14G     /root/.cache/huggingface/hub/models--tiiuae--falcon-7b
78G     /root/.cache/huggingface/hub/models--tiiuae--falcon-40b

Before a model is used, it is unpacked in memory. So the amount of memory (RAM) required to run a model is more than the storage space it consumes. Decide the amount of hardware resources based on the size of the model you want to run. Also, the amount of GPU memory depends on the work the model has to do. For example, using a text-generation pipeline with the max_length parameter set to 200 will use less memory compared to setting the max_length parameter to 2000.

If the server has insufficient memory to handle the model, the process terminates with an error.

To monitor CPU and memory usage, use a tool like top.

To monitor GPU usage, use watch to fetch the output of nvidia-smi every 1 second:

$ watch -n 1 nvidia-smi

Conclusion

In this guide, you have used transformer models from Hugging Face and implemented some recent models. To use a new model, look at its model card page to learn how to use it. To use a specific model in practice, study the details and configuration options from the model documentation.

LLM models are undoubtedly powerful. However, they are not perfect and cannot be used blindly. It's critical to verify their output. In particular, language models have no notion of facts. The prompt passed to the text-generation models in the above examples is Vultr is a cloud service provider. From the perspective of the model, this is equal to FooBar is a cloud service provider. Observe the output text generated by the model - you will notice coherent on-topic sentences with factual errors. This is called hallucination and It's expected that future models will address this significant shortcoming.

Next Article: OSS Large Language Models on Vultr

Browse Learning Path: