OSS Large Language Models on Vultr

How to Use Meta Llama 2 Large Language Model on Vultr Cloud GPU

Updated on August 10, 2023

Introduction

Llama 2 Large Language Model (LLM) is a successor to the Llama 1 model released by Meta. Primarily, Llama 2 models are available in three model flavors that depending on their parameter scale range from 7 billion to 70 billion, these are Llama-2-7b, Llama-2-13b, and Llama-2-70b. Llama 2 LLM models have a commercial, and open-source license for research and non-commercial use.

This article explains how to use the Meta Llama 2 large language model (LLM) on a Vultr Cloud GPU server. You are to initialize the Llama-2-70b-hf and Llama-2-70b-chat-hf models with quantization, then compare model weights in the Llama 2 LLM family.

Prerequisites

Before you begin:

Deploy a new Ubuntu 22.04 A100 Vultr Cloud GPU Server with at least:
- 80 GB GPU RAM
- 12 vCPUs
- 120 GB Memory
Use SSH to access the server.
Create a non-root user with sudo rights and switch to the account.
Create a HuggingFace account.

Access the Llama 2 LLM Model

In this section, configure your HuggingFace account to access and download the Llama 2 family of models.

Request access to Llama2 through the official Meta downloads page.

When prompted, enter the same email address as your HuggingFace account, and wait for a Meta confirmation email.

Scroll down on the page, check the terms and conditions box, then click Accept and Continue to continue.
Log in to your HuggingFace account, and navigate to settings.
On the left navigation menu, click Access Tokens.
Click the New token button to set up a new access token.
Give the token a name for example: meta-llama, set the role to read, and click the Generate a Token button to save.
Click the Show option to reveal your token in plain text. Copy the token to your clipboard.
In your Hugging Face interface, enter Llama-2-7b in the search bar to open the model page.
Click the checkbox to share your information with Meta, and click Submit to request access to the model repository.

When successful, you should receive a confirmation email from HuggingFace accepting your request to access the model. This confirms that you can use the model files as permitted by the Meta terms and conditions.

Install the CUDA Toolkit

To run Llama 2 models with lower precision settings, the CUDA toolkit is essential. Install the toolkit to install the libraries needed to write and compile GPU-accelerated applications using CUDA as described in the steps below.

Download the latest CUDA toolkit version.

 $ wget https://developer.download.nvidia.com/compute/cuda/11.8.0/local_installers/cuda_11.8.0_520.61.05_linux.run

Initialize the CUDA toolkit installation.
```
 $ sudo sh cuda_11.8.0_520.61.05_linux.run
```
When prompted, read the CUDA terms and conditions. Enter accept to agree to the toolkit license. Then, in the installation prompt, press Space to deselect all any provided options, and only keep the CUDA toolkit selected. Using arrow keys, scroll to the Install option and press Enter to start the installation process.
Using echo, append the following configurations at the end of the ~/.bashrc file.
```
 $ echo " export PATH=$PATH:/usr/local/cuda-11.8/bin
          export LD_LIBRARY_PATH=/usr/local/cuda-11.8/lib64 " >> ~/.bashrc
```
The above configuration lines declare the environment variable configurations that allow your system to use the CUDA toolkit and its libraries.
Using a text editor such as Vim, edit the /etc/ld.so.conf/cuda-11-8.conf file.
```
 $ sudo vim /etc/ld.so.conf.d/cuda-11-8.conf
```
Add the following configuration at the beginning of the file.
```
 /usr/local/cuda-11.8/lib64
```
Save and close the file.
To save the configuration, end your SSH session.
```
 $ exit
```
Start a new SSH session.
```
 $ ssh example-user@SERVER-IP
```
Run the following ldconfig command to update the linker cache, and refresh information about shared libraries on your server.
```
 $ sudo ldconfig
```

Install Model Dependencies

To use the model features and tools, install Jupyter Notebook to run commands, then install the required libraries as described in the steps below.

Install PyTorch.
```
 $ pip3 install torch --index-url https://download.pytorch.org/whl/cu118
```
The above command installs the PyTorch library that offers efficient tensor computations and supports GPU acceleration for training operations.

To install a PyTorch version that matches your CUDA visit the documentation page to set preferences and run the install command.
Install dependency packages.
```
 $ pip3 install bitsandbytes scipy transformers accelerate einops xformers
```
Below is what each package represents:
- transformers: It's used for Natural Language Processing (NLP) tasks, and key functionalities include tokenization and fine tuning.
- accelerate: Improves the training and inference of machine learning models.
- einops: Reshapes and reduces the dimensions of multi-dimensional arrays.
- xformers: Provides multiple building blocks for making transformer-based models.
- bitsandbytes: Focuses on functions that optimize operations involving 8-bit data, such as matrix multiplication.
- scipy: Enables access to the bitsandbytes functionalities for scientific, and technical computing.
Install the Jupyter notebook package.
```
 $ pip3 install notebook
```
Allow incoming connections to the Jupyter Notebook port 8888.
```
 $ sudo ufw allow 8888
```

Start Jupyter Notebook.

 $ jupyter notebook --ip=0.0.0.0

If you receive the following error:

 Command 'jupyter' not found, but can be installed with:

End your SSH connection, and reconnect to the server to refresh the cache.

When successful, Jupyter Notebook should start with the following output:

 [I 2023-07-31 00:29:42.997 ServerApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
 [W 2023-07-31 00:29:42.999 ServerApp] No web browser found: Error('could not locate runnable browser').
 [C 2023-07-31 00:29:42.999 ServerApp] 

     To access the server, open this file in a browser:
         file:///home/example-user/.local/share/jupyter/runtime/jpserver-69912-open.html
     Or copy and paste one of these URLs:
         http://HOSTNAME:8888/tree?token=e536707fcc573e0f19be40d90902825ec6e04181bed85be9
         http://127.0.0.1:8888/tree?token=e536707fcc573e0f19be40d90902825ec6e04181bed85be9

As displayed in the above output, copy the generated token URL to securely access Jupyter Notebook in your browser.

In a web browser such as Chrome, access Jupyter Notebook using your generated access token.
```
 http://SERVER-IP:8888/tree?token=YOUR=TOKEN
```

Run Llama 2 70B Model

In this section, initialize the Llama-2-70b-hf model in 4-bit and 16-bit precision, and add your Hugging Face authorization key to initialize the model pipeline and tokenizer as described in the steps below.

Access the Jupyter Notebook web interface.
On the top right bar, click New to reveal a dropdown list.
Click Notebook, and select Python 3 (ipykernel) to open a new file.
In the new Kernel file, click the filename, by default, it's set to Untitled.
Rename the file to Llama-2-70b, and press :key:Enter: to save the new filename.

In a new code cell, initialize the Llama-2-70b-hf model.

 from torch import cuda, bfloat16
 import transformers

 model_id = 'meta-llama/Llama-2-70b-hf'

 device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

 quant_config = transformers.BitsAndBytesConfig(
     load_in_4bit=True,
     bnb_4bit_quant_type='nf4',
     bnb_4bit_use_double_quant=True,
     bnb_4bit_compute_dtype=bfloat16
 )

 auth_token = 'YOUR_AUTHORIZATION_TOKEN'

 model_config = transformers.AutoConfig.from_pretrained(
     model_id,
     use_auth_token=auth_token
 )

 model = transformers.AutoModelForCausalLM.from_pretrained(
     model_id,
     trust_remote_code=True,
     config=model_config,
     quantization_config=quant_config,
     use_auth_token=auth_token
 )

 model.eval()
 print(f"Model loaded on {device}")

Paste your Hugging Face token next to the auth_token = directive to replace YOUR-AUTHORIZATION_TOKEN.

The above code sets the model_id and enables 4-bit quantization with bitsandbytes. This applies 4-bit to less relevant parts of the model and 16-bit quantization to the text-generation parts of the model. In 16-bit, the output is less degraded providing near-accurate information.

Click the play button on the top menu bar, or press Ctrl + Enter to run the initialize the model.

When successful, the code prints the device it runs on, and shows the model is successfully downloaded. The download process may take about 30 minutes to complete.
In a new code cell, initialize the tokenizer.
```
 tokenizer = transformers.AutoTokenizer.from_pretrained(
     model_id,
     use_auth_token=auth_token
 )
```
The above code sets the tokenizer to model_id. Every LLM has a different tokenizer that converts text streams to smaller units for the language model to understand and interpret the input.
Initialize the pipeline.
```
 pipe = transformers.pipeline(
     model=model, 
     tokenizer=tokenizer,
     task='text-generation',
     temperature=0.0, 
     max_new_tokens=50,  
     repetition_penalty=1.1 
 )
```
The above code initializes the pipeline for text generation through which you can manipulate the kind of response to generate using the model. To enhance the output, the pipeline accepts additional parameters.
Run the following code to add a text prompt to the pipeline. Replace Hello World with your desired prompt.
```
 result = pipe('Hello World')[0]['generated_text']
 print(result)
```
The above code block generates output based on the input prompt. To generate a response, it can take up to 5 minutes to complete.

Verify the GPU usage statistics.

 !nvidia-smi

Output:

 +-----------------------------------------------------------------------------+
 |  Processes:                                                                 |
 |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
 |        ID   ID                                                   Usage      |
 |=============================================================================|
 |    0    0    0      35554      C   /usr/bin/python3                37666MiB |
 +-----------------------------------------------------------------------------+

As displayed in the above output, the Llama-2-7b-hf model uses 37.6 GB of GPU memory when executed with 4-bit precision and quantization. In full precision, the model VRAM consumption is much higher.

Run the Llama 2 70B Chat Model

In this section, initialize the Llama-2-70b-chat-hf fine-tuned model with 4-bit and 16-bit precision as described in the following steps.

On the main menu bar, click Kernel, and select Restart and Clear Outputs of All Cells to free up the GPU memory.
Click File, select the New dropdown, and create a new Notebook.
Rename the notebook to Llama-2-7b-chat-hf.

Initialize the Llama-2-70b-chat-hf model. Replace AUTHORIZATION_TOKEN with your Hugging Face access token on the auth_token = directive.

 from torch import cuda, bfloat16
 import transformers

 model_id = 'meta-llama/Llama-2-70b-chat-hf'

 device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

 quant_config = transformers.BitsAndBytesConfig(
     load_in_4bit=True,
     bnb_4bit_quant_type='nf4',
     bnb_4bit_use_double_quant=True,
     bnb_4bit_compute_dtype=bfloat16
 )

 auth_token = 'YOUR_AUTHORIZATION_TOKEN'

 model_config = transformers.AutoConfig.from_pretrained(
     model_id,
     use_auth_token=auth_token
 )

 model = transformers.AutoModelForCausalLM.from_pretrained(
     model_id,
     trust_remote_code=True,
     config=model_config,
     quantization_config=quant_config,
     use_auth_token=auth_token
 )

 model.eval()
 print(f"Model loaded on {device}")

The above code uses the fine-tuned chat model Llama-2-7b-chat-hf, and your access token to access the model.

Click the play button, or press Ctrl + Enter to execute the code.

Initialize the tokenizer.

 tokenizer = transformers.AutoTokenizer.from_pretrained(
     model_id,
     use_auth_token=auth_token
 )

Initialize the pipeline.

 pipe = transformers.pipeline(
     model=model, 
     tokenizer=tokenizer,
     task='text-generation',
     temperature=0.0, 
     max_new_tokens=50,  
     repetition_penalty=1.1
 )

Add a text prompt to the pipeline. Replace Hello World with your desired prompt.
```
 result = pipe('Hello World')[0]['generated_text']
 print(result)
```
In the chat model, the prompt you enter must be in a dialogue format to differentiate the responses between the base model and the fine-tuned version.

Verify the GPU usage statistics.

 !nvidia-smi

Output:

 +-----------------------------------------------------------------------------+
 | Processes:                                                                  |
 |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
 |        ID   ID                                                   Usage      |
 |=============================================================================|
 |    0    0    0      36099      C   /usr/bin/python3                37666MiB |
 +-----------------------------------------------------------------------------+

As displayed in the above output, the Llama-2-70b-hf model uses up to 37.6 GB of VRAM when executed with 4-bit precision and quantization. The VRAM consumption of both the base model and fine-tuned models is similar because it's directly proportional to the parameter range of 70 billion.

Llama 2 Model Weights

Llama 2 parameters range from 7 billion to 70 billion, and each model has a fine-tuned chat version. Models with a low parameter range consume less GPU memory and can apply to testing inference on the model with fewer resources, but with a tradeoff on the output quality.

The following model options are available for Llama 2:

Llama-2-13b-hf: Has a 13 billion parameter range and uses 8.9 GB VRAM when run with 4-bit quantized precision.
Llama-2-13b-chat-hf: A fine-tuned version of the 13 billion base model designed to have Chatbot-like functionality.
Llama-2-7b-hf: Has a 7 billion parameter range and uses 5.5 GB VRAM when executed with 4-bit quantized precision.
Llama-2-7b-chat-hf: A fine-tuned version of the 7 billion base model. The VRAM consumption matches the base model and works like a chatbot.

The above models are open-source and commercially licensed, you can use them for research and commercial purposes.

Llama 2 improvements over Llama 1

Llama 2 has significant advantages over its predecessor Llama 1 with more variants available on both the base and fine-tuned version.

Unlike Llama 1, Llama 2 is open-sourced and commercially available to use.
Llama 2 has a parameter range of 7 to 70 billion while Llama 1 has a parameter range of 7 to 65 billion.
The Llama 2 model trains on 2 trillion tokens which is 40% more tokens than Llama 1. This increases its accuracy and knowledge in outputs.
Llama 2 has a context length of 4096 which is double the context length of Llama 1.
Llama 2 offers better results against standard benchmarks such as World Knowledge, Reading Comprehension, and Commonsense Reasoning as compared to Llama 1.
Llama 2 offers fine tuned chat models together with base models while Llama 1 only offers base models.

Common Declarations

trust_remote_code: Assesses code trustworthiness, integrity, and safety measures based on its origin when fetching code from external sources.
task: Sets the pipeline task to text generation.
temperature: With a maximum value of 1.0 and a minimum value of 0.1, it controls the output randomness. Higher values (closer to 1.0) lead to more randomness in the output.
max_new_tokens: Defines the number of tokens in the output. If not defined, the model produces an output with a random number of tokens.
repetition_penalty: Manages the likelihood of generating repeated tokens. Higher values reduce the occurrence of repeated tokens, and vice versa.

Conclusion

In this article, you used Meta Llama 2 models on a Vultr Cloud GPU Server, and run the latest Llama 2 70b model together with its fine-tuned chat version in 4-bit mode. Below are the VRAM usage statistics for Llama 2 models with a 4-bit quantized configuration on an 80 GB RAM A100 Vultr GPU.

GPU Stats

More Information

For more information on the Meta Llama 2 models, visit the following official documentation resources.

To implement more Cloud GPU solutions on your server, visit the following resources.

Next Article: Deploy Machine Learning Models to Production

Browse Learning Path: