How to Use Meta Llama 2 Large Language Model on Vultr Cloud GPU
Introduction
Llama 2 Large Language Model (LLM) is a successor to the Llama 1 model released by Meta. Primarily, Llama 2 models are available in three model flavors that depending on their parameter scale range from 7 billion to 70 billion, these are Llama-2-7b
, Llama-2-13b
, and Llama-2-70b
. Llama 2 LLM models have a commercial, and open-source license for research and non-commercial use.
This article explains how to use the Meta Llama 2 large language model (LLM) on a Vultr Cloud GPU server. You are to initialize the Llama-2-70b-hf
and Llama-2-70b-chat-hf
models with quantization, then compare model weights in the Llama 2 LLM family.
Prerequisites
Before you begin:
Deploy a new Ubuntu 22.04 A100 Vultr Cloud GPU Server with at least:
- 80 GB GPU RAM
- 12 vCPUs
- 120 GB Memory
Create a non-root user with sudo rights and switch to the account.
Access the Llama 2 LLM Model
In this section, configure your HuggingFace account to access and download the Llama 2 family of models.
Request access to Llama2 through the official Meta downloads page.
When prompted, enter the same email address as your HuggingFace account, and wait for a Meta confirmation email.
Scroll down on the page, check the terms and conditions box, then click Accept and Continue to continue.
Log in to your HuggingFace account, and navigate to settings.
On the left navigation menu, click Access Tokens.
Click the New token button to set up a new access token.
Give the token a name for example:
meta-llama
, set the role to read, and click the Generate a Token button to save.Click the Show option to reveal your token in plain text. Copy the token to your clipboard.
In your Hugging Face interface, enter
Llama-2-7b
in the search bar to open the model page.Click the checkbox to share your information with Meta, and click Submit to request access to the model repository.
When successful, you should receive a confirmation email from HuggingFace accepting your request to access the model. This confirms that you can use the model files as permitted by the Meta terms and conditions.
Install the CUDA Toolkit
To run Llama 2 models with lower precision settings, the CUDA toolkit is essential. Install the toolkit to install the libraries needed to write and compile GPU-accelerated applications using CUDA as described in the steps below.
Download the latest CUDA toolkit version.
$ wget https://developer.download.nvidia.com/compute/cuda/11.8.0/local_installers/cuda_11.8.0_520.61.05_linux.run
Initialize the CUDA toolkit installation.
$ sudo sh cuda_11.8.0_520.61.05_linux.run
When prompted, read the CUDA terms and conditions. Enter
accept
to agree to the toolkit license. Then, in the installation prompt, press Space to deselect all any provided options, and only keep the CUDA toolkit selected. Using arrow keys, scroll to theInstall
option and press Enter to start the installation process.Using
echo
, append the following configurations at the end of the~/.bashrc
file.$ echo " export PATH=$PATH:/usr/local/cuda-11.8/bin export LD_LIBRARY_PATH=/usr/local/cuda-11.8/lib64 " >> ~/.bashrc
The above configuration lines declare the environment variable configurations that allow your system to use the CUDA toolkit and its libraries.
Using a text editor such as
Vim
, edit the/etc/ld.so.conf/cuda-11-8.conf
file.$ sudo vim /etc/ld.so.conf.d/cuda-11-8.conf
Add the following configuration at the beginning of the file.
/usr/local/cuda-11.8/lib64
Save and close the file.
To save the configuration, end your SSH session.
$ exit
Start a new SSH session.
$ ssh example-user@SERVER-IP
Run the following
ldconfig
command to update the linker cache, and refresh information about shared libraries on your server.$ sudo ldconfig
Install Model Dependencies
To use the model features and tools, install Jupyter Notebook to run commands, then install the required libraries as described in the steps below.
Install PyTorch.
$ pip3 install torch --index-url https://download.pytorch.org/whl/cu118
The above command installs the PyTorch library that offers efficient tensor computations and supports GPU acceleration for training operations.
To install a PyTorch version that matches your CUDA visit the documentation page to set preferences and run the install command.
Install dependency packages.
$ pip3 install bitsandbytes scipy transformers accelerate einops xformers
Below is what each package represents:
transformers
: It's used for Natural Language Processing (NLP) tasks, and key functionalities include tokenization and fine tuning.accelerate
: Improves the training and inference of machine learning models.einops
: Reshapes and reduces the dimensions of multi-dimensional arrays.xformers
: Provides multiple building blocks for making transformer-based models.bitsandbytes
: Focuses on functions that optimize operations involving 8-bit data, such as matrix multiplication.scipy
: Enables access to thebitsandbytes
functionalities for scientific, and technical computing.
Install the Jupyter
notebook
package.$ pip3 install notebook
Allow incoming connections to the Jupyter Notebook port
8888
.$ sudo ufw allow 8888
Start Jupyter Notebook.
$ jupyter notebook --ip=0.0.0.0
If you receive the following error:
Command 'jupyter' not found, but can be installed with:
End your SSH connection, and reconnect to the server to refresh the cache.
When successful, Jupyter Notebook should start with the following output:
[I 2023-07-31 00:29:42.997 ServerApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation). [W 2023-07-31 00:29:42.999 ServerApp] No web browser found: Error('could not locate runnable browser'). [C 2023-07-31 00:29:42.999 ServerApp] To access the server, open this file in a browser: file:///home/example-user/.local/share/jupyter/runtime/jpserver-69912-open.html Or copy and paste one of these URLs: http://HOSTNAME:8888/tree?token=e536707fcc573e0f19be40d90902825ec6e04181bed85be9 http://127.0.0.1:8888/tree?token=e536707fcc573e0f19be40d90902825ec6e04181bed85be9
As displayed in the above output, copy the generated token URL to securely access Jupyter Notebook in your browser.
In a web browser such as Chrome, access Jupyter Notebook using your generated access token.
http://SERVER-IP:8888/tree?token=YOUR=TOKEN
Run Llama 2 70B Model
In this section, initialize the Llama-2-70b-hf
model in 4-bit and 16-bit precision, and add your Hugging Face authorization key to initialize the model pipeline and tokenizer as described in the steps below.
Access the Jupyter Notebook web interface.
On the top right bar, click New to reveal a dropdown list.
Click Notebook, and select
Python 3 (ipykernel)
to open a new file.In the new Kernel file, click the filename, by default, it's set to
Untitled
.Rename the file to
Llama-2-70b
, and press :key:Enter: to save the new filename.In a new code cell, initialize the
Llama-2-70b-hf
model.from torch import cuda, bfloat16 import transformers model_id = 'meta-llama/Llama-2-70b-hf' device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu' quant_config = transformers.BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type='nf4', bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=bfloat16 ) auth_token = 'YOUR_AUTHORIZATION_TOKEN' model_config = transformers.AutoConfig.from_pretrained( model_id, use_auth_token=auth_token ) model = transformers.AutoModelForCausalLM.from_pretrained( model_id, trust_remote_code=True, config=model_config, quantization_config=quant_config, use_auth_token=auth_token ) model.eval() print(f"Model loaded on {device}")
Paste your Hugging Face token next to the
auth_token =
directive to replaceYOUR-AUTHORIZATION_TOKEN
.The above code sets the
model_id
and enables 4-bit quantization withbitsandbytes
. This applies 4-bit to less relevant parts of the model and 16-bit quantization to the text-generation parts of the model. In 16-bit, the output is less degraded providing near-accurate information.Click the play button on the top menu bar, or press Ctrl + Enter to run the initialize the model.
When successful, the code prints the device it runs on, and shows the model is successfully downloaded. The download process may take about 30 minutes to complete.
In a new code cell, initialize the tokenizer.
tokenizer = transformers.AutoTokenizer.from_pretrained( model_id, use_auth_token=auth_token )
The above code sets the tokenizer to
model_id
. Every LLM has a different tokenizer that converts text streams to smaller units for the language model to understand and interpret the input.Initialize the pipeline.
pipe = transformers.pipeline( model=model, tokenizer=tokenizer, task='text-generation', temperature=0.0, max_new_tokens=50, repetition_penalty=1.1 )
The above code initializes the pipeline for text generation through which you can manipulate the kind of response to generate using the model. To enhance the output, the pipeline accepts additional parameters.
Run the following code to add a text prompt to the pipeline. Replace
Hello World
with your desired prompt.result = pipe('Hello World')[0]['generated_text'] print(result)
The above code block generates output based on the input prompt. To generate a response, it can take up to 5 minutes to complete.
Verify the GPU usage statistics.
!nvidia-smi
Output:
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 0 0 35554 C /usr/bin/python3 37666MiB | +-----------------------------------------------------------------------------+
As displayed in the above output, the
Llama-2-7b-hf
model uses37.6 GB
of GPU memory when executed with 4-bit precision and quantization. In full precision, the model VRAM consumption is much higher.
Run the Llama 2 70B Chat Model
In this section, initialize the Llama-2-70b-chat-hf
fine-tuned model with 4-bit and 16-bit precision as described in the following steps.
On the main menu bar, click Kernel, and select Restart and Clear Outputs of All Cells to free up the GPU memory.
Click File, select the New dropdown, and create a new Notebook.
Rename the notebook to
Llama-2-7b-chat-hf
.Initialize the
Llama-2-70b-chat-hf
model. ReplaceAUTHORIZATION_TOKEN
with your Hugging Face access token on theauth_token =
directive.from torch import cuda, bfloat16 import transformers model_id = 'meta-llama/Llama-2-70b-chat-hf' device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu' quant_config = transformers.BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type='nf4', bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=bfloat16 ) auth_token = 'YOUR_AUTHORIZATION_TOKEN' model_config = transformers.AutoConfig.from_pretrained( model_id, use_auth_token=auth_token ) model = transformers.AutoModelForCausalLM.from_pretrained( model_id, trust_remote_code=True, config=model_config, quantization_config=quant_config, use_auth_token=auth_token ) model.eval() print(f"Model loaded on {device}")
The above code uses the fine-tuned chat model
Llama-2-7b-chat-hf
, and your access token to access the model.Click the play button, or press Ctrl + Enter to execute the code.
Initialize the tokenizer.
tokenizer = transformers.AutoTokenizer.from_pretrained( model_id, use_auth_token=auth_token )
Initialize the pipeline.
pipe = transformers.pipeline( model=model, tokenizer=tokenizer, task='text-generation', temperature=0.0, max_new_tokens=50, repetition_penalty=1.1 )
Add a text prompt to the pipeline. Replace
Hello World
with your desired prompt.result = pipe('Hello World')[0]['generated_text'] print(result)
In the chat model, the prompt you enter must be in a dialogue format to differentiate the responses between the base model and the fine-tuned version.
Verify the GPU usage statistics.
!nvidia-smi
Output:
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 0 0 36099 C /usr/bin/python3 37666MiB | +-----------------------------------------------------------------------------+
As displayed in the above output, the
Llama-2-70b-hf
model uses up to37.6 GB
of VRAM when executed with 4-bit precision and quantization. The VRAM consumption of both the base model and fine-tuned models is similar because it's directly proportional to the parameter range of 70 billion.
Llama 2 Model Weights
Llama 2 parameters range from 7 billion to 70 billion, and each model has a fine-tuned chat version. Models with a low parameter range consume less GPU memory and can apply to testing inference on the model with fewer resources, but with a tradeoff on the output quality.
The following model options are available for Llama 2:
Llama-2-13b-hf
: Has a 13 billion parameter range and uses8.9 GB
VRAM when run with 4-bit quantized precision.Llama-2-13b-chat-hf
: A fine-tuned version of the 13 billion base model designed to have Chatbot-like functionality.Llama-2-7b-hf
: Has a 7 billion parameter range and uses5.5 GB
VRAM when executed with 4-bit quantized precision.Llama-2-7b-chat-hf
: A fine-tuned version of the 7 billion base model. The VRAM consumption matches the base model and works like a chatbot.
The above models are open-source and commercially licensed, you can use them for research and commercial purposes.
Llama 2 improvements over Llama 1
Llama 2 has significant advantages over its predecessor Llama 1 with more variants available on both the base and fine-tuned version.
- Unlike Llama 1, Llama 2 is open-sourced and commercially available to use.
- Llama 2 has a parameter range of 7 to 70 billion while Llama 1 has a parameter range of 7 to 65 billion.
- The Llama 2 model trains on 2 trillion tokens which is 40% more tokens than Llama 1. This increases its accuracy and knowledge in outputs.
- Llama 2 has a context length of 4096 which is double the context length of Llama 1.
- Llama 2 offers better results against standard benchmarks such as World Knowledge, Reading Comprehension, and Commonsense Reasoning as compared to Llama 1.
- Llama 2 offers fine tuned chat models together with base models while Llama 1 only offers base models.
Common Declarations
trust_remote_code
: Assesses code trustworthiness, integrity, and safety measures based on its origin when fetching code from external sources.task
: Sets the pipeline task to text generation.temperature
: With a maximum value of 1.0 and a minimum value of 0.1, it controls the output randomness. Higher values (closer to 1.0) lead to more randomness in the output.max_new_tokens
: Defines the number of tokens in the output. If not defined, the model produces an output with a random number of tokens.repetition_penalty
: Manages the likelihood of generating repeated tokens. Higher values reduce the occurrence of repeated tokens, and vice versa.
Conclusion
In this article, you used Meta Llama 2 models on a Vultr Cloud GPU Server, and run the latest Llama 2 70b model together with its fine-tuned chat version in 4-bit mode. Below are the VRAM usage statistics for Llama 2 models with a 4-bit quantized configuration on an 80 GB RAM A100 Vultr GPU.
More Information
For more information on the Meta Llama 2 models, visit the following official documentation resources.
- Llama-2-70b-hf documentation
- Llama-2-70b-chat-hf documentation
- Llama-2-7b-chat-hf documentation
- Llama-2-7b-hf documentation
- Llama-2-13b-hf documentation
- Llama-2-13b-chat-hf documentation
To implement more Cloud GPU solutions on your server, visit the following resources.
- AI Generated Images with OpenJourney and Vultr Cloud GPU.
- How to Use OpenTelemetry with Streamlit Applications.
- Text-guided Image-to-image Generation with Stable Diffusion.
- Fine Tune a Hugging Face Transformer model on Vultr Cloud GPU.
- How to Use Hugging Face Transformer models on a Vultr Cloud GPU.