Fine Tune a Hugging Face Transformer Model on Vultr Cloud GPU

Introduction

Large Language Models (LLMs), such as OpenAI's GPT, TII's Falcon, EleutherAI's GPT-Neo and GPT-J, attract public attention for their broad-based capabilities. LLMs are a type of transformer model. Transformer models are typically trained on large volumes of data over many hours or days, using many GPUs and TPUs. Most models mentioned are pre-trained and ready to use as-is. However, Pre-trained models are general-purpose models, they do not excel at any specific task, and their output is based on the dataset(s) they were trained on.

A model's output depends on its weights. An untrained model has random weights. The (untrained) model's outputs are random while the training process updates weights until the model's output matches the training goals. You can update a model's weights so that it performs better at a specific task. To do this, train the model on a dataset specific to the task it needs to perform well on.

For example, to generate text in a specific language, the model needs to be trained on a dataset of text in that language. Similarly, to make a model that mimics a subject matter expert on a specific topic, you need to train it with resources on that topic.

The two main approaches to train a model on a specific dataset are:

Train a model from scratch on the desired dataset.
Start with a pre-trained model and train it further on the desired dataset as described in this article.

This article explains how you can fine-tune a Hugging Face Transformer model on an A100 Vultr Cloud GPU instance. It explains the steps to fine-tune GPT-Neo (with 125 million parameters) using the Netflix dataset, and by the end of it, the text generated by the models after fine-tuning has the "tone" of a show/movie description.

Prerequisites

Before you start, make sure you:

Deploy a fresh Debian server with NVIDIA A100 or A40 Vultr Cloud GPU and at least:
- 10 GB GPU RAM
Use SSH to access the server.
Create a non-root user with sudo rights and switch to the account.
Have some Intermediate Python programming skills.

Fine-tuning Overview

To fine-tune a model, you need:

A system with the necessary hardware and software. For example, a Vultr Cloud GPU instance.
A pre-trained model.
A dataset on which to train (fine-tune) and evaluate the pre-trained model.
A tokenizer to convert (tokenize) the dataset into a format (for example, arrays and tensors of numbers) that the model can use. Models are unable to use raw data (for example, text) directly.
A metric to evaluate the model's performance during the training process.
A training function to train the model.

Hardware Considerations

Training models is a GPU-heavy task. Large models have a higher number of parameters (weights) and correspondingly high GPU requirements.

The relatively small GPT-Neo model with 125 million parameters requires 1 GB to load in memory. To fine-tune this model on the Netflix dataset, the system needs 8 GB of GPU RAM using a batch size of 1. The examples in this article are based on a batch size of 1.

For production use, it's recommended to fine-tune the model using a somewhat higher batch size, like 4 or 8. This increases the memory requirements. With a batch size of 4, the GPU needed goes up to 26 GB. 8 batches take over 50 GB GPU RAM. A machine with a single A100 Vultr GPU is enough for this.

The larger GPT-Neo-1.3B model with 1.3 billion parameters uses 35 GB of GPU RAM to fine-tune with a batch size of 1. It takes around 55 GB with a batch size of 2. With a batch size of 2 and accumulating gradients every 8 steps, it takes 60 GB.

GPT-Neo-2.7B with 2.7 billion parameters requires 65 GB GPU with a batch size of 1. On a machine with 1 A100 80GB GPU, it trains at a rate of around 0.4 samples per second.

If the server has insufficient memory to handle the model, the process terminates with an out-of-memory (OOM) error. OOM errors commonly look like the one below.

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 80.00 GiB total capacity; 73.26 GiB already allocated; 98.19 MiB free; 73.57 GiB reserved in total by PyTorch)

The command nvidia-smi displays the system's GPU usage. Run the command to monitor the system's (near real-time) GPU usage.

$ watch nvidia-smi

Set up the Debian Server

In this section, set up the Vultr Debian instance with the necessary dependency packages required to fine-tune HuggingFace transformer models using the GPU.

Download the Conda installation script.

 $ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

Run the script:
```
 $ sudo bash Miniconda3-latest-Linux-x86_64.sh
```
Follow the on-screen instructions to finish the Conda installation.
To activate Conda, end your SSH session and re-log in to the server.
```
 $ ssh user@SERVER-IP
```
Upgrade Conda.
```
 $ conda upgrade -y conda
```

Create a new Conda environment named env1.

 $ conda create -y --name env1 python=3.9

Activate the environment.
```
 $ conda activate env1
```
Upgrade pip.
```
 $ pip install --upgrade pip
```

Using Conda, install the CUDA GPU packages.

 $ conda install -y -c conda-forge cudatoolkit=11.8 cudnn=8.2

Install Pytorch and NVIDIA dependencies.

 $ conda install -y -c pytorch -c nvidia pytorch=2.0.1 pytorch-cuda=11.8

Export the Conda library paths.

 $ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/

Create a new Conda activation scripts directory.
```
 $ mkdir -p $CONDA_PREFIX/etc/conda/activate.d
```

Append the paths of Nvidia tools to the Conda activation script.

$ echo 'CUDNN_PATH=$(dirname $(python -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)"))' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh

$ echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/' > $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh

Activate Conda:

$ source $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh

Using Pip, install Tensorflow, transformers, huggingface-hun, einops, accelerate, and xformers.

$ pip install nvidia-cudnn-cu11==8.6.0.163 tensorflow==2.12.* transformers==4.30.* huggingface-hub einops==0.6.1 accelerate==0.20.3 xformers==0.0.20 scikit-learn==1.3.0 evaluate==0.4.0

Test a GPU-based command.

$ python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

If Python is installed with GPU support, the above command output should look like the one below.
```
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
```
Start a Python shell before proceeding further with the examples.
```
$ python
```

Fine-tune GPT-Neo-125M

GPT-Neo-125M is a smaller model with low memory requirements, and thus, it's a more beginner-friendly learning tool. In this section, fine-tune the GPT-Neo-125M model as described below.

Import the necessary packages into the Python shell.

 >>> from transformers import Trainer, TrainingArguments, AutoTokenizer, DataCollatorForLanguageModeling
 >>> from transformers import GPTNeoForCausalLM
 >>> from datasets import load_dataset
 >>> import torch, evaluate, sklearn, numpy as np

Load the model and move it to the GPU memory.

 >>> model = GPTNeoForCausalLM.from_pretrained(
         "EleutherAI/gpt-neo-125m", 
         low_cpu_mem_usage=True,
     ).cuda()

Set the padding token to be equal to the end-of-sentence (EOS) token.
```
 >>> model.config.pad_token_id = model.config.eos_token_id
```
The above step is optional. If you don't set it manually, the system does it automatically, but displays a warning when running the model:
```
 # Error-message:
 # Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
```

Import the tokenizer corresponding to the model.

 >>> tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-125m")

As with the model, manually set the EOS token equal to the padding token for the tokenizer.
```
 >>> tokenizer.pad_token = tokenizer.eos_token
```
Load the dataset.
```
 >>> dataset = load_dataset("hugginglearners/netflix-shows")
```
The above command loads the netflix-shows dataset from the HuggingLearners repository.
Fetch the keys of an arbitrary item and check the data structure.
```
 >>> dataset["train"][100].keys() 
```
Check the data itself.
```
 >>> dataset["train"][100]
```
The fine-tuning process uses text in the description field. This field consists of descriptions of Netflix shows.
Define a tokenizing function:
```
 >>> def tokenize_function(examples):
         tokenized_data = tokenizer(examples["description"], padding="max_length", truncation=True)
         return tokenized_data
```
This tokenizing function calls the tokenizer (imported earlier) and applies it to the data item's description field. Simultaneously, it pads and truncates the input. This is necessary because models need standardized inputs. Each input must be of the same length. The maximum size (length) of each input is limited. However, with real-world data, the inputs are of different lengths. They can also be longer than the maximum length the model accepts, using Padding and truncating fixes this.
To tokenize the dataset, use Python's map method to apply the tokenizing function on each item in the dataset:
```
 >>> tokenized_dataset = dataset.map(tokenize_function, batched=True)
```
Create two subsets from the tokenized dataset: a training dataset and an evaluation dataset:
```
 >>> train_dataset = tokenized_dataset["train"].select(range(0,7999))
 >>> eval_dataset = tokenized_dataset["train"].select(range(8000,8799))
```
The original dataset has over 8800 items. The training dataset consists of 8000 items and the evaluation dataset has 800.

In practice, it's advisable to test the entire process with a smaller dataset. Create sample training and evaluation datasets (with, for example, around 100 items for training, and 20 for evaluating).
```
>>> # train_dataset = tokenized_dataset["train"].select(range(0,99)) 
>>> # eval_dataset = tokenized_dataset["train"].select(range(100,119))
```
A small dataset, as shown above, is not useful for training (fine-tuning). However, it helps to test the entire process before applying it to the complete dataset. The example code directly uses the full dataset, not the smaller samples.
Specify the output directory in which to store model checkpoints:
```
>>> output_dir =  "./fine_tuned_models/gpt-neo-125m"
```
Define a set of training arguments - these are the parameters passed to the training function.
```
>>> training_args = TrainingArguments(
        output_dir = output_dir,
        logging_dir='./logs',
        label_names=['input_ids', 'attention_mask'], 
        per_device_train_batch_size=1,
        gradient_accumulation_steps=1,
        num_train_epochs=1,
    )
```
Below are the parameters used in the training arguments:
- The output directory stores the model checkpoints.
- The logging directory stores the training logs.
- label_names are the labels used to label the data.
- per_device_train_batch_size specifies the batch size. In this case, the example uses a batch size of 1. If this parameter is not specified, the training process uses the default batch size of 8. Larger batch sizes need more GPU. The batch size is per GPU. If the machine has 4 GPUs, and this parameter is set to 1, each GPU (in parallel) processes 1 batch. Larger batch sizes generally lead to better training. However, a batch size too large leads to instability in the training process.
- gradient_accumulation_steps is the number of steps for which the gradient should be accumulated. This allows you to apply gradient accumulation during the training process. Standard GPUs cannot handle larger batch sizes. Sometimes, the GPU can only handle a batch size of 1. Gradient accumulation is a way to partially replicate the (positive) effects of larger batch sizes while still training with smaller batches. The default value is 1. Experiment with using a value of 4, 8, or 16.
- num_train_epochs specifies the number of epochs. The higher the number of epochs, the longer the processing time. If this parameter is not specified, it uses the default value of 3 epochs. Having too few epochs can result in an untrained model. Too many epochs can result in overfitting. The examples in this article run the training for a single epoch. This is inappropriate in a production setting.
Performance of the training (fine-tuning) process depends on many parameters. Configuring these parameters with the right values is partly subjective and depends on experience. It also varies from model to model. This is called optimizing the training process.

Define a metric:
```
>>> metric = evaluate.load("accuracy")
```
This metric is loaded using HuggingFace's Evaluate module. During the training process, the model's performance is measured based on this metric. The metric can be based on different parameters. In this case, the parameter is "accuracy".
Define a function compute_metrics which uses the above metric.
```
>>> def compute_metrics(prediction_to_evaluate):
        logits, labels = prediction_to_evaluate
        prediction = np.argmax(logits, axis=-1)
        return metric.compute(predictions=prediction, references=labels)
```
This function is passed to the trainer as a parameter. The trainer uses it to compute the loss and measure (evaluate) the performance (accuracy) of each output (prediction).
Datacollators are special functions that tell the trainer how to form batches from a list of input elements. In this case, it is necessary to specify a collator specific to language models.
```
>>> data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)
```
The collator accepts the tokenizer as an input parameter. The parameter mlm specifies whether masked language modeling is used. Masked language modeling is useful in applications like translation. In this case, to build a generative model, set the mlm parameter to false.

Using the training arguments, the compute_metrics function, and the data collator, define the trainer function.

>>> trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        compute_metrics=compute_metrics,
        data_collator=data_collator,
    )

Run the training, and assign the output to a variable, result.
```
>>> result = trainer.train()
```
The training (fine-tuning) can take a while. On an instance with 2 vCPUs and 10 GB GPU, it takes around 3 hours for the Netflix dataset. After the training finishes, check the details of the training process.
```
>>> result.metrics
```
This shows the time taken for the training, the rate of training (samples per second), the loss, and other information. Monitor these results while trying to improve the fine-tuning by varying the training_args parameters (for example, gradient_accumulation_steps, per_device_train_batch_size, and so on.).

Save the trained model with an appropriate name.

>>> trainer.save_model('vultr/finetuned_gpt_neo_125_netflix')

Exit and restart the Python shell.
```
>>> exit()
```

Use the Fine-tuned model

In a new Python session, import the required modules.

 >>> from transformers import pipeline, AutoTokenizer
 >>> from transformers import GPTNeoForCausalLM

Import the tokenizer for the model:

 >>> tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-125m")

Both the pre-trained and fine-tuned models use the same tokenizer. import the default model.

 >>> model = GPTNeoForCausalLM.from_pretrained(
         "EleutherAI/gpt-neo-125m", 
         low_cpu_mem_usage=True,
     ).cuda()

Import the fine-tuned model you saved earlier.

 >>> model_finetuned = GPTNeoForCausalLM.from_pretrained(
         "vultr/finetuned_gpt_neo_125_netflix", 
         low_cpu_mem_usage=True,
     ).cuda()

Define a pipeline with the default model.

 >>> generator = pipeline(task="text-generation", model=model, tokenizer=tokenizer, device=0)

Define a pipeline with the fine-tuned model.

 >>> generator_finetuned = pipeline(task="text-generation", model=model_finetuned, tokenizer=tokenizer, device=0)

Declare a few input prompts:

 >>> prompt = "One fine morning" 
 >>> # prompt = "A boy and a girl"
 >>> # prompt = "Vultr is a cloud service provider"

Generate text using the default model:

 >>> generator(prompt, max_length=200, do_sample=True, num_return_sequences=1)

Generate text using the fine-tuned model:

 >>> generator_finetuned(prompt, max_length=200, do_sample=True, num_return_sequences=1)

Try a few other prompts and run each prompt a few times.

Compare the outputs of the pre-trained and the fine-tuned model. Observe, qualitatively, the difference in the outputs. Notice that the fine-tuned model's output reads more like a story.

Fine-tune Larger GPT-Neo Models

The process of fine-tuning and using larger GPT-Neo models is the same. In the above code samples, replace EleutherAI/gpt-neo-125m with EleutherAI/gpt-neo-1.3b to use the model with 1.3 billion parameters. Use EleutherAI/gpt-neo-2.7b for the model with 2.7 billion parameters. Save the fine-tuned models with an appropriate name. Both models can be fine-tuned on a single A100 Vultr instance with 80 GB GPU RAM.

Conclusion

Fine-tuning transformer models is a large and complex topic. This explains an introduction to the principles of fine-tuning, in particular LLMs. It demonstrates fine-tuning with a new example. The example fine-tunes a pre-trained LLMs (GPT-Neo with 125 million parameters) on a HuggingFace dataset consisting of Netflix show descriptions. Given an input prompt, the resultant models are capable of generating text in the style of a Netflix show description.

It's critical to understand that an LLM model, fine-tuned or otherwise, merely generates coherent text that sounds similar to the training corpus. Current LLM models are not concerned with the meaningfulness or correctness of their outputs. The training process optimizes for generating text, not for acquiring or understanding knowledge. While a fine-tuned LLM can be a valuable tool for summarizing text, sentence completion, answering questions, and the like, the outputs should be subject to human scrutiny.

Tags:

Transformer Model

GPU

Hugging Face