How to Build an Inference API Using Hugging Face Transformers and FastAPI

Updated on November 21, 2023
How to Build an Inference API Using Hugging Face Transformers and FastAPI header image

Introduction

Transformer machine learning models are versatile and can be adapted to perform a broad range of tasks. In particular, LLMs with their language-related abilities are valuable for many use cases. To get started with these models, you can use pre-trained models, such as GPT-J, Falcon, or train (fine-tune) a pre-trained model for a specific task.

Inference is the process of applying a model to input data to produce a specific output. To serve a model to users over the internet, build an inference API, and put the model into production. To build an inference API, you need:

  • An AI model, such as an LLM (pre-trained or fine-tuned)
  • A web framework to build and serve APIs
  • A system that's configured to accept and serve requests over the internet

This article explains how to implement each of the building steps, and have a functional inference API running on a Vultr Cloud Server.

Scope

In this article, you will build inference APIs for Hugging Face Transformer models, and examples are based on text generation using the pre-trained GPT-J model with 6 billion parameters. You will also use the smaller GPT Neo 125M model which can be run with 1 GB GPU RAM. However, the output quality is noticeably worse when using smaller pre-trained models.

FastAPI is used to build the API interface. It uses Gunicorn with Uvicorn workers to serve the API because Python-based frameworks such as Django are useful when building full-fledged web applications. When building an API-only application using a Python-based framework, FastAPI is the best choice as used in this article.

Prerequisites

Before you begin:

Set up the Server

In this section, set up the Debian server with the necessary packages required to run an inference API using Hugging Face transformer models. You will install tools to run the models and serve the API as described in the steps below.

  1. Install htop and Tmux:

     $ sudo apt install -y htop tmux
  2. Download the Conda installer.

     $ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
  3. Run the installer.

     $ bash Miniconda3-latest-Linux-x86_64.sh

    Reply to the Installation prompts as below:

     Do you accept the license terms? [yes|no]
     [no] >>>  yes
    
     Miniconda3 will now be installed into this location:
     /home/example-user/miniconda3
    
      - Press ENTER to confirm the location
      - Press CTRL-C to abort the installation
      - Or specify a different location below
    
     [/home/example-user/miniconda3] >>>
     PREFIX=/home/example-user/miniconda3
     Unpacking payload ...
    
     Do you wish the installer to initialize Miniconda3
     by running conda init? [yes|no]
     [no] >>> yes
  4. When the installation is successful, disconnect your SSH session.

     $ exit
  5. Start a new SSH session to activate Conda.

     $ ssh example-user@SERVER-IP

    When logged in, your prompt should look like the one below:

     (base) example-user@Test:~$
  6. Upgrade Conda.

     $ conda upgrade -y conda
  7. Create a new Conda environment env1 with the latest Python3 version 3.11.

     $ conda create -y --name env1 python=3.11

    Verify the latest Python3 version before installing 3.11.

  8. Activate the environment env1.

     $ conda activate env1
  9. Upgrade pip.

     $ pip install --upgrade pip
  10. Using Conda, install the CUDA GPU packages.

     $ conda install -y -c conda-forge cudatoolkit=11.8 cudnn=8.2
  11. Install Pytorch and related GPU dependencies.

     $ conda install -y -c pytorch -c nvidia pytorch=2.0.1 pytorch-cuda=11.8
  12. Set the appropriate paths to initialize Conda.

     $ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/
  13. Create the activation directory.

     $ mkdir -p $CONDA_PREFIX/etc/conda/activate.d
  14. Append paths to the Nvidia tools to the activation shell script.

     $ echo 'CUDNN_PATH=$(dirname $(python -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)"))' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
     $ echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/' > $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
  15. Activate Conda.

     $ source $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
  16. Install tensorflow, transformers, huggingface-hub, Nvidia tools, and dependencies like accelerate.

     $ pip install nvidia-cudnn-cu11==8.6.0.163 tensorflow==2.12.* transformers==4.30.* huggingface-hub accelerate==0.20.3 xformers==0.0.20
  17. To test GPU integration, run the following Python command.

     $ python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

    If your output looks like the one below, Python and Conda environments have access to the machine's GPU.

     [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
  18. Install FastAPI, its Pydantic, Uvicorn, and gunicorn dependecies.

     $ pip install fastapi==0.100.0 pydantic==1.10.4 "uvicorn[standard]"==0.22.0 gunicorn==20.1.0

    The above command installs, FastAPI which is the Python framework used to build the API application. Pydantic, a Python-based data validation library used to implement custom data types. Uvicorn, a Python-based low-level web server for asynchronous applications based on the Asynchronous Server Gateway Interface (AGSI) standard. Gunicorn, a Python-based HTTP server based on the Web Server Gateway Interface (WGSI) standard to serve the API.

Inference API for Text Generation

In this section, set up a basic API to serve a text generation model, and configure it for production use as described below.

  1. Using a text editor such as Nano, create a new Python file app.py.

     $ nano app.py
  2. Add the following code to the file.

     # import this transformer to run GPT J 6B
     from transformers import GPTJForCausalLM
    
     # import this transformer to run GPT Neo 125M
     # from transformers import GPTNeoForCausalLM 
    
     from transformers import pipeline, AutoTokenizer 
     import torch  
     from fastapi import FastAPI 
     from pydantic import BaseModel 
     from uvicorn.workers import UvicornWorker
    
     # use this tokenizer to run GPT J 6B
     tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6b") 
    
     # use this tokenizer to run GPT Neo 125M
     # tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-125m")  
    
     # use this model for GPT J 6B
     model = GPTJForCausalLM.from_pretrained("EleutherAI/gpt-j-6b", torch_dtype=torch.float16, low_cpu_mem_usage=True).cuda() 
    
     # use this model for GPT Neo 125M
     # model = GPTNeoForCausalLM.from_pretrained("EleutherAI/gpt-neo-125m", torch_dtype=torch.float16, low_cpu_mem_usage=True).cuda() 
    
     generate_pipeline = pipeline(task="text-generation", model=model, device=0, max_length=500, do_sample=True, num_return_sequences=1, tokenizer=tokenizer)
    
     app = FastAPI() 
    
     class InputPrompt(BaseModel):     
        text: str  
    
     class GeneratedText(BaseModel):     
        text: str  
    
     @app.post("/generate", response_model=GeneratedText) 
     async def generate_func(prompt: InputPrompt):     
        output = generate_pipeline(prompt.text)          
        return {"text": output[0]["generated_text"]}

    Save and close the file.

    The above code imports all necessary packages, defines the tokenizer, model, and uses Hugging Face pipelines to declare a text-generation pipeline using the GPT-J model. app = FastAPI() packages the pipeline into an API endpoint and initiates the App.

    A class is declared to specify the input and output data types. The POST endpoint generate is created to accept user input as the body of the HTTP request, and the generate_func processes text entered by the user as a JSON object before returning the generated text to the user.

  3. Using Gunicorn, run the App.

     $ gunicorn app:app --timeout 1000  --worker-class uvicorn.workers.UvicornWorker --bind 0.0.0.0:8080

    The above command starts a server listening on the localhost port 8080, and serves app – the FastAPI app declared in the app.py file.

  4. Using curl, test the application with POST data:

     $ curl -X 'POST' 'http://localhost:8080/generate' -H 'accept: application/json' -H 'Content-Type: application/json' -d '{"text": "my name is "}'

    In the above command, curl sends data as a JSON object. -d specifies the data field, and accept specifies the data type the client can accept and understand. Content-Type specifies the request data type.

Expose the API Server to the Internet

To allow external connections to the API server, and server Internet user requests, open the API server ports through the firewall as described in this section.

  1. By default, UFW is active on Vultr Debian servers, verify the firewall status.

     $ sudo ufw status
  2. Allow connections to the API Server port 8080.

     $ sudo ufw allow 8080/tcp
  3. Reload Firewall rules to apply changes.

     $ sudo ufw reload
  4. In your local terminal session, connect to the inference API over the Internet.

     $ curl -X 'POST' 'http://remote.server.ip.address:8080/generate' -H 'accept: application/json' -H 'Content-Type: application/json' -d '{"text": "my name is "}'
  5. To stop the API server, verify its background Job Id.

     $ jobs

    Output:

     [1]+  Running 
  6. Kill the Job id.

     $ kill %1

In this section you implemented a basic inference API with type safety on the text generation pipeline. To run inference on other types of pipelines and models, modify the pipeline and type definitions as desired.

Serve Fine-tuned Models

The process for serving fine-tuned models similar to serving pre-trained models. Before serving a fine-tuned model, train and save it. Alternatively, you can download and save a fine-tuned model for implementation.

For example, instead of:

# model = GPTJForCausalLM.from_pretrained("EleutherAI/gpt-j-6b", torch_dtype=torch.float16, low_cpu_mem_usage=True).cuda() 

Use the path to the saved fine-tuned model as below:

# model = GPTNeoForCausalLM.from_pretrained("vultr/fine_tuned_gpt_neo_125", torch_dtype=torch.float16, low_cpu_mem_usage=True).cuda() 

The model path vultr/fine_tuned_gpt_neo_125 in the above code is based on the examples for Fine Tuning a Hugging Face Transformer Model on Vultr Cloud GPU.

Multithreading

  1. To run the server with multiple workers, run Gunicorn with the --workers option:

     $ gunicorn app:app --workers 2 --timeout 1000  --worker-class uvicorn.workers.UvicornWorker --bind 0.0.0.0:8080

    The above command forks the main thread into N processes. When unspecified, N defaults to 1. For each fork, an instance of the model is replicated in the GPU. Hence, if the model needs X GB GPU RAM to run, having N workers needs around X\*N GB of GPU.

  2. When Gunicorn is started with multiple workers, use ps to check the system processes to view the different forks.

     $ ps -ax | grep python

    Your output should look like the one below:

     21895 pts/1    Dl+    1:08 /root/miniconda3/envs/env1/bin/python /root/miniconda3/envs/env1/bin/gunicorn app:app --workers 2 --timeout 1000 --worker-class uvicorn.workers.UvicornWorker --bind 0.0.0.0:8080

    Each line, as displayed in the output is a fork of the main thread. Run watch nvidia-smi at the Linux terminal to monitor the GPU usage in real-time. Verify that each of the forked processes occupies its own GPU memory space.

  3. To decide the approximate number of workers, the common rule below is used.

     N = number of threads + 1

    On regular cloud servers, 1 vCPU is equal to 1 thread while dedicated servers with hyperthreading equate to 2 threads per CPU core. When running multiple workers, verify that the amount of GPU memory is enough to run N copies of the model, else, the system outputs an out-of-memory error.

    In some cases, even with enough GPU, loading a large model with multiple workers can display worker termination warnings as below:

     [20153] [WARNING] Worker with pid 20154 was terminated due to signal 9

    Run the dmesg utility to view a more detailed output.

     oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-0.slice/session-3.scope,task=gunicorn,pid=20154,uid=0
     [ 4412.006670] Out of memory: Killed process 20154 (gunicorn) total-vm:38942992kB, anon-rss:30142812kB, file-rss:2304kB, shmem-rss:0kB, UID:0 pgtables:68700kB oom_score_adj:0

    As per the output, loading the worker objects (such as its copy of the model) led to an out-of-memory error and hence, the worker was terminated. In general, the system automatically spawns another process and recovers from this error. If the warnings persist, try increasing the timeout value.

    Python functions can be defined using either def or async def. Using async def together with N Uvicorn workers creates N forks of the main thread. Incoming requests are distributed among these N processes. Because the function is asynchronous, it accepts new requests while the slow task (generating ML output) from a previous request is still processing. Each thread sequentially processes the requests assigned to it.

    Using def creates a new thread for each incoming request. Each thread runs in parallel and processes its request. Generating output from a machine learning model is a resource-intensive operation. Therefore, having many concurrent threads leads to resource contention and slows down the system.

Performance Testing

To get a better understanding, test the system performance using both async def and def, and different numbers of workers. To study performance differences, use a smaller model, such as GPT-Neo-125m. In this section, test how fast the API server responds to requests

In your remote session, use curl, and verify how long the server takes to respond to a single HTTP POST request.

$ curl -o /dev/null -X 'POST' -w 'Total: %{time_total}s\n' 'http://localhost:8080/generate' -H 'accept: application/json' -H 'Content-Type: application/json' -d '{"text": "my name is "}'

The above command tests the total time taken to get a response from the server. This helps when testing the response time (without the effect of network latencies) of the server.

To test server responsiveness to concurrent requests, establish two more SSH sessions, and issue the above cURL command from each pane in quick succession per session. While the system is processing the requests, monitor the htop utility output to view the number of threads and CPU in use, and take note of the time taken for each request to complete.

Conclusion

In this article, you set up an API server to run an inference on Hugging Face Transformer models, and built a text generation API from scratch using a Vultr Cloud GPU Server. You also made code changes to run any pre-trained or fine-tuned models on the server. When serving an API to a wide user base, ensure to adequately address performance, security, and load balancing concerns.

For more implementations, visit the following resources: