AI Image Captioning With BLIP-2 on Vultr Cloud GPU

Updated on October 6, 2023
AI Image Captioning With BLIP-2 on Vultr Cloud GPU header image


Bootstrapping Language-Image Pre-training (BLIP-2) is a pre-training framework that uses the development of trained vision models and large language models (LLMs) for zero-shot image-to-text generation. It delivers good results based on a wide range of vision-language tasks. BLIP-2 uses three models, an image encoder, a Querying Transformer (Q-Former), and a large language model which allow the model to perform tasks such as:

  • Image captioning
  • Visual question answering (VQA)
  • Chat-like conversations by retaining the previous conversation using prompts.

This article explains how to carry out AI Image Captioning With BLIP-2 on a Vultr Cloud GPU server. You are to use the BLIP-2 model to perform zero-shot image-to-text generation tasks using an imported image.


Before you begin:

Set Up the Server

In this section, set up the server to run the BLIP-2 model with all necessary dependency packages as described in the steps below.

  1. Install PyTorch

     $ pip3 install torch torchvision --index-url

    The above command installs PyTorch with pre-built CUDA 11.8 libraries. To use the latest version, refer to the PyTorch Documentation.

  2. Install Jupyter Notebook

     $ pip3 install notebook
  3. By default, UFW is active on Vultr servers. Therefore, allow the Jupyter Notebook port 8888 through the firewall to accept connections

     $ sudo ufw allow 8888
  4. Restart the firewall to apply changes

     $ sudo ufw reload
  5. Start Jupyter Notebook

     $ jupyter notebook --ip= 

    The above command starts a Jupyter Notebook instance that listens for incoming connections on all Server interfaces. If the command returns an error, exit your SSH session and start it again to activate the Jupyter library on your server.

  6. Using a web browser such as Chrome, access Jupyter Notebook with the token generated in your command output

  7. Within the Jupyter Notebook interface, click New, select Notebook and create a Python3 Kernel file to start working on the model


Set Up the Model

In this section, use Jupyter Notebook to import the required model libraries, load the pre-trained or fine-tuned BLIP-2 captioning model, and run it on the server as described in the steps below.

  1. Install the salesforce-lavis package

     !pip3 install salesforce-lavis

    LAVIS is a Python deep learning library used for Language-and-Vision research and applications in tasks like retrieval, captioning, visual question answering, and multi-modal classification. It's used along with BLIP-2 for Visual Question Answering (VQA) related tasks.

  2. Upgrade Jupyter Notebook and ipywidgets

     !pip3 install --upgrade jupyter ipywidgets
  3. Import the required libraries

     import torch
     from PIL import Image
     import requests
     from lavis.models import load_model_and_preprocess

    Below is what the libraries do:

    • torch: It's used to build and train neural networks
    • Image: It's imported from PIL which provides image processing capabilities such as opening an image and image processing
    • requests: Downloads an image from a specified URL
    • load_model_and_preprocess: Loads a pre-trained image captioning model along with the reprocessing steps
  4. Import the base image. Replace with your actual image URL

     img_url = ''
  5. Process the image

     raw_image =, stream=True).raw).convert('RGB')

    Below is what the function does:

    • Opens the raw image data. In the above code, it opens the raw image data from the HTTP response
    • requests.get(img_url, stream=True).raw: Uses the requests library to send an HTTP GET request to the URL specified in img_url. The stream=True argument ensures that the response gets streamed instead of getting downloaded in full. Streaming is useful when working with large files like images because it allows data download in chunks to conserve memory. The .raw attribute provides access to the raw content of the response, which is the image data.
    • convert('RGB'): Converts the image to a specified mode. In this case, 'RGB' (Red, Green, Blue) which is commonly used for image processing tasks applies to the image. This is necessary because the original input image can be in a different color mode, and converting it to RGB ensures consistent processing.

    To view the generated RGB image, run the following command:

  6. Move the computations to the GPU memory

     device = torch.device("cuda")

    The above command creates a torch.device object that represents the CUDA device. CUDA allows you to use NVIDIA GPUs to speed up computations in machine learning and other tasks.

  7. Load the Pre-trained BLIP-2 Model

     model, vis_processors, _ = load_model_and_preprocess(

    Below is what the function does:

    • model: Holds the loaded pre-trained image captioning model
    • vis_processors: Holds the visualization processors. These processors are responsible for any pre-processing or post-processing steps required to visualize the results or outputs of the model
    • load_model_and_preprocess: Consists of several arguments to load the model and preprocess the data. These include:
      • name: Specifies a name or identifier for a specific model
      • model_type: Specifies the type or variant of the pre-trained model to load
      • is_eval: Verifies whether the model is in use for evaluation or not
      • device: Specifies the device on which the model loads and runs. In this case, cuda which means the model uses available GPU memory

    Several other models can replace the above-used code block. These include the following:


     model, vis_processors, _ = load_model_and_preprocess(
          name="blip2_opt", model_type="pretrain_opt2.7b", is_eval=True, device=device


     model, vis_processors, _ = load_model_and_preprocess(
          name="blip2_opt", model_type="pretrain_opt6.7b", is_eval=True, device=device


     model, vis_processors, _ = load_model_and_preprocess(
          name="blip2_opt", model_type="caption_coco_opt2.7b", is_eval=True, device=device


     model, vis_processors, _ = load_model_and_preprocess(
          name="blip2_opt", model_type="caption_coco_opt6.7b", is_eval=True, device=device


     model, vis_processors, _ = load_model_and_preprocess(
          name="blip2_t5", model_type="pretrain_flant5xl", is_eval=True, device=device

    You can use any of the above models to load and set up the model.

    You do not need to run more than one model at a time to generate an output. Make sure only one model is in use at a time.

  8. Load the processor

  9. Prepare the image as input Using the associated processors

     image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)

    Below is what the code does:

    • vis_processors["eval"]: vis_processors contains various processors, and "eval" refers to a processor designed specifically for evaluation.
    • unsqueeze(0): Adds a new dimension of size 1 to the tensor. In this case, it's converting the processed image tensor from a 3-dimensional tensor to a 4-dimensional tensor
    • to(device): Moves the tensor to the GPU for processing

When the above model steps are complete, the model is ready to generate a caption for any given image, visual question answering, and chat-like conversations.

Caption Generation

BLIP-2 allows two types of caption generation: Single Caption generation and Multiple Caption generation. In this section, generate captions on any given image as described in the steps below.

  1. Single Caption: Generates one caption for an image. To view the single generated caption for the imported image, run the following code

     model.generate({"image": image})

    Below is what the code does:

    • model: Refers to the language model used to perform the image captioning task
    • {"image": image}: Refers to a dictionary where the key image corresponds to the input image to generate captions. The image variable contains the actual image data.
  2. Multiple Caption: Generates multiple independent captions for an image

     model.generate({"image": image}, use_nucleus_sampling=True, num_captions=3)

    Below is what the code does:

    • use_nucleus_sampling=True: Defines that the nucleus sampling applies in the generation process. Nucleus sampling is also known as top-k sampling. In nucleus sampling, the model considers a subset of the most likely words instead of sampling from a fixed number (k) of top options. This leads to more diverse text generation. To disable nucleus sampling, set the value to False
    • num_captions: Specifies the number of captions you intend to generate for the input image. It's set to 3 but you can change the number to generate more or less captions

Visual Question Answering (VQA)

Zero-Shot Vision-to-Language Generation refers to the ability of a model to generate captions or descriptions for images it has never seen during training. Therefore the model is capable of understanding the content of an image. In this section, ask the model a question based on the image as described below.

To ask a specific question, run the following command with a question prompt

    "image": image, 
    "prompt": "Question: YOUR_QUESTION_HERE? Answer:"})

In the above code, "image": image specifies the input image that you want to generate text about. This produces the answer based on the asked question. Some prompt examples include, How many dogs are there in the picture?, Which city is this?, Where is this monument located?, among others

It's important to note that the model is fine-tuned on keywords like Question. This means that the generated descriptions or captions are more precise when they follow a prompt template. For example, declare a question using the Question: keyword to get more precise responses

Context Based Visual Question Answering for Chat-Like Conversations

The BLIP-2 model is capable of answering more than one question based on the same image by using the context of the previous image. To generate answers based on a specific context, run the following code:

context = [

question = "NEW_QUESTION_HERE?"
template = "Question: {} Answer: {}."
prompt = " ".join([template.format(context[i][0], context[i][1]) for i in range(len(context))]) + " Question: " + question + " Answer:"

    "image": image,
    "prompt": prompt

Below is what the code does:

  • context: Initializes a list where each element is a tuple consisting of a previous question and its corresponding answer. This context applies as a history of the conversation for the model to refer to. You can add more previous questions and answers depending on the total number of answers returned by the model. Below is an example of the context field:

      context = [
        ("which city is this?", "new york city"),
        ("is it day or night?", "night"),
  • question: Sets up a new question to ask the model in the conversation. Replace NEW_QUESTION_HERE? with your actual question to ask

  • template: Defines a template string that’s used to create the conversation history. The curly braces {} are placeholders filled by the previous question and answer pairs

  • prompt: Takes the entire prompt provided to the model, formats the context tuples and adds a new question to create a comprehensive view for generating an answer

  • model.generate: Generates an answer based on the provided prompt and image. The use_nucleus_sampling parameter value is False, this means nucleus sampling is not used for this generation process. You can set use_nucleus_sampling parameter to True to apply nucleus sampling in your output


In this article, you implemented AI image captioning with the BLIP-2 model on a Vultr Cloud GPU server. You prepared the server, installed libraries, and executed the model functions to generate output based on the input image. Additionally, you explored various use cases such as image captioning, VQA, and chat-like conversations based on context. For more information about the model, visit the BLIP-2 Hugging Face Space.

Next Steps

To leverage the power of your Vultr Cloud GPU server, visit the following resources to implement more solutions: