How to Use Vultr Serverless Inference in Python

Updated on 02 May, 2025
Guide
How to Use Vultr Serverless Inference in Python header image

Vultr Serverless Inference allows you to run inference workloads for large language models such as Mixtral 8x7B, Mistral 7B, Meta Llama 2 70B, and more. Using Vultr Serverless Inference, you can run inference workloads without having to worry about the infrastructure, and you only pay for the input and output tokens.

This guide demonstrates step-by-step process to start using Vultr Serverless Inference in Python.

Prerequisites

Before you begin, you must:

Set Up the Environment

  1. Create a new project directory and navigate to the project directory.

    console
    $ mkdir vultr-serverless-inference-python
    $ cd vultr-serverless-inference-python
    
  2. Create a new Python virtual environment.

    console
    $ python3 -m venv venv
    $ source venv/bin/activate
    
  3. Install the required Python packages.

    console
    (venv) $ pip install requests
    (venv) $ pip install openai
    
    Note
    Please note that you only need to install the openai package if you are using the OpenAI SDK for Vultr Serverless Inference.

Inference via Direct API Calls

Vultr Serverless Inference provides a RESTful API to run inference workloads. You can use the requests package to make the API calls.

  1. Create a new Python file name inference.py.

    console
    (venv) $ nano inference.py
    
  2. Add the following code to inference.py.

    python
    import os
    import requests
    
    api_key = os.environ.get('VULTR_SERVERLESS_INFERENCE_API_KEY')
    
    # Set the model
    # List of available models: https://api.vultrinference.com/v1/chat/models
    model = ''
    messages = [
        {
            'role': 'user',
            'content': 'What is the capital of India?'
        }
    ]
    
    headers = {
        'Authorization ': f'Bearer {api_key}',
    }
    
    data = {
        'model': model,
        'messages': messages
    }
    
    response = requests.post('https://api.vultrinference.com/v1/chat/completions', headers=headers, json=data)
    llm_response = response.json()['choices'][0]['message']['content']
    
    print(llm_response)
    
  3. Run the Python script.

    console
    (venv) $ export VULTR_SERVERLESS_INFERENCE_API_KEY=<your_api_key>
    (venv) $ python inference.py
    

    Here, we are making a POST request to https://api.vultrinference.com/v1/chat/completions with the required headers and data. The messages list contains the list of messages for which we want to generate completions, role can be either system, user or assistant, and content is the message content.

    To maintain conversation context, you can add the previous messages to the messages list. You can also use the stream parameter to get real-time completions. For more information, refer to the Vultr Serverless Inference API documentation.

Inference via OpenAI SDK

If you are using the OpenAI SDK for Vultr Serverless Inference, you can use the openai package to make the API calls.

  1. Create a new Python file name inference_openai.py.

    console
    (venv) $ nano inference_openai.py
    
  2. Add the following code to inference_openai.py.

    python
    import os
    import openai
    
    client = openai.OpenAI(
        api_key=os.environ.get('VULTR_SERVERLESS_INFERENCE_API_KEY'),
        base_url="https://api.vultrinference.com/v1",
    )
    
    # Set the model
    # List of available models: https://api.vultrinference.com/v1/chat/models
    model = ''
    messages = [
        {
            'role': 'user',
            'content': 'What is the capital of India?'
        }
    ]
    
    response = client.chat.completions.create(model=model, messages=messages)
    llm_response = chat_completion.choices[0].message.content
    
    print(llm_response)
    
  3. Run the Python script.

    console
    (venv) $ export VULTR_SERVERLESS_INFERENCE_API_KEY=<your_api_key>
    (venv) $ python inference_openai.py
    

    Here, we are using the openai package to make the API calls. The messages list contains the list of messages for which we want to generate completions, role can be either system, user or assistant, and content is the message content.

    To maintain conversation context, you can add the previous messages to the messages list. You can also use the stream parameter to get real-time completions. For more information, refer to the Vultr Serverless Inference API documentation.

Conclusion

In this guide, you learned how to use Vultr Serverless Inference in Python to run inference workloads for large language models. You also learned how to use the requests package and the OpenAI SDK to make API calls to Vultr Serverless Inference. You can now integrate Vultr Serverless Inference into your Python applications to generate completions for large language models.

Comments

No comments yet.