How to Use Retrieval-Augmented Generation (RAG) with Vultr Serverless Inference

Updated on 11 March, 2025

Retrieval-Augmented Generation (RAG) enhances AI models by combining real-time information retrieval with generative capabilities. When using Vultr Serverless Inference, RAG enables chat models to incorporate external, domain-specific knowledge from vector store collections. This improves the contextual accuracy of responses, reduces hallucinations, and ensures that the output is based on the most relevant and up-to-date information.

Note
The models that support RAG-based chat completion on Vultr Serverless Inference are: deepseek-r1-distill-qwen-32b, qwen2.5-32b-instruct, qwen2.5-coder-32b-instruct, llama-3.1-70b-instruct-fp8, llama-3.3-70b-instruct-fp8, deepseek-r1-distill-llama-70b, and deepseek-r1. mistral-7B-v0.3 and mistral-nemo-instruct-2407 are not compatible with RAG.

Follow this guide to use RAG-based chat completion on Vultr Serverless Inference using the Vultr API.

  1. Send a GET request to the List Collections endpoint and note the target collection's ID.

    console
    $ curl "https://api.vultrinference.com/v1/vector_store" \
        -X GET \
        -H "Authorization: Bearer ${INFERENCE_API_KEY}"
    
  2. Send a GET request to the List Models endpoint and note the preferred inference model's ID.

    console
    $ curl "https://api.vultrinference.com/v1/models" \
        -X GET \
        -H "Authorization: Bearer ${INFERENCE_API_KEY}"
    
  3. Send a POST request to the RAG Chat Completion endpoint to generate responses using Retrieval-Augmented Generation (RAG).

    console
    $ curl "https://api.vultrinference.com/v1/chat/completions/RAG" \
        -X POST \
        -H "Authorization: Bearer ${INFERENCE_API_KEY}" \
        -H "Content-Type: application/json" \
        --data '{
            "collection": "{collection-id}",
            "model": "{model-id}",
            "messages": [
                {
                    "role": "user",
                    "content": "{user-input}"
                }
            ],
            "max_tokens": 512
        }'
    

    Visit the RAG Chat Completion endpoint to view additional attributes you can apply for greater control when interacting with the preferred inference model.

Comments

No comments yet.

How to Use Retrieval-Augmented Generation (RAG) with Vultr Serverless Inference | Vultr Docs