How to Use Retrieval-Augmented Generation (RAG) with Vultr Serverless Inference

Updated on March 11, 2025

Retrieval-Augmented Generation (RAG) enhances AI models by combining real-time information retrieval with generative capabilities. When using Vultr Serverless Inference, RAG enables chat models to incorporate external, domain-specific knowledge from vector store collections. This improves the contextual accuracy of responses, reduces hallucinations, and ensures that the output is based on the most relevant and up-to-date information.

Note
The models that support RAG-based chat completion on Vultr Serverless Inference are: deepseek-r1-distill-qwen-32b, qwen2.5-32b-instruct, qwen2.5-coder-32b-instruct, llama-3.1-70b-instruct-fp8, llama-3.3-70b-instruct-fp8, deepseek-r1-distill-llama-70b, and deepseek-r1. mistral-7B-v0.3 and mistral-nemo-instruct-2407 are not compatible with RAG.

Follow this guide to use RAG-based chat completion on Vultr Serverless Inference using the Vultr API.

  1. Send a GET request to the List Collections endpoint and note the target collection's ID.

    console
    $ curl "https://api.vultrinference.com/v1/vector_store" \
        -X GET \
        -H "Authorization: Bearer ${INFERENCE_API_KEY}"
    
  2. Send a GET request to the List Models endpoint and note the preferred inference model's ID.

    console
    $ curl "https://api.vultrinference.com/v1/models" \
        -X GET \
        -H "Authorization: Bearer ${INFERENCE_API_KEY}"
    
  3. Send a POST request to the RAG Chat Completion endpoint to generate responses using Retrieval-Augmented Generation (RAG).

    console
    $ curl "https://api.vultrinference.com/v1/chat/completions/RAG" \
        -X POST \
        -H "Authorization: Bearer ${INFERENCE_API_KEY}" \
        -H "Content-Type: application/json" \
        --data '{
            "collection": "{collection-id}",
            "model": "{model-id}",
            "messages": [
                {
                    "role": "user",
                    "content": "{user-input}"
                }
            ],
            "max_tokens": 512
        }'
    

    Visit the RAG Chat Completion endpoint to view additional attributes you can apply for greater control when interacting with the preferred inference model.

Comments

No comments yet.