---
title: RAG Chat Collection
url: https://docs.vultr.com/products/compute/serverless-inference/vector-store/rag-chat-collection
description: A collection of RAG (Retrieval-Augmented Generation) chat models that enhance AI responses with relevant information from your data sources.
publish_date: 2025-03-11T19:31:09.985625Z
last_updated: 2026-05-26T19:10:53.780583Z
---

# How to Use Retrieval-Augmented Generation (RAG) with Vultr Serverless Inference

Retrieval-Augmented Generation (RAG) enhances AI models by combining real-time information retrieval with generative capabilities. When using Vultr Serverless Inference, RAG enables chat models to incorporate external, domain-specific knowledge from vector store collections. This improves the contextual accuracy of responses, reduces hallucinations, and ensures that the output is based on the most relevant and up-to-date information. The RAG endpoint also supports tool calling, allowing models to invoke defined functions during RAG-based interactions. This enables advanced use cases where the model can not only retrieve external knowledge but also act on it—for example, performing calculations, fetching live data, or calling APIs based on retrieved context.

> [!NOTE]
> The models that support RAG-based chat completion on Vultr Serverless Inference are: `deepseek-r1-distill-qwen-32b`, `qwen2.5-32b-instruct`, `qwen2.5-coder-32b-instruct`, `llama-3.1-70b-instruct-fp8`, `llama-3.3-70b-instruct-fp8`, `deepseek-r1-distill-llama-70b`, and `deepseek-r1`.
>
> `mistral-7B-v0.3` and `mistral-nemo-instruct-2407` are not compatible with RAG.

Follow this guide to use RAG-based chat completion on Vultr Serverless Inference using the Vultr API.

## Generate RAG-Based Chat Completions

1. Send a `GET` request to the [**List Collections** endpoint](https://api.vultrinference.com/#tag/Vector-Store/operation/list-vector-stores) and note the target collection's ID.

    ```console
    $ curl "https://api.vultrinference.com/v1/vector_store" \
        -X GET \
        -H "Authorization: Bearer ${INFERENCE_API_KEY}"
    ```

1. Send a `GET` request to the [**List Models** endpoint](https://api.vultrinference.com/#tag/Models/operation/list-models) and note the preferred inference model's ID.

    ```console
    $ curl "https://api.vultrinference.com/v1/models" \
        -X GET \
        -H "Authorization: Bearer ${INFERENCE_API_KEY}"
    ```

1. Send a `POST` request to the [**RAG Chat Completion** endpoint](https://api.vultrinference.com/#tag/Chat/operation/rag-chat-completion) to generate responses using Retrieval-Augmented Generation (RAG).

    ```console
    $ curl "https://api.vultrinference.com/v1/chat/completions/RAG" \
        -X POST \
        -H "Authorization: Bearer ${INFERENCE_API_KEY}" \
        -H "Content-Type: application/json" \
        --data '{
            "collection": "{collection-id}",
            "model": "{model-id}",
            "messages": [
                {
                    "role": "user",
                    "content": "{user-input}"
                }
            ],
            "max_tokens": 512
        }'
    ```

    Visit the [**RAG Chat Completion** endpoint](https://api.vultrinference.com/#tag/Chat/operation/rag-chat-completion) to view additional attributes you can apply for greater control when interacting with the preferred inference model.

## Use Tool Calling with the RAG Endpoint

> [!NOTE]
> Tool calling is currently supported only on the `kimi-k2-instruct` model.

1. Define your tools using the "tools" parameter in the RAG chat request body.
1. Set "tool_choice" to "auto", "required", or "none" to control when the model triggers a tool call.
1. Send a `POST` request to the [RAG Chat Completion endpoint](https://api.vultrinference.com/#tag/Chat/operation/rag-chat-completion) to combine RAG retrieval and tool invocation.

    ```console
    $ curl "https://api.vultrinference.com/v1/chat/completions/RAG" \
        -X POST \
        -H "Authorization: Bearer ${INFERENCE_API_KEY}" \
        -H "Content-Type: application/json" \
        --data '{
            "collection": "{collection-id}",
            "model": "{model-id}",
            "messages": [
                { "role": "user", "content": "Ask a question that requires external data retrieval." }
            ],
            "tools": [
                {
                    "type": "function",
                    "function": {
                        "name": "function_name",
                        "description": "Describe the purpose of the function.",
                        "parameters": {
                            "type": "object",
                            "properties": {
                                "parameter_name": {
                                    "type": "string",
                                    "description": "Describe the expected input parameter."
                                }
                            },
                            "required": ["parameter_name"]
                        }
                    }
                }
            ],
            "tool_choice": "auto",
            "max_tokens": 512
        }'
    ```

    The model responds with a structured tool call, for example:

    ```
    {
      "role": "assistant",
      "tool_calls": [
        {
          "type": "function",
          "function": {
            "name": "function_name",
            "arguments": "{\"parameter_name\": \"example_value\"}"
          }
        }
      ]
    }
    ```

    You can execute this function locally or through an API and send the output back to the RAG endpoint in a follow-up request to generate a complete, context-aware response.

    For detailed implementation steps and usage examples, refer to the [Tool Calling with Vultr Serverless Inference Guide](https://docs.vultr.com/how-to-use-tool-calling-with-vultr-serverless-inference).