Retrieval-Augmented Generation (RAG) enhances AI models by combining real-time information retrieval with generative capabilities. When using Vultr Serverless Inference, RAG enables chat models to incorporate external, domain-specific knowledge from vector store collections. This improves the contextual accuracy of responses, reduces hallucinations, and ensures that the output is based on the most relevant and up-to-date information.
deepseek-r1-distill-qwen-32b
, qwen2.5-32b-instruct
, qwen2.5-coder-32b-instruct
, llama-3.1-70b-instruct-fp8
, llama-3.3-70b-instruct-fp8
, deepseek-r1-distill-llama-70b
, and deepseek-r1
.
mistral-7B-v0.3
and mistral-nemo-instruct-2407
are not compatible with RAG.
Follow this guide to use RAG-based chat completion on Vultr Serverless Inference using the Vultr API.
Send a GET
request to the List Collections endpoint and note the target collection's ID.
$ curl "https://api.vultrinference.com/v1/vector_store" \
-X GET \
-H "Authorization: Bearer ${INFERENCE_API_KEY}"
Send a GET
request to the List Models endpoint and note the preferred inference model's ID.
$ curl "https://api.vultrinference.com/v1/models" \
-X GET \
-H "Authorization: Bearer ${INFERENCE_API_KEY}"
Send a POST
request to the RAG Chat Completion endpoint to generate responses using Retrieval-Augmented Generation (RAG).
$ curl "https://api.vultrinference.com/v1/chat/completions/RAG" \
-X POST \
-H "Authorization: Bearer ${INFERENCE_API_KEY}" \
-H "Content-Type: application/json" \
--data '{
"collection": "{collection-id}",
"model": "{model-id}",
"messages": [
{
"role": "user",
"content": "{user-input}"
}
],
"max_tokens": 512
}'
Visit the RAG Chat Completion endpoint to view additional attributes you can apply for greater control when interacting with the preferred inference model.
No comments yet.