How to Rank Documents with VultronRetriever on Vultr Serverless Inference

VultronRetriever is a family of visual document retrieval models available on Vultr Serverless Inference. Given a question and a set of document page images, such as PDF pages, scans, slides, or screenshots, the models score how relevant each page is to the question. Unlike text-based retrieval, they read each page visually, including its tables, charts, figures, and layout, with no optical character recognition (OCR) or text-extraction step. The page image itself is the document.
This guide explains how to use VultronRetriever on Vultr Serverless Inference to rank documents and pages by relevance. It covers authenticating with the application programming interface (API), ranking plain text documents, ranking document page images, and building a complete multimodal retrieval-augmented generation (RAG) pipeline that answers questions against a PDF. It also covers choosing a model tier, handling the API's limits, and scaling to large document sets.
VultronRetriever comes in three sizes:
| Model ID | Size | Use it when |
|---|---|---|
vultr/VultronRetrieverFlash-Qwen3.5-0.8B |
0.8B | Default choice. Fast and cost-efficient, with strong quality. |
vultr/VultronRetrieverCore-Qwen3.5-4.5B |
4.5B | A middle ground for denser documents. |
vultr/VultronRetrieverPrime-Qwen3.5-8B |
8B | Maximum quality on hard or cluttered documents. |
Prerequisites
Before you begin, ensure you have:
- A Vultr account with a Serverless Inference subscription and its API key.
- An HTTP client such as
curlto send requests to the API. - Python 3.11 or later to run the multimodal RAG pipeline.
Authenticate with the API
Vultr Serverless Inference exposes an OpenAI-compatible API. All requests use the following base URL:
https://api.vultrinference.com/v1Export your Serverless Inference key as an environment variable so the commands throughout this guide can reference it in the bearer token header. Replace
YOUR_API_KEYwith your Serverless Inference key.console$ export INFERENCE_API_KEY="YOUR_API_KEY"
List the models your key can access to confirm access and view the available VultronRetriever IDs.
console$ curl https://api.vultrinference.com/v1/models \ -H "Authorization: Bearer ${INFERENCE_API_KEY}"
The three VultronRetriever models appear in the output, each with a
ReRankfeature.
Rank Text Documents
The /v1/rerank endpoint accepts a query and a list of documents, then returns them ordered by relevance. Start with plain text to confirm your setup before moving on to page images.
Send a rerank request with a query and four short documents.
console$ curl -X POST https://api.vultrinference.com/v1/rerank \ -H "Authorization: Bearer ${INFERENCE_API_KEY}" \ -H "Content-Type: application/json" \ -d '{ "model": "vultr/VultronRetrieverFlash-Qwen3.5-0.8B", "query": "How do transformer models handle long-range dependencies?", "documents": [ "Transformers use self-attention so each token can attend to all other tokens.", "The weather forecast for tomorrow shows partial clouds with a high of 72 degrees.", "Unlike RNNs, transformers process all positions simultaneously through attention.", "The new restaurant downtown serves excellent pasta dishes." ], "top_n": 3 }'
The response lists each returned document with a
relevance_score, sorted from most to least relevant. Thetop_nfield caps the response at the three highest-scoring documents:{ "id": "score-ac2ebcd0dfa55f34", "model": "vultr/VultronRetrieverFlash-Qwen3.5-0.8B", "usage": { "prompt_tokens": 95, "total_tokens": 95 }, "results": [ { "index": 0, "relevance_score": 4.84, "document": { "text": "Transformers use self-attention..." } }, { "index": 2, "relevance_score": 4.68, "document": { "text": "Unlike RNNs, transformers..." } }, { "index": 1, "relevance_score": 2.71, "document": { "text": "The weather forecast..." } } ] }
The two transformer sentences score well above the unrelated ones, so relevant and irrelevant content separate clearly. Scores are relative to a single query, so compare them only within one response, not across different queries.
Rank Document Page Images
Ranking page images is the primary use case for these models. Pass each page as an image using OpenAI-style content parts. Each image document is an object with a content array that holds an image_url part, and the URL is a base64 data URI of the page.
Encode a page image as base64 and store it in a variable.
console$ IMG=$(base64 < page.jpg | tr -d '\n')
Send a rerank request that mixes an image document with a plain text document.
console$ curl -X POST https://api.vultrinference.com/v1/rerank \ -H "Authorization: Bearer ${INFERENCE_API_KEY}" \ -H "Content-Type: application/json" \ -d @- <<EOF { "model": "vultr/VultronRetrieverFlash-Qwen3.5-0.8B", "query": "What was Q3 revenue?", "documents": [ {"content": [{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,$IMG"}}]}, "You can mix plain text documents into the same request." ] } EOF
The image document must use the
{"content": [...]}shape shown above. A relevant page scores clearly higher than an unrelated one, in the same range as relevant text.
data:image/jpeg;base64,... string as a document, the API does not return an error. It treats the base64 characters as text and returns a low, meaningless score, around the same value as unrelated text. Always wrap page images in the {"content": [...]} object.
Build a Multimodal RAG Pipeline
The following script combines the pieces into one file. It renders each PDF page to a compressed image, ranks the pages with VultronRetriever, and passes the highest-ranked page to a vision chat model on the same API for the final answer.
Update the package index.
console$ sudo apt update
Install the
python3-venvpackage, which ships separately from Python on Debian and Ubuntu.console$ sudo apt install python3-venv
Create a virtual environment to isolate the dependencies from system packages.
console$ python3 -m venv venv
Activate the virtual environment.
console$ source venv/bin/activate
Install the dependencies.
console$ pip install requests pymupdf pillow
Create the script file.
console$ nano vultron_quickstart.py
Add the following content to the file.
python"""Ask questions against any PDF with VultronRetriever on Vultr Serverless Inference. Usage: python vultron_quickstart.py "What was Q3 revenue?" report.pdf Set your API key in the INFERENCE_API_KEY environment variable. """ import base64 import io import os import sys import fitz # PyMuPDF import requests from PIL import Image API_KEY = os.environ.get("INFERENCE_API_KEY", "YOUR_API_KEY") BASE_URL = "https://api.vultrinference.com/v1" RETRIEVER = "vultr/VultronRetrieverFlash-Qwen3.5-0.8B" # or Core-Qwen3.5-4.5B / Prime-Qwen3.5-8B CHAT_MODEL = "Qwen/Qwen3.6-27B" # vision-capable; XiaomiMiMo/MiMo-V2.5-Pro and moonshotai/Kimi-K2.6 also work # The API rejects request bodies over ~1 MB, so pages are JPEG-compressed and # scored in batches. Scores for the same query are comparable across batches. MAX_BATCH_BYTES = 900_000 MAX_PIXELS = 1_300_000 # the model's visual-token budget; higher resolution is wasted JPEG_QUALITY = 80 def pdf_to_page_uris(pdf_path, dpi=110): """Render each PDF page to a compressed JPEG data URI.""" uris = [] with fitz.open(pdf_path) as doc: for page in doc: pix = page.get_pixmap(dpi=dpi) img = Image.frombytes("RGB", (pix.width, pix.height), pix.samples) if img.width * img.height > MAX_PIXELS: scale = (MAX_PIXELS / (img.width * img.height)) ** 0.5 img = img.resize((int(img.width * scale), int(img.height * scale))) buf = io.BytesIO() img.save(buf, format="JPEG", quality=JPEG_QUALITY) b64 = base64.b64encode(buf.getvalue()).decode() uris.append("data:image/jpeg;base64," + b64) return uris def image_doc(uri): """Wrap a data URI in the document shape /v1/rerank expects for images.""" return {"content": [{"type": "image_url", "image_url": {"url": uri}}]} def rank_pages(query, page_uris, model=RETRIEVER): """Score every page against the query. Returns [(page_index, score)], best first, across as many API calls as the 1 MB body limit requires.""" headers = {"Authorization": f"Bearer {API_KEY}"} scores = {} batch, batch_idx, batch_bytes = [], [], 0 todo = list(enumerate(page_uris)) + [(None, None)] # sentinel flushes the last batch for idx, uri in todo: if uri is not None and batch_bytes + len(uri) < MAX_BATCH_BYTES: batch.append(image_doc(uri)) batch_idx.append(idx) batch_bytes += len(uri) continue if batch: resp = requests.post(f"{BASE_URL}/rerank", headers=headers, json={"model": model, "query": query, "documents": batch}) resp.raise_for_status() for r in resp.json()["results"]: scores[batch_idx[r["index"]]] = r["relevance_score"] batch, batch_idx, batch_bytes = [], [], 0 if uri is not None: batch.append(image_doc(uri)) batch_idx.append(idx) batch_bytes = len(uri) return sorted(scores.items(), key=lambda kv: -kv[1]) def ask(query, pdf_path, top_pages=1): """Rank the PDF's pages, then hand the best one(s) to a vision chat model.""" pages = pdf_to_page_uris(pdf_path) ranking = rank_pages(query, pages) print(f"page ranking: {[(i + 1, round(s, 2)) for i, s in ranking]}") content = [{"type": "text", "text": f"Answer using only the attached document page(s): {query}"}] for idx, _ in ranking[:top_pages]: content.append({"type": "image_url", "image_url": {"url": pages[idx]}}) resp = requests.post(f"{BASE_URL}/chat/completions", headers={"Authorization": f"Bearer {API_KEY}"}, json={"model": CHAT_MODEL, "max_tokens": 1000, "messages": [{"role": "user", "content": content}]}) resp.raise_for_status() return resp.json()["choices"][0]["message"]["content"] if __name__ == "__main__": query, pdf_path = sys.argv[1], sys.argv[2] print(ask(query, pdf_path))
Save and close the file.
Set your API key.
console$ export INFERENCE_API_KEY=YOUR_API_KEY
Run the script against any PDF. Replace
"What was Q3 revenue?"with your question andreport.pdfwith the path to your PDF.console$ python vultron_quickstart.py "What was Q3 revenue?" report.pdf
The script prints the page ranking, then the model's answer drawn only from the top-ranked page:
page ranking: [(2, 4.59), (1, 2.57), (3, 2.33)] Based on the attached document, the Q3 revenue was $5.2 million.
Choose a Model Tier
All three tiers accept identical requests, so switching between them is a one-line change to the model field. Start with Flash (0.8B). It responds in a couple of seconds for dozens of pages and ranks accurately on typical documents. Move up to Core (4.5B) or Prime (8B) when your documents are dense or cluttered, or when the answer depends on fine-grained details inside table cells.
Handle API Limits and Errors
Keep the following limits in mind when sending requests:
| Limit | Value | What to do |
|---|---|---|
| Request body size | ~1 MB (HTTP 413 above it) |
JPEG-compress pages (quality ~80) and split them across batched calls. The Python example handles this automatically. |
| Image resolution | ~1.3 megapixels per page | Downscale before sending. Higher resolution adds payload, not quality. |
| Score comparison | Scores are relative to each query | Merge rankings across batched calls for the same query. Do not compare scores across different queries. |
| Embeddings endpoint | Not available (HTTP 404) |
/v1/rerank is the only route for these models. To search large corpora, use index-based retrieval, described in the next section. |
Resolve the most common errors as follows:
HTTP 413: The request body exceeds ~1 MB. Compress images further or send fewer pages per call.HTTP 400validation error: Check the document shape. Text documents are plain strings, while image documents use the{"content": [{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}]}shape.- Scores all low and similar (~1.5 to 2): The request passes a base64 image as a bare string, so the API scores it as text. Use the content-parts shape.
- Empty answer from the chat model: Raise
max_tokens. Reasoning models spend tokens on reasoning before the answer, so allow at least 1000.
Scale to Large Document Sets
The /v1/rerank endpoint re-reads every page on each query, which suits document sets of tens to a few hundred pages. To search thousands of pages, the same models support an index-based pattern that embeds every page once, stores the embeddings in a vector database, and answers each query with a fast index lookup.
This pattern runs the model yourself, because Serverless Inference does not expose an embeddings route for these models. The open weights are on Hugging Face, and you can serve them with vLLM on a Vultr Cloud GPU instance, storing multi-vector embeddings in a vector database such as Qdrant, which supports the models' MaxSim scoring natively. Each model card includes ready-to-run vLLM code:
- VultronRetrieverFlash-Qwen3.5-0.8B
- VultronRetrieverCore-Qwen3.5-4.5B
- VultronRetrieverPrime-Qwen3.5-8B
Flash runs comfortably on a single mid-range GPU.
Conclusion
You have used VultronRetriever on Vultr Serverless Inference to rank text documents and document page images, and built a multimodal RAG pipeline that answers questions directly from PDF pages. Because the models read pages visually, they capture the tables, charts, and layout that text extraction misses, with no OCR step. For larger corpora, serve the open-weight models yourself to enable index-based retrieval. For more information, refer to the Vultr Serverless Inference documentation and the model cards on Hugging Face.