Rank Documents with VultronRetriever on Vultr Inference

How to Rank Documents with VultronRetriever on Vultr Serverless Inference header image

VultronRetriever is a family of visual document retrieval models available on Vultr Serverless Inference. Given a question and a set of document page images, such as PDF pages, scans, slides, or screenshots, the models score how relevant each page is to the question. Unlike text-based retrieval, they read each page visually, including its tables, charts, figures, and layout, with no optical character recognition (OCR) or text-extraction step. The page image itself is the document.

This guide explains how to use VultronRetriever on Vultr Serverless Inference to rank documents and pages by relevance. It covers authenticating with the application programming interface (API), ranking plain text documents, ranking document page images, and building a complete multimodal retrieval-augmented generation (RAG) pipeline that answers questions against a PDF. It also covers choosing a model tier, handling the API's limits, and scaling to large document sets.

VultronRetriever comes in three sizes:

Model ID	Size	Use it when
`vultr/VultronRetrieverFlash-Qwen3.5-0.8B`	0.8B	Default choice. Fast and cost-efficient, with strong quality.
`vultr/VultronRetrieverCore-Qwen3.5-4.5B`	4.5B	A middle ground for denser documents.
`vultr/VultronRetrieverPrime-Qwen3.5-8B`	8B	Maximum quality on hard or cluttered documents.

Prerequisites

Before you begin, ensure you have:

A Vultr account with a Serverless Inference subscription and its API key.
An HTTP client such as curl to send requests to the API.
Python 3.11 or later to run the multimodal RAG pipeline.

Authenticate with the API

Vultr Serverless Inference exposes an OpenAI-compatible API. All requests use the following base URL:

https://api.vultrinference.com/v1

Export your Serverless Inference key as an environment variable so the commands throughout this guide can reference it in the bearer token header. Replace YOUR_API_KEY with your Serverless Inference key.
console
```
$ export INFERENCE_API_KEY="YOUR_API_KEY"
```
List the models your key can access to confirm access and view the available VultronRetriever IDs.
console
```
$ curl https://api.vultrinference.com/v1/models \
    -H "Authorization: Bearer ${INFERENCE_API_KEY}"
```
The three VultronRetriever models appear in the output, each with a ReRank feature.

Rank Text Documents

The /v1/rerank endpoint accepts a query and a list of documents, then returns them ordered by relevance. Start with plain text to confirm your setup before moving on to page images.

Send a rerank request with a query and four short documents.

                            console
                            
                        
$ curl -X POST https://api.vultrinference.com/v1/rerank \
    -H "Authorization: Bearer ${INFERENCE_API_KEY}" \
    -H "Content-Type: application/json" \
    -d '{
      "model": "vultr/VultronRetrieverFlash-Qwen3.5-0.8B",
      "query": "How do transformer models handle long-range dependencies?",
      "documents": [
        "Transformers use self-attention so each token can attend to all other tokens.",
        "The weather forecast for tomorrow shows partial clouds with a high of 72 degrees.",
        "Unlike RNNs, transformers process all positions simultaneously through attention.",
        "The new restaurant downtown serves excellent pasta dishes."
      ],
      "top_n": 3
    }'

The response lists each returned document with a relevance_score, sorted from most to least relevant. The top_n field caps the response at the three highest-scoring documents:

{
  "id": "score-ac2ebcd0dfa55f34",
  "model": "vultr/VultronRetrieverFlash-Qwen3.5-0.8B",
  "usage": { "prompt_tokens": 95, "total_tokens": 95 },
  "results": [
    { "index": 0, "relevance_score": 4.84, "document": { "text": "Transformers use self-attention..." } },
    { "index": 2, "relevance_score": 4.68, "document": { "text": "Unlike RNNs, transformers..." } },
    { "index": 1, "relevance_score": 2.71, "document": { "text": "The weather forecast..." } }
  ]
}

The two transformer sentences score well above the unrelated ones, so relevant and irrelevant content separate clearly. Scores are relative to a single query, so compare them only within one response, not across different queries.

Rank Document Page Images

Ranking page images is the primary use case for these models. Pass each page as an image using OpenAI-style content parts. Each image document is an object with a content array that holds an image_url part, and the URL is a base64 data URI of the page.

Encode a page image as base64 and store it in a variable.
console
```
$ IMG=$(base64 < page.jpg | tr -d '\n')
```

Send a rerank request that mixes an image document with a plain text document.

                            console
                            
                        
$ curl -X POST https://api.vultrinference.com/v1/rerank \
    -H "Authorization: Bearer ${INFERENCE_API_KEY}" \
    -H "Content-Type: application/json" \
    -d @- <<EOF
{
  "model": "vultr/VultronRetrieverFlash-Qwen3.5-0.8B",
  "query": "What was Q3 revenue?",
  "documents": [
    {"content": [{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,$IMG"}}]},
    "You can mix plain text documents into the same request."
  ]
}
EOF

The image document must use the {"content": [...]} shape shown above. A relevant page scores clearly higher than an unrelated one, in the same range as relevant text.

Warning

If you pass a bare data:image/jpeg;base64,... string as a document, the API does not return an error. It treats the base64 characters as text and returns a low, meaningless score, around the same value as unrelated text. Always wrap page images in the {"content": [...]} object.

Build a Multimodal RAG Pipeline

The following script combines the pieces into one file. It renders each PDF page to a compressed image, ranks the pages with VultronRetriever, and passes the highest-ranked page to a vision chat model on the same API for the final answer.

Update the package index.
console
```
$ sudo apt update
```
Install the python3-venv package, which ships separately from Python on Debian and Ubuntu.
console
```
$ sudo apt install python3-venv
```
Create a virtual environment to isolate the dependencies from system packages.
console
```
$ python3 -m venv venv
```
Activate the virtual environment.
console
```
$ source venv/bin/activate
```
Install the dependencies.
console
```
$ pip install requests pymupdf pillow
```
Create the script file.
console
```
$ nano vultron_quickstart.py
```

Add the following content to the file.

                            python
                            
                        
"""Ask questions against any PDF with VultronRetriever on Vultr Serverless Inference.

Usage:
    python vultron_quickstart.py "What was Q3 revenue?" report.pdf

Set your API key in the INFERENCE_API_KEY environment variable.
"""
import base64
import io
import os
import sys

import fitz  # PyMuPDF
import requests
from PIL import Image

API_KEY = os.environ.get("INFERENCE_API_KEY", "YOUR_API_KEY")
BASE_URL = "https://api.vultrinference.com/v1"
RETRIEVER = "vultr/VultronRetrieverFlash-Qwen3.5-0.8B"  # or Core-Qwen3.5-4.5B / Prime-Qwen3.5-8B
CHAT_MODEL = "Qwen/Qwen3.6-27B"  # vision-capable; XiaomiMiMo/MiMo-V2.5-Pro and moonshotai/Kimi-K2.6 also work

# The API rejects request bodies over ~1 MB, so pages are JPEG-compressed and
# scored in batches. Scores for the same query are comparable across batches.
MAX_BATCH_BYTES = 900_000
MAX_PIXELS = 1_300_000  # the model's visual-token budget; higher resolution is wasted
JPEG_QUALITY = 80


def pdf_to_page_uris(pdf_path, dpi=110):
    """Render each PDF page to a compressed JPEG data URI."""
    uris = []
    with fitz.open(pdf_path) as doc:
        for page in doc:
            pix = page.get_pixmap(dpi=dpi)
            img = Image.frombytes("RGB", (pix.width, pix.height), pix.samples)
            if img.width * img.height > MAX_PIXELS:
                scale = (MAX_PIXELS / (img.width * img.height)) ** 0.5
                img = img.resize((int(img.width * scale), int(img.height * scale)))
            buf = io.BytesIO()
            img.save(buf, format="JPEG", quality=JPEG_QUALITY)
            b64 = base64.b64encode(buf.getvalue()).decode()
            uris.append("data:image/jpeg;base64," + b64)
    return uris


def image_doc(uri):
    """Wrap a data URI in the document shape /v1/rerank expects for images."""
    return {"content": [{"type": "image_url", "image_url": {"url": uri}}]}


def rank_pages(query, page_uris, model=RETRIEVER):
    """Score every page against the query. Returns [(page_index, score)], best first,
    across as many API calls as the 1 MB body limit requires."""
    headers = {"Authorization": f"Bearer {API_KEY}"}
    scores = {}
    batch, batch_idx, batch_bytes = [], [], 0
    todo = list(enumerate(page_uris)) + [(None, None)]  # sentinel flushes the last batch
    for idx, uri in todo:
        if uri is not None and batch_bytes + len(uri) < MAX_BATCH_BYTES:
            batch.append(image_doc(uri))
            batch_idx.append(idx)
            batch_bytes += len(uri)
            continue
        if batch:
            resp = requests.post(f"{BASE_URL}/rerank", headers=headers,
                                 json={"model": model, "query": query, "documents": batch})
            resp.raise_for_status()
            for r in resp.json()["results"]:
                scores[batch_idx[r["index"]]] = r["relevance_score"]
            batch, batch_idx, batch_bytes = [], [], 0
        if uri is not None:
            batch.append(image_doc(uri))
            batch_idx.append(idx)
            batch_bytes = len(uri)
    return sorted(scores.items(), key=lambda kv: -kv[1])


def ask(query, pdf_path, top_pages=1):
    """Rank the PDF's pages, then hand the best one(s) to a vision chat model."""
    pages = pdf_to_page_uris(pdf_path)
    ranking = rank_pages(query, pages)
    print(f"page ranking: {[(i + 1, round(s, 2)) for i, s in ranking]}")
    content = [{"type": "text",
                "text": f"Answer using only the attached document page(s): {query}"}]
    for idx, _ in ranking[:top_pages]:
        content.append({"type": "image_url", "image_url": {"url": pages[idx]}})
    resp = requests.post(f"{BASE_URL}/chat/completions",
                         headers={"Authorization": f"Bearer {API_KEY}"},
                         json={"model": CHAT_MODEL, "max_tokens": 1000,
                               "messages": [{"role": "user", "content": content}]})
    resp.raise_for_status()
    return resp.json()["choices"][0]["message"]["content"]


if __name__ == "__main__":
    query, pdf_path = sys.argv[1], sys.argv[2]
    print(ask(query, pdf_path))

Save and close the file.

Set your API key.

                            console
                            
$ export INFERENCE_API_KEY=YOUR_API_KEY

Run the script against any PDF. Replace "What was Q3 revenue?" with your question and report.pdf with the path to your PDF.
console
```
$ python vultron_quickstart.py "What was Q3 revenue?" report.pdf
```
The script prints the page ranking, then the model's answer drawn only from the top-ranked page:
```
page ranking: [(2, 4.59), (1, 2.57), (3, 2.33)]

Based on the attached document, the Q3 revenue was $5.2 million.
```

Choose a Model Tier

All three tiers accept identical requests, so switching between them is a one-line change to the model field. Start with Flash (0.8B). It responds in a couple of seconds for dozens of pages and ranks accurately on typical documents. Move up to Core (4.5B) or Prime (8B) when your documents are dense or cluttered, or when the answer depends on fine-grained details inside table cells.

Handle API Limits and Errors

Keep the following limits in mind when sending requests:

Limit	Value	What to do
Request body size	~1 MB (`HTTP 413` above it)	JPEG-compress pages (quality ~80) and split them across batched calls. The Python example handles this automatically.
Image resolution	~1.3 megapixels per page	Downscale before sending. Higher resolution adds payload, not quality.
Score comparison	Scores are relative to each query	Merge rankings across batched calls for the same query. Do not compare scores across different queries.
Embeddings endpoint	Not available (`HTTP 404`)	`/v1/rerank` is the only route for these models. To search large corpora, use index-based retrieval, described in the next section.

Resolve the most common errors as follows:

HTTP 413: The request body exceeds ~1 MB. Compress images further or send fewer pages per call.
HTTP 400 validation error: Check the document shape. Text documents are plain strings, while image documents use the {"content": [{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}]} shape.
Scores all low and similar (~1.5 to 2): The request passes a base64 image as a bare string, so the API scores it as text. Use the content-parts shape.
Empty answer from the chat model: Raise max_tokens. Reasoning models spend tokens on reasoning before the answer, so allow at least 1000.

Scale to Large Document Sets

The /v1/rerank endpoint re-reads every page on each query, which suits document sets of tens to a few hundred pages. To search thousands of pages, the same models support an index-based pattern that embeds every page once, stores the embeddings in a vector database, and answers each query with a fast index lookup.

This pattern runs the model yourself, because Serverless Inference does not expose an embeddings route for these models. The open weights are on Hugging Face, and you can serve them with vLLM on a Vultr Cloud GPU instance, storing multi-vector embeddings in a vector database such as Qdrant, which supports the models' MaxSim scoring natively. Each model card includes ready-to-run vLLM code:

Flash runs comfortably on a single mid-range GPU.

Conclusion

You have used VultronRetriever on Vultr Serverless Inference to rank text documents and document page images, and built a multimodal RAG pipeline that answers questions directly from PDF pages. Because the models read pages visually, they capture the tables, charts, and layout that text extraction misses, with no OCR step. For larger corpora, serve the open-weight models yourself to enable index-based retrieval. For more information, refer to the Vultr Serverless Inference documentation and the model cards on Hugging Face.

Tags:

How to Rank Documents with VultronRetriever on Vultr Serverless Inference

How to Rank Documents with VultronRetriever on Vultr Serverless Inference

Prerequisites

Authenticate with the API

Rank Text Documents

Rank Document Page Images

Build a Multimodal RAG Pipeline

Choose a Model Tier

Handle API Limits and Errors

Scale to Large Document Sets

Conclusion

Comments

Products

Features

Solutions

Marketplace

Resources

Company

Tech Talks

Vultr Blogs