Can I Run Inference Workloads for Models Other than Large Language Models on Vultr Serverless Inference?

Updated on 15 September, 2025

Serverless Inference currently specializes in serving large language models with optimized GPU resources and token streaming capabilities.

At present, Vultr Serverless Inference is purpose-built for serving large language models (LLMs). The service is tuned for transformer-based architectures that require high GPU memory, optimized token streaming, and inference container environments specifically aligned with text generation. The following large language models are available:

mistral-nemo-instruct-240
qwq-32b-awq
deepseek-r1-distill-qwen-32b
qwen2.5-32b-instruct
qwen2.5-coder-32b-instruct
hermes-3-llama-3.1-70b-fp8
llama-3.1-70b-instruct-fp8
llama-3.3-70b-instruct-fp8
deepseek-r1-distill-llama-70b
kimi-k2-instruct

In addition to LLMs, Vultr Serverless Inference supports text-to-image model such as flux.1-dev, and text-to-speech models, enabling you to convert text into natural-sounding speech:

bark
bark-small
xtts

Models that require different runtime, scheduling, or memory configurations are not currently supported. Workloads outside the supported text and image-generation profiles can be run on general-purpose GPU instances in Vultr Compute, but they remain outside the scope of Vultr Serverless Inference today.

Can I Run Inference Workloads for Models Other than Large Language Models on Vultr Serverless Inference?

Products

Features

Solutions

Marketplace

Resources

Company