Vultr DocsLatest Content

Associated Doc

Can I Run Inference Workloads for Models Other than Large Language Models on Vultr Serverless Inference?

Updated on 15 September, 2025

Serverless Inference currently specializes in serving large language models with optimized GPU resources and token streaming capabilities.


At present, Vultr Serverless Inference is purpose-built for serving large language models (LLMs). The service is tuned for transformer-based architectures that require high GPU memory, optimized token streaming, and inference container environments specifically aligned with text generation. The following large language models are available:

  • mistral-nemo-instruct-240
  • qwq-32b-awq
  • deepseek-r1-distill-qwen-32b
  • qwen2.5-32b-instruct
  • qwen2.5-coder-32b-instruct
  • hermes-3-llama-3.1-70b-fp8
  • llama-3.1-70b-instruct-fp8
  • llama-3.3-70b-instruct-fp8
  • deepseek-r1-distill-llama-70b
  • kimi-k2-instruct

In addition to LLMs, Vultr Serverless Inference supports text-to-image model such as flux.1-dev, and text-to-speech models, enabling you to convert text into natural-sounding speech:

  • bark
  • bark-small
  • xtts

Models that require different runtime, scheduling, or memory configurations are not currently supported. Workloads outside the supported text and image-generation profiles can be run on general-purpose GPU instances in Vultr Compute, but they remain outside the scope of Vultr Serverless Inference today.