Serverless Inference currently specializes in serving large language models with optimized GPU resources and token streaming capabilities.
At present, Vultr Serverless Inference is purpose-built for serving large language models (LLMs). The service is tuned for transformer-based architectures that require high GPU memory, optimized token streaming, and inference container environments specifically aligned with text generation. The following large language models are available:
In addition to LLMs, Vultr Serverless Inference supports text-to-image model such as flux.1-dev, and text-to-speech models, enabling you to convert text into natural-sounding speech:
Models that require different runtime, scheduling, or memory configurations are not currently supported. Workloads outside the supported text and image-generation profiles can be run on general-purpose GPU instances in Vultr Compute, but they remain outside the scope of Vultr Serverless Inference today.