Vultr DocsLatest Content

How to Deploy AMD Inference Microservice (AIM) on Vultr Cloud GPU

Updated on 21 November, 2025
Guide
Deploy AMD Inference Microservices on Vultr GPU to serve high-performance LLM models.
How to Deploy AMD Inference Microservice (AIM) on Vultr Cloud GPU header image

AMD Inference Microservices (AIMs) are standardized, portable services for deploying AI models on AMD Instinct™ GPUs. They run on ROCm 7 and are distributed as Docker images for easy deployment across environments. AIMs simplify model serving by automatically selecting optimal runtime configurations based on hardware and model requirements. They also provide an OpenAI-compatible API for seamless integration with existing applications.

This guide explains how to deploy AMD Inference Microservice on Vultr Cloud GPU and utilize AMD Instinct™ GPUs to server AI models with ease and seamless integration.

Prerequisites

Before you begin, ensure you:

Deploy the AMD Inference Microservice

This section walks you through deploying an AMD Inference Microservice (AIM) running the Qwen/Qwen3-32B model on a Kubernetes cluster equipped with AMD Instinct™ GPUs.

You’ll configure your namespace, authenticate with Hugging Face, apply the AMD device plugin, create a custom vLLM profile for MI325X, and deploy the AIM using a Kubernetes Deployment and Service. Finally, you will have a fully running AIM instance exposed via NodePort and ready to serve inference through an OpenAI-compatible API endpoint.

Note
This guide deploys the Qwen/Qwen3-32B model. If you deploy a different model from the AIM Catalog, update the ConfigMap and Deployment to match the model’s recommended profile and resources.
  1. Export your namespace. Replace YOUR_NS_NAME with your namespace name.

    console
    $ export NS=YOUR_NS_NAME
    
  2. Export your Hugging Face access token. Replace YOUR_HUGGINGFACE_TOKEN with your Hugging Face access token.

    console
    $ export HF_TOKEN=YOUR_HUGGINGFACE_TOKEN
    
  3. Create the namespace.

    console
    $ kubectl create ns $NS
    
  4. Create a Kubernetes secret containing your Hugging Face token.

    console
    $ kubectl create secret generic hf-token \
        --from-literal="hf-token=$HF_TOKEN" \
        -n $NS
    
  5. Apply the AMD device plugin manifest.

    console
    $ kubectl create -f https://raw.githubusercontent.com/ROCm/k8s-device-plugin/master/k8s-ds-amdgpu-dp.yaml
    
  6. Create a ConfigMap to store the custom vLLM profile for MI325X.

    console
    $ nano vllm-mi325x-fp16-tp1-latency.yaml
    

    Add the following content:

    yaml
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: qwen3-32b-profile
    data:
      vllm-mi325x-fp16-tp1-latency.yaml: |
        aim_id: Qwen/Qwen3-32B
        model_id: Qwen/Qwen3-32B
        metadata:
          engine: vllm
          gpu: MI325X
          precision: fp16
          gpu_count: 1
          metric: latency
          manual_selection_only: false
          type: unoptimized
        engine_args:
          swap-space: 64
          tensor-parallel-size: 1
          max-num-seqs: 512
          dtype: float16
          max-seq-len-to-capture: 32768
          max-num-batched-tokens: 1024
          max-model-len: 32768
          no-enable-prefix-caching:
          no-enable-log-requests:
          disable-uvicorn-access-log:
          no-trust-remote-code:
          gpu-memory-utilization: 0.9
          distributed_executor_backend: mp
          reasoning-parser: qwen3
          async-scheduling:
        env_vars:
          GPU_ARCHS: "gfx942"
          HSA_NO_SCRATCH_RECLAIM: "1"
          VLLM_USE_AITER_TRITON_ROPE: "1"
          VLLM_ROCM_USE_AITER: "1"
          VLLM_ROCM_USE_AITER_RMSNORM: "1"
    

    This YAML file creates a ConfigMap that stores a custom vLLM runtime profile for running the Qwen/Qwen3-32B model on an AMD MI325X GPU. The profile defines:

    • Model metadata such as engine type (vLLM), GPU type (MI325X), precision (fp16), number of GPUs (1), and that the profile targets latency-optimized performance.
    • Engine arguments that control how vLLM runs the model, including tensor parallelism, maximum sequence lengths, batching limits, memory utilization, and various optimization flags.
    • Environment variables required for ROCm and vLLM optimizations specific to the MI325X architecture (gfx942).

    Save and exit the file.

    Note
    Create this ConfigMap only when you deploy on MI325X. You do not need this ConfigMap for MI300X.
  7. Apply the configmap manifest.

    console
    $ kubectl apply -f vllm-mi325x-fp16-tp1-latency.yaml -n $NS
    
  8. Create the Deployment manifest.

    console
    $ nano aim-qwen3-deployment.yaml
    

    Add the following content:

    yaml
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: minimal-aim-deployment
      labels:
        app: minimal-aim-deployment
    spec:
      progressDeadlineSeconds: 3600
      replicas: 1
      selector:
        matchLabels:
          app: minimal-aim-deployment
      template:
        metadata:
          labels:
            app: minimal-aim-deployment
        spec:
          containers:
            - name: minimal-aim-deployment
              image: amdenterpriseai/aim-qwen-qwen3-32b:0.8.4
              imagePullPolicy: Always
              env:
                - name: HF_TOKEN
                  valueFrom:
                    secretKeyRef:
                      name: hf-token
                      key: hf-token
                - name: AIM_PRECISION
                  value: "fp16"
                - name: AIM_GPU_COUNT
                  value: "1"
                - name: AIM_ENGINE
                  value: "vllm"
                - name: AIM_METRIC
                  value: "latency"
                - name: AIM_LOG_LEVEL_ROOT
                  value: "INFO"
                - name: AIM_LOG_LEVEL
                  value: "INFO"
                - name: AIM_PORT
                  value: "8000"
              ports:
                - name: http
                  containerPort: 8000
              resources:
                requests:
                  memory: "80Gi"
                  cpu: "8"
                  amd.com/gpu: "1"
                limits:
                  memory: "80Gi"
                  cpu: "8"
                  amd.com/gpu: "1"
              startupProbe:
                httpGet:
                  path: /v1/models
                  port: http
                periodSeconds: 10
                failureThreshold: 360
              livenessProbe:
                httpGet:
                  path: /health
                  port: http
              readinessProbe:
                httpGet:
                  path: /v1/models
                  port: http
              volumeMounts:
                - name: ephemeral-storage
                  mountPath: /tmp
                - name: dshm
                  mountPath: /dev/shm
                - name: custom-profile                                          # MI325X
                  mountPath: /workspace/aim-runtime/profiles/Qwen/Qwen3-32B     # MI325X
                  readOnly: true                                                # MI325X
          volumes:
            - name: ephemeral-storage
              emptyDir:
                sizeLimit: 256Gi
            - name: dshm
              emptyDir:
                medium: Memory
                sizeLimit: 64Gi
            - name: custom-profile          # MI325X
              configMap:                    # MI325X
                name: qwen3-32b-profile     # MI325X
    

    Save and exit the file.

    Note
    This manifest deploys the Qwen/Qwen3-32B model on an AMD MI325X GPU. If you are deploying on an AMD MI300X GPU, remove all lines marked with the # MI325X tag.
  9. Apply the deployment manifest.

    console
    $ kubectl apply -f aim-qwen3-deployment.yaml -n $NS
    
    Note
    The deployment may take up to 5 minutes to pull the model and become ready.
  10. Create a NodePort type service to expose the deployment externally.

    console
    $ nano aim-qwen3-svc.yaml
    

    Add the following content:

    yaml
    apiVersion: v1
    kind: Service
    metadata:
      name: minimal-aim-deployment
      labels:
        app: minimal-aim-deployment
    spec:
      type: NodePort
      ports:
        - name: http
          port: 80
          targetPort: 8000
          nodePort: 32000
      selector:
        app: minimal-aim-deployment
    

    Save and exit the file.

  11. Test the inference endpoint.

    console
    $ curl http://SERVER-IP:32000/v1/completions \
        -H "Content-Type: application/json" \
        -d '{
            "model": "Qwen/Qwen3-32B",
            "prompt": "Artificial intelligence is the field of computer science that",
            "max_tokens": 50,
            "temperature": 0
        }'
    

Conclusion

In this guide, you successfully deployed an AMD Inference Microservice (AIM) on a Vultr Cloud GPU instance using AMD Instinct™ hardware. You configured the required namespace, authenticated with Hugging Face, and applied the AMD device plugin. You also created a custom vLLM runtime profile for the MI325X GPU, deployed the Qwen/Qwen3-32B model, and exposed the service externally through a NodePort. By completing these steps, you had a fully operational AIM instance capable of serving LLM inference through an OpenAI-compatible API endpoint.

Comments