How to Deploy AMD Inference Microservice (AIM) on Vultr Cloud GPU

AMD Inference Microservices (AIMs) are standardized, portable services for deploying AI models on AMD Instinct™ GPUs. They run on ROCm 7 and are distributed as Docker images for easy deployment across environments. AIMs simplify model serving by automatically selecting optimal runtime configurations based on hardware and model requirements. They also provide an OpenAI-compatible API for seamless integration with existing applications.

This guide explains how to deploy AMD Inference Microservice on Vultr Cloud GPU and utilize AMD Instinct™ GPUs to server AI models with ease and seamless integration.

Prerequisites

Before you begin, ensure you:

Have access to an AMD Instinct™ MI300X/MI325X GPU.
Create a Hugging Face Access Token.
Create a RKE2 based Kubernetes Cluster.
Install the AMD GPU Operator.

Deploy the AMD Inference Microservice

This section walks you through deploying an AMD Inference Microservice (AIM) running the Qwen/Qwen3-32B model on a Kubernetes cluster equipped with AMD Instinct™ GPUs.

You’ll configure your namespace, authenticate with Hugging Face, apply the AMD device plugin, create a custom vLLM profile for MI325X, and deploy the AIM using a Kubernetes Deployment and Service. Finally, you will have a fully running AIM instance exposed via NodePort and ready to serve inference through an OpenAI-compatible API endpoint.

Note

This guide deploys the Qwen/Qwen3-32B model. If you deploy a different model from the AIM Catalog, update the ConfigMap and Deployment to match the model’s recommended profile and resources.

Export your namespace. Replace YOUR_NS_NAME with your namespace name.
console
```
$ export NS=YOUR_NS_NAME
```
Export your Hugging Face access token. Replace YOUR_HUGGINGFACE_TOKEN with your Hugging Face access token.
console
```
$ export HF_TOKEN=YOUR_HUGGINGFACE_TOKEN
```
Create the namespace.
console
```
$ kubectl create ns $NS
```

Create a Kubernetes secret containing your Hugging Face token.

                            console
                            
                        
$ kubectl create secret generic hf-token \
    --from-literal="hf-token=$HF_TOKEN" \
    -n $NS

Apply the AMD device plugin manifest.

                            console
                            
$ kubectl create -f https://raw.githubusercontent.com/ROCm/k8s-device-plugin/master/k8s-ds-amdgpu-dp.yaml

Create a ConfigMap to store the custom vLLM profile for MI325X.

console

$ nano vllm-mi325x-fp16-tp1-latency.yaml

Add the following content:

                            yaml
                            
                        
apiVersion: v1
kind: ConfigMap
metadata:
  name: qwen3-32b-profile
data:
  vllm-mi325x-fp16-tp1-latency.yaml: |
    aim_id: Qwen/Qwen3-32B
    model_id: Qwen/Qwen3-32B
    metadata:
      engine: vllm
      gpu: MI325X
      precision: fp16
      gpu_count: 1
      metric: latency
      manual_selection_only: false
      type: unoptimized
    engine_args:
      swap-space: 64
      tensor-parallel-size: 1
      max-num-seqs: 512
      dtype: float16
      max-seq-len-to-capture: 32768
      max-num-batched-tokens: 1024
      max-model-len: 32768
      no-enable-prefix-caching:
      no-enable-log-requests:
      disable-uvicorn-access-log:
      no-trust-remote-code:
      gpu-memory-utilization: 0.9
      distributed_executor_backend: mp
      reasoning-parser: qwen3
      async-scheduling:
    env_vars:
      GPU_ARCHS: "gfx942"
      HSA_NO_SCRATCH_RECLAIM: "1"
      VLLM_USE_AITER_TRITON_ROPE: "1"
      VLLM_ROCM_USE_AITER: "1"
      VLLM_ROCM_USE_AITER_RMSNORM: "1"

This YAML file creates a ConfigMap that stores a custom vLLM runtime profile for running the Qwen/Qwen3-32B model on an AMD MI325X GPU. The profile defines:

Model metadata such as engine type (vLLM), GPU type (MI325X), precision (fp16), number of GPUs (1), and that the profile targets latency-optimized performance.
Engine arguments that control how vLLM runs the model, including tensor parallelism, maximum sequence lengths, batching limits, memory utilization, and various optimization flags.
Environment variables required for ROCm and vLLM optimizations specific to the MI325X architecture (gfx942).

Save and exit the file.

Note

Create this ConfigMap only when you deploy on MI325X. You do not need this ConfigMap for MI300X.

Apply the configmap manifest.

                            console
                            
$ kubectl apply -f vllm-mi325x-fp16-tp1-latency.yaml -n $NS

Create the Deployment manifest.

console

$ nano aim-qwen3-deployment.yaml

Add the following content:

                            yaml
                            
                        
apiVersion: apps/v1
kind: Deployment
metadata:
  name: minimal-aim-deployment
  labels:
    app: minimal-aim-deployment
spec:
  progressDeadlineSeconds: 3600
  replicas: 1
  selector:
    matchLabels:
      app: minimal-aim-deployment
  template:
    metadata:
      labels:
        app: minimal-aim-deployment
    spec:
      containers:
        - name: minimal-aim-deployment
          image: amdenterpriseai/aim-qwen-qwen3-32b:0.8.4
          imagePullPolicy: Always
          env:
            - name: HF_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-token
                  key: hf-token
            - name: AIM_PRECISION
              value: "fp16"
            - name: AIM_GPU_COUNT
              value: "1"
            - name: AIM_ENGINE
              value: "vllm"
            - name: AIM_METRIC
              value: "latency"
            - name: AIM_LOG_LEVEL_ROOT
              value: "INFO"
            - name: AIM_LOG_LEVEL
              value: "INFO"
            - name: AIM_PORT
              value: "8000"
          ports:
            - name: http
              containerPort: 8000
          resources:
            requests:
              memory: "80Gi"
              cpu: "8"
              amd.com/gpu: "1"
            limits:
              memory: "80Gi"
              cpu: "8"
              amd.com/gpu: "1"
          startupProbe:
            httpGet:
              path: /v1/models
              port: http
            periodSeconds: 10
            failureThreshold: 360
          livenessProbe:
            httpGet:
              path: /health
              port: http
          readinessProbe:
            httpGet:
              path: /v1/models
              port: http
          volumeMounts:
            - name: ephemeral-storage
              mountPath: /tmp
            - name: dshm
              mountPath: /dev/shm
            - name: custom-profile                                          # MI325X
              mountPath: /workspace/aim-runtime/profiles/Qwen/Qwen3-32B     # MI325X
              readOnly: true                                                # MI325X
      volumes:
        - name: ephemeral-storage
          emptyDir:
            sizeLimit: 256Gi
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: 64Gi
        - name: custom-profile          # MI325X
          configMap:                    # MI325X
            name: qwen3-32b-profile     # MI325X

Save and exit the file.

Note

This manifest deploys the Qwen/Qwen3-32B model on an AMD MI325X GPU. If you are deploying on an AMD MI300X GPU, remove all lines marked with the # MI325X tag.

Apply the deployment manifest.
console
```
$ kubectl apply -f aim-qwen3-deployment.yaml -n $NS
```
Note
The deployment may take up to 5 minutes to pull the model and become ready.

Create a NodePort type service to expose the deployment externally.

console

$ nano aim-qwen3-svc.yaml

Add the following content:

                            yaml
                            
                        
apiVersion: v1
kind: Service
metadata:
  name: minimal-aim-deployment
  labels:
    app: minimal-aim-deployment
spec:
  type: NodePort
  ports:
    - name: http
      port: 80
      targetPort: 8000
      nodePort: 32000
  selector:
    app: minimal-aim-deployment

Save and exit the file.

Test the inference endpoint.

                            console
                            
                        
$ curl http://SERVER-IP:32000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen/Qwen3-32B",
        "prompt": "Artificial intelligence is the field of computer science that",
        "max_tokens": 50,
        "temperature": 0
    }'

Conclusion

In this guide, you successfully deployed an AMD Inference Microservice (AIM) on a Vultr Cloud GPU instance using AMD Instinct™ hardware. You configured the required namespace, authenticated with Hugging Face, and applied the AMD device plugin. You also created a custom vLLM runtime profile for the MI325X GPU, deployed the Qwen/Qwen3-32B model, and exposed the service externally through a NodePort. By completing these steps, you had a fully operational AIM instance capable of serving LLM inference through an OpenAI-compatible API endpoint.

Tags:

Architectural Patterns

How to Deploy AMD Inference Microservice (AIM) on Vultr Cloud GPU

Prerequisites

Deploy the AMD Inference Microservice

Conclusion

Comments

Products

Features

Solutions

Marketplace

Resources

Company

Tech Talks

Vultr Blogs