.webp)
AMD Inference Microservices (AIMs) are standardized, portable services for deploying AI models on AMD Instinct™ GPUs. They run on ROCm 7 and are distributed as Docker images for easy deployment across environments. AIMs simplify model serving by automatically selecting optimal runtime configurations based on hardware and model requirements. They also provide an OpenAI-compatible API for seamless integration with existing applications.
This guide explains how to deploy AMD Inference Microservice on Vultr Cloud GPU and utilize AMD Instinct™ GPUs to server AI models with ease and seamless integration.
Prerequisites
Before you begin, ensure you:
- Have access to an AMD Instinct™ MI300X/MI325X GPU.
- Create a Hugging Face Access Token.
- Create a RKE2 based Kubernetes Cluster.
- Install the AMD GPU Operator.
Deploy the AMD Inference Microservice
This section walks you through deploying an AMD Inference Microservice (AIM) running the Qwen/Qwen3-32B model on a Kubernetes cluster equipped with AMD Instinct™ GPUs.
You’ll configure your namespace, authenticate with Hugging Face, apply the AMD device plugin, create a custom vLLM profile for MI325X, and deploy the AIM using a Kubernetes Deployment and Service. Finally, you will have a fully running AIM instance exposed via NodePort and ready to serve inference through an OpenAI-compatible API endpoint.
Export your namespace. Replace
YOUR_NS_NAMEwith your namespace name.console$ export NS=YOUR_NS_NAME
Export your Hugging Face access token. Replace
YOUR_HUGGINGFACE_TOKENwith your Hugging Face access token.console$ export HF_TOKEN=YOUR_HUGGINGFACE_TOKEN
Create the namespace.
console$ kubectl create ns $NS
Create a Kubernetes secret containing your Hugging Face token.
console$ kubectl create secret generic hf-token \ --from-literal="hf-token=$HF_TOKEN" \ -n $NS
Apply the AMD device plugin manifest.
console$ kubectl create -f https://raw.githubusercontent.com/ROCm/k8s-device-plugin/master/k8s-ds-amdgpu-dp.yaml
Create a ConfigMap to store the custom vLLM profile for MI325X.
console$ nano vllm-mi325x-fp16-tp1-latency.yaml
Add the following content:
yamlapiVersion: v1 kind: ConfigMap metadata: name: qwen3-32b-profile data: vllm-mi325x-fp16-tp1-latency.yaml: | aim_id: Qwen/Qwen3-32B model_id: Qwen/Qwen3-32B metadata: engine: vllm gpu: MI325X precision: fp16 gpu_count: 1 metric: latency manual_selection_only: false type: unoptimized engine_args: swap-space: 64 tensor-parallel-size: 1 max-num-seqs: 512 dtype: float16 max-seq-len-to-capture: 32768 max-num-batched-tokens: 1024 max-model-len: 32768 no-enable-prefix-caching: no-enable-log-requests: disable-uvicorn-access-log: no-trust-remote-code: gpu-memory-utilization: 0.9 distributed_executor_backend: mp reasoning-parser: qwen3 async-scheduling: env_vars: GPU_ARCHS: "gfx942" HSA_NO_SCRATCH_RECLAIM: "1" VLLM_USE_AITER_TRITON_ROPE: "1" VLLM_ROCM_USE_AITER: "1" VLLM_ROCM_USE_AITER_RMSNORM: "1"
This YAML file creates a ConfigMap that stores a custom vLLM runtime profile for running the
Qwen/Qwen3-32Bmodel on an AMD MI325X GPU. The profile defines:- Model metadata such as engine type (vLLM), GPU type (MI325X), precision (fp16), number of GPUs (1), and that the profile targets latency-optimized performance.
- Engine arguments that control how vLLM runs the model, including tensor parallelism, maximum sequence lengths, batching limits, memory utilization, and various optimization flags.
- Environment variables required for ROCm and vLLM optimizations specific to the MI325X architecture (gfx942).
Save and exit the file.
Create this ConfigMap only when you deploy on MI325X. You do not need this ConfigMap for MI300X.NoteApply the configmap manifest.
console$ kubectl apply -f vllm-mi325x-fp16-tp1-latency.yaml -n $NS
Create the Deployment manifest.
console$ nano aim-qwen3-deployment.yaml
Add the following content:
yamlapiVersion: apps/v1 kind: Deployment metadata: name: minimal-aim-deployment labels: app: minimal-aim-deployment spec: progressDeadlineSeconds: 3600 replicas: 1 selector: matchLabels: app: minimal-aim-deployment template: metadata: labels: app: minimal-aim-deployment spec: containers: - name: minimal-aim-deployment image: amdenterpriseai/aim-qwen-qwen3-32b:0.8.4 imagePullPolicy: Always env: - name: HF_TOKEN valueFrom: secretKeyRef: name: hf-token key: hf-token - name: AIM_PRECISION value: "fp16" - name: AIM_GPU_COUNT value: "1" - name: AIM_ENGINE value: "vllm" - name: AIM_METRIC value: "latency" - name: AIM_LOG_LEVEL_ROOT value: "INFO" - name: AIM_LOG_LEVEL value: "INFO" - name: AIM_PORT value: "8000" ports: - name: http containerPort: 8000 resources: requests: memory: "80Gi" cpu: "8" amd.com/gpu: "1" limits: memory: "80Gi" cpu: "8" amd.com/gpu: "1" startupProbe: httpGet: path: /v1/models port: http periodSeconds: 10 failureThreshold: 360 livenessProbe: httpGet: path: /health port: http readinessProbe: httpGet: path: /v1/models port: http volumeMounts: - name: ephemeral-storage mountPath: /tmp - name: dshm mountPath: /dev/shm - name: custom-profile # MI325X mountPath: /workspace/aim-runtime/profiles/Qwen/Qwen3-32B # MI325X readOnly: true # MI325X volumes: - name: ephemeral-storage emptyDir: sizeLimit: 256Gi - name: dshm emptyDir: medium: Memory sizeLimit: 64Gi - name: custom-profile # MI325X configMap: # MI325X name: qwen3-32b-profile # MI325X
Save and exit the file.
This manifest deploys theNoteQwen/Qwen3-32Bmodel on an AMD MI325X GPU. If you are deploying on an AMD MI300X GPU, remove all lines marked with the# MI325Xtag.Apply the deployment manifest.
console$ kubectl apply -f aim-qwen3-deployment.yaml -n $NS
The deployment may take up to 5 minutes to pull the model and become ready.NoteCreate a NodePort type service to expose the deployment externally.
console$ nano aim-qwen3-svc.yaml
Add the following content:
yamlapiVersion: v1 kind: Service metadata: name: minimal-aim-deployment labels: app: minimal-aim-deployment spec: type: NodePort ports: - name: http port: 80 targetPort: 8000 nodePort: 32000 selector: app: minimal-aim-deployment
Save and exit the file.
Test the inference endpoint.
console$ curl http://SERVER-IP:32000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen3-32B", "prompt": "Artificial intelligence is the field of computer science that", "max_tokens": 50, "temperature": 0 }'
Conclusion
In this guide, you successfully deployed an AMD Inference Microservice (AIM) on a Vultr Cloud GPU instance using AMD Instinct™ hardware. You configured the required namespace, authenticated with Hugging Face, and applied the AMD device plugin. You also created a custom vLLM runtime profile for the MI325X GPU, deployed the Qwen/Qwen3-32B model, and exposed the service externally through a NodePort. By completing these steps, you had a fully operational AIM instance capable of serving LLM inference through an OpenAI-compatible API endpoint.