How to Deploy AMD Enterprise AI Platform on Vultr

AMD Enterprise AI Suite is a complete platform for building, deploying, and running AI workloads on Kubernetes tuned for AMD hardware. It can be used by system administrators, platform teams, AI researchers, and developers working on AI solutions.

This guide explains all the core components such as AMD AI Workbench, AMD Resource Manager, Kubernetes AI Workload Orchestrator (Kaiwo), Kubernetes Platform, Cluster Forge, and AMD Inference Microservices (AIMs) that are offered for AI compute use, you will also be able to deploy the AI platform using Vultr Cloud GPU Infrastructure and AMD Instinct™ MI300X GPUs.

Prerequisites

Before you begin, ensure you:

Have access to an AMD Instinct™ MI300X GPU.
Have a 2TB Block Storage volume attached for workloads.

Note

This guide uses a .nip.io domain during the installation process.

Key Components of the Platform

AMD AI Workbench: Focuses on simplifying the execution of fine-tuning, inference or other jobs, enabling researchers to manage AI workloads by offering low-code approaches for developing AI applications. With a comprehensive model catalog and integrations with MLops tools such as MLflow, TensorBoard and Kubeflow, AMD AI Workbench allows researchers to use AI development tools in a efficient manner.
AMD Resource Manager: Helps organizations control and optimize how users and teams access GPUs, data, and compute resources. It improves GPU utilization through fair scheduling and shared access, while offering dashboards to monitor usage across projects and departments.
Kubernetes AI Workload Orchestrator (Kaiwo): Enhances GPU efficiency by reducing idle time through intelligent scheduling. It manages AI job placement using a Kubernetes operator and supports features like multiple queues, fair sharing, quotas, and topology-aware scheduling to run workloads more effectively.
Kubernetes Platform: Serves as the core container orchestration layer that powers the deployment, scaling, and management of AI workloads. It provides the flexibility and reliability needed for tasks ranging from training large models to running production inference.
Cluster Forge: Simplifies the setup of a production-ready AI platform by automating the deployment of Kubernetes control and compute planes. It integrates open-source tools and packaged AI workloads, enabling teams using AMD hardware to get started within hours.
AMD Inference Microservices (AIMs): Streamlines the process of serving AI and LLM models by automatically selecting optimal runtime settings based on the model, hardware, and user inputs. Its expanding catalog of prebuilt microservices makes deploying inference workloads fast and efficient.

Deploy AMD Enterprise AI Platform

This section walks you through deploying the AMD Enterprise AI Platform using the bloom installer. You’ll download and configure Bloom, set up the required YAML settings, and launch the installation interface. An SSH tunnel is used to securely access the web-based installer. After confirming the final options in the interface, you can start the full platform deployment.

Download the official bloom binary.

console

$ wget https://github.com/silogen/cluster-bloom/releases/download/v1.2.2/bloom

Make the bloom binary executable.
console
```
$ chmod +x bloom
```
Create a bloom.yaml configuration file.
console
```
$ nano bloom.yaml
```
Add the following content to the file. Replace <server-ip-address> with your server’s IP address.
yaml
```
DOMAIN: <server-ip-address>.nip.io
OIDC_URL: https://kc.<server-ip-address>.nip.io/realms/airm
FIRST_NODE: true
GPU_NODE: true
CERT_OPTION: generate
USE_CERT_MANAGER: true
CLUSTER_DISKS: /dev/vdb1
CLUSTERFORGE_RELEASE: https://github.com/silogen/cluster-forge/releases/download/v1.5.2/release-enterprise-ai-v1.5.2.tar.gz
NO_DISKS_FOR_CLUSTER: false
```
In the above configuration:
- DOMAIN: The base domain the platform uses (nip.io automatically maps to your server IP).
- OIDC_URL: The Keycloak authentication endpoint for the airm realm.
- FIRST_NODE: Marks this server as the initial node in the cluster.
- GPU_NODE: Enables GPU capabilities for this node.
- CERT_OPTION: Defines how certificates are created (generate = auto-generate).
- USE_CERT_MANAGER: Enables Cert-Manager for managing TLS certificates.
- CLUSTER_DISKS: The disk or partition the cluster uses for storage.
- CLUSTERFORGE_RELEASE: URL of the ClusterForge package required for installation.
- NO_DISKS_FOR_CLUSTER: Indicates whether the cluster should run without disks (false = use the disk listed above).
Start the installation.
console
```
$ sudo ./bloom --config bloom.yaml
```
This command starts a web interface at http://127.0.0.1:62078.
Create an SSH tunnel to access the interface.
console
```
$ ssh -L 62078:127.0.0.1:62078 <username>@<server-ip-address>
```
Replace <username> with your server username and <server-ip-address> with your server’s IP address.
Open the interface in your browser:
```
http://localhost:62078
```
Follow the instructions in the web interface and configure any pending options as required.
After completing the configuration, click Generate Configuration & Start Installation to begin the deployment.

Note
The deployment process takes approximately 20 minutes.

Key Feature of the Platform

Optimized GPU utilization and lower operational costs: Intelligent scheduling maximizes GPU usage, reduces waste, and lowers overall compute costs.
Unified AI infrastructure: Brings all AI tools and environments together into a single, consistent platform for easier collaboration and governance.
Accelerated Time-to-Production: Built-in microservices and streamlined workflows help teams move AI models into production faster.
AI-native workload orchestration: Purpose-built scheduling and inference services ensure efficient, high-performance execution of AI workloads on AMD Instinct™ GPUs.

Conclusion

By following this guide, you learned how to deploy the AMD Enterprise AI Platform on Vultr using AMD Instinct™ GPUs and the Bloom installer. You also explored the suite’s key components including AI Workbench, Resource Manager, Kaiwo, Cluster Forge, and AIMs, and how they work together to create a unified, high-performance AI infrastructure.

Tags:

Artificial Intelligence