
Vultr's Open Cluster Manager Marketplace Application is a pre-configured, infrastructure automation solution designed to simplify the deployment of GPU-accelerated clusters on Vultr. It leverages Terraform and Ansible to provision and configure compute resources, networking, monitoring tools, and workload management systems. The application includes built-in support for tools like Grafana, Prometheus, Loki, and Slurm, enabling users to monitor GPU metrics, collect logs, and manage job scheduling seamlessly across a distributed cluster.
In this guide, you’ll learn how to deploy the Open Cluster Manager application from the Vultr Marketplace, customise the infrastructure configuration to match your GPU requirements, and build a GPU cluster. You’ll also set up performance monitoring and centralized logging, access a preconfigured Grafana dashboard, and scale the cluster up or down based on your workload needs.
Prerequisites
Before you begin, you need to:
- Have access to your Vultr API Key.
- Optional: Have an existing Loki deployment for centralized logging.
0.0.0.0/0
) to ensure user-space tools can access the Vultr API. If using a reserved IP, only that specific IP reservation needs to be added to the ACL. Additionally, ensure the cluster manager is deployed in a region with available inventory of the desired SKU for cluster nodes. This is required for the manager to create a VPC and enable private communication between itself and all subsequent nodes.
Deploy Vultr Open Cluster Manager Marketplace Application
In this section, you are to deploy the Vultr's Open Cluster Manager Marketplace Application on a Vultr instance. The steps involve selecting a suitable server type, region, and plan (GPU is not required), configuring the software, and providing necessary application variables such as your Vultr API key. Optional settings for Loki integration are also available if you wish to enable centralized logging.
Login into your Vultr Customer Portal
Navigate to Compute under the Products section, then click Deploy.
Select your server type, region and plan.
NoteClick Configure Software.
Under the Marketplace Apps section, search for Open Cluster Manager and select the application.
Provide the below specified app variables.
vultr_api_key
(Required): Your Vultr API key. This is required for the manager to communicate with your account and provision required resources.loki_pass
(Optional): Loki basic auth password.loki_url
(Optional): The base URL of your Loki server (e.g., https://loki.example.com).loki_user
(Optional): Loki basic auth username.
Noteloki_url
,loki_user
, andloki_pass
values that correspond to your existing Loki instance.Avoid selecting the VPC Networks additional feature during the provisioning, as the Open Cluster Manager automatically provisions a VPC network as part of the cluster setup process.
Click Deploy to start the provisioning process.
Note
The cloud-init
script runs at instance deployment and performs several setup tasks. It retrieves user-provided variables (e.g., Vultr API key, Loki credentials) via the Metadata API and stores them in /etc/environment
. It installs the Vultr CLI, Ansible, Terraform, and Docker, generates an SSH keypair, and adds the public key to the Vultr account. The script then creates a VPC network, updates configuration values in /root/config.yml.sample
, renames it to /root/config.yml
, and attaches the VPC, triggering a reboot.
Review and customise the Infrastructure Configuration
In this section, you’ll SSH into your deployed Vultr instance and review the default configuration file located at /root/config.yml
. The file is split into Terraform and Ansible sections. The Terraform configuration defines the infrastructure such as GPU type, region, and instance count, while the Ansible section handles post-deployment setup, including monitoring with Grafana Alloy, optional Loki logging, and Slurm for GPU job scheduling. You can customise these settings based on your specific requirements.
After the deployment process is completed SSH into your Vultr instance by using root user credentials provided in the overview section of Vultr Customer Portal.
The default configuration file used to provision the cluster is located at
/root/config.yml
. Review and adjust the configuration preferably based on your GPU requirements.console# cat /root/config.yml
The configuration is organized into two primary sections:
Terraform Configuration
iniinstance_plan: vcg-l40s-16c-180g-48vram instance_gpu: l40s instance_gpu_count: 1 os_id: 1743 # Ubuntu 22.04 LTS x64 instance_count: 2 instance_region: ewr vpc_ids: 71aa7038-63d1-474a-962b-b84773c0a786 fwg_id: ssh_key_ids: [c8e97f26-5e9f-4f5c-9440-265da23a2ca5] hostprefix: ewr-cluster-node hostsuffix: gpu.local
This section defines the infrastructure that Terraform provisions, including GPU type, instance count, region, and networking. It controls how your GPU instances are deployed on Vultr.
instance_plan
: The GPU instance type to deploy.vcg-l40s-16c-180g-48vram
indicates a high-performance L40S instance.instance_gpu
: Specifies the GPU model. In this case,l40s
.instance_gpu_count
: Number of GPUs to attach per instance.os_id
: Operating system ID.1743
corresponds to Ubuntu 22.04 LTS x64 in the Vultr OS catalog.instance_count
: Number of cluster nodes to provision.instance_region
: Region code.ewr
refers to the New Jersey region.vpc_ids
: ID of theVPC network
where the cluster will be deployed.fwg_id
: (Optional) ID of a pre-configured firewall group.ssh_key_ids
: List of SSH key IDs to be injected into the nodes.hostprefix
/hostsuffix
: Naming convention for cluster nodes.
Ansible Configuration
ini# Loki loki_cluster_url: "https://loki.example.com" tenant_username: "<loki_username>" tenant_password: "StrongPassword"
This section includes settings used by Ansible during the post-provisioning phase to configure each GPU node. It automates the setup of:
- Grafana Alloy for GPU metrics collection.
- Loki logging integration (if configured).
- Slurm workload manager for GPU job scheduling (including Slurm workers on nodes and controller on the manager).
Provision the GPU Resources
After you review and update the /root/config.yml
file to match your GPU and networking requirements, you can build the cluster using the provided automation script or manually execute each step.
To automatically build the cluster, run the following script:
# /root/build-cluster.sh
The build-cluster.sh
script provisions the infrastructure using Terraform, configures the infrastructure using Ansible, and, upon completion, provides a Grafana dashboard URL to monitor GPU metrics and cluster performance.
/root/build-cluster.sh
script. The playbook is designed to be idempotent and will reattempt any failed steps.
If you prefer more control or wish to troubleshoot the provisioning process, you can manually execute each step to build and configure the GPU cluster using Terraform and Ansible.
Change into the Terraform working directory.
console# cd /root/terraform
Initialize the Terraform project.
console# terraform init
Review the execution plan.
console# terraform plan
Apply the Terraform configuration to provision the GPU nodes.
console# terraform apply
Wait for all cluster nodes to be fully deployed and online before continuing.
Change into the Ansible configuration directory.
console# cd /root/ansible
Run the Ansible playbook to configure all nodes.
console# ansible-playbook -i hosts cluster.yml
This playbook performs the following actions:
- Updates all GPU nodes and the manager with the latest system packages.
- Installs Grafana Alloy on each node for real-time metrics collection.
- Configures Grafana Alloy to forward logs to your Loki instance (if provided).
- Installs and sets up the Slurm Daemon (
slurmd
) on all GPU nodes. - Installs and configures the Slurm Controller (
slurmctld
) on the manager node. - Deploys Grafana and Prometheus containers via Docker on the manager (refer to
/root/docker-compose.yml
). - Installs Prometheus Node Exporter on all cluster nodes for hardware and OS-level monitoring.
- Adds a preconfigured Node Exporter dashboard to the local Grafana instance.
- Once the playbook completes, your GPU cluster is fully configured to schedule Slurm jobs and collect performance metrics.
Print each nodes's hostname for confirmation.
# srun -N2 hostname
Access Grafana Dashboard
After the cluster is deployed, you can access the Grafana dashboard to monitor GPU metrics and cluster performance.
Open the following URL in your browser, replacing
<SERVER-IP>
with your actual server IP.http://<SERVER-IP>:3000
On the login screen, enter the following credentials:
- Username:
admin
- Password: Available in the Vultr Customer Portal under the instance's Application Instructions section.
- Username:
View GPU Monitoring Dashboards
After logging in to Grafana, follow the steps below to access the preconfigured dashboards for monitoring system and GPU metrics across your cluster.
In the Grafana sidebar, click the Dashboards icon.
Navigate to Browse to view available dashboards.
Select one of the following dashboards to open it and begin monitoring metrics. The dashboards should look like the one below:
Scale the Cluster
To increase/decrease the number of GPU resources in your cluster, edit the configuration file and rerun the provisioning script.
Open the cluster configuration file to increase the number of instance.
console# vim /root/config.yml
Locate the
instance_count
parameter and update its value to the desired number of nodes.ini... instance_count: {int_instance_count} ...
Save and close the file.
Rebuild the cluster using the automation script to apply the new changes.
console# /root/build-cluster.sh
This script adjusts instance count and reconfigures the cluster according to the
instance_count
value provided.
instance_count
, auto-configures them, and integrates them into Grafana. Scaling down deletes nodes exceeding the new count and updates dashboards accordingly. Always back up critical data before scaling down.
Conclusion
By following this guide, you’ve successfully deployed the Vultr Open Cluster Manager Marketplace Application, provisioned GPU resources, and configured your cluster for monitoring and workload management. The automation scripts and pre-built dashboards streamline the setup process, giving you a ready-to-use GPU cluster with Grafana, Prometheus, and Slurm integration. Whether you're running compute-intensive workloads or monitoring system performance, the cluster is flexible and easy to scale as your requirements evolve.
No comments yet.