How to Deploy and Utilize Vultr Open Cluster Manager Marketplace Application

Updated on 03 July, 2025
Guide
Deploy and manage GPU resources on Vultr using Open Cluster Manager with integrated monitoring, logging, and scaling support.
How to Deploy and Utilize Vultr Open Cluster Manager Marketplace Application header image

Vultr's Open Cluster Manager Marketplace Application is a pre-configured, infrastructure automation solution designed to simplify the deployment of GPU-accelerated clusters on Vultr. It leverages Terraform and Ansible to provision and configure compute resources, networking, monitoring tools, and workload management systems. The application includes built-in support for tools like Grafana, Prometheus, Loki, and Slurm, enabling users to monitor GPU metrics, collect logs, and manage job scheduling seamlessly across a distributed cluster.

In this guide, you’ll learn how to deploy the Open Cluster Manager application from the Vultr Marketplace, customise the infrastructure configuration to match your GPU requirements, and build a GPU cluster. You’ll also set up performance monitoring and centralized logging, access a preconfigured Grafana dashboard, and scale the cluster up or down based on your workload needs.

Prerequisites

Before you begin, you need to:

  • Have access to your Vultr API Key.
  • Optional: Have an existing Loki deployment for centralized logging.
Note
If deploying the cluster manager on a dynamically assigned IP, the API key’s ACL must allow requests from all IPs (0.0.0.0/0) to ensure user-space tools can access the Vultr API. If using a reserved IP, only that specific IP reservation needs to be added to the ACL. Additionally, ensure the cluster manager is deployed in a region with available inventory of the desired SKU for cluster nodes. This is required for the manager to create a VPC and enable private communication between itself and all subsequent nodes.

Deploy Vultr Open Cluster Manager Marketplace Application

In this section, you are to deploy the Vultr's Open Cluster Manager Marketplace Application on a Vultr instance. The steps involve selecting a suitable server type, region, and plan (GPU is not required), configuring the software, and providing necessary application variables such as your Vultr API key. Optional settings for Loki integration are also available if you wish to enable centralized logging.

  1. Login into your Vultr Customer Portal

  2. Navigate to Compute under the Products section, then click Deploy.

  3. Select your server type, region and plan.

    Note
    You can opt a non-GPU plan for the Open Cluster Manager instance. The application does not require GPU resources.
  4. Click Configure Software.

  5. Under the Marketplace Apps section, search for Open Cluster Manager and select the application.

  6. Provide the below specified app variables.

    • vultr_api_key (Required): Your Vultr API key. This is required for the manager to communicate with your account and provision required resources.
    • loki_pass (Optional): Loki basic auth password.
    • loki_url (Optional): The base URL of your Loki server (e.g., https://loki.example.com).
    • loki_user (Optional): Loki basic auth username.
    Note
    To enable centralized logging for your resources, enter the appropriate loki_url, loki_user, and loki_pass values that correspond to your existing Loki instance.
  7. Avoid selecting the VPC Networks additional feature during the provisioning, as the Open Cluster Manager automatically provisions a VPC network as part of the cluster setup process.

  8. Click Deploy to start the provisioning process.

    Note
    The deployment process can take from 10-15 min to complete, you can review the process by opening the server console.

The cloud-init script runs at instance deployment and performs several setup tasks. It retrieves user-provided variables (e.g., Vultr API key, Loki credentials) via the Metadata API and stores them in /etc/environment. It installs the Vultr CLI, Ansible, Terraform, and Docker, generates an SSH keypair, and adds the public key to the Vultr account. The script then creates a VPC network, updates configuration values in /root/config.yml.sample, renames it to /root/config.yml, and attaches the VPC, triggering a reboot.

Review and customise the Infrastructure Configuration

In this section, you’ll SSH into your deployed Vultr instance and review the default configuration file located at /root/config.yml. The file is split into Terraform and Ansible sections. The Terraform configuration defines the infrastructure such as GPU type, region, and instance count, while the Ansible section handles post-deployment setup, including monitoring with Grafana Alloy, optional Loki logging, and Slurm for GPU job scheduling. You can customise these settings based on your specific requirements.

  1. After the deployment process is completed SSH into your Vultr instance by using root user credentials provided in the overview section of Vultr Customer Portal.

  2. The default configuration file used to provision the cluster is located at /root/config.yml. Review and adjust the configuration preferably based on your GPU requirements.

    console
    # cat /root/config.yml
    

    The configuration is organized into two primary sections:

    • Terraform Configuration

      ini
      instance_plan: vcg-l40s-16c-180g-48vram
      instance_gpu: l40s
      instance_gpu_count: 1
      os_id: 1743  # Ubuntu 22.04 LTS x64
      instance_count: 2
      instance_region: ewr
      vpc_ids: 71aa7038-63d1-474a-962b-b84773c0a786
      fwg_id:
      ssh_key_ids: [c8e97f26-5e9f-4f5c-9440-265da23a2ca5]
      hostprefix: ewr-cluster-node
      hostsuffix: gpu.local
      

      This section defines the infrastructure that Terraform provisions, including GPU type, instance count, region, and networking. It controls how your GPU instances are deployed on Vultr.

      • instance_plan: The GPU instance type to deploy. vcg-l40s-16c-180g-48vram indicates a high-performance L40S instance.
      • instance_gpu: Specifies the GPU model. In this case, l40s.
      • instance_gpu_count: Number of GPUs to attach per instance.
      • os_id: Operating system ID. 1743 corresponds to Ubuntu 22.04 LTS x64 in the Vultr OS catalog.
      • instance_count: Number of cluster nodes to provision.
      • instance_region: Region code. ewr refers to the New Jersey region.
      • vpc_ids: ID of the VPC network where the cluster will be deployed.
      • fwg_id: (Optional) ID of a pre-configured firewall group.
      • ssh_key_ids: List of SSH key IDs to be injected into the nodes.
      • hostprefix / hostsuffix: Naming convention for cluster nodes.
    • Ansible Configuration

      ini
      # Loki
      loki_cluster_url: "https://loki.example.com"
      tenant_username: "<loki_username>"
      tenant_password: "StrongPassword"
      

      This section includes settings used by Ansible during the post-provisioning phase to configure each GPU node. It automates the setup of:

      • Grafana Alloy for GPU metrics collection.
      • Loki logging integration (if configured).
      • Slurm workload manager for GPU job scheduling (including Slurm workers on nodes and controller on the manager).

Provision the GPU Resources

After you review and update the /root/config.yml file to match your GPU and networking requirements, you can build the cluster using the provided automation script or manually execute each step.

To automatically build the cluster, run the following script:

console
# /root/build-cluster.sh

The build-cluster.sh script provisions the infrastructure using Terraform, configures the infrastructure using Ansible, and, upon completion, provides a Grafana dashboard URL to monitor GPU metrics and cluster performance.

Note
If you encounter any errors during the Ansible playbook execution, rerun the /root/build-cluster.sh script. The playbook is designed to be idempotent and will reattempt any failed steps.

Access Grafana Dashboard

After the cluster is deployed, you can access the Grafana dashboard to monitor GPU metrics and cluster performance.

  1. Open the following URL in your browser, replacing <SERVER-IP> with your actual server IP.

    http://<SERVER-IP>:3000
  2. On the login screen, enter the following credentials:

    • Username: admin
    • Password: Available in the Vultr Customer Portal under the instance's Application Instructions section.

View GPU Monitoring Dashboards

After logging in to Grafana, follow the steps below to access the preconfigured dashboards for monitoring system and GPU metrics across your cluster.

  1. In the Grafana sidebar, click the Dashboards icon.

  2. Navigate to Browse to view available dashboards.

  3. Select one of the following dashboards to open it and begin monitoring metrics. The dashboards should look like the one below:

    Grafana Dashboard Image

Scale the Cluster

To increase/decrease the number of GPU resources in your cluster, edit the configuration file and rerun the provisioning script.

  1. Open the cluster configuration file to increase the number of instance.

    console
    # vim /root/config.yml
    
  2. Locate the instance_count parameter and update its value to the desired number of nodes.

    ini
    ...
    instance_count: {int_instance_count}
    ...
    

    Save and close the file.

  3. Rebuild the cluster using the automation script to apply the new changes.

    console
    # /root/build-cluster.sh
    

    This script adjusts instance count and reconfigures the cluster according to the instance_count value provided.

Note
Scaling up adds new nodes based on the updated instance_count, auto-configures them, and integrates them into Grafana. Scaling down deletes nodes exceeding the new count and updates dashboards accordingly. Always back up critical data before scaling down.

Conclusion

By following this guide, you’ve successfully deployed the Vultr Open Cluster Manager Marketplace Application, provisioned GPU resources, and configured your cluster for monitoring and workload management. The automation scripts and pre-built dashboards streamline the setup process, giving you a ready-to-use GPU cluster with Grafana, Prometheus, and Slurm integration. Whether you're running compute-intensive workloads or monitoring system performance, the cluster is flexible and easy to scale as your requirements evolve.

Comments

No comments yet.