How to Install K3s with ROCm GPU Operator on Ubuntu 24.04
Introduction
K3s is a lightweight, certified Kubernetes distribution designed for resource-constrained environments, such as edge computing and IoT devices. It simplifies the deployment and management of Kubernetes clusters while maintaining compatibility with standard Kubernetes tools and APIs. When combined with the ROCm GPU Operator, K3s can efficiently manage AMD GPUs within the cluster, automating tasks such as GPU resource management and monitoring. A K3s cluster with the ROCm GPU Operator allows developers to easily deploy and scale applications that take advantage of AMD GPU acceleration, without the complexity of manual configuration or setup. This combination streamlines the process of running GPU-accelerated workloads in Kubernetes environments.
In this article, you will install K3s and Helm, followed by the installation and deployment of the ROCm GPU Operator to enable the management of AMD GPUs within your Kubernetes cluster. Additionally, you will install cert-manager to handle the automation of TLS certificate management across the cluster, ensuring secure communication between services.
Install and Configure K3s
In this section, you are to install K3s and Helm. Furthermore, you are to configure K3s and enable the K3s system service for auto start whenever the system boots up.
Install K3s.
console$ curl -sfL https://get.k3s.io | sh -
Create a new
.kube
directory in your user home directory.console$ mkdir -p $HOME/.kube
Add a
k3s.yaml
symbolic link to theconfig
file in the.kube
directory to set it as the default Kubernetes configuration file.console$ ln -s /etc/rancher/k3s/k3s.yaml $HOME/.kube/config
Change the
.kube/config
file permissions to755
to enable Helm to load the configuration file.console$ sudo chmod 755 $HOME/.kube/config
Install Helm to manage and install Kubernetes applications.
console$ curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
View the K3s service status and verify that it's running.
console$ sudo systemctl status k3s
Output:
● k3s.service - Lightweight Kubernetes Loaded: loaded (/etc/systemd/system/k3s.service; enabled; vendor preset: enabled) Active: active (running) since Sat 2024-11-30 17:19:18 UTC; 48s ago Docs: https://k3s.io Process: 134898 ExecStartPre=/bin/sh -xc ! /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service> Process: 134900 ExecStartPre=/sbin/modprobe br_netfilter (code=exited, status=0/SUCCESS) Process: 134902 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS) Main PID: 134903 (k3s-server) Tasks: 366 Memory: 1.0G CPU: 2min 39.401s
Enable the K3s system service.
console$ sudo systemctl enable k3s
Install ROCm GPU Operator
In this section, you are to install cert-manager, a critical dependency for managing TLS certificates in Kubernetes, and deploy the ROCm GPU Operator, which facilitates GPU management in the cluster. You will also verify the deployment by listing all Kubernetes resources associated with the GPU Operator.
Add Helm repository for installing cert-manager, if not already added.
console$ helm repo add jetstack https://charts.jetstack.io --force-update
Install cert-manager, if not already installed.
console$ helm install cert-manager jetstack/cert-manager --namespace cert-manager --create-namespace --version v1.15.1 --set crds.enabled=true
Please note that the above commands install the
v1.51.1
version of cert-manager and you may check the official documentation for discovering the latest version.Install the ROCm GPU Operator.
console$ helm install amd-gpu-operator --namespace kube-amd-gpu --create-namespace https://github.com/ROCm/gpu-operator/releases/download/v1.0.0/gpu-operator-charts-v1.0.0.tgz
The above deploys the AMD GPU Operator into the
kube-amd-gpu
namespace, enabling the management and utilization of AMD GPUs in the Kubernetes cluster.List all the Kubernetes resources.
console$ kubectl get all --namespace kube-amd-gpu
Output:
NAME READY STATUS RESTARTS AGE pod/amd-gpu-operator-gpu-operator-charts-controller-manager-6bmkfwr 1/1 Running 0 6m pod/amd-gpu-operator-kmm-controller-5f8d79b46-j4sbw 1/1 Running 0 6m pod/amd-gpu-operator-kmm-webhook-server-5d5fc8bdd6-p7qfs 1/1 Running 0 6m pod/amd-gpu-operator-node-feature-discovery-gc-78989c896-lffs2 1/1 Running 0 6m pod/amd-gpu-operator-node-feature-discovery-master-b8bffc48b-dtw7x 1/1 Running 0 6m pod/amd-gpu-operator-node-feature-discovery-worker-m6jfz 1/1 Running 0 6m .........
The pods may take 5–10 minutes to transition to a "Running" status.
Conclusion
In this article, you have installed K3s, and Helm, to help simplify the management of Kubernetes applications. Following that, you proceeded with the installation and deployment of the ROCm GPU Operator, which allows efficient management of AMD GPUs in the Kubernetes cluster. Additionally, you installed cert-manager, a tool that automates the management of TLS certificates within Kubernetes, to ensure secure communication across the cluster. These steps collectively enable seamless GPU management and optimization for machine learning workloads running on AMD GPUs within the Kubernetes environment.