How to Install K3s with ROCm GPU Operator on Ubuntu 24.04

Introduction

K3s is a lightweight, certified Kubernetes distribution designed for resource-constrained environments, such as edge computing and IoT devices. It simplifies the deployment and management of Kubernetes clusters while maintaining compatibility with standard Kubernetes tools and APIs. When combined with the ROCm GPU Operator, K3s can efficiently manage AMD GPUs within the cluster, automating tasks such as GPU resource management and monitoring. A K3s cluster with the ROCm GPU Operator allows developers to easily deploy and scale applications that take advantage of AMD GPU acceleration, without the complexity of manual configuration or setup. This combination streamlines the process of running GPU-accelerated workloads in Kubernetes environments.

In this article, you will install K3s and Helm, followed by the installation and deployment of the ROCm GPU Operator to enable the management of AMD GPUs within your Kubernetes cluster. Additionally, you will install cert-manager to handle the automation of TLS certificate management across the cluster, ensuring secure communication between services.

Install and Configure K3s

In this section, you are to install K3s and Helm. Furthermore, you are to configure K3s and enable the K3s system service for auto start whenever the system boots up.

Install K3s.
console
```
$ curl -sfL https://get.k3s.io | sh -
```
Create a new .kube directory in your user home directory.
console
```
$ mkdir -p $HOME/.kube
```
Add a k3s.yaml symbolic link to the config file in the .kube directory to set it as the default Kubernetes configuration file.
console
```
$ ln -s /etc/rancher/k3s/k3s.yaml $HOME/.kube/config
```
Change the .kube/config file permissions to 755 to enable Helm to load the configuration file.
console
```
$ sudo chmod 755 $HOME/.kube/config
```

Install Helm to manage and install Kubernetes applications.

                            console
                            
$ curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash

View the K3s service status and verify that it's running.

                            console
                            
$ sudo systemctl status k3s

Output:

● k3s.service - Lightweight Kubernetes
    Loaded: loaded (/etc/systemd/system/k3s.service; enabled; vendor preset: enabled)
    Active: active (running) since Sat 2024-11-30 17:19:18 UTC; 48s ago
    Docs: https://k3s.io
    Process: 134898 ExecStartPre=/bin/sh -xc ! /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service>
    Process: 134900 ExecStartPre=/sbin/modprobe br_netfilter (code=exited, status=0/SUCCESS)
    Process: 134902 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
  Main PID: 134903 (k3s-server)
    Tasks: 366
    Memory: 1.0G
        CPU: 2min 39.401s

Enable the K3s system service.
console
```
$ sudo systemctl enable k3s
```

Install ROCm GPU Operator

In this section, you are to install cert-manager, a critical dependency for managing TLS certificates in Kubernetes, and deploy the ROCm GPU Operator, which facilitates GPU management in the cluster. You will also verify the deployment by listing all Kubernetes resources associated with the GPU Operator.

Add Helm repository for installing cert-manager, if not already added.
console
```
$ helm repo add jetstack https://charts.jetstack.io --force-update
```
Install cert-manager, if not already installed.
console
```
$ helm install cert-manager jetstack/cert-manager --namespace cert-manager --create-namespace --version v1.15.1 --set crds.enabled=true
```
Please note that the above commands install the v1.51.1 version of cert-manager and you may check the official documentation for discovering the latest version.
Install the ROCm GPU Operator.
console
```
$ helm install amd-gpu-operator --namespace kube-amd-gpu --create-namespace https://github.com/ROCm/gpu-operator/releases/download/v1.0.0/gpu-operator-charts-v1.0.0.tgz
```
The above deploys the AMD GPU Operator into the kube-amd-gpu namespace, enabling the management and utilization of AMD GPUs in the Kubernetes cluster.

List all the Kubernetes resources.

                            console
                            
$ kubectl get all --namespace kube-amd-gpu

Output:

NAME                                                                  READY   STATUS    RESTARTS   AGE
pod/amd-gpu-operator-gpu-operator-charts-controller-manager-6bmkfwr   1/1     Running   0          6m
pod/amd-gpu-operator-kmm-controller-5f8d79b46-j4sbw                   1/1     Running   0          6m
pod/amd-gpu-operator-kmm-webhook-server-5d5fc8bdd6-p7qfs              1/1     Running   0          6m
pod/amd-gpu-operator-node-feature-discovery-gc-78989c896-lffs2        1/1     Running   0          6m
pod/amd-gpu-operator-node-feature-discovery-master-b8bffc48b-dtw7x    1/1     Running   0          6m
pod/amd-gpu-operator-node-feature-discovery-worker-m6jfz              1/1     Running   0          6m
.........

The pods may take 5–10 minutes to transition to a "Running" status.

Conclusion

In this article, you have installed K3s, and Helm, to help simplify the management of Kubernetes applications. Following that, you proceeded with the installation and deployment of the ROCm GPU Operator, which allows efficient management of AMD GPUs in the Kubernetes cluster. Additionally, you installed cert-manager, a tool that automates the management of TLS certificates within Kubernetes, to ensure secure communication across the cluster. These steps collectively enable seamless GPU management and optimization for machine learning workloads running on AMD GPUs within the Kubernetes environment.