Add-on: gpu

This addon enables NVIDIA GPU support on MicroK8s using the NVIDIA GPU Operator.

You can enable this addon with the following command:

microk8s enable gpu

NOTE: The GPU addon is supported on MicroK8s versions 1.22 or newer. For MicroK8s 1.21, see GPU addon on MicroK8s 1.21.

Verify installation

Verify that all components are deployed and configured correctly with:

microk8s kubectl logs -n gpu-operator-resources -lapp=nvidia-operator-validator -c nvidia-operator-validator

which should return:

all validations are successful

Deploy a test workload

Once the GPU addon is enabled, workloads can request the GPU using a limit setting, e.g. nvidia.com/gpu: 1. For example, you can run a cuda-vector-add test pod with:

microk8s kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: cuda-vector-add
spec:
  restartPolicy: OnFailure
  containers:
    - name: cuda-vector-add
      image: "k8s.gcr.io/cuda-vector-add:v0.1"
      resources:
        limits:
          nvidia.com/gpu: 1
EOF

And then check the pod’s logs to verify that everything is okay:

microk8s kubectl logs cuda-vector-add

where a successful run would produce logs similar to:

[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

You are ready to run GPU workloads on your MicroK8s cluster!

Addon configuration options

NOTE: These require MicroK8s revision $REVISION$ or newer. Check the installed revision with snap list microk8s.

In the microk8s enable gpu command, the following command-line arguments may be set:

Argument Default Description
--driver $driver auto Supported values are auto (use host driver if found), host (force use the host driver), or operator (force use the operator driver).
--version $VERSION v1.10.1 Version of the GPU operator to install.
--toolkit-version $VERSION `` If not empty, override the version of the nvidia-container-runtime that will be installed.
--set-as-default-runtime / --no-set-as-default-runtime true Set the default containerd runtime to nvidia.
--set $key=$value `` Set additional configuration options to the GPU operator Helm chart. May be passed multiple times. For a list of options see values.yaml.
--values $file `` Set additional configuration options to the GPU operator Helm chart using a file. May be passed multiple times. For a list of options see values.yaml.

Use host drivers and runtime

Use host NVIDIA drivers

The GPU addon works with the existing NVIDIA host drivers (if available), otherwise it will deploy the nvidia-driver-daemonset to dynamically build and load the NVIDIA drivers into the kernel.

In order to use host drivers, install the NVIDIA drivers before enabling the addon:

sudo apt-get update
sudo apt-get install nvidia-headless-510-server nvidia-utils-510-server

Verify that drivers are loaded by checking nvidia-smi:

nvidia-smi

Then enable the addon:

microk8s enable gpu

Use host nvidia-container-runtime

The GPU addon will automatically install nvidia-container-runtime, which is the runtime required to execute GPU workloads on the MicroK8s cluster. This is done by the nvidia-container-toolkit-daemonset pod.

If needed, this section documents how you to install the nvidia-container-runtime manually. The steps below should be performed before enabling the GPU addon.

Install nvidia-container-runtime following the upstream instructions. At the time of writing, the instructions for Ubuntu hosts look like this:

curl -s -L https://nvidia.github.io/nvidia-container-runtime/gpgkey | \
  sudo apt-key add -
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-container-runtime/$distribution/nvidia-container-runtime.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-runtime.list
sudo apt-get update
sudo apt-get install nvidia-container-runtime

This will install nvidia-container-runtime in /usr/bin/nvidia-container-runtime. Next, edit the containerd configuration file so that it knows where to find the runtime binaries for the nvidia runtime:

echo '
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          runtime_type = "io.containerd.runc.v2"

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"
' | sudo tee -a /var/snap/microk8s/current/args/containerd-template.toml

Restart MicroK8s to reload the containerd configuration:

sudo snap restart microk8s

Finally, enable the gpu addon and make sure that the toolkit daemonset is not deployed:

microk8s enable gpu --set toolkit.enabled=false

Features

GPU addon features

  • Use the existing NVIDIA host drivers, or build the drivers and load to the kernel dynamically at runtime.
  • Automatically install and configure the nvidia-container-runtime for containerd.
  • Configure the nvidia.com/gpu kubelet device plugin, to support resource capacity and limits on GPU nodes.
  • Multi-instance GPU (MIG) can be configured using ConfigMap resources.

GPU addon components

The GPU addon will install and configure the following components on the MicroK8s cluster:

  • nvidia-feature-discovery: Runs feature discovery on all cluster nodes, to detect GPU devices and host capabilities.
  • nvidia-driver-daemonset: Runs in all GPU nodes of the cluster, builds and loads the NVIDIA drivers into the running kernel.
  • nvidia-container-toolkit-daemonset: Runs in all GPU nodes of the cluster. Once the NVIDIA drivers are loaded, installs the nvidia-container-runtime binaries and configures the nvidia runtime on containerd accordingly. By default, it sets the default runtime to nvidia, so all pod workloads can use the GPU.
  • nvidia-device-plugin-daemonset: Runs in all GPU nodes of the cluster, and configures the nvidia.com/gpu kubelet device plugin. This is used to configure resource capacity and limits for the GPU nodes.
  • nvidia-operator-validator: Validates that the NVIDIA drivers, container runtime and the kubelet device plugin have been configured correctly. Finally, it executes an example cuda workload.

A complete installation of the GPU operator looks like this (output of microk8s kubectl get pod -n gpu-operator-resources):

NAME                                       READY   STATUS      RESTARTS   AGE    IP            NODE        NOMINATED NODE   READINESS GATES
nvidia-container-toolkit-daemonset-mjbk8   1/1     Running     0          110m   10.1.51.198   machine-0   <none>           <none>
nvidia-cuda-validator-xj2kx                0/1     Completed   0          109m   10.1.51.204   machine-0   <none>           <none>
nvidia-dcgm-nvqnz                          1/1     Running     0          110m   10.1.51.199   machine-0   <none>           <none>
gpu-feature-discovery-dn6lt                1/1     Running     0          110m   10.1.51.202   machine-0   <none>           <none>
nvidia-device-plugin-daemonset-zg76f       1/1     Running     0          110m   10.1.51.201   machine-0   <none>           <none>
nvidia-device-plugin-validator-k6hdv       0/1     Completed   0          107m   10.1.51.205   machine-0   <none>           <none>
nvidia-dcgm-exporter-9vnc5                 1/1     Running     0          110m   10.1.51.203   machine-0   <none>           <none>
nvidia-operator-validator-ntvdj            1/1     Running     0          110m   10.1.51.200   machine-0   <none>           <none>

MicroK8s 1.21

MicroK8s version 1.21 is out of support since May 2022. The GPU addon included with MicroK8s 1.21 was an early alpha and is no longer functional.

Due to a problem with the way containerd is configured in MicroK8s versions 1.21 and older, the nvidia-toolkit-daemonset installed by the GPU operator is incompatible and leaves MicroK8s in a broken state.

It is recommended to update to a supported version of MicroK8s. However, it is possible to install the GPU operator by following the steps described in this GitHub gist.

Last updated 3 months ago. Help improve this document in the forum.