This addon enables NVIDIA GPU support on MicroK8s using the NVIDIA GPU Operator and offers:
- Use of any existing NVIDIA host drivers, or compilation and loading of kernel drivers dynamically at runtime.
- Installation and configuration of the
nvidia-container-runtime
for containerd. - Configuration of the
nvidia.com/gpu
kubelet device plugin, to support resource capacity and limits on GPU nodes. - Multi-instance GPU (MIG) configuration via ConfigMap resources.
You can enable this addon with the following command:
microk8s enable gpu
NOTE: The GPU addon is supported on MicroK8s versions 1.22 or newer. For MicroK8s 1.21, see GPU addon on MicroK8s 1.21.
NOTE: For MicroK8s 1.25 or older, if you see an an error similar to
Error: INSTALLATION FAILED: failed to download "nvidia/gpu-operator" at version "v22.9.0"
You can instead try:
microk8s helm repo update nvidia microk8s enable gpu
Verify installation
Verify that all components are deployed and configured correctly with:
microk8s kubectl logs -n gpu-operator-resources -lapp=nvidia-operator-validator -c nvidia-operator-validator
which should return:
all validations are successful
Deploy a test workload
Once the GPU addon is enabled, workloads can request the GPU using a limit setting, e.g. nvidia.com/gpu: 1
. For example, you can run a cuda-vector-add
test pod with:
microk8s kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: cuda-vector-add
spec:
restartPolicy: OnFailure
containers:
- name: cuda-vector-add
image: "k8s.gcr.io/cuda-vector-add:v0.1"
resources:
limits:
nvidia.com/gpu: 1
EOF
And then check the pod’s logs to verify that everything is okay:
microk8s kubectl logs cuda-vector-add
where a successful run would produce logs similar to:
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
You are ready to run GPU workloads on your MicroK8s cluster!
Addon configuration options
NOTE: These require MicroK8s revision
$REVISION$
or newer. Check the installed revision withsnap list microk8s
.
In the microk8s enable gpu
command, the following command-line arguments may be set:
Argument | Default | Description |
---|---|---|
--driver $driver |
auto |
Supported values are auto (use host driver if found), host (force use the host driver), or operator (force use the operator driver). |
--version $VERSION |
v1.10.1 |
Version of the GPU operator to install. |
--toolkit-version $VERSION |
`` | If not empty, override the version of the nvidia-container-runtime that will be installed. |
--set-as-default-runtime / --no-set-as-default-runtime |
true |
Set the default containerd runtime to nvidia . |
--set $key=$value |
`` | Set additional configuration options to the GPU operator Helm chart. May be passed multiple times. For a list of options see values.yaml. |
--values $file |
`` | Set additional configuration options to the GPU operator Helm chart using a file. May be passed multiple times. For a list of options see values.yaml. |
Use host drivers and runtime
Use host NVIDIA drivers
The GPU addon works with the existing NVIDIA host drivers (if available), otherwise it will deploy the nvidia-driver-daemonset
to dynamically build and load the NVIDIA drivers into the kernel.
In order to use host drivers, install the NVIDIA drivers before enabling the addon:
sudo apt-get update
sudo apt-get install nvidia-headless-510-server nvidia-utils-510-server
Verify that drivers are loaded by checking nvidia-smi
:
nvidia-smi
Then enable the addon:
microk8s enable gpu
Use host nvidia-container-runtime
The GPU addon will automatically install nvidia-container-runtime
, which is the runtime required to execute GPU workloads on the MicroK8s cluster. This is done by the nvidia-container-toolkit-daemonset
pod.
If needed, this section documents how you to install the nvidia-container-runtime manually. The steps below should be performed before enabling the GPU addon.
Install nvidia-container-runtime following the upstream instructions. At the time of writing, the instructions for Ubuntu hosts look like this:
curl -s -L https://nvidia.github.io/nvidia-container-runtime/gpgkey | \
sudo apt-key add -
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-container-runtime/$distribution/nvidia-container-runtime.list | \
sudo tee /etc/apt/sources.list.d/nvidia-container-runtime.list
sudo apt-get update
sudo apt-get install nvidia-container-runtime
This will install nvidia-container-runtime
in /usr/bin/nvidia-container-runtime
. Next, edit the containerd configuration file so that it knows where to find the runtime binaries for the nvidia
runtime:
echo '
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
BinaryName = "/usr/bin/nvidia-container-runtime"
' | sudo tee -a /var/snap/microk8s/current/args/containerd-template.toml
Restart MicroK8s to reload the containerd configuration:
sudo snap restart microk8s
Finally, enable the gpu addon and make sure that the toolkit daemonset is not deployed:
microk8s enable gpu --set toolkit.enabled=false
Configure NVIDIA Multi-Instance GPU
NVIDIA Multi-Instance GPU (MIG) expands the performance and value of NVIDIA H100, A100 and A30 Tensor Core GPUs. MIG can partition the GPU into as many as seven instances, each fully isolated with its own high-bandwidth memory, cache, and compute cores. This allows for serving workloads under guaranteed quality of service (QoS) while extending the reach of accelerated computing resources to every user.
After enabling the GPU addon in MicroK8s on a host with an NVIDIA GPU that supports MIG, the GPU operator will automatically deploy the nvidia-mig-manager
daemonset on the cluster. Configuring the GPU card on the node to enable MIG is done by setting an appropriate label on the Kubernetes node.
Enable MIG
-
First, ensure that your GPU card has support for MIG. If that is the case, then
nvidia-mig-manager
should be running in the cluster, and the node should have anvidia.com/mig.available=true
label. Verify this with:microk8s kubectl get pod -A -lapp=nvidia-mig-manager
… which would return an output similar to:
NAMESPACE NAME READY STATUS RESTARTS AGE gpu-operator-resources nvidia-mig-manager-52mhg 1/1 Running 0 5h57m
Also, ensure that the node has the
nvidia.com/mig.capable=true
label:microk8s kubectl describe node $node | grep nvidia.com
… the output then should show the relevant labels and their values:
..... nvidia.com/gpu.present=true nvidia.com/gpu.product=NVIDIA-A100-SXM4-40GB nvidia.com/gpu.count=1 ..... nvidia.com/mig.capable=true nvidia.com/mig.strategy=single .....
-
Set the
nvidia.com/mig.config
label on the node with the MIG configuration you want to apply. In our example, we have an NVIDIA A100 40GB card, and we will use theall-1g.5gb
profile, which segments an NVIDIA A100 card to 71g.5gb
GPU instance profiles:microk8s kubectl label node $node nvidia.com/mig.config=all-1g.5gb
This will automatically apply any configuration required on the GPU card to enable MIG, and will restart any running GPU workloads for the changes to take effect.
NOTE: The label value should match one of the profiles found in the
gpu-operator-resources/default-mig-parted-config
ConfigMap. You can see the default list of profiles (along with the supported cards for each using the following command).sudo microk8s kubectl get configmap -n gpu-operator-resources default-mig-parted-config -o template --template '{{ index .data "config.yaml" }}' | less
NOTE: Consult the NVIDIA MIG partitioning documentation for the naming scheme and a more detailed explanation on the subject.
-
mig-manager
will report the result by setting thenvidia.com/mig.config.state
label on the node. Check it withmicrok8s kubectl describe node $node | grep nvidia.com
. If the configuration has been successful, the labels should look like this:..... nvidia.com/gpu.present=true nvidia.com/gpu.product=NVIDIA-A100-SXM4-40GB-MIG-1g.5gb nvidia.com/gpu.count=7 ..... nvidia.com/mig.capable=true nvidia.com/mig.config=all-1g.5gb nvidia.com/mig.config.state=success nvidia.com/mig.strategy=single .....
In case of a failure, consult the logs from the
nvidia-mig-manager
pod for details:microk8s kubectl logs -n gpu-operator-resources -lapp=nvidia-mig-manager
-
Finally, use
nvidia-smi
to verify that 7 GPU instances are now available for use:Run the following command on the
nvidia-driver-daemonset
:microk8s kubectl exec -it -n gpu-operator-resources daemonset/nvidia-driver-daemonset -- nvidia-smi -L
… which will produce output similar to:
Defaulted container "nvidia-driver-ctr" out of: nvidia-driver-ctr, k8s-driver-manager (init) GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-c5185459-3273-c6e2-90f0-39d86b34e76d) MIG 1g.5gb Device 0: (UUID: MIG-1c817b48-ee26-5471-851a-eed01baea921) MIG 1g.5gb Device 1: (UUID: MIG-48f14d22-c8c9-5144-8e3a-69204b70c936) MIG 1g.5gb Device 2: (UUID: MIG-d658d644-b5f2-57a5-939d-c80c15ab4d9d) MIG 1g.5gb Device 3: (UUID: MIG-775c8fd6-dc7d-5ed3-8e57-003a741fcef6) MIG 1g.5gb Device 4: (UUID: MIG-0845bff0-9cdc-5f24-aaf3-55f679682651) MIG 1g.5gb Device 5: (UUID: MIG-be49e572-62ef-599c-a113-350a7c06bced) MIG 1g.5gb Device 6: (UUID: MIG-6370eedd-389c-58dc-8b57-b435a656a45a)
For more configuration options or extra MIG configuration strategies, consult the official NVIDIA documentation
Features
GPU addon features
- Use the existing NVIDIA host drivers, or build the drivers and load to the kernel dynamically at runtime.
- Automatically install and configure the
nvidia-container-runtime
for containerd. - Configure the
nvidia.com/gpu
kubelet device plugin, to support resource capacity and limits on GPU nodes. - Multi-instance GPU (MIG) can be configured using ConfigMap resources.
GPU addon components
The GPU addon will install and configure the following components on the MicroK8s cluster:
nvidia-feature-discovery
: Runs feature discovery on all cluster nodes, to detect GPU devices and host capabilities.nvidia-driver-daemonset
: Runs in all GPU nodes of the cluster, builds and loads the NVIDIA drivers into the running kernel.nvidia-container-toolkit-daemonset
: Runs in all GPU nodes of the cluster. Once the NVIDIA drivers are loaded, installs thenvidia-container-runtime
binaries and configures thenvidia
runtime on containerd accordingly. By default, it sets the default runtime tonvidia
, so all pod workloads can use the GPU.nvidia-device-plugin-daemonset
: Runs in all GPU nodes of the cluster, and configures thenvidia.com/gpu
kubelet device plugin. This is used to configure resource capacity and limits for the GPU nodes.nvidia-operator-validator
: Validates that the NVIDIA drivers, container runtime and the kubelet device plugin have been configured correctly. Finally, it executes an example cuda workload.
A complete installation of the GPU operator looks like this (output of microk8s kubectl get pod -n gpu-operator-resources
):
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nvidia-container-toolkit-daemonset-mjbk8 1/1 Running 0 110m 10.1.51.198 machine-0 <none> <none>
nvidia-cuda-validator-xj2kx 0/1 Completed 0 109m 10.1.51.204 machine-0 <none> <none>
nvidia-dcgm-nvqnz 1/1 Running 0 110m 10.1.51.199 machine-0 <none> <none>
gpu-feature-discovery-dn6lt 1/1 Running 0 110m 10.1.51.202 machine-0 <none> <none>
nvidia-device-plugin-daemonset-zg76f 1/1 Running 0 110m 10.1.51.201 machine-0 <none> <none>
nvidia-device-plugin-validator-k6hdv 0/1 Completed 0 107m 10.1.51.205 machine-0 <none> <none>
nvidia-dcgm-exporter-9vnc5 1/1 Running 0 110m 10.1.51.203 machine-0 <none> <none>
nvidia-operator-validator-ntvdj 1/1 Running 0 110m 10.1.51.200 machine-0 <none> <none>
MicroK8s 1.21
MicroK8s version 1.21 is out of support since May 2022. The GPU addon included with MicroK8s 1.21 was an early alpha and is no longer functional.
Due to a problem with the way containerd is configured in MicroK8s versions 1.21 and older, the nvidia-toolkit-daemonset
installed by the GPU operator is incompatible and leaves MicroK8s in a broken state.
It is recommended to update to a supported version of MicroK8s. However, it is possible to install the GPU operator by following the steps described in this GitHub gist.