High Availability (HA)

HA for MicroK8s is currently only available as a tech preview for testing purposes.

A highly available Kubernetes cluster is a cluster that can withstand a failure on any one of its components and continue serving workloads without interruption. There are three components necessary for a highly available Kubernetes cluster:

  1. There must be more than one node available at any time.
  2. The control plane must be running on more than one node, so that losing a single node would not render the cluster in-operable.
  3. The cluster state must be in a datastore that is itself highly available.

This documentation describe the steps needed to form an HA cluster in MicroK8s and to check its state.

As HA is a tech preview, this documentation is also a work in progress and subject to change. Please add comments if any parts aren’t working for you!

Testing HA for MicroK8s

To test the HA implementation, you will need:

  1. To install the ‘ha-preview’ version of MicroK8s
  2. At least three nodes. For testing on a single machine, please see the documentation for installing on LXD

Install the first node

HA is currently offered as a tech preview from the latest/edge/ha-preview branch
On Linux, you can install this with:

sudo snap install microk8s --classic --channel=latest/edge/ha-preview

or update an existing installation with:

sudo snap refresh microk8s --classic --channel=latest/edge/ha-preview

For Windows and macOS, you can update your installation with:

multipass exec microk8s -- sudo snap refresh microk8s --classic --channel=latest/edge/ha-preview

(see the install docs for Windows and macOS if you need to install MicroK8s.)

Add at least two other nodes

As before, install the ha-preview version of MicroK8s on at least two additional machines (or LXD containers).
Follow the usual procedure for clustering (described in the clustering documentation):

On the inital node, run:

microk8s add-node

This will output a command with a generated token such as microk8s join 10.128.63.86:25000/567a21bdfc9a64738ef4b3286b2b8a69. Copy this command and run it from the next node. It may take a few minutes to successfully join.
Repeat this process (generate a token, run it from the joining node) for the third and any additional nodes.

Check the status

run the status command:

microk8s status

With HA enabled, this will now inform you of the HA status and the addresses and roles of additional nodes. For example:

microk8s is running
high-availability: yes
  datastore master nodes: 10.128.63.86:19001 10.128.63.166:19001 10.128.63.43:19001
  datastore standby nodes: none

Working with HA

All nodes of the HA cluster run the master control plane. A subset of the cluster nodes (at least three) maintain a copy of the Kubernetes dqlite database. Database maintenance involves a voting process through which a leader is elected. Apart from the voting nodes there are non-voting nodes silently keeping a copy of the database. These nodes are on standby to take over the position of a departing voter. Finally, there are nodes that neither vote nor replicate the database. These nodes are called spare. To sum up, the three node roles are:

voters: replicating the database, participating in leader election
standby: replicating the database, not participating in leader election
spare: not replicating the database, not participating in leader election

Cluster formation, database syncing, voter and leader elections are all transparent to the administrator.

The state of the current state of the HA cluster is shown with:

microk8s status

The output of the HA inspection reports:

  • If HA is achieved or not.
  • The voter and stand-by nodes.

Since all nodes of the HA cluster run the master control plane the microk8s * commands are now available everywhere. Should one of the nodes crash we can move to any other node and continue working without much disruption.

Almost all of the HA cluster management is transparent to the admin and requires minimal configuration. The administrator can only add or remove nodes. To ensure the health of the cluster the following timings should be taken into account:

  • If the leader node gets “removed” ungracefully, e.g. it crashes and never comes back, it will take up to 5 seconds for the cluster to elect a new leader.
  • Promoting a non-voter to a voter takes up to 30 seconds. This promotion takes place when a new node enters the cluster or when a voter crashes.

To remove a node gracefully, first run the leave command on the departing node:

microk8s leave

The node will be marked as ‘NotReady’ (unreachable) in Kubernetes. To complete the removal of the departing , issue the following on any of the remaining nodes:

microk8s remove-node <node>

In the case we are not able to call microk8s leave from the departing node, e.g. due to a node crash, we need to call microk8s remove-node with the --force flag:

microk8s remove-node <node> --force

Add-ons on an HA cluster

Certain add-ons download and “install” client binaries. These binaries will be available only on the node the add-on was enabled from. For example, the helm client that gets installed with microk8s enable helm will be available only on the node the user issued the microk8s enable command.

Upgrading an existing cluster

If you have an existing cluster, you can upgrade to the ha-preview channel

sudo snap refresh microk8s --channel=latest/edge/ha-preview

You then need to enable HA clustering:

microk8s enable ha-cluster

Any machines which are already nodes in a cluster will need to exit and rejoin
in order to establish HA.

To do so, cycle through the nodes to drain, remove, and rejoin them:

microk8s kubectl drain <node>

On the node machine, force it to leave the cluster with:

microk8s leave

Then enable HA with microk8s enable ha-cluster and re-join the node to the cluster with a microk8s add-node and microk8s join issued on the master and node respectively.

What about an etcd based HA?

MicroK8s ships the upstream Kubernetes so an etcd HA setup is also possible, see the upstream documentation on how this can be achieved: 1 2.
The etcd approach is more involved and outside the scope of this document. Overall you will need to maintain your own etcd HA cluster. You will then need to configure the API server and flannel to point to that etcd. Finally you will need to provide a load balancer in front of the nodes acting as masters and configure the workers to reach the masters through the load-balanced endpoint.


Last updated 24 days ago.