Recover from lost quorum
By default, a three-node MicroK8s cluster automatically becomes highly available (HA). In HA mode the default datastore (dqlite) implements a Raft based protocol where an elected leader holds the definitive copy of the database. Under normal operation copies of the database are maintained by two more nodes. If you permanently lose the majority of the cluster members that serve as database nodes (for example, if you have a three-node cluster and you lose two of them), the cluster will become unavailable. However, if at least one database node has survived, you will be able to recover the cluster with the following manual steps.
Stopping MicroK8s is done with:
You must also make sure the lost nodes that used to form the cluster will not come back alive again. Any lost nodes that can be reinstated will have to re-join the cluster with the
microk8s add-node and
microk8s join process (see the documentation on clusters).
Dqlite stores data and configuration files under
/var/snap/microk8s/current/var/kubernetes/backend/. To make a safe copy of the current state log in to a surviving node and create tarball of the dqlite directory:
tar -cvf backup.tar /var/snap/microk8s/current/var/kubernetes/backend
/var/snap/microk8s/current/var/kubernetes/backend the file
cluster.yaml reflects the state of the cluster as dqlite sees it. Edit this file to remove the lost nodes leaving only the ones available. For example, let’s assume a three-node cluster with nodes
10.211.205.221 are lost. In this case the
cluster.yaml will look like this:
- Address: 10.211.205.122:19001 ID: 3297041220608546238 Role: 0 - Address: 10.211.205.253:19001 ID: 9373968242441247628 Role: 0 - Address: 10.211.205.221:19001 ID: 3349965773726029294 Role: 0
By removing the lost nodes
cluster.yaml should be left with only the
- Address: 10.211.205.253:19001 ID: 9373968242441247628 Role: 0
MicroK8s comes with a dqlite client utility for node reconfiguration.
The command to run is:
sudo /snap/microk8s/current/bin/dqlite \ -s 127.0.0.1:19001 \ -c /var/snap/microk8s/current/var/kubernetes/backend/cluster.crt \ -k /var/snap/microk8s/current/var/kubernetes/backend/cluster.key \ k8s ".reconfigure /var/snap/microk8s/current/var/kubernetes/backend/ /var/snap/microk8s/current/var/kubernetes/backend/cluster.yaml"
/snap/microk8s/current/bin/dqlite utility needs to be called with
sudo and takes the following arguments:
the endpoint to the (now stopped) dqlite service. We have used
-s 127.0.0.1:19001for this endpoint in the example above.
the private and public keys needed to access the database. These keys are passed with the
-karguments and are found in the directory where dqlite keeps the database.
the name of the database. For MicroK8s the database is
the operation to be performed in this case is “reconfigure”
the path to the database we want to reconfigure is the current database under
the end cluster configuration we want to recreate is reflected in the
cluster.yamlwe edited in the previous step.
It should now be possible to bring the cluster back online with:
The lost nodes are registered in Kubernetes but should be reporting as
microk8s kubectl get no
To remove the lost nodes use:
microk8s remove-node <node name>
High availability will be reattained when there are three or more nodes in the MicroK8s cluster. If the original failed nodes have been revived, or new nodes created, these can be joined to the cluster to restore high availability. See the documentation on clusters for instructions on adding nodes.