CAPV: Addressing Node Provisioning Issues Due to an Invalid State of ETCD
I recently ran into a strange scenario on a Kubernetes cluster after a sudden and unexpected crash it had experienced due to an issue in the underlying vSphere environment. In this case, the cluster was a TKG cluster (in fact, it happened to be the TKG management cluster), however, the same situation could have occurred on any cluster managed by Cluster API Provider vSphere (CAPV).
I have seen clusters unexpectedly crash many times before and most of the time, they successfully went back online when all nodes were up and running. In this case, however, some of the nodes could not boot properly, and Cluster API started attempting their reconciliation.
It started by removing one unhealthy control plane node from the cluster and attempting to replace it with a new, healthy one. I monitored the process on vSphere, which initially seemed to be normal - the clone of the template succeeded, the new machine was configured, powered on, started properly, etc. However, monitoring the status of the nodes on the Kubernetes cluster itself, I could see that the new node was not joining the cluster.
$ kubectl get node
NAME STATUS ROLES AGE VERSION
it-tkg-mgmt-md-0-hvrt5-598fdcffffxl9pdq-7nhgr Ready <none> 15d v1.27.5+vmware.1
it-tkg-mgmt-md-0-hvrt5-598fdcffffxl9pdq-k7d2f Ready <none> 54d v1.27.5+vmware.1
it-tkg-mgmt-md-0-hvrt5-598fdcffffxl9pdq-p6w76 Ready <none> 54d v1.27.5+vmware.1
it-tkg-mgmt-rg44c-qfhp2 Ready control-plane 54d v1.27.5+vmware.1
it-tkg-mgmt-rg44c-sbv4f Ready control-plane 29d v1.27.5+vmware.1
This cluster should have three control plane nodes. As you can see, only two are listed, and both are old ones.
Looking at the Cluster API Machine resources, I found the reason for that:
$ kubectl get machine -n tkg-system
NAME CLUSTER NODENAME PROVIDERID PHASE AGE VERSION
it-tkg-mgmt-md-0-hvrt5-598fdcffffxl9pdq-7nhgr it-tkg-mgmt it-tkg-mgmt-md-0-hvrt5-598fdcffffxl9pdq-7nhgr vsphere://4233fe14-ce2d-37d0-34d9-13286bc2a8f9 Running 16d v1.27.5+vmware.1
it-tkg-mgmt-md-0-hvrt5-598fdcffffxl9pdq-k7d2f it-tkg-mgmt it-tkg-mgmt-md-0-hvrt5-598fdcffffxl9pdq-k7d2f vsphere://42339cc8-a26f-c879-3505-1881e34481ac Running 54d v1.27.5+vmware.1
it-tkg-mgmt-md-0-hvrt5-598fdcffffxl9pdq-p6w76 it-tkg-mgmt it-tkg-mgmt-md-0-hvrt5-598fdcffffxl9pdq-p6w76 vsphere://4233d909-f407-9904-a4be-c165b7cd18ef Running 54d v1.27.5+vmware.1
it-tkg-mgmt-rg44c-qfhp2 it-tkg-mgmt it-tkg-mgmt-rg44c-qfhp2 vsphere://4233989e-1f0f-2807-2f91-2227e531e91b Running 54d v1.27.5+vmware.1
it-tkg-mgmt-rg44c-sbv4f it-tkg-mgmt it-tkg-mgmt-rg44c-sbv4f vsphere://42338dee-6beb-f435-b345-0492f5bab982 Running 29d v1.27.5+vmware.1
it-tkg-mgmt-rg44c-vrkvg it-tkg-mgmt vsphere://4233564d-5599-92ba-fd2a-e8437b1c90ee Provisioned 12m v1.27.5+vmware.1
Cluster API failed to complete the provisioning process of the new machine. Its Phase hung at the Provisioned state and never turned into Running.
At first, I decided to just wait and see what would happen (although this process typically doesn’t take more than a few minutes). At some point, the process timed out (there is a 20-minute timeout for the provisioning process) and CAPV deleted the machine and created a new one. However, the results were the same, over and over again.
Looking at the machine resource, I also couldn’t see any relevant error or hint.
$ kubectl describe machine it-tkg-mgmt-rg44c-vrkvg -n tkg-system
Name: it-tkg-mgmt-rg44c-vrkvg
Namespace: tkg-system
Labels: cluster.x-k8s.io/cluster-name=it-tkg-mgmt
cluster.x-k8s.io/control-plane=
cluster.x-k8s.io/control-plane-name=it-tkg-mgmt-rg44c
node.cluster.x-k8s.io/esxi-host=some-esxi-01.some.domain
topology.cluster.x-k8s.io/owned=
Annotations: controlplane.cluster.x-k8s.io/kubeadm-cluster-configuration:
{"etcd":{"local":{"imageRepository":"projects.registry.vmware.com/tkg","imageTag":"v3.5.7_vmware.6","dataDir":"/var/lib/etcd","extraArgs":...
controlplane.cluster.x-k8s.io/remediation-for: {"machine":"it-tkg-mgmt-rg44c-gn5sw","timestamp":"2023-11-30T14:39:33Z","retryCount":819}
run.tanzu.vmware.com/resolve-os-image: image-type=ova,os-name=ubuntu
API Version: cluster.x-k8s.io/v1beta1
Kind: Machine
Metadata:
Creation Timestamp: 2023-11-30T14:39:38Z
Finalizers:
machine.cluster.x-k8s.io
Generation: 3
Owner References:
API Version: controlplane.cluster.x-k8s.io/v1beta1
Block Owner Deletion: true
Controller: true
Kind: KubeadmControlPlane
Name: it-tkg-mgmt-rg44c
UID: 24136f3d-5268-4401-a5bd-3672bc5b3255
Resource Version: 50545428
UID: d74da63a-7873-414c-919f-c6020eec2a30
Spec:
Bootstrap:
Config Ref:
API Version: bootstrap.cluster.x-k8s.io/v1beta1
Kind: KubeadmConfig
Name: it-tkg-mgmt-rg44c-8bnx6
Namespace: tkg-system
UID: bd21eb6b-def3-4b67-9ab3-2b33b7fa8abe
Data Secret Name: it-tkg-mgmt-rg44c-8bnx6
Cluster Name: it-tkg-mgmt
Infrastructure Ref:
API Version: infrastructure.cluster.x-k8s.io/v1beta1
Kind: VSphereMachine
Name: it-tkg-mgmt-control-plane-7fgtz-w8vrp
Namespace: tkg-system
UID: 80433d34-1d05-4f34-855e-a0f26aa0440b
Node Deletion Timeout: 10s
Provider ID: vsphere://4233564d-5599-92ba-fd2a-e8437b1c90ee
Version: v1.27.5+vmware.1
Status:
Addresses:
Address: 172.16.52.16
Type: ExternalIP
Bootstrap Ready: true
Conditions:
Last Transition Time: 2023-11-30T14:41:30Z
Status: True
Type: Ready
Last Transition Time: 2023-11-30T14:39:39Z
Status: True
Type: BootstrapReady
Last Transition Time: 2023-11-30T14:41:30Z
Status: True
Type: InfrastructureReady
Last Transition Time: 2023-11-30T14:41:30Z
Reason: NodeProvisioning
Severity: Warning
Status: False
Type: NodeHealthy
Infrastructure Ready: true
Last Updated: 2023-11-30T14:41:30Z
Observed Generation: 3
Phase: Provisioned
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal DetectedUnhealthy 14s (x23 over 5m10s) machinehealthcheck-controller Machine tkg-system/it-tkg-mgmt-rg44c/it-tkg-mgmt-rg44c-vrkvg/ has unhealthy node
The only hint here was that the MachineHealthCheck (MHC) Controller marked the machine unhealthy.
Looking at the MHC resources, I could see that the CurrentHealthy column showed 2 for the control plane node group, instead of 3.
$ kubectl get mhc -n tkg-system
NAME CLUSTER EXPECTEDMACHINES MAXUNHEALTHY CURRENTHEALTHY AGE
it-tkg-mgmt-md-0-hvrt5 it-tkg-mgmt 3 100% 3 98d
it-tkg-mgmt-rg44c it-tkg-mgmt 3 100% 2 98d
That still didn’t give me any hint about the provisioning issue…
I decided to SSH into the Kubernetes control plane nodes and manually look at the containers running on them. I immediately noticed the issue. All containers kept crashing and restarting.
$ ssh [email protected]
Authorized uses only. All activity may be monitored and reported.
Warning: your password will expire in 0 days
Welcome to Ubuntu 20.04.6 LTS (GNU/Linux 5.4.0-162-generic x86_64)
* Documentation: https://help.ubuntu.com
* Management: https://landscape.canonical.com
* Support: https://ubuntu.com/advantage
Last login: Thu Nov 30 15:38:32 2023 from 172.16.51.5
capv@it-tkg-mgmt-rg44c-qfhp2:~$ sudo -i
root@it-tkg-mgmt-rg44c-qfhp2:~# crictl ps -a
CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID POD
8cf9e56956fd5 102ad189891ef 5 days ago Running node-driver-registrar 27 2b2dd83beb84c vsphere-csi-node-7x76b
63014d16e1658 102ad189891ef 5 days ago Exited node-driver-registrar 26 2b2dd83beb84c vsphere-csi-node-7x76b
ca1dc012adbc3 5f118433f1e9a 5 days ago Running pinniped-concierge 3 6307e606d7236 pinniped-concierge-78d876b767-z8pmc
af3f11a43f61b 15fe5073dd8c9 5 days ago Running csi-snapshotter 3 9cdb23ee0b000 vsphere-csi-controller-56cd4b889c-ls9zm
d3c38ed83a497 c56e5b7cec9c0 5 days ago Running csi-provisioner 5 9cdb23ee0b000 vsphere-csi-controller-56cd4b889c-ls9zm
869e5c4dafbf7 dddc954fe5e78 5 days ago Running vsphere-syncer 3 9cdb23ee0b000 vsphere-csi-controller-56cd4b889c-ls9zm
32f93bef88ad7 8133b410f2d7f 5 days ago Running liveness-probe 3 9cdb23ee0b000 vsphere-csi-controller-56cd4b889c-ls9zm
2f3173ebd1063 ce474044f1f86 5 days ago Running ako-tkg-system-it-tkg-mgmt 21 4656f26aa26cd ako-0
f4a84bde4df8a ac8014b83ac8b 5 days ago Running vsphere-csi-controller 62 9cdb23ee0b000 vsphere-csi-controller-56cd4b889c-ls9zm
2a1df07d205e4 edf226fded4ef 5 days ago Running csi-resizer 6 9cdb23ee0b000 vsphere-csi-controller-56cd4b889c-ls9zm
e2b5fecddf18f efe341eddba2c 5 days ago Running csi-attacher 14 9cdb23ee0b000 vsphere-csi-controller-56cd4b889c-ls9zm
00a45c08e4ab8 67a2ea74bd686 5 days ago Running kapp-controller 6 b7ba3dae27e1d kapp-controller-8644d7bcb9-8rbv6
c75c2930956e4 91e9600932c40 5 days ago Running manager 1 7ff8aaeffb3b8 ako-operator-controller-manager-846fddf65f-vsnw6
4f10ff4d06c98 fd15a392ebc42 5 days ago Running manager 3 23e0cac674f10 tkr-conversion-webhook-manager-d8fb9c6cd-b7fq8
372d6e901d38d e37132f649a90 5 days ago Running vsphere-cloud-controller-manager 8 9f383250fb775 vsphere-cloud-controller-manager-r74rw
bdd0df698309b 645fbff1a73f5 5 days ago Running antrea-ovs 3 4d538e1d7a7b6 antrea-agent-vsvvd
23293ff51ae6b 645fbff1a73f5 5 days ago Running antrea-agent 5 4d538e1d7a7b6 antrea-agent-vsvvd
b637899678459 645fbff1a73f5 5 days ago Exited install-cni 1 4d538e1d7a7b6 antrea-agent-vsvvd
b40e651219492 8133b410f2d7f 5 days ago Running liveness-probe 3 2b2dd83beb84c vsphere-csi-node-7x76b
18be8b3d3332b ac8014b83ac8b 5 days ago Running vsphere-csi-node 15 2b2dd83beb84c vsphere-csi-node-7x76b
76072dc0b8ef4 645fbff1a73f5 5 days ago Exited antrea-agent-tweaker 2 4d538e1d7a7b6 antrea-agent-vsvvd
886b11a7fe681 67a2ea74bd686 5 days ago Running kapp-controller-sidecarexec 1 b7ba3dae27e1d kapp-controller-8644d7bcb9-8rbv6
b96279477a9b7 67a2ea74bd686 5 days ago Exited kapp-controller 5 b7ba3dae27e1d kapp-controller-8644d7bcb9-8rbv6
0425d45dff72e 8adb3fe1f0f13 5 days ago Running kube-proxy 3 394913d255893 kube-proxy-57fsr
4f9f816f0c728 67a2ea74bd686 5 days ago Exited init-kapp-controller 1 b7ba3dae27e1d kapp-controller-8644d7bcb9-8rbv6
134dfbf73c032 bff7d9e209c21 5 days ago Running kube-apiserver 31 76a25f8d31b30 kube-apiserver-it-tkg-mgmt-rg44c-qfhp2
eccb0e4ee874a bff7d9e209c21 5 days ago Exited kube-apiserver 30 76a25f8d31b30 kube-apiserver-it-tkg-mgmt-rg44c-qfhp2
28631a62b2ae1 cd9c0c7429e19 5 days ago Running etcd 6 edfffa11e43ad etcd-it-tkg-mgmt-rg44c-qfhp2
d8efaf2b9608c 9c80c77189951 5 days ago Running kube-scheduler 10 681ed5149bfa2 kube-scheduler-it-tkg-mgmt-rg44c-qfhp2
c4c4b945b27ed 1ad46f3c7fbf5 5 days ago Running kube-controller-manager 11 05823fae6de1c kube-controller-manager-it-tkg-mgmt-rg44c-qfhp2
Everything made more sense at that point. Since the ETCD database kept crashing, the Kubernetes API Server and everything else crashed as well.
Diving into the ETCD database logs, I noticed a few concerning entries (a bunch of these):
{"level":"warn","ts":"2023-11-30T15:33:44.975Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"b6e6acb707bcb7be","rtt":"0s","error":"dial tcp 172.16.52.10:2380: i/o timeout"}
{"level":"warn","ts":"2023-11-30T15:33:46.896Z","caller":"etcdhttp/metrics.go:173","msg":"serving /health false; no leader"}
{"level":"warn","ts":"2023-11-30T15:33:46.974Z","caller":"etcdserver/server.go:2083","msg":"failed to publish local member to cluster through raft","local-member-id":"a4f2bd5654759765","local-member-attributes":"{Name:it-tkg-mgmt-rg44c-qfhp2 ClientURLs:[https://172.16.52.14:2379]}","request-path":"/0/members/a4f2bd5654759765/attributes","publish-timeout":"7s","error":"etcdserver: request timed out"}
{"level":"warn","ts":"2023-11-30T15:33:49.976Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"b6e6acb707bcb7be","rtt":"0s","error":"dial tcp 172.16.52.10:2380: connect: no route to host"}
Due to the unstable state of ETCD, the connection to the cluster was also intermittent and I kept getting connection errors every few minutes. For example:
The connection to the server 172.16.53.10:6443 was refused - did you specify the right host or port?
I realized that due to the sudden crash, the state of ETCD was corrupted. The ETCD leader election couldn’t be completed and the existing control plane nodes were unstable (kept going down and back up every few minutes). By the ETCD logs (shown above), I figured that the current ETCD leader was 172.16.52.10, since the other healthy node (172.16.52.14) was trying to reach it. From the 172.16.52.14 node, I opened a shell to the ETCD container using crictl exec -it $(crictl ps | grep etcd | grep "Running" | cut -d " " -f1) sh, and listed the current cluster members.
ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/peer.crt --key=/etc/kubernetes/pki/etcd/peer.key member list
The output:
a4f2bd5654759765, started, it-tkg-mgmt-rg44c-qfhp2, https://172.16.52.14:2380, https://172.16.52.14:2379, false
b6e6acb707bcb7be, started, it-tkg-mgmt-rg44c-sbv4f, https://172.16.52.10:2380, https://172.16.52.10:2379, false
And moved the ETCD leader to 172.16.52.14 (by its ID):
ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/peer.crt --key=/etc/kubernetes/pki/etcd/peer.key move-leader a4f2bd5654759765
Output:
Leadership transferred from a4f2bd5654759765 to a4f2bd5654759765
Immediately after doing so, CAPI removed the faulty node (172.16.52.10), so I knew the cluster reconciliation started functioning properly. The third missing node was also created successfully and provisioning was smooth.
$ kubectl get no
NAME STATUS ROLES AGE VERSION
it-tkg-mgmt-md-0-hvrt5-598fdcffffxl9pdq-7nhgr Ready <none> 21d v1.27.5+vmware.1
it-tkg-mgmt-md-0-hvrt5-598fdcffffxl9pdq-k7d2f Ready <none> 60d v1.27.5+vmware.1
it-tkg-mgmt-md-0-hvrt5-598fdcffffxl9pdq-p6w76 Ready <none> 60d v1.27.5+vmware.1
it-tkg-mgmt-rg44c-c5kbt Ready control-plane 5d4h v1.27.5+vmware.1
it-tkg-mgmt-rg44c-qfhp2 Ready control-plane 60d v1.27.5+vmware.1
it-tkg-mgmt-rg44c-qz744 Ready control-plane 5d4h v1.27.5+vmware.1
Finally, looking at the vSphere environment, I noticed a stale/orphan control plane machine in the inventory which probably couldn’t be deleted properly by CAPV due to the cluster issues, so I had to delete it manually, directly from the inventory.
It took a while to figure this one out and felt like an edge case worth sharing.

