CAPV: Addressing Node Provisioning Issues Due to an Invalid State of ETCD

I recently ran into a strange scenario on a Kubernetes cluster after a sudden and unexpected crash it had experienced due to an issue in the underlying vSphere environment. In this case, the cluster was a TKG cluster (in fact, it happened to be the TKG management cluster), however, the same situation could have occurred on any cluster managed by Cluster API Provider vSphere (CAPV).

I have seen clusters unexpectedly crash many times before and most of the time, they successfully went back online when all nodes were up and running. In this case, however, some of the nodes could not boot properly, and Cluster API started attempting their reconciliation.

It started by removing one unhealthy control plane node from the cluster and attempting to replace it with a new, healthy one. I monitored the process on vSphere, which initially seemed to be normal - the clone of the template succeeded, the new machine was configured, powered on, started properly, etc. However, monitoring the status of the nodes on the Kubernetes cluster itself, I could see that the new node was not joining the cluster.

$ kubectl get node

NAME                                            STATUS   ROLES           AGE   VERSION
it-tkg-mgmt-md-0-hvrt5-598fdcffffxl9pdq-7nhgr   Ready    <none>          15d   v1.27.5+vmware.1
it-tkg-mgmt-md-0-hvrt5-598fdcffffxl9pdq-k7d2f   Ready    <none>          54d   v1.27.5+vmware.1
it-tkg-mgmt-md-0-hvrt5-598fdcffffxl9pdq-p6w76   Ready    <none>          54d   v1.27.5+vmware.1
it-tkg-mgmt-rg44c-qfhp2                         Ready    control-plane   54d   v1.27.5+vmware.1
it-tkg-mgmt-rg44c-sbv4f                         Ready    control-plane   29d   v1.27.5+vmware.1

This cluster should have three control plane nodes. As you can see, only two are listed, and both are old ones.

Looking at the Cluster API Machine resources, I found the reason for that:

$ kubectl get machine -n tkg-system

NAME                                            CLUSTER       NODENAME                                        PROVIDERID                                       PHASE         AGE   VERSION
it-tkg-mgmt-md-0-hvrt5-598fdcffffxl9pdq-7nhgr   it-tkg-mgmt   it-tkg-mgmt-md-0-hvrt5-598fdcffffxl9pdq-7nhgr   vsphere://4233fe14-ce2d-37d0-34d9-13286bc2a8f9   Running       16d   v1.27.5+vmware.1
it-tkg-mgmt-md-0-hvrt5-598fdcffffxl9pdq-k7d2f   it-tkg-mgmt   it-tkg-mgmt-md-0-hvrt5-598fdcffffxl9pdq-k7d2f   vsphere://42339cc8-a26f-c879-3505-1881e34481ac   Running       54d   v1.27.5+vmware.1
it-tkg-mgmt-md-0-hvrt5-598fdcffffxl9pdq-p6w76   it-tkg-mgmt   it-tkg-mgmt-md-0-hvrt5-598fdcffffxl9pdq-p6w76   vsphere://4233d909-f407-9904-a4be-c165b7cd18ef   Running       54d   v1.27.5+vmware.1
it-tkg-mgmt-rg44c-qfhp2                         it-tkg-mgmt   it-tkg-mgmt-rg44c-qfhp2                         vsphere://4233989e-1f0f-2807-2f91-2227e531e91b   Running       54d   v1.27.5+vmware.1
it-tkg-mgmt-rg44c-sbv4f                         it-tkg-mgmt   it-tkg-mgmt-rg44c-sbv4f                         vsphere://42338dee-6beb-f435-b345-0492f5bab982   Running       29d   v1.27.5+vmware.1
it-tkg-mgmt-rg44c-vrkvg                         it-tkg-mgmt                                                   vsphere://4233564d-5599-92ba-fd2a-e8437b1c90ee   Provisioned   12m   v1.27.5+vmware.1

Cluster API failed to complete the provisioning process of the new machine. Its Phase hung at the Provisioned state and never turned into Running.

At first, I decided to just wait and see what would happen (although this process typically doesn’t take more than a few minutes). At some point, the process timed out (there is a 20-minute timeout for the provisioning process) and CAPV deleted the machine and created a new one. However, the results were the same, over and over again.

Looking at the machine resource, I also couldn’t see any relevant error or hint.

$ kubectl describe machine it-tkg-mgmt-rg44c-vrkvg -n tkg-system

Name:         it-tkg-mgmt-rg44c-vrkvg
Namespace:    tkg-system
Labels:       cluster.x-k8s.io/cluster-name=it-tkg-mgmt
              cluster.x-k8s.io/control-plane=
              cluster.x-k8s.io/control-plane-name=it-tkg-mgmt-rg44c
              node.cluster.x-k8s.io/esxi-host=some-esxi-01.some.domain
              topology.cluster.x-k8s.io/owned=
Annotations:  controlplane.cluster.x-k8s.io/kubeadm-cluster-configuration:
                {"etcd":{"local":{"imageRepository":"projects.registry.vmware.com/tkg","imageTag":"v3.5.7_vmware.6","dataDir":"/var/lib/etcd","extraArgs":...
              controlplane.cluster.x-k8s.io/remediation-for: {"machine":"it-tkg-mgmt-rg44c-gn5sw","timestamp":"2023-11-30T14:39:33Z","retryCount":819}
              run.tanzu.vmware.com/resolve-os-image: image-type=ova,os-name=ubuntu
API Version:  cluster.x-k8s.io/v1beta1
Kind:         Machine
Metadata:
  Creation Timestamp:  2023-11-30T14:39:38Z
  Finalizers:
    machine.cluster.x-k8s.io
  Generation:  3
  Owner References:
    API Version:           controlplane.cluster.x-k8s.io/v1beta1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  KubeadmControlPlane
    Name:                  it-tkg-mgmt-rg44c
    UID:                   24136f3d-5268-4401-a5bd-3672bc5b3255
  Resource Version:        50545428
  UID:                     d74da63a-7873-414c-919f-c6020eec2a30
Spec:
  Bootstrap:
    Config Ref:
      API Version:     bootstrap.cluster.x-k8s.io/v1beta1
      Kind:            KubeadmConfig
      Name:            it-tkg-mgmt-rg44c-8bnx6
      Namespace:       tkg-system
      UID:             bd21eb6b-def3-4b67-9ab3-2b33b7fa8abe
    Data Secret Name:  it-tkg-mgmt-rg44c-8bnx6
  Cluster Name:        it-tkg-mgmt
  Infrastructure Ref:
    API Version:          infrastructure.cluster.x-k8s.io/v1beta1
    Kind:                 VSphereMachine
    Name:                 it-tkg-mgmt-control-plane-7fgtz-w8vrp
    Namespace:            tkg-system
    UID:                  80433d34-1d05-4f34-855e-a0f26aa0440b
  Node Deletion Timeout:  10s
  Provider ID:            vsphere://4233564d-5599-92ba-fd2a-e8437b1c90ee
  Version:                v1.27.5+vmware.1
Status:
  Addresses:
    Address:        172.16.52.16
    Type:           ExternalIP
  Bootstrap Ready:  true
  Conditions:
    Last Transition Time:  2023-11-30T14:41:30Z
    Status:                True
    Type:                  Ready
    Last Transition Time:  2023-11-30T14:39:39Z
    Status:                True
    Type:                  BootstrapReady
    Last Transition Time:  2023-11-30T14:41:30Z
    Status:                True
    Type:                  InfrastructureReady
    Last Transition Time:  2023-11-30T14:41:30Z
    Reason:                NodeProvisioning
    Severity:              Warning
    Status:                False
    Type:                  NodeHealthy
  Infrastructure Ready:    true
  Last Updated:            2023-11-30T14:41:30Z
  Observed Generation:     3
  Phase:                   Provisioned
Events:
  Type    Reason             Age                   From                           Message
  ----    ------             ----                  ----                           -------
  Normal  DetectedUnhealthy  14s (x23 over 5m10s)  machinehealthcheck-controller  Machine tkg-system/it-tkg-mgmt-rg44c/it-tkg-mgmt-rg44c-vrkvg/ has unhealthy node

The only hint here was that the MachineHealthCheck (MHC) Controller marked the machine unhealthy.

Looking at the MHC resources, I could see that the CurrentHealthy column showed 2 for the control plane node group, instead of 3.

$ kubectl get mhc -n tkg-system

NAME                     CLUSTER       EXPECTEDMACHINES   MAXUNHEALTHY   CURRENTHEALTHY   AGE
it-tkg-mgmt-md-0-hvrt5   it-tkg-mgmt   3                  100%           3                98d
it-tkg-mgmt-rg44c        it-tkg-mgmt   3                  100%           2                98d

That still didn’t give me any hint about the provisioning issue…

I decided to SSH into the Kubernetes control plane nodes and manually look at the containers running on them. I immediately noticed the issue. All containers kept crashing and restarting.

$ ssh [email protected]

Authorized uses only. All activity may be monitored and reported.
Warning: your password will expire in 0 days
Welcome to Ubuntu 20.04.6 LTS (GNU/Linux 5.4.0-162-generic x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/advantage
Last login: Thu Nov 30 15:38:32 2023 from 172.16.51.5

capv@it-tkg-mgmt-rg44c-qfhp2:~$ sudo -i

root@it-tkg-mgmt-rg44c-qfhp2:~# crictl ps -a
CONTAINER           IMAGE               CREATED             STATE               NAME                               ATTEMPT             POD ID              POD
8cf9e56956fd5       102ad189891ef       5 days ago          Running             node-driver-registrar              27                  2b2dd83beb84c       vsphere-csi-node-7x76b
63014d16e1658       102ad189891ef       5 days ago          Exited              node-driver-registrar              26                  2b2dd83beb84c       vsphere-csi-node-7x76b
ca1dc012adbc3       5f118433f1e9a       5 days ago          Running             pinniped-concierge                 3                   6307e606d7236       pinniped-concierge-78d876b767-z8pmc
af3f11a43f61b       15fe5073dd8c9       5 days ago          Running             csi-snapshotter                    3                   9cdb23ee0b000       vsphere-csi-controller-56cd4b889c-ls9zm
d3c38ed83a497       c56e5b7cec9c0       5 days ago          Running             csi-provisioner                    5                   9cdb23ee0b000       vsphere-csi-controller-56cd4b889c-ls9zm
869e5c4dafbf7       dddc954fe5e78       5 days ago          Running             vsphere-syncer                     3                   9cdb23ee0b000       vsphere-csi-controller-56cd4b889c-ls9zm
32f93bef88ad7       8133b410f2d7f       5 days ago          Running             liveness-probe                     3                   9cdb23ee0b000       vsphere-csi-controller-56cd4b889c-ls9zm
2f3173ebd1063       ce474044f1f86       5 days ago          Running             ako-tkg-system-it-tkg-mgmt         21                  4656f26aa26cd       ako-0
f4a84bde4df8a       ac8014b83ac8b       5 days ago          Running             vsphere-csi-controller             62                  9cdb23ee0b000       vsphere-csi-controller-56cd4b889c-ls9zm
2a1df07d205e4       edf226fded4ef       5 days ago          Running             csi-resizer                        6                   9cdb23ee0b000       vsphere-csi-controller-56cd4b889c-ls9zm
e2b5fecddf18f       efe341eddba2c       5 days ago          Running             csi-attacher                       14                  9cdb23ee0b000       vsphere-csi-controller-56cd4b889c-ls9zm
00a45c08e4ab8       67a2ea74bd686       5 days ago          Running             kapp-controller                    6                   b7ba3dae27e1d       kapp-controller-8644d7bcb9-8rbv6
c75c2930956e4       91e9600932c40       5 days ago          Running             manager                            1                   7ff8aaeffb3b8       ako-operator-controller-manager-846fddf65f-vsnw6
4f10ff4d06c98       fd15a392ebc42       5 days ago          Running             manager                            3                   23e0cac674f10       tkr-conversion-webhook-manager-d8fb9c6cd-b7fq8
372d6e901d38d       e37132f649a90       5 days ago          Running             vsphere-cloud-controller-manager   8                   9f383250fb775       vsphere-cloud-controller-manager-r74rw
bdd0df698309b       645fbff1a73f5       5 days ago          Running             antrea-ovs                         3                   4d538e1d7a7b6       antrea-agent-vsvvd
23293ff51ae6b       645fbff1a73f5       5 days ago          Running             antrea-agent                       5                   4d538e1d7a7b6       antrea-agent-vsvvd
b637899678459       645fbff1a73f5       5 days ago          Exited              install-cni                        1                   4d538e1d7a7b6       antrea-agent-vsvvd
b40e651219492       8133b410f2d7f       5 days ago          Running             liveness-probe                     3                   2b2dd83beb84c       vsphere-csi-node-7x76b
18be8b3d3332b       ac8014b83ac8b       5 days ago          Running             vsphere-csi-node                   15                  2b2dd83beb84c       vsphere-csi-node-7x76b
76072dc0b8ef4       645fbff1a73f5       5 days ago          Exited              antrea-agent-tweaker               2                   4d538e1d7a7b6       antrea-agent-vsvvd
886b11a7fe681       67a2ea74bd686       5 days ago          Running             kapp-controller-sidecarexec        1                   b7ba3dae27e1d       kapp-controller-8644d7bcb9-8rbv6
b96279477a9b7       67a2ea74bd686       5 days ago          Exited              kapp-controller                    5                   b7ba3dae27e1d       kapp-controller-8644d7bcb9-8rbv6
0425d45dff72e       8adb3fe1f0f13       5 days ago          Running             kube-proxy                         3                   394913d255893       kube-proxy-57fsr
4f9f816f0c728       67a2ea74bd686       5 days ago          Exited              init-kapp-controller               1                   b7ba3dae27e1d       kapp-controller-8644d7bcb9-8rbv6
134dfbf73c032       bff7d9e209c21       5 days ago          Running             kube-apiserver                     31                  76a25f8d31b30       kube-apiserver-it-tkg-mgmt-rg44c-qfhp2
eccb0e4ee874a       bff7d9e209c21       5 days ago          Exited              kube-apiserver                     30                  76a25f8d31b30       kube-apiserver-it-tkg-mgmt-rg44c-qfhp2
28631a62b2ae1       cd9c0c7429e19       5 days ago          Running             etcd                               6                   edfffa11e43ad       etcd-it-tkg-mgmt-rg44c-qfhp2
d8efaf2b9608c       9c80c77189951       5 days ago          Running             kube-scheduler                     10                  681ed5149bfa2       kube-scheduler-it-tkg-mgmt-rg44c-qfhp2
c4c4b945b27ed       1ad46f3c7fbf5       5 days ago          Running             kube-controller-manager            11                  05823fae6de1c       kube-controller-manager-it-tkg-mgmt-rg44c-qfhp2

Everything made more sense at that point. Since the ETCD database kept crashing, the Kubernetes API Server and everything else crashed as well.

Diving into the ETCD database logs, I noticed a few concerning entries (a bunch of these):

{"level":"warn","ts":"2023-11-30T15:33:44.975Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"b6e6acb707bcb7be","rtt":"0s","error":"dial tcp 172.16.52.10:2380: i/o timeout"}
{"level":"warn","ts":"2023-11-30T15:33:46.896Z","caller":"etcdhttp/metrics.go:173","msg":"serving /health false; no leader"}
{"level":"warn","ts":"2023-11-30T15:33:46.974Z","caller":"etcdserver/server.go:2083","msg":"failed to publish local member to cluster through raft","local-member-id":"a4f2bd5654759765","local-member-attributes":"{Name:it-tkg-mgmt-rg44c-qfhp2 ClientURLs:[https://172.16.52.14:2379]}","request-path":"/0/members/a4f2bd5654759765/attributes","publish-timeout":"7s","error":"etcdserver: request timed out"}
{"level":"warn","ts":"2023-11-30T15:33:49.976Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"b6e6acb707bcb7be","rtt":"0s","error":"dial tcp 172.16.52.10:2380: connect: no route to host"}

Due to the unstable state of ETCD, the connection to the cluster was also intermittent and I kept getting connection errors every few minutes. For example:

The connection to the server 172.16.53.10:6443 was refused - did you specify the right host or port?

I realized that due to the sudden crash, the state of ETCD was corrupted. The ETCD leader election couldn’t be completed and the existing control plane nodes were unstable (kept going down and back up every few minutes). By the ETCD logs (shown above), I figured that the current ETCD leader was 172.16.52.10, since the other healthy node (172.16.52.14) was trying to reach it. From the 172.16.52.14 node, I opened a shell to the ETCD container using crictl exec -it $(crictl ps | grep etcd | grep "Running" | cut -d " " -f1) sh, and listed the current cluster members.

ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/peer.crt --key=/etc/kubernetes/pki/etcd/peer.key member list

The output:

a4f2bd5654759765, started, it-tkg-mgmt-rg44c-qfhp2, https://172.16.52.14:2380, https://172.16.52.14:2379, false
b6e6acb707bcb7be, started, it-tkg-mgmt-rg44c-sbv4f, https://172.16.52.10:2380, https://172.16.52.10:2379, false

And moved the ETCD leader to 172.16.52.14 (by its ID):

ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/peer.crt --key=/etc/kubernetes/pki/etcd/peer.key move-leader a4f2bd5654759765

Output:

Leadership transferred from a4f2bd5654759765 to a4f2bd5654759765

Immediately after doing so, CAPI removed the faulty node (172.16.52.10), so I knew the cluster reconciliation started functioning properly. The third missing node was also created successfully and provisioning was smooth.

$ kubectl get no

NAME                                            STATUS   ROLES           AGE    VERSION
it-tkg-mgmt-md-0-hvrt5-598fdcffffxl9pdq-7nhgr   Ready    <none>          21d    v1.27.5+vmware.1
it-tkg-mgmt-md-0-hvrt5-598fdcffffxl9pdq-k7d2f   Ready    <none>          60d    v1.27.5+vmware.1
it-tkg-mgmt-md-0-hvrt5-598fdcffffxl9pdq-p6w76   Ready    <none>          60d    v1.27.5+vmware.1
it-tkg-mgmt-rg44c-c5kbt                         Ready    control-plane   5d4h   v1.27.5+vmware.1
it-tkg-mgmt-rg44c-qfhp2                         Ready    control-plane   60d    v1.27.5+vmware.1
it-tkg-mgmt-rg44c-qz744                         Ready    control-plane   5d4h   v1.27.5+vmware.1

Finally, looking at the vSphere environment, I noticed a stale/orphan control plane machine in the inventory which probably couldn’t be deleted properly by CAPV due to the cluster issues, so I had to delete it manually, directly from the inventory.

It took a while to figure this one out and felt like an edge case worth sharing.