TKG on Build. Run. Repeat.

Fixing Missing TKRs in Existing TKGS Deployments

Wed, 01 May 2024 09:00:00 -0400

I regularly check the Tanzu Kubernetes Releases (TKR) release notes page for new updates. Yesterday, a new TKR was released with support for Kubernetes 1.28.8, and while attempting to test this new version in my TKGS environment, I realized that the TKR was not present in my environment and I started wondering why, as normally, when new TKRs are released, they immediately become available for deployment, since the vCenter is subscribed to the VMware public content library where all the TKRs are hosted. This time, that was not the case, so I started investigating.

CAPV: Addressing Node Provisioning Issues Due to an Invalid State of ETCD

Fri, 01 Dec 2023 09:00:00 -0400

I recently ran into a strange scenario on a Kubernetes cluster after a sudden and unexpected crash it had experienced due to an issue in the underlying vSphere environment. In this case, the cluster was a TKG cluster (in fact, it happened to be the TKG management cluster), however, the same situation could have occurred on any cluster managed by Cluster API Provider vSphere (CAPV).

I have seen clusters unexpectedly crash many times before and most of the time, they successfully went back online when all nodes were up and running. In this case, however, some of the nodes could not boot properly, and Cluster API started attempting their reconciliation.

CAPV: Fixing and Cleaning Up Idle vCenter Server Sessions

Wed, 01 Nov 2023 09:00:00 -0400

I recently ran into an issue causing the vCenter server to crash almost daily. What seemed to be a random vCenter issue initially, turned out to be related to CAPV (Cluster API Provider vSphere), running on some of our Kubernetes clusters. That was also an edge case I had not seen before, so I decided to document and share it here.

Initially, the issue we were witnessing on the vCenter server was the following:

TKG 2.3: Fixing the Prometheus Data Source in the Grafana Package

Fri, 01 Sep 2023 09:00:00 -0400

With the release of TKG 2.3, the Grafana package was finally updated from version 7.5.x to 9.5.1. If you have deployed the new Grafana package (9.5.1+vmware.2-tkg.1) or upgraded your existing one to this version, you may have run into error messages in your Grafana dashboards.

For example, in the TKG Kubernetes cluster monitoring default dashboard, you may have run into the Failed to call resource error when opening the dashboard and noticed that a lot of the data is missing.

TKG: Updating Pinniped Configuration and Addressing Common Issues

Thu, 01 Jun 2023 09:00:00 -0400

Most of the TKG engagements I’ve been involved in included Pinniped for Kubernetes authentication. On many occasions, I have seen issues where the configuration provided to Pinniped was incorrect or partially incorrect. For example, common issues may be related to the LDAPS integration. Many environments I have seen utilize Active Directory as the authentication source, and Pinniped requires the LDAPS certificate, username, and password, which are often specified incorrectly. Since this configuration is not validated during the deployment, you end up with an invalid state of Pinniped on your management cluster.

Streamlining and Customizing Windows Image Builder for TKG

Wed, 01 Mar 2023 09:00:00 -0400

Tanzu Kubernetes Grid (TKG) is one of the few platforms providing out-of-the-box support and streamlined deployment of Windows Kubernetes clusters. VMware is actively investing in this area and constantly improving the support and capabilities around Windows on Kubernetes.

Unlike Linux-based clusters, for which VMware provides pre-packaged base OS images (typically based on Ubuntu and Photon OS), VMware cannot offer Windows pre-packaged images, primarily due to licensing restrictions, I suppose. Therefore, building your own Windows base OS image is one of the prerequisites for deploying a TKG Windows workload cluster. Fortunately, VMware leverages the upstream Image Builder project - a fantastic collection of cross-provider Kubernetes virtual machine image-building utilities intended to simplify and streamline the creation of base OS images for Kubernetes.

Tanzu Kubernetes Grid GPU Integration

Wed, 01 Mar 2023 09:00:00 -0400

I recently had to demonstrate Tanzu Kubernetes Grid and its GPU integration capabilities. Developing a good use case and assembling the demo required some preliminary research.

During my research, I reached out to Jay Vyas, staff engineer at VMware, SIG Windows lead for Kubernetes, a Kubernetes legend, and an awesome guy in general. :) For those who don’t know Jay, he is also one of the authors of the fantastic book Core Kubernetes (look it up!).

Getting Started with Carvel ytt - Real-World Examples

Sun, 01 Jan 2023 09:00:00 -0400

Over the years of working with Tanzu Kubernetes Grid (TKG), one tool has stood out as a game-changer for resource customization: Carvel’s ytt. Whether tailoring cluster manifests, customizing TKG packages, or addressing unique deployment requirements, ytt has consistently been a fundamental part of the workflow. Its flexibility, power, and declarative approach make it an essential tool for anyone working deeply with Kubernetes in a TKG ecosystem.

But what exactly is ytt? Short for YAML Templating Tool, ytt is part of the Carvel suite of tools designed for Kubernetes resource management. It provides a powerful, programmable approach to templating YAML configurations by combining straightforward data values, overlays, and scripting capabilities. Unlike many traditional templating tools, ytt prioritizes structure and intent, making it easier to maintain, validate, and debug configurations—particularly in complex, large-scale Kubernetes environments.

Replacing your vCenter server certificate? TKG needs to know about it…

Sun, 01 Jan 2023 09:00:00 -0400

I recently ran into an issue where TKGm had suddenly failed to connect to the vCenter server.

The issue turned out to be TLS-related, and I noticed that the vCenter server certificate had been replaced…

Due to the certificate issue, Cluster API components failed to communicate with vSphere, causing cluster reconciliation to fail, among other vSphere-related operations.

Since all TKG clusters in the environment were deployed with the VSPHERE_TLS_THUMBPRINT parameter specified, replacing the vCenter certificate breaks the connection to vSphere, as the TLS thumbprint changes as well.

Upgrading NSX ALB in a TKG Environment

Thu, 01 Sep 2022 09:00:00 -0400

For quite a long time, the highest version of the NSX ALB TKG supported was 20.1.6/20.1.3, although 21.1.x has been available for a while, and I have been wondering when TKG would support it. In the release notes of TKG 1.5.4, I recently noticed a note that has been added regarding NSX ALB 21.1.x under the Configuration variables section:

AVI_CONTROLLER_VERSION sets the NSX Advanced Load Balancer (ALB) version for NSX ALB v21.1.x deployments in Tanzu Kubernetes Grid.

Customizing Elasticsearch indices using Fluent-Bit in TKG

Mon, 01 Aug 2022 09:00:00 -0400

Fluent-Bit is currently the preferred option for log shipping in TKG and is provided out of the box as a Tanzu package that can be easily deployed on each TKG/Kubernetes cluster.

A recent implementation required shipping all Kubernetes logs to Elasticsearch, complying with a specific naming convention for the Elasticsearch indices.

Applying such customizations requires you to utilize the Lua filter. Using the Lua filter, you can modify incoming records by invoking custom scripts to apply your logic when processing the records.

Getting Harbor to trust your LDAPS certificate in TKG

Mon, 01 Aug 2022 09:00:00 -0400

In a recent TKG implementation, it was required to configure Harbor with LDAPS rather than LDAP.

I deployed the Harbor package on the TKG shared services cluster and configured LDAP. However, when testing the connection, I received an error message that was not informative at all:

Failed to verify LDAP server with error: error: ldap server network timeout.

Although the error message doesn’t explicitly say there’s a certificate issue and there is nothing in the harbor-core container logs, it immediately made sense to me that the harbor-core container didn’t trust my LDAPS/CA certificate, so I started investigating how the certificate could be injected somehow into Harbor. The Harbor package doesn’t have any input for the LDAPS/CA certificate in its data values file, so I knew I had to create my own YTT overlay.

Getting kapp-controller to trust your CA certificates in TKG

Mon, 01 Aug 2022 09:00:00 -0400

Have you ever had to deploy a package using kapp-controller from your Harbor private registry?

I recently deployed the Tanzu RabbitMQ package to a TKGm workload cluster in an air-gapped/internet-restricted environment.

Doing so in air-gapped environments requires you to push the packages into Harbor, then have kapp-controller deploy the package from Harbor.

After adding the PackageRepository referencing my Harbor registry, I observed it couldn’t complete reconciling due to a certificate issue.

Is your TKG cluster name too long, or is it your DHCP Server…?

Mon, 01 Aug 2022 09:00:00 -0400

Recently, when working on a TKGm implementation project, I initially ran into an issue that seemed very odd, as I hadn’t encountered such behavior in any other implementation before.

The issue was that a workload cluster deployment hung after deploying the first control plane node. Until then, everything seemed just fine; as the cluster deployment had successfully initialized, NSX ALB had successfully allocated a control plane VIP. After that, however, the deployment had completely hung and seemed like it wouldn’t proceed.