Upgrading NSX ALB in a TKG Environment

2022-09-01 8 min read Cloud Native Kubernetes NSX ALB Tanzu TKG

For quite a long time, the highest version of the NSX ALB TKG supported was 20.1.6/20.1.3, although 21.1.x has been available for a while, and I have been wondering when TKG would support it. In the release notes of TKG 1.5.4, I recently noticed a note that has been added regarding NSX ALB 21.1.x under the Configuration variables section:

AVI_CONTROLLER_VERSION sets the NSX Advanced Load Balancer (ALB) version for NSX ALB v21.1.x deployments in Tanzu Kubernetes Grid.

However, I couldn’t find any official reference for upgrading an existing NSX ALB instance in a TKG environment, so I made my research. I found two references for the NSX ALB controller version on my TKG management cluster:

  • In the AKO Operator Add-on secret:

    kubectl get secret tkg-mgmt-cls-ako-operator-addon -n tkg-system -o jsonpath='{.data.values\.yaml}' | base64 -d
    
    #@data/values
    #@overlay/match-child-defaults missing_ok=True
    ---
    akoOperator:
    avi_enable: true
    namespace: tkg-system-networking
    cluster_name: tkg-mgmt-cls
    config:
        avi_disable_ingress_class: true
        avi_ingress_default_ingress_controller: false
        avi_ingress_shard_vs_size: ""
        avi_ingress_service_type: ""
        avi_ingress_node_network_list: '""'
        avi_admin_credential_name: avi-controller-credentials
        avi_ca_name: avi-controller-ca
        avi_controller: it-nsxalb-ctrl.terasky.demo
        avi_username: admin
        avi_password: sample-password
        avi_cloud_name: Default-Cloud
        avi_service_engine_group: Default-Group
        avi_management_cluster_service_engine_group: Default-Group
        avi_data_network: k8s-vips
        avi_data_network_cidr: 10.100.154.0/24
        avi_control_plane_network: k8s-vips
        avi_control_plane_network_cidr: 10.100.154.0/24
        avi_ca_data_b64: LS0tLS1CRUdJTiBDR......
        avi_labels: '""'
        avi_disable_static_route_sync: true
        avi_cni_plugin: antrea
        avi_control_plane_ha_provider: true
        avi_management_cluster_vip_network_name: k8s-vips
        avi_management_cluster_vip_network_cidr: 10.100.154.0/24
        avi_management_cluster_control_plane_vip_network_name: k8s-vips
        avi_management_cluster_control_plane_vip_network_cidr: 10.100.154.0/24
        avi_control_plane_endpoint_port: 6443
        avi_controller_version: 20.1.3 # The NSX ALB controller version
    
  • In the AKO Deployment Configs:

    kubectl get akodeploymentconfigs.networking.tkg.tanzu.vmware.com install-ako-for-all -o yaml
    # And
    kubectl get akodeploymentconfigs.networking.tkg.tanzu.vmware.com install-ako-for-management-cluster -o yaml
    
    apiVersion: networking.tkg.tanzu.vmware.com/v1alpha1
    kind: AKODeploymentConfig
    metadata:
    ...
    name: install-ako-for-all
    ...
    spec:
    adminCredentialRef:
        name: avi-controller-credentials
        namespace: tkg-system-networking
    certificateAuthorityRef:
        name: avi-controller-ca
        namespace: tkg-system-networking
    cloudName: Default-Cloud
    controlPlaneNetwork:
        cidr: 10.100.154.0/24
        name: k8s-vips
    controller: it-nsxalb-ctrl.terasky.demo
    controllerVersion: 20.1.3 # The NSX ALB controller version
    dataNetwork:
        cidr: 10.100.154.0/24
        name: k8s-vips
    extraConfigs:
        cniPlugin: antrea
        disableStaticRouteSync: true
        ingress:
        defaultIngressController: false
        disableIngressClass: true
        l4Config:
        autoFQDN: disabled
        networksConfig: {}
    serviceEngineGroup: Default-Group
    

As you can see, there are references for the default NSX ALB controller version, which is 20.1.3. Since AKO Operator uses the 20.1.3 API when interacting with NSX ALB, I realized that upgrading NSX ALB without updating AKO Operator to match the new version might break the compatibility between the two, so I came up with this shell script. The script takes two inputs - TKG management cluster name and the target NSX ALB controller version. It then patches the AKO Operator Add-on secret and the AKO Deployment Configs using kubectl. The script is only executed on the management cluster. Once the AKO Operator is updated and the ako-operator, load-balancer-and-ingress-service, and tanzu-addons-manager packages are reconciled, kapp-controller will apply the new configuration on all workload clusters.

Upgrade Instructions

For this example, I am upgrading a 3-node NSX ALB cluster from version 20.1.x to 21.1.4.

First, obtain the relevant upgrade package from https://portal.avipulse.vmware.com/software/vantage. It is typically a .pkg file.

Before starting the upgrade, log in to any of the NSX ALB controllers, go to Administration > Controller > Nodes, and ensure all NSX ALB controllers are healthy and active.

Screenshot

Go to Administration > Controller > Software and click Upload From Computer, then select your upgrade file, (e.g. controller-21.1.4-2p3-9009.pkg).

Screenshot

Wait for the upload to complete.

Screenshot

Screenshot

Once the upload completes, go to Administration > Controller > System Update, select the new version you have just uploaded at the bottom, and click Upgrade.

Screenshot

In the upgrade dialog, ensure the Upgrade All Service Engine Groups option is selected and keep the defaults, then click Continue.

Screenshot

Review any warnings raised by the pre-checks. It is usually safe to proceed. Click Confirm when ready.

Screenshot

Wait for the upgrade to complete.

Screenshot

The NSX ALB cluster VIP will be unavailable during the upgrade. However, if you wish to monitor the upgrade process, you can browse to any of the controllers. You should see the Upgrade in progress page.

Screenshot

Once the upgrade completes, you can access the UI from the cluster VIP address and observe the status of the nodes under Administration > Controller > Nodes. Ensure all nodes are active.

Screenshot

If you have also just upgraded to version 21.x.x, you have probably noticed the new UI. As of NSX ALB 21.x.x, the UI is Clarity-based, like many other VMware products these days.

Updating TKG Configuration

Now that NSX ALB is upgraded, you must update the TKG configuration, reflecting the version you upgraded to.

As mentioned before, you can do so using my script, which is available on GitHub.

Clone my TKG GitHub repository and cd into vmware-tkg/helpers/tkg-nsxalb-upgrade.

Ensure the script is executable on your machine.

chmod +x tkg-update-nsxalb-version.sh*

Execute the script using the following syntax:

./tkg-update-nsxalb-version.sh <TKG_MGMT_CLUSTER_NAME> <NSXBALB_CONTROLLER_VERSION>

For example:

./tkg-update-nsxalb-version.sh tkg-mgmt-cls '21.1.4'

Example output:

Base directory: .
✔  successfully logged in to management cluster using the kubeconfig tkg-mgmt-cls
Checking for required plugins...
All required plugins are already installed and up-to-date
Tanzu context tkg-mgmt-cls has been set
Setting kubectl context
Switched to context "tkg-mgmt-cls-admin@tkg-mgmt-cls".
kubectl context tkg-mgmt-cls-admin@tkg-mgmt-cls has been set
Patching AKO Operator config
secret/tkg-mgmt-cls-ako-operator-addon patched
Patching AKODeploymentConfig resources
Patching resource 'akodeploymentconfig.networking.tkg.tanzu.vmware.com/install-ako-for-all'
akodeploymentconfig.networking.tkg.tanzu.vmware.com/install-ako-for-all patched
Patching resource 'akodeploymentconfig.networking.tkg.tanzu.vmware.com/install-ako-for-management-cluster'
akodeploymentconfig.networking.tkg.tanzu.vmware.com/install-ako-for-management-cluster patched
Target cluster 'https://10.100.154.230:6443' (nodes: tkg-mgmt-cls-control-plane-cvcmd, 5+)

App 'ako-operator' is owned by 'PackageInstall/ako-operator'
Triggering reconciliation for app 'ako-operator' in namespace 'tkg-system'2:03:00AM: Triggering reconciliation for app 'ako-operator' in namespace 'tkg-system'
2:03:00AM: Waiting for app reconciliation for 'ako-operator'
2:03:56AM: Fetching
            | apiVersion: vendir.k14s.io/v1alpha1
            | directories:
            | - contents:
            |   - imgpkgBundle:
            |       image: projects.registry.vmware.com/tkg/packages/core/ako-operator@sha256:f1fd17e8de5b92f66c566050c557fd688ffe75205a97d2a646569d3587108462
            |     path: .
            |   path: "0"
            | kind: LockConfig
            |
2:03:56AM: Fetch succeeded
2:03:56AM: Template succeeded
2:03:56AM: Deploy started (2s ago)
2:04:16AM: Deploying
            | Target cluster 'https://100.64.0.1:443' (nodes: tkg-mgmt-cls-control-plane-cvcmd, 5+)
            | Changes
            | Namespace  Name                                                  Kind                      Conds.  Age  Op  Op st.  Wait to    Rs       Ri
            | (cluster)  akodeploymentconfigs.networking.tkg.tanzu.vmware.com  CustomResourceDefinition  0/0 t   3h   -   -       reconcile  ongoing  Condition Established is not set
            | Op:      0 create, 0 delete, 0 update, 1 noop
            | Wait to: 1 reconcile, 0 delete, 0 noop
            | 2:04:14AM: ---- applying 1 changes [0/1 done] ----
            | 2:04:14AM: noop customresourcedefinition/akodeploymentconfigs.networking.tkg.tanzu.vmware.com (apiextensions.k8s.io/v1) cluster
            | 2:04:14AM: ---- waiting on 1 changes [0/1 done] ----
2:04:18AM: Deploying
            | Target cluster 'https://100.64.0.1:443' (nodes: tkg-mgmt-cls-control-plane-cvcmd, 5+)
            | Changes
            | Namespace  Name                                                  Kind                      Conds.  Age  Op  Op st.  Wait to    Rs       Ri
            | (cluster)  akodeploymentconfigs.networking.tkg.tanzu.vmware.com  CustomResourceDefinition  0/0 t   3h   -   -       reconcile  ongoing  Condition Established is not set
            | Op:      0 create, 0 delete, 0 update, 1 noop
            | Wait to: 1 reconcile, 0 delete, 0 noop
            | 2:04:14AM: ---- applying 1 changes [0/1 done] ----
            | 2:04:14AM: noop customresourcedefinition/akodeploymentconfigs.networking.tkg.tanzu.vmware.com (apiextensions.k8s.io/v1) cluster
            | 2:04:14AM: ---- waiting on 1 changes [0/1 done] ----
            | 2:04:18AM: ok: reconcile customresourcedefinition/akodeploymentconfigs.networking.tkg.tanzu.vmware.com (apiextensions.k8s.io/v1) cluster
            | 2:04:18AM: ---- applying complete [1/1 done] ----
            | 2:04:18AM: ---- waiting complete [1/1 done] ----
            | Succeeded
2:04:18AM: Deploy succeeded (1s ago)

Succeeded
Target cluster 'https://10.100.154.230:6443' (nodes: tkg-mgmt-cls-control-plane-cvcmd, 5+)

App 'load-balancer-and-ingress-service' is owned by 'PackageInstall/load-balancer-and-ingress-service'
Triggering reconciliation for app 'load-balancer-and-ingress-service' in namespace 'tkg-system'2:04:19AM: Triggering reconciliation for app 'load-balancer-and-ingress-service' in namespace 'tkg-system'
2:04:19AM: Waiting for app reconciliation for 'load-balancer-and-ingress-service'
2:04:28AM: Fetching
            | apiVersion: vendir.k14s.io/v1alpha1
            | directories:
            | - contents:
            |   - imgpkgBundle:
            |       image: projects.registry.vmware.com/tkg/packages/core/load-balancer-and-ingress-service@sha256:10bbc6abb07ea096ca82924ad0d44881d4c076131751c46c0dc64b1b57275423
            |     path: .
            |   path: "0"
            | kind: LockConfig
            |
2:04:28AM: Fetch succeeded
2:04:28AM: Template succeeded
2:04:28AM: Deploy started (2s ago)
2:04:48AM: Deploying
            | Target cluster 'https://100.64.0.1:443' (nodes: tkg-mgmt-cls-control-plane-cvcmd, 5+)
            | Changes
            | Namespace  Name                                Kind                      Conds.  Age  Op  Op st.  Wait to    Rs       Ri
            | (cluster)  gatewayclasses.networking.x-k8s.io  CustomResourceDefinition  0/0 t   3h   -   -       reconcile  ongoing  Condition Established is not set
            | ^          gateways.networking.x-k8s.io        CustomResourceDefinition  0/0 t   3h   -   -       reconcile  ongoing  Condition Established is not set
            | Op:      0 create, 0 delete, 0 update, 2 noop
            | Wait to: 2 reconcile, 0 delete, 0 noop
            | 2:04:47AM: ---- applying 2 changes [0/2 done] ----
            | 2:04:47AM: noop customresourcedefinition/gatewayclasses.networking.x-k8s.io (apiextensions.k8s.io/v1) cluster
            | 2:04:47AM: noop customresourcedefinition/gateways.networking.x-k8s.io (apiextensions.k8s.io/v1) cluster
            | 2:04:47AM: ---- waiting on 2 changes [0/2 done] ----
2:04:50AM: Deploying
            | Target cluster 'https://100.64.0.1:443' (nodes: tkg-mgmt-cls-control-plane-cvcmd, 5+)
            | Changes
            | Namespace  Name                                Kind                      Conds.  Age  Op  Op st.  Wait to    Rs       Ri
            | (cluster)  gatewayclasses.networking.x-k8s.io  CustomResourceDefinition  0/0 t   3h   -   -       reconcile  ongoing  Condition Established is not set
            | ^          gateways.networking.x-k8s.io        CustomResourceDefinition  0/0 t   3h   -   -       reconcile  ongoing  Condition Established is not set
            | Op:      0 create, 0 delete, 0 update, 2 noop
            | Wait to: 2 reconcile, 0 delete, 0 noop
            | 2:04:47AM: ---- applying 2 changes [0/2 done] ----
            | 2:04:47AM: noop customresourcedefinition/gatewayclasses.networking.x-k8s.io (apiextensions.k8s.io/v1) cluster
            | 2:04:47AM: noop customresourcedefinition/gateways.networking.x-k8s.io (apiextensions.k8s.io/v1) cluster
            | 2:04:47AM: ---- waiting on 2 changes [0/2 done] ----
            | 2:04:50AM: ok: reconcile customresourcedefinition/gatewayclasses.networking.x-k8s.io (apiextensions.k8s.io/v1) cluster
            | 2:04:50AM: ok: reconcile customresourcedefinition/gateways.networking.x-k8s.io (apiextensions.k8s.io/v1) cluster
            | 2:04:50AM: ---- applying complete [2/2 done] ----
            | 2:04:50AM: ---- waiting complete [2/2 done] ----
            | Succeeded
2:04:50AM: Deploy succeeded (1s ago)

Succeeded
Target cluster 'https://10.100.154.230:6443' (nodes: tkg-mgmt-cls-control-plane-cvcmd, 5+)

App 'tanzu-addons-manager' is owned by 'PackageInstall/tanzu-addons-manager'
Triggering reconciliation for app 'tanzu-addons-manager' in namespace 'tkg-system'2:04:51AM: Triggering reconciliation for app 'tanzu-addons-manager' in namespace 'tkg-system'
2:04:52AM: Waiting for app reconciliation for 'tanzu-addons-manager'
2:04:52AM: Waiting for generation 4 to be observed
2:05:01AM: Fetching
            | apiVersion: vendir.k14s.io/v1alpha1
            | directories:
            | - contents:
            |   - imgpkgBundle:
            |       image: projects.registry.vmware.com/tkg/packages/core/addons-manager@sha256:248662abcbf966fdda0b342906a6b70c19f94459f8f4d6a8d78210e6ae23c694
            |     path: .
            |   path: "0"
            | kind: LockConfig
            |
2:05:01AM: Fetch succeeded
2:05:01AM: Template succeeded
2:05:01AM: Deploy started (2s ago)
2:05:19AM: Deploying
            | Target cluster 'https://100.64.0.1:443' (nodes: tkg-mgmt-cls-control-plane-cvcmd, 5+)
            | Changes
            | Namespace  Name  Kind  Conds.  Age  Op  Op st.  Wait to  Rs  Ri
            | Op:      0 create, 0 delete, 0 update, 0 noop
            | Wait to: 0 reconcile, 0 delete, 0 noop
            | Succeeded
2:05:19AM: Deploy succeeded (2s ago)

Succeeded

Done!

That’s it. If you now look at the AKO Operator Add-on secret and the AKO Deployment Configs, you will see that the controller version has been updated. You can do so using the same commands I mentioned before:

kubectl get secret tkg-mgmt-cls-ako-operator-addon -n tkg-system -o jsonpath='{.data.values\.yaml}' | base64 -d
kubectl get akodeploymentconfigs.networking.tkg.tanzu.vmware.com install-ako-for-all -o yaml
kubectl get akodeploymentconfigs.networking.tkg.tanzu.vmware.com install-ako-for-management-cluster -o yaml

For new TKG management clusters, ensure the AVI_CONTROLLER_VERSION parameter is set to the NSX ALB controller version in your cluster config YAML file. For example: AVI_CONTROLLER_VERSION: 21.1.4. According to the TKG 1.6 release notes, this is no longer required, as the NSX ALB version will be detected automatically.

I hope this helps anyone looking to upgrade NSX ALB in a TKG environment, and hopefully, this process will be automated within TKG at some point.