TKG 2.3: Fixing the Prometheus Data Source in the Grafana Package

With the release of TKG 2.3, the Grafana package was finally updated from version 7.5.x to 9.5.1. If you have deployed the new Grafana package (9.5.1+vmware.2-tkg.1) or upgraded your existing one to this version, you may have run into error messages in your Grafana dashboards.

For example, in the TKG Kubernetes cluster monitoring default dashboard, you may have run into the Failed to call resource error when opening the dashboard and noticed that a lot of the data is missing.

Screenshot

In other dashboards, such as the Kubernetes / API Server dashboard, errors may not occur, but the data is missing.

Screenshot

Customized/non-default dashboards may also show similar symptoms.

I started investigating this issue by looking at the default Prometheus Data Source and immediately noticed an error in the URL parameter under the HTTP section.

Screenshot

Screenshot

Testing the connection to the Data Source, I received the following error message:

Error reading Prometheus: parse "prometheus-server.tanzu-system-monitoring.svc.cluster.local": invalid URI for request

Screenshot

Looking at the official Grafana documentation, I realized it was now required to specify the full URL to the Prometheus instance, including http:// or https:// before the Prometheus instance IP address/FQDN.

Reference: https://grafana.com/docs/grafana/v9.5/datasources/prometheus/#provisioning-example

I then retrieved the configuration used by the Grafana package using the following command.

IMAGE_URL=$(kubectl -n tkg-system get packages grafana.tanzu.vmware.com.9.5.1+vmware.2-tkg.1 -o jsonpath='{.spec.template.spec.fetch[0].imgpkgBundle.image}')
echo $IMAGE_URL

imgpkg pull -b $IMAGE_URL -o /tmp/tkg-grafana

And looked at the default configuration specified in the values.yaml file of the package. Under grafana.config.datasource_yaml in the file, I found the default URL of the Prometheus Data Source:

grafana:
  config:
    datasource_yaml: |-
      apiVersion: 1
      datasources:
        - name: Prometheus
          type: prometheus
          url: prometheus-server.tanzu-system-monitoring.svc.cluster.local
          access: proxy
          isDefault: true

The default Data Source configuration does not specify http:// or https:// before the Prometheus FQDN, which was fine in older Grafana versions, and is now required in the newer versions. This means that the Data Source configuration provided by the Grafana package is no longer valid for the modern Grafana versions.

To address this issue, I added grafana.config.datasource_yaml with the correct Prometheus URL to my values.yaml overrides as follows:

grafana:
  secret:
    admin_password: Vk13YXJlMSE=

  config:
    datasource_yaml: |-
      apiVersion: 1
      datasources:
        - name: Prometheus
          type: prometheus
          url: http://prometheus-server.tanzu-system-monitoring.svc.cluster.local
          access: proxy
          isDefault: true

And updated the Grafana package on my cluster using the updated data values.

tanzu package install grafana \
--package "$PKG_NAME" \
--version "$PKG_VERSION" \
--values-file grafana-data-values.yaml \
--namespace tkg-packages

In the output, you should see that the grafana-datasource ConfigMap is updated due to this change.

8:25:53PM: Pausing reconciliation for package installation 'grafana' in namespace 'tkg-packages'
8:25:54PM: Updating secret 'grafana-tkg-packages-values'
8:25:54PM: Creating overlay secrets
8:25:54PM: Resuming reconciliation for package installation 'grafana' in namespace 'tkg-packages'
8:25:54PM: Waiting for PackageInstall reconciliation for 'grafana'
8:25:54PM: Waiting for generation 3 to be observed
8:25:56PM: Fetching
            | apiVersion: vendir.k14s.io/v1alpha1
            | directories:
            | - contents:
            |   - imgpkgBundle:
            |       image: projects.registry.vmware.com/tkg/packages/standard/grafana@sha256:7e9225bb461b470534f347a7990437c01956a603f916d0214159ad7634db08b2
            |     path: .
            |   path: "0"
            | kind: LockConfig
            |
8:25:56PM: Fetch succeeded
8:25:56PM: Template succeeded
8:25:56PM: Deploy started (2s ago)
8:25:58PM: Deploying
            | Target cluster 'https://100.64.0.1:443' (nodes: it-tkg-wld-02-npm26-n7lrs, 5+)
            | Changes
            | Namespace                Name                Kind       Age  Op      Op st.  Wait to    Rs  Ri
            | tanzu-system-dashboards  grafana-datasource  ConfigMap  2h   update  -       reconcile  ok  -
            | Op:      0 create, 0 delete, 1 update, 0 noop, 0 exists
            | Wait to: 1 reconcile, 0 delete, 0 noop
            | 8:25:59PM: ---- applying 1 changes [0/1 done] ----
            | 8:25:59PM: update configmap/grafana-datasource (v1) namespace: tanzu-system-dashboards
            | 8:25:59PM: ---- waiting on 1 changes [0/1 done] ----
            | 8:25:59PM: ok: reconcile configmap/grafana-datasource (v1) namespace: tanzu-system-dashboards
            | 8:25:59PM: ---- applying complete [1/1 done] ----
            | 8:25:59PM: ---- waiting complete [1/1 done] ----
            | Succeeded
8:25:59PM: Deploy succeeded

You may have to delete the Grafana Pod, so that a new one will be created and mount the updated configuration from the ConfigMap.

kubectl delete pod -l app.kubernetes.io/name=grafana -n tanzu-system-dashboards

Ensure that the new Pod is running.

kubectl get pod -n tanzu-system-dashboards

Example output:

NAME                       READY   STATUS    RESTARTS   AGE
grafana-58bf6bbb6c-vmvzx   2/2     Running   0          89s

If you refresh the Grafana UI and view the Prometheus Data Source, you should see the updated URL with no errors.

Screenshot

You should also be able to test the connection to the Data Source successfully.

Screenshot

Since the Data Source is now valid, your dashboards should start displaying data properly.

Screenshot

I have reported the issue in the Grafana package to VMware. Hopefully, a future version of the package will address this issue, but until then, I believe the workaround provided in this post is fairly easy to implement.