CAPV: Fixing and Cleaning Up Idle vCenter Server Sessions
I recently ran into an issue causing the vCenter server to crash almost daily. What seemed to be a random vCenter issue initially, turned out to be related to CAPV (Cluster API Provider vSphere), running on some of our Kubernetes clusters. That was also an edge case I had not seen before, so I decided to document and share it here.
Initially, the issue we were witnessing on the vCenter server was the following:
Could not connect to one or more vCenter Server systems: https://VCENTER_HOSTNAME:443/sdk
Investigating the vCenter server logs, we found that there were too many open sessions on the vCenter server.
Looking at the vCenter UI, under Monitor -> Sessions, we found hundreds of idle sessions, most of which were initiated by a specific service account - the one used by CAPV on some of our Kubernetes clusters.
The vCenter server crashes made much more sense at this point since there is a limit on the number of sessions that can connect to vCenter, for both idle and active session count. The number of open sessions can fill up quickly, especially if you have multiple sources/systems/components interacting with the vCenter server and do not disconnect properly.
In the screenshot below, you can see some of these idle sessions, initiated by the k8s-capv-useragent user agent. Also, the originating IP addresses are Kubernetes nodes on which the capv-controller-manager Pods reside.
We then looked at the capv-controller-manager Pod logs.
kubectl logs -n capv-system $(kubectl get pod -n capv-system -l control-plane=controller-manager -o jsonpath='{.items[].metadata.name}')
And immediately observed a clear pattern of error messages:
I1129 18:59:41.248933 1 session.go:223] "session: performing session log out and clearing session" server="VCENTER_HOSTNAME" datacenter="" key="[email protected]"
I1129 18:59:44.454398 1 session.go:174] "session: cached vSphere client session" server="VCENTER_HOSTNAME" datacenter="/Datacenter"
E1129 18:59:46.430510 1 session.go:121] "session: unable to check if vim session is active" err="ServerFaultCode: Permission to perform this operation was denied." server="VCENTER_HOSTNAME" datacenter="/Datacenter"
As well as many other 503 Service Unavailable error messages as a result of the vCenter server crashes. For example:
E1129 20:01:07.762891 1 controller.go:326] "Reconciler error" err="unable to create tags manager: POST https://VCENTER_HOSTNAME/rest/com/vmware/cis/session: 503 Service Unavailable" controller="vspherevm" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="VSphereVM" VSphereVM="default/it-cls-opu-01-np01-md-0-5d58b5c595xdd98j-jwkw5" namespace="default" name="it-cls-opu-01-np01-md-0-5d58b5c595xdd98j-jwkw5" reconcileID=741afefc-d265-4b6b-8f4c-d7dd8412b017
At that point, it was clear to us that the issue was permission-related, preventing the service account from properly listing the sessions and terminating them. We looked at the vSphere role assigned to the service account in vSphere. (Note: we are using least-privileged roles containing only the necessary vSphere privileges).
We immediately noticed that the role did not have any session-related privileges checked. After carefully looking at the required privileges, we added the Message and the Validate Session privileges to the role (these privileges are programmatically identified by Sessions.GlobalMessage and the Sessions.ValidateSession).
Although it may not be necessary, we deleted the capv-controller-manager Pod to restart it and ensure that new vCenter sessions are successfully established and that logs are clear.
kubectl delete pod -n capv-system -l control-plane=controller-manager
The logs were completely clean and the error messages were gone.
To clean up and terminate the old idle sessions on the vCenter server, we took the following steps:
Connect to vCenter Server using PowerCLI:
$vCenterHostname = "VCENTER_HOSTNAME"
Connect-VIServer $vCenterHostname -Force
Create a file named Get-VISession.ps1 with the following content:
Function Get-ViSession {
<#
.SYNOPSIS
Lists vCenter Sessions.
.DESCRIPTION
Lists all connected vCenter Sessions.
.EXAMPLE
PS C:\> Get-VISession
.EXAMPLE
PS C:\> Get-VISession | Where { $_.IdleMinutes -gt 5 }
#>
$SessionMgr = Get-View $DefaultViserver.ExtensionData.Client.ServiceContent.SessionManager
$AllSessions = @()
$SessionMgr.SessionList | ForEach-Object {
$Session = New-Object -TypeName PSObject -Property @{
Key = $_.Key
UserName = $_.UserName
FullName = $_.FullName
LoginTime = ($_.LoginTime).ToLocalTime()
LastActiveTime = ($_.LastActiveTime).ToLocalTime()
}
If ($_.Key -eq $SessionMgr.CurrentSession.Key) {
$Session | Add-Member -MemberType NoteProperty -Name Status -Value "Current Session"
}
Else {
$Session | Add-Member -MemberType NoteProperty -Name Status -Value "Idle"
}
$Session | Add-Member -MemberType NoteProperty -Name IdleMinutes -Value ([Math]::Round(((Get-Date) – ($_.LastActiveTime).ToLocalTime()).TotalMinutes))
$AllSessions += $Session
}
$AllSessions
}
Function Disconnect-ViSession {
<#
.SYNOPSIS
Disconnects a connected vCenter Session.
.DESCRIPTION
Disconnects a open connected vCenter Session.
.PARAMETER SessionList
A session or a list of sessions to disconnect.
.EXAMPLE
PS C:\> Get-VISession | Where { $_.IdleMinutes -gt 5 } | Disconnect-ViSession
.EXAMPLE
PS C:\> Get-VISession | Where { $_.Username -eq "User19" } | Disconnect-ViSession
#>
[CmdletBinding()]
Param (
[Parameter(ValueFromPipeline = $true)]
$SessionList
)
Process {
$SessionMgr = Get-View $DefaultViserver.ExtensionData.Client.ServiceContent.SessionManager
$SessionList | ForEach-Object {
Write-Output "Disconnecting Session for $($_.Username) which has been active since $($_.LoginTime)"
$SessionMgr.TerminateSession($_.Key)
}
}
}
Note: the above PowerShell/PowerCLI snippet is based on this VMware blog post and is slightly modified.
Import the Get-VISessions.ps1 file as a PowerShell module:
Import-Module .\Get-VISessions.ps1
Execute the function to clean up the Idle sessions:
Get-ViSession | Where-Object {$_.Status -eq 'Idle'} | Disconnect-ViSession
The vCenter server has stopped crashing since, and the number of open sessions has drastically decreased since CAPV is now properly terminating its sessions.
I hope this post helps anyone who happens to run into this issue out there.



