The misery of a Solution User Certificate in vCenter..

Probably this blog is never gonna be used.. Then why write it? Because only I know how i felt when I did what i did 🙂

Context:

While playing around with my vSphere with Kubernetes labs, in the process of generating a wildcard certificate for workload management and retrieving the private key of the CSR I generated, I ended up deleting the solution user cert/key from the VECS store accidentally. And for a while, nothing seemed to have gone wrong. But soon the misery began….If you haven’t got this so far, you are perfectly fine! I din’t know about any of this until I landed on this trouble! So hopefully this helps someone along the way…

Oh, but please keep in mind, the method I would be discussing here is completely UNSUPPORTED! The supported (of course, the right way of doing it) method would be to regenerate all the solution user certs and update vCenter integrations.

Issue reproduction:

I have a wcp enabled cluster. dcli com vmware vcenter namespacemanagement software clusters list should list out the clusters were wcp is enabled.

And my wcpsvc.log doesn’t report any errors.

Now, I use the vecs cli on vCenter to remove the wcp solution user cert/key pair, which ideally SHOULDN’T be done. But you know, just in case..

Running /usr/lib/vmware-vmafd/bin/vecs-cli store list on the vCenter should list all of your certificate stores, just like the one in the screenshot below. Depending on other 3rd party integrations that you might have, the number in this list might vary.

When you run /usr/lib/vmware-vmafd/bin/vecs-cli entry list –store <storename> it should list out the entries along with the certificate that’s there on the certificate store.

Certificate output here is truncated

Now, the reason why I deleted this key is because I thought the private key associated with the CSR I generate using vSphere UI for Workload Management (shown below) is stored in this certificate store and I wanted to test if the key (in the VECS store) is getting regenerated every time I generate a CSR on the UI. And guess what, I miserably failed.

I ran /usr/lib/vmware-vmafd/bin/vecs-cli entry delete –store wcp –alias wcp 🙁 That’s how dumb I was..

And doing a getcert now, would throw me an OBJECT NOT FOUND error on the alias. The store would still remain.

Now, everything would continue to work well, unless and until the “wcp” solution user is called for. This could be for operations related to the service itself (like deploying your supervisor cluster, disabling wcp etc..) or for inter-service communication within vCenter via solution users.

Doing a quick check on the logs, you would still see the cluster status to be healthy.

Now, to put the solution user into play, I am going to disable workload management on the cluster I already have it enabled for and then do a tail on the wcpsvc logs.

Error Snippet:

2021-11-17T18:48:19.361Z debug wcp [ssolib/sts.go:87] [opID=6193a589-domain-c8] Getting HOK signer; store: wcp, alias: wcp

2021-11-17T18:48:19.38Z error wcp [ssolib/helper.go:105] [opID=6193a589-domain-c8] Failed executing shell command; cmd: ‘/usr/lib/vmware-vmafd/bin/vecs-cli’, args: [entry getcert –store wcp –alias wcp], stdout: ”, stderr: ‘vecs-cli failed. Error 4312: Possible errors: 

LDAP error: Unknown (extension) error 

Win Error: Operation failed with error ERROR_OBJECT_NOT_FOUND (4312) 

‘, err: exit status 216

2021-11-17T18:48:19.38Z error wcp [auth/permission.go:96] [opID=6193a589-domain-c8] Unable to remove global permissions for user vsphere.local\wcp-vmop-user-domain-c8-2fe62da9-098c-4291-a310-c2e839a95574. Error: exit status 216

2021-11-17T18:48:19.38Z error wcp [kubelifecycle/identity.go:98] [opID=6193a589-domain-c8] Error removing global permissions from vmoperator service account wcp-vmop-user-domain-c8-2fe62da9-098c-4291-a310-c2e839a95574 : exit status 216

The highlighted text from the snippet suggests that it is looking for an alias named wcp which I deleted from the wcp store unfortunately. Now until the necessary permissions are granted(via the certificate), the wcp disable loop wouldn’t end.

This is when I realised how deep have I sunk 😀

Resolution:

Method 1: (Fully Supported)

Use KB: https://kb.vmware.com/s/article/2112283 to regenerate solution user certs (Option 6 on the cert tool) on vCenter and restart wcp services to complete the disable operation that is in loop.

Method 2: (UNSUPPORTED)

Since this is wcp service that we are talking about, this would only work if you are able to ssh into one of the supervisor cluster control plane VMs.

Step 1: From the supervisor VM copy the certs on to vCenter using the following command:

scp root@10.xx.xx.xx:/etc/vmware/wcp/tls/wcpusr* /tmp

Step 2: Import the copied certs into the VECS wcp store using the following command:

/usr/lib/vmware-vmafd/bin/vecs-cli entry create –store wcp –alias wcp –cert /tmp/wcpusr.cert –key /tmp/wcpusr.key

Now by checking at the wcpsvc.log, we can confirm that the deletion progress has begun. In no time, wcp on the cluster will get disabled.

You can compare the timestamps on the logs/deletion messages just to be sure 🙂

PS: I could have simply averted the situation I was in, had I taken a back up of the cert/key using vecs-cli entry getcert/getkey –output <path/filename> and used these backup files to create using Step 2 above.

I learnt my lesson the hard way. Dint I? Well, what can I say 🙂

If this is all still above your head, I am happy to help you out 🙂 Feel free to drop me a message! And also please let me know if you could think of any better workarounds to this problem in the comments below!

Happy Learning!

Please follow and like my content:

Leave a Reply

Your email address will not be published. Required fields are marked *

error

Enjoy this blog? Please spread the word :)