Date: 2021-07-18
Author: Simon Jackson
We had a power outage in the datacentre, all servers powered off, and back on again, and now vCenter is not contactable. Infact everythig on the site is offline.In the stressful event of a P0 outage; a major incident call was ensued.
I jumped in to identify any connectivity issues, and found myself unable to reach the vCenter and esxi instance, on one VLAN, but was able to access the host, on an untagged VLAN over the same uplinks. With this info, i knew it was a vSwitch (or dVSwitch) issue; so I assumed it was a trunking or port-failover problem.
Little did I know, at the time, the DVS Port Group Ephemeral Bindings, only available with communication to vcenter, was the root-cause of my problem.
Further Info: http://www.vmskills.com/2010/10/static-dynamic-and-ephemeral-binding-in.html
I went hunting for the design spreadsheet i put together, when the kit was first built.
...we all use spreadsheets for this, right?
For reference, here is the data I had avaialble to me, along with secrets:
I quickly observed that the ESXi UI webclient prevented any modifications to any resource attached to a dvSwitch, or dvPortGroup. This make sense, as the communication back to vCenter Server was unavailable at the time.
I then went in search of some esxcli commands to reconfigure the vDS from the local CLI.
My objective was:
Detatch vmnic0 from the dvSwitch from port-group called DVS-LAN-MGMT
Attach vmnic0 to the vSwitch to port-group called Management Network
Detatch vmk0, from DVS-LAN-MGMT, and immediately attach it to Management Network - all in one command!
Set the port-ID assignment as Static:Fixed, instead of Dynamic:Ephemeral, for the DVS-LAN-MGMT port-group only
Reverse the changes 1-3 above to restore the original design
This is what happened...
Note: the output was clipped from both of these commands, and VM-Names removed throughout. The important points to consider are highlighted.
esxcfg is of course very useful, and gives NO output to the above commands. And if you were SSH'd in, at the point of executing the second command, your connection will likely drop for a fraction of a second, don't worry you should be okay to re-establish the SSH session..
now vSwitch has connectivity upstream. If we moved a VM, or vmkernel across to this vswitch, in the right port-group (vlan), then we would restore connectivity.
Note: I suggest doing all this in one command; as if you were working over SSH, the moment you detatch vmk0 from the dvSwitch, you will loose connectivity!
If you are using a DCUI console, via iDRAC or something like that, then don't worry about the one-liner.
I accessed the UI using https://<ip-address-of-esxi-host>/ui where i was able to move the vCenter from the unknown dvSwitch port-group, to the vSwitch port-group called "Management Network".
Now we can access the UI of the vcenter server using https://<ip-address-of-vcenter-server>/ui
Switching to view the DVS Config:
Editing the Management port-group, setting Port-Binding and Port-Allocation to support power-outages.
Click on the Configure tab, and choose Edit
Click on Properties, and select the dropdown for Port-Binding,
Choose Static
Select the next dropdown for Port-Allocation and choose Fixed
Click OK, then Apply
Right Click on the DVSwitch, choose Add & Manage Hosts, Select Manage host networking, Click Next
Select Host 1, click Next
Adjacent to vmnic0, select the uplink called "Uplink 1", click Next
Adjacent to vmk0, select the port-group called "DVS-LAN-MGMT", click Next
Tick to enable Virtual Machine Migration, and find the vCenter Server, for Network Adapter 1, select port-group called "DVS-LAN-MGMT", click Next
Click Apply
All VMs being 'to be' powered-on from here, will find a ephemeral port-allocation successfully now; even after a power-outage!
All VMs that 'were already' powered-on, needed a reboot, or their nic disconnecting and then re-connecting.
I hope this explaination helps someone who finds themselves in the same sticky situation..