Restore disconnected node
From MyWiki
(Created page with 'We had an outage in the building which led to all sort of strange things (loss of network, sporadic disappearances of servers, etc) One of the issues was outage of on three ESX …')
Newer edit →
Revision as of 20:55, 31 August 2014
We had an outage in the building which led to all sort of strange things (loss of network, sporadic disappearances of servers, etc)
One of the issues was outage of on three ESX nodes, thankfully, not instantaneous which allowed us to work through the issues and restore the service back to normal. But one of the nodes was still disconnected as far as vCenter was concerned even though we could SSH into it without a problem.
We couldn't migrate VMs out of it though. We couldn't get to the VMs neither over the network nor via console. Catch 22.
Since the VMs were on shared iSCSI storage, we decided to kill the problem node and get the VMs up on another node later. Not something you'd like to do, but there seemed to be no other choice left.
The problem node 10.33.33.15, good running node is 10.33.33.18. Shared storage is iSCSI LUN hosted on a QNAP NAS appliance.
So, 10.33.33.15 goes down, but that doesn't change the fact that the node and all the VMs are marked disconnected in vCenter and we can do nothing with it there.
SSH to 10.33.33.18 and check the storage available:
~ # df -h Filesystem Size Used Available Use% Mounted on visorfs 1.4G 366.5M 1.1G 25% / vfat 4.0G 2.8M 4.0G 0% /vmfs/volumes/5241a93f-b7dd45ec-e1d9-5cf3fc09c4fe vfat 249.7M 115.0M 134.8M 46% /vmfs/volumes/a8622371-f21c2f29-e927-779374556003 vfat 249.7M 4.0k 249.7M 0% /vmfs/volumes/693e6c7c-feec42d1-cfdc-cf9f0d4fc01e vfat 285.9M 145.3M 140.6M 51% /vmfs/volumes/f57412e8-354f182d-35e2-275460791eb0 vmfs3 403.0G 563.0M 402.5G 0% /vmfs/volumes/5241a95d-1704ff26-2d0b-5cf3fc09c4fe vmfs3 1.9T 1.1T 891.0G 55% /vmfs/volumes/517ff6d9-b7dec161-a899-001a64792b78
The very last line is the iSCSI LUN we were looking for. And here my VMs:
~ # ls /vmfs/volumes/517ff6d9-b7dec161-a899-001a64792b78/ EXCALIBUR FRCBCERTIF FRCBZABBIX1 SMDEVWEBTEST01_new FR-C-WDS-01 FRCBDC2 ISO SMLPXWIKI FR1-C-PS-01.smartandco.com FRCBSCCM SEVEN_64_EN fr1-c-dc-01.smartandco.com FR1-C-SAGE30 FRCBSOPHOS SEVEN_64_FR FR1-CORP-AdminC FRCBVCENTER SMAPPFILES FRCBBACKUP01 FRCBWSUS SMARTNAGIOS
So far, so good.
Let's get the first one over to this node:
~ # vim-cmd solo/registervm /vmfs/volumes/517ff6d9-b7dec161-a899-001a64792b78/SMAPPFILES/SMAPPFILES.vmx 320 ~ # vim-cmd vmsvc/getallvms Vmid Name File Guest OS Version Annotation 320 SMAPPFILES [Frcbnas3_Iscsi] SMAPPFILES/SMAPPFILES.vmx debian5_64Guest vmx-07
Continue on with the rest of the VMs until I get them all registered on this node:
~ # vim-cmd solo/registervm /vmfs/volumes/517ff6d9-b7dec161-a899-001a64792b78/SMLPXWIKI/SMLPXWIKI2.vmx 560 ~ # vim-cmd vmsvc/getallvms Vmid Name File Guest OS Version Annotation 448 SMAPPFILES [Frcbnas3_Iscsi] SMAPPFILES/SMAPPFILES.vmx debian5_64Guest vmx-07 464 EXCALIBUR [Frcbnas3_Iscsi] EXCALIBUR/EXCALIBUR.vmx winNetStandardGuest vmx-07 480 FRCBBACKUP01 [Frcbnas3_Iscsi] FRCBBACKUP01/FRCBBACKUP01.vmx windows7Server64Guest vmx-07 496 FRCBCERTIF [Frcbnas3_Iscsi] FRCBCERTIF/FRCBCERTIF.vmx windows7Server64Guest vmx-07 512 FRCBDC2 [Frcbnas3_Iscsi] FRCBDC2/FRCBDC2.vmx winNetStandard64Guest vmx-07 528 FRCBWSUS [Frcbnas3_Iscsi] FRCBWSUS/FRCBWSUS.vmx windows7Server64Guest vmx-07 544 SMARTNAGIOS [Frcbnas3_Iscsi] SMARTNAGIOS/SMARTNAGIOS.vmx rhel5Guest vmx-07 560 SMLPXWIKI [Frcbnas3_Iscsi] SMLPXWIKI/SMLPXWIKI2.vmx winNetStandardGuest vmx-07
All was good and we started those that we needed now on the good node (10.33.33.18) and started to recover the failed 10.33.33.15 which stubbornly didn't want to boot up now. Ended up powering it down, pulling out power cables and after waiting for a minute bringing it back up. It did.
But, the VMs we so nicely re-registered on 10.33.33.18 are still registered on 10.33.33.15 and when we decided to reconnect it in vCenter we saw the whole list of it there, with vCenter saying that it's going to add it to the cluster. Nice, but not what we needed.
SSH into 10.33.33.15 to un-register the VMs before re-connecting the node back into the cluster:
Here is the whole list, but note that the VMID is not the same as it now has on 10.33.33.18
~ # vim-cmd vmsvc/getallvms Vmid Name File Guest OS Version Annotation 128 FRCBDC2 [Frcbnas3_Iscsi] FRCBDC2/FRCBDC2.vmx winNetStandard64Guest vmx-07 224 FR1-C-SCCM-01 [Frcbnas3_Iscsi] FRCBSCCM/FRCBSCCM.vmx windows7Server64Guest vmx-07 272 SMLPXWIKI [Frcbnas3_Iscsi] SMLPXWIKI/SMLPXWIKI2.vmx winNetStandardGuest vmx-07 400 SMARTNAGIOS [Frcbnas3_Iscsi] SMARTNAGIOS/SMARTNAGIOS.vmx rhel5Guest vmx-07 480 FRCBCERTIF [Frcbnas3_Iscsi] FRCBCERTIF/FRCBCERTIF.vmx windows7Server64Guest vmx-07 496 FRCBBACKUP01 [Frcbnas3_Iscsi] FRCBBACKUP01/FRCBBACKUP01.vmx windows7Server64Guest vmx-07 512 FRCBWSUS [Frcbnas3_Iscsi] FRCBWSUS/FRCBWSUS.vmx windows7Server64Guest vmx-07 528 FR1-C-SAGE30 [Frcbnas3_Iscsi] FR1-C-SAGE30/FR1-C-SAGE30.vmx windows7Server64Guest vmx-07 544 EXCALIBUR [Frcbnas3_Iscsi] EXCALIBUR/EXCALIBUR.vmx winNetStandardGuest vmx-07 64 SMAPPFILES [Frcbnas3_Iscsi] SMAPPFILES/SMAPPFILES.vmx debian5_64Guest vmx-07
Let's take a shot at one of the VMs - SMARTNAGIOS:
~ # vim-cmd vmsvc/unregister 400 ~ # vim-cmd vmsvc/getallvms Vmid Name File Guest OS Version Annotation 128 FRCBDC2 [Frcbnas3_Iscsi] FRCBDC2/FRCBDC2.vmx winNetStandard64Guest vmx-07 224 FR1-C-SCCM-01 [Frcbnas3_Iscsi] FRCBSCCM/FRCBSCCM.vmx windows7Server64Guest vmx-07 272 SMLPXWIKI [Frcbnas3_Iscsi] SMLPXWIKI/SMLPXWIKI2.vmx winNetStandardGuest vmx-07 480 FRCBCERTIF [Frcbnas3_Iscsi] FRCBCERTIF/FRCBCERTIF.vmx windows7Server64Guest vmx-07 496 FRCBBACKUP01 [Frcbnas3_Iscsi] FRCBBACKUP01/FRCBBACKUP01.vmx windows7Server64Guest vmx-07 512 FRCBWSUS [Frcbnas3_Iscsi] FRCBWSUS/FRCBWSUS.vmx windows7Server64Guest vmx-07 528 FR1-C-SAGE30 [Frcbnas3_Iscsi] FR1-C-SAGE30/FR1-C-SAGE30.vmx windows7Server64Guest vmx-07 544 EXCALIBUR [Frcbnas3_Iscsi] EXCALIBUR/EXCALIBUR.vmx winNetStandardGuest vmx-07 64 SMAPPFILES [Frcbnas3_Iscsi] SMAPPFILES/SMAPPFILES.vmx debian5_64Guest vmx-07
Carry on and unregister the rest of it:
~ # vim-cmd vmsvc/unregister 64 ~ # vim-cmd vmsvc/unregister 544 ~ # vim-cmd vmsvc/unregister 528 ~ # vim-cmd vmsvc/unregister 512 ~ # vim-cmd vmsvc/unregister 496 ~ # vim-cmd vmsvc/unregister 480 ~ # vim-cmd vmsvc/unregister 272 ~ # vim-cmd vmsvc/unregister 224 ~ # vim-cmd vmsvc/unregister 128 ~ # vim-cmd vmsvc/getallvms Vmid Name File Guest OS Version Annotation
Now we can re-connect the node to the cluster and wait when DRS kicks in and move something on to it. Which it later did.
Problem solved :-)