Restore disconnected node

From MyWiki

Revision as of 20:55, 31 August 2014 by Admin (Talk | contribs)
(diff) ← Older revision | Current revision (diff) | Newer revision → (diff)
Jump to: navigation, search

We had an outage in the building which led to all sort of strange things (loss of network, sporadic disappearances of servers, etc)

One of the issues was outage of on three ESX nodes, thankfully, not instantaneous which allowed us to work through the issues and restore the service back to normal. But one of the nodes was still disconnected as far as vCenter was concerned even though we could SSH into it without a problem.

We couldn't migrate VMs out of it though. We couldn't get to the VMs neither over the network nor via console. Catch 22.

Since the VMs were on shared iSCSI storage, we decided to kill the problem node and get the VMs up on another node later. Not something you'd like to do, but there seemed to be no other choice left.

The problem node 10.33.33.15, good running node is 10.33.33.18. Shared storage is iSCSI LUN hosted on a QNAP NAS appliance.

So, 10.33.33.15 goes down, but that doesn't change the fact that the node and all the VMs are marked disconnected in vCenter and we can do nothing with it there.

SSH to 10.33.33.18 and check the storage available:

~ # df -h
Filesystem                Size      Used Available Use% Mounted on
visorfs                   1.4G    366.5M      1.1G  25% /
vfat                      4.0G      2.8M      4.0G   0% /vmfs/volumes/5241a93f-b7dd45ec-e1d9-5cf3fc09c4fe
vfat                    249.7M    115.0M    134.8M  46% /vmfs/volumes/a8622371-f21c2f29-e927-779374556003
vfat                    249.7M      4.0k    249.7M   0% /vmfs/volumes/693e6c7c-feec42d1-cfdc-cf9f0d4fc01e
vfat                    285.9M    145.3M    140.6M  51% /vmfs/volumes/f57412e8-354f182d-35e2-275460791eb0
vmfs3                   403.0G    563.0M    402.5G   0% /vmfs/volumes/5241a95d-1704ff26-2d0b-5cf3fc09c4fe
vmfs3                     1.9T      1.1T    891.0G  55% /vmfs/volumes/517ff6d9-b7dec161-a899-001a64792b78

The very last line is the iSCSI LUN we were looking for. And here my VMs:

~ # ls /vmfs/volumes/517ff6d9-b7dec161-a899-001a64792b78/
EXCALIBUR                   FRCBCERTIF                  FRCBZABBIX1                 SMDEVWEBTEST01_new
FR-C-WDS-01                 FRCBDC2                     ISO                         SMLPXWIKI
FR1-C-PS-01.smartandco.com  FRCBSCCM                    SEVEN_64_EN                 fr1-c-dc-01.smartandco.com
FR1-C-SAGE30                FRCBSOPHOS                  SEVEN_64_FR
FR1-CORP-AdminC             FRCBVCENTER                 SMAPPFILES
FRCBBACKUP01                FRCBWSUS                    SMARTNAGIOS

So far, so good.

Let's get the first one over to this node:

~ # vim-cmd solo/registervm /vmfs/volumes/517ff6d9-b7dec161-a899-001a64792b78/SMAPPFILES/SMAPPFILES.vmx
320

~ # vim-cmd vmsvc/getallvms
Vmid      Name                         File                         Guest OS       Version   Annotation
320    SMAPPFILES   [Frcbnas3_Iscsi] SMAPPFILES/SMAPPFILES.vmx   debian5_64Guest   vmx-07              

Continue on with the rest of the VMs until I get them all registered on this node:

~ # vim-cmd solo/registervm /vmfs/volumes/517ff6d9-b7dec161-a899-001a64792b78/SMLPXWIKI/SMLPXWIKI2.vmx
560
~ # vim-cmd vmsvc/getallvms
Vmid       Name                            File                              Guest OS          Version   Annotation
448    SMAPPFILES     [Frcbnas3_Iscsi] SMAPPFILES/SMAPPFILES.vmx       debian5_64Guest         vmx-07              
464    EXCALIBUR      [Frcbnas3_Iscsi] EXCALIBUR/EXCALIBUR.vmx         winNetStandardGuest     vmx-07              
480    FRCBBACKUP01   [Frcbnas3_Iscsi] FRCBBACKUP01/FRCBBACKUP01.vmx   windows7Server64Guest   vmx-07              
496    FRCBCERTIF     [Frcbnas3_Iscsi] FRCBCERTIF/FRCBCERTIF.vmx       windows7Server64Guest   vmx-07              
512    FRCBDC2        [Frcbnas3_Iscsi] FRCBDC2/FRCBDC2.vmx             winNetStandard64Guest   vmx-07              
528    FRCBWSUS       [Frcbnas3_Iscsi] FRCBWSUS/FRCBWSUS.vmx           windows7Server64Guest   vmx-07              
544    SMARTNAGIOS    [Frcbnas3_Iscsi] SMARTNAGIOS/SMARTNAGIOS.vmx     rhel5Guest              vmx-07              
560    SMLPXWIKI      [Frcbnas3_Iscsi] SMLPXWIKI/SMLPXWIKI2.vmx        winNetStandardGuest     vmx-07        

All was good and we started those that we needed now on the good node (10.33.33.18) and started to recover the failed 10.33.33.15 which stubbornly didn't want to boot up now. Ended up powering it down, pulling out power cables and after waiting for a minute bringing it back up. It did.

But, the VMs we so nicely re-registered on 10.33.33.18 are still registered on 10.33.33.15 and when we decided to reconnect it in vCenter we saw the whole list of it there, with vCenter saying that it's going to add it to the cluster. Nice, but not what we needed.

SSH into 10.33.33.15 to un-register the VMs before re-connecting the node back into the cluster:

Here is the whole list, but note that the VMID is not the same as it now has on 10.33.33.18

~ # vim-cmd vmsvc/getallvms
Vmid       Name                             File                              Guest OS          Version   Annotation
128    FRCBDC2         [Frcbnas3_Iscsi] FRCBDC2/FRCBDC2.vmx             winNetStandard64Guest   vmx-07              
224    FR1-C-SCCM-01   [Frcbnas3_Iscsi] FRCBSCCM/FRCBSCCM.vmx           windows7Server64Guest   vmx-07              
272    SMLPXWIKI       [Frcbnas3_Iscsi] SMLPXWIKI/SMLPXWIKI2.vmx        winNetStandardGuest     vmx-07              
400    SMARTNAGIOS     [Frcbnas3_Iscsi] SMARTNAGIOS/SMARTNAGIOS.vmx     rhel5Guest              vmx-07              
480    FRCBCERTIF      [Frcbnas3_Iscsi] FRCBCERTIF/FRCBCERTIF.vmx       windows7Server64Guest   vmx-07              
496    FRCBBACKUP01    [Frcbnas3_Iscsi] FRCBBACKUP01/FRCBBACKUP01.vmx   windows7Server64Guest   vmx-07              
512    FRCBWSUS        [Frcbnas3_Iscsi] FRCBWSUS/FRCBWSUS.vmx           windows7Server64Guest   vmx-07              
528    FR1-C-SAGE30    [Frcbnas3_Iscsi] FR1-C-SAGE30/FR1-C-SAGE30.vmx   windows7Server64Guest   vmx-07              
544    EXCALIBUR       [Frcbnas3_Iscsi] EXCALIBUR/EXCALIBUR.vmx         winNetStandardGuest     vmx-07              
64     SMAPPFILES      [Frcbnas3_Iscsi] SMAPPFILES/SMAPPFILES.vmx       debian5_64Guest         vmx-07      

Let's take a shot at one of the VMs - SMARTNAGIOS:

~ # vim-cmd vmsvc/unregister 400

~ # vim-cmd vmsvc/getallvms
Vmid       Name                             File                              Guest OS          Version   Annotation
128    FRCBDC2         [Frcbnas3_Iscsi] FRCBDC2/FRCBDC2.vmx             winNetStandard64Guest   vmx-07              
224    FR1-C-SCCM-01   [Frcbnas3_Iscsi] FRCBSCCM/FRCBSCCM.vmx           windows7Server64Guest   vmx-07              
272    SMLPXWIKI       [Frcbnas3_Iscsi] SMLPXWIKI/SMLPXWIKI2.vmx        winNetStandardGuest     vmx-07              
480    FRCBCERTIF      [Frcbnas3_Iscsi] FRCBCERTIF/FRCBCERTIF.vmx       windows7Server64Guest   vmx-07              
496    FRCBBACKUP01    [Frcbnas3_Iscsi] FRCBBACKUP01/FRCBBACKUP01.vmx   windows7Server64Guest   vmx-07              
512    FRCBWSUS        [Frcbnas3_Iscsi] FRCBWSUS/FRCBWSUS.vmx           windows7Server64Guest   vmx-07              
528    FR1-C-SAGE30    [Frcbnas3_Iscsi] FR1-C-SAGE30/FR1-C-SAGE30.vmx   windows7Server64Guest   vmx-07              
544    EXCALIBUR       [Frcbnas3_Iscsi] EXCALIBUR/EXCALIBUR.vmx         winNetStandardGuest     vmx-07              
64     SMAPPFILES      [Frcbnas3_Iscsi] SMAPPFILES/SMAPPFILES.vmx       debian5_64Guest         vmx-07 

Carry on and unregister the rest of it:

~ # vim-cmd vmsvc/unregister 64 
~ # vim-cmd vmsvc/unregister 544
~ # vim-cmd vmsvc/unregister 528
~ # vim-cmd vmsvc/unregister 512
~ # vim-cmd vmsvc/unregister 496
~ # vim-cmd vmsvc/unregister 480
~ # vim-cmd vmsvc/unregister 272
~ # vim-cmd vmsvc/unregister 224
~ # vim-cmd vmsvc/unregister 128
~ # vim-cmd vmsvc/getallvms
Vmid   Name   File   Guest OS   Version   Annotation

Now we can re-connect the node to the cluster and wait when DRS kicks in and move something on to it. Which it later did.

Problem solved :-)

Personal tools