Sunday, June 19, 2016

After unexpacted host reboot, Powering on a RDM attached virtual machine fails with the error: Incompatible device backing specified for device '0'

Last week one of our host unexpectedly got restarted and once the host came online we were unable to power on a VM (a passive cluster node) due to an error like,

Incompatible device backing specified for device '0'

HA didn’t restart this VM due to a VM to host-Must DRS rule.

This error occurs when LUN is not consistently mapped on hosts where primary/secondary hosts are running however here when crosschecked found everything correct (LUN Number/naa.id) on affected host.

As this was a passive node so we removed the affected drive from this VM and started this node and then started investigating the issue.

On checking the vml.id of this LUN on both hosts, found it different but the strange thing was its correct on the host in question but wrong on all other hosts in cluster. To share a LUN with different nodes, it should be consistently mapped on all hosts and should have consistent unique vml.id (VMware Legacy id) but here its different so seems the RDM disk pointer file meta data got corrupted.

You can find the vml.id of LUN as follows,

First note down/copy the identifier of LUN (naa.id) and then fire this cmd,  
#esxcli storage core device list -d naa.id

Now to fix this issue what we can do is,  remove the affected RDM disk from the both nodes and then delete the RDM pointer file from Datastore (this doesn’t affect your actual data on LUN). Now after re-scanning the hosts for Datastores, re-add the LUN as RDM drive on both nodes. Now you would be able to power on the affected node.

If due to any reason above doesn’t work then as above after removing the affected RDM drives from both nodes, follow these steps,
  1. Note the NAA_ID of the LUN.
  2. Detach RDM using vSphere client.
  3. Un-present  the LUN from host on storage array. 
  4. Rescan host storage. 
  5. Remove LUN from detached list using these commands:

    #esxcli storage core device detached list
    #esxcli storage core device detached remove -d naa.id
  6. Rescan the host storage. 
  7. Re-present LUN to host. 
  8. Now again rescan the hosts for datastores
If the LUN has been flagged as perennially reserved, this can prevent the removal from succeeding.

Run this command to remove the flag:

#esxcli storage core device setconfig -d naa.id --perennially-reserved=false

Now the command to remove the device should work.

# esxcli storage core device detached remove -d naa.id

Now cross check the vml.id on hosts and it should be same and after adding the RDM drive on nodes you will be able to power on the VM nodes.

Reference: VMware kb#  1016210

Update: Apr 2018

I didn't test it but found this work around listed in a related kb #205489
  1. While adding hard disk to additional nodes of cluster, instead of selecting Existing Hard Disk under New device drop-down menu, select RDM Disk under New device drop-down menu and click Add.
  2. Select the LUN naaid which was added to the first node of the cluster. The LUN number may be different on this host.
  3. Verify that disk got added successfully.

That’s it… :) 


Saturday, June 4, 2016

How to deal with unresponsive windows service, like vCenter svc

You might have seen this, where you tried to restart a windows service and it got stuck on stopping or in some cases starting. Recently same thing happened with me when tried to restart vCenter service, it got stuck on stopping service.

Here what we can do is, first note down the service name by going to its properties,

Here for vCenter service its ‘vpxd’

Now open windows command prompt in elevated mode and run this cmd,

C:\> sc queryex vpxd 

This will give show you the detailed info/status of intended service, note down the PID of respective service.

        SERVICE_NAME: vpxd
        TYPE               : 10  WIN32_OWN_PROCESS
        STATE              : 3  STOP_PENDING
                                (STOPPABLE, NOT_PAUSABLE, ACCEPTS_SHUTDOWN)
        WIN32_EXIT_CODE    : 0  (0x0)
        SERVICE_EXIT_CODE  : 0  (0x0)
        CHECKPOINT         : 0x0
        WAIT_HINT          : 0x493e0
        PID                : 4061
        FLAGS              :

Now run this cmd,

C:/> taskkill /f /pid xxxx

Here the PID is 4061 so,

C:/> taskkill /f /pid 4061

this will terminate the service immediately, once done then you can start the service either from GUI (services console) or from cmd itself by running this command, 

C:/> sc start vpxd

That’s it... :)