Rescan-O-Death!

VMware is great! It offers wonderfull functionality, like vmotion, High Availability, Dynamic Resource Scheduling and much more, right from a Graphical User Interface that any 11 year-old can understand.

 I really mean that. But beware! This wonderful GUI allows you to destroy your Virtual Infrastructure without warning. So you’d better know what you are doing.

 I’d like to point you to a known issue from ESX version 3.0.1 that I found still to be present in version 3.5. It’s results are not End-Of-The-World-type mayhem, but will still make the most experienced administrator break a sweat.

I’m talking about the Rescan Storage dialog:

VMware Rescan Storage

Clicking OK while both checkboxes are checked is like playing Russian Roulette with your ESX server. Most of the times you’ll be able to have a cup of coffee while this is running. But sometimes… BANG! … your ESX host hangs and you’ll be the one doing the running. Because when ESX hangs, all vm’s hang. Which causes a lot of your users to want to call you. And they won’t want to discuss the weather.

The problem is caused by a deadlock situation on the HBA and can only be resolved by a hard reset of your server.

So how do we prevent this horrible event from ever occuring, you ask? Simple. Remember to never check more than one box at a time when using this dialog.

So, how’s the weather over there?

»crosslinked«

4 thoughts on “Rescan-O-Death!”

  1. I had similar result with scanning the disks. Takeing the host out of the cluster and back in again helped moving the VMs to an othe host and than restart the host witch hang.

    An onther bug ist that the scann will change the host within the scan if two checkboxes are checked and you change the focus on an other host.

  2. Just hit this one… Nasty stuff. Worth mentioning that the HBA itself remained useful (ant the VMs alive), but the “scantools” process was in deep meditation… Couldn’t kill it, ofcourse but killing the parent process – esxcfg-rescan did the trick for me – that is, I was able to restart the mgmt agents and at least make the VC see the host again. Now moving the VMs to other hosts, preparing to reboot the damn thing.
    Thanks for posting this experience ;)
    P.S. Server is Sun Fire X4600

  3. Guys,
    I’m not surpriced, if you are using shity storage with lot of SCSI reservation problems or NFS with network interruptions that will always happend.
    Please install ESX 4.1 u1 and check again…

    regards,
    TmaX

  4. Hey TmaX,
    As you can see, the post is from 2008 and refers to ESX3.5. I am currently running ESXi4.1 and have not seen the issue since ESX4.
    Hugo

Leave a Reply