Hyper-V Cluster Service Crash During DPM 2010 Backup and EqualLogic Hardware VSS
Hurrah! Another interesting issue while using Hyper-V with CSV. The problem was odd - If I kicked off a backup of a protection group in DPM which is configured to backup using the Child Partition Snapshot of the VMs, on a random Hyper-V node the cluster service would crash, the attached VMs backups would fail, and the VMs would be brought up on another node. Definitely not what I would like to have seen happen. The problem is there wasn't a whole heck of a lot of information to go off of in the logs. In fact the only really recurring error is Event ID 5121 which is expected - but only if you're storage provider didn't supply a hardware VSS component.
Cluster Shared Volume 'Volume1' ('Cluster Disk - CSV') is no longer directly accessible from this cluster node. I/O access will be redirected to the storage device over the network through the node that owns the volume. This may result in degraded performance. If redirected access is turned on for this volume, please turn it off. If redirected access is turned off, please troubleshoot this node's connectivity to the storage device and I/O will resume to a healthy state once connectivity to the storage device is reestablished.
EqualLogic does support hardware VSS - but you need to configure it to work. This should remove any event id 5121 errors.... of course it didn't with mine. (Need to reboot maybe...)
1. Launch Remote setup Wizard from Programs menu->Equallogic->Remote setup wizard. ** c:\Program Files\EqualLogic\bin on Core 2. In remote setup wizard select "Configure this computer to access a PS series SAN" 3. If the group is not added then add the group. If the group is added select the group and click on "Modify" 4. Verify that "Group Name", Group IP address, Chap Credentials and the password is entered correctly. Note that the chap credentials for VSS/VDS access could be different from iSCSI access. 5. restart the equalogic vss provider (net stop eqlvss & net start eqlvss)
Turns out the actual crashing was an easy fix with a patch from Microsoft - KB2494162. Install that, reboot and let DPM run rampant. So far I've been running my protection group every hour on the hour (to test) and its not hiccuped once.