Monday Sucks: DPM BMR Restore of Hyper-V Cluster Nodes

First of all – let me tell you that I did not get to this point lightly. As a matter of fact it was quite honestly the LAST damned place I wanted to be. So how did I get here?

A while back Dell, who we purchased our EqualLogic (EQL for short) units from, contacted us about apparently requiring some annual maintenance as part of our service contract. So they email you up out of the blue and basically tell you that you have 30 days to schedule a time with an engineer who will assist you in updating your firmware on your EQL units. Oh and by the way, they want you to update your Host Integration Tools (HIT) Kit on all iSCSI connected servers as well at the same time. So, I went about collecting the information they wanted and getting some MX time setup for a weekend. I understand (and completely agree with) updating firmware and software that fixes big bugs so I don’t really blame Dell for wanting this done and quite honestly once a year is not too big of a deal.

So, the MX day comes along and all goes pretty well aside from the fact that it took about 12 hours to get the whole thing done because of the EQL firmware, switch firmware and the long list of servers that needed to be done, not to mention that (per Dell) HIT kit 3.5.1 needs to be uninstalled before moving to 4.0. Long day. However one little issue sprung up. We run Windows Core for our Hyper-V hosts. The tech who was running the show couldn’t find the documentation for how to remove 3.5.1 from Core – so the install of 4.0 was run directly over 3.5.1. Turns out… that was a BAD BAD move. Why? Well my Core Hyper-V hosts are basically headless and configured to boot directly from iSCSI not from internal disks. After said 4.0 installation – the system for some reason continues to generate more and more iSCSI connections to the boot LUN until…. the server crashes, the VMs on said node failover (ungracefully) to another host and spin back up.  All told my hyper-v hosts crash once a day, at least.

H0w do I know Dell’s HIT Kit is to blame? The logs… it’s all in the logs. I went from a few Event ID 116 (Removal of device xxxx on iSCSI session xxxx was vetoed by xxxx) events a week to  200+ a day per host .. and the logs started going nuts right after the HIT Kit update reboot.

Note: Typically there’s about 40-50 per EQL Unit not 200 per unit. When Nagios starts sending me these alarms I’m guaranteed a host crash within minutes.

Anybody else see the pattern that starts to develop at some random time after boot? This could happen 12 hours after boot or 3 hours. No rhyme or reason. Just starts creating new connections to the LUN and never releases the old ones until the system goes down in flames.

Case opened with Dell, spent almost 4 hours on the phone while they had tech after tech look at it and come up with bubkis. At Dell’s recommendation I’ve uninstalled 4.0 and gone back to 3.5.1. I’ve gone through the registry bit by bit looking for residual 4.0 stuffs. I’ve spent all kinds of time on technet and google looking for anything I can find to try and solve the issue. No joy. I’ve sent a multiple DSETs, a couple Lasso files. Nada. The final straw this morning (summary of email conversation – I was much more polite than this):

Tech: Please send more DSETs so we can send to Engineers.

Me: WTF happened to the last ones I sent on Friday?! Nothing has changed except for more crashes.

Tech: Oh… there they are teehee … let me get them to the engineers.

Me: Like what should have happened two days ago?

… sounds of birds chirping as the emails stopped coming …

Anyway, part of this was my own stupid. I should have (and dammit I thought about it and then forgot to add it to my checklist) but I should have snapshot the boot LUNs prior to the MX. Whoops. Thankfully I was smart enough to make sure I had BMR backups being done on the Hyper-V hosts (and had tested it before actually trusting it).

The process for running a BMR restore is pretty simple – and even though these are clustered hosts the process remains pretty much the same (even from an iSCSI boot perspective).

The generic process can be found here:

The only gotchas with my setup were:

1.I had to transition all the VMs from one host to the other and put the host in MX mode (SCVMM).

2. I had to allow the iSCSI connections to start and then cancel booting from the LUN. I was then prompted to boot from CD. This way the Boot LUN was attached to the server and Windows Setup could see it. Obviously doing the restore setup from Admin Tools wasn’t an option on Core.

3. Since my BMR was apparently a little too old (2 weeks?!) I had to disjoin the host from the domain and rejoin it. Not a big deal and the cluster picked the host right back up as if nothing had changed.

4. Only do one host at a time (depends on your cluster’s ability to tolerate failure obviously).

Now I get to spend many hours monitoring every little thing to make sure the host stays stable and it doesn’t go LUN happy again. Needless to say HIT Kit 4.0 is no longer on my ‘update’ list. Here’s hoping this fixes it…

UPDATE: No Go. This did not fix the issue. At this point I’m at a loss. Going back to a pre-install BMR recovery and still having the same problem — Which did NOT exist when the BMR backup was taken. Temporary solution now is I’m migrating one host to use physical disks inside the chassis (post on how I’m doing that upcoming) – so at least one host is stable and can host the critical VMs along with the ones that can’t handle spontaneous reboots very well (like SQL).

VHD Parent Locator Fixup Failed – DPM 2010 ID 3130

DPM failed to fixup VHD parent locator post backup for Microsoft Hyper-V \Backup Using Child Partition Snapshot\SERVER(SERVER) on DPMSERVER. Your recovery point is valid, but you will be unable to perform item-level recoveries using this recovery point. (ID 3130)

So I’ve seen this happen a couple times now on my DPM 2010 installation. I wasn’t too overly worried about it because it was an occasional alert and usually the next set of recovery points would be fine. However, after a couple VMs started having the same recovery point error on a near nightly basis it was time to start figuring out what was going on.

I found this thread which contained a link to this KB article and that was the solution for the original poster. Unfortunately that KB didn’t really apply to me because all the hosts/vms in question are 2008 R2. But, what it did do is point me to the root of the problem itself. What I had done a while back was setup checkpoints on the systems in question and then a short while later (when I was done testing a couple things) I deleted all the previous checkpoints for those systems and kept only the current running point in time. The problem was I never took the downtime to merge the differencing disks and naturally I later forgot that I needed to. So, what would end up happening is I would have say 5 minutes of one of these VMs being off as I made configuration changes and Hyper-V would begin the merge process. However SCVMM didn’t show anything like that happening, so when I was done my config changes, I would boot the VM and Hyper-V would cancel the merge process.

So, in short the solution to this error is quite simply shut down your VM, open the Hyper-V console (not SCVMM) and wait while the merge process completes. Depending on the size of the differencing disks it may take a (long) while.

And for reference – the easiest way to determine if you have a disk merge that needs to be completed is to browse to your VM storage and check to see if there are AVHD files in there. You can also open the VM’s XML configuration and look for this line:
disk_merge_pending type=”bool” True

Forcing an Update of the Exchange 2007 GAL

Some things should be simple. Like just right clicking the GAL in the EMC and selecting update. Then updating Outlook. But Nooooooo… do you think that actually works when you need it to?! Of course not.

So, fire up the Exchange Management Shell —

PS> Get-OfflineAddressBook | Update-OfflineAddressBook

PS> Update-FileDistributionService -Identity YourCASServer

Head on over to Outlook and download the address book and your changes should be there.

Exchange 2007 Quick Tip: Find disabled AD users with active mailboxes and stop their email

Doing a little cleanup today and needed to check who was “disabled” in AD but still had mailboxes on our Exchange server. In case you weren’t aware, Exchange mailboxes remain active even if the user is disabled. As part of how we do things, we keep mailboxes around for a looooong time because people tend to leave/retire then come back on a temp basis. So when they return, they have all their old emails available. So, first let’s compare AD w/ Exchange and get a list of folks… (source for the code below)

Download Quest powershell.
Run the PS query below
> get-qaduser -includedproperties altrecipient, homeMDB -disabled | select-object -property “name”, “description” , “altrecipient”, “homeMDB” > c:\mailboxes.csv
Then sort by HomeMDB.

Now you have a list of folks to work with. The next step if you want to stop email flowing to that mailbox is to do one of a couple things ( there’s some other options too) -you can either restrict who can send email to that address (say a dummy account in your organization only) which will prevent anyone else sending to that mailbox or you can change the primary SMTP address to something else and the original address will cause a non-deliverable.

What’s the difference? Not much. If you restrict who can email you will get the following NDR:

Your message wasn’t delivered because of security policies. Microsoft Exchange will not try to redeliver this message for you. Please provide the following diagnostic text to your system administrator. #550 5.7.1 RESOLVER.RST.NotAuthorized; not authorized ##

If you change the primary address to something else you will get this:

The recipient’s e-mail address was not found in the recipient’s e-mail system. Microsoft Exchange will not try to redeliver this message for you. Please check the e-mail address and try resending this message, or provide the following diagnostic text to your system administrator. #550 5.1.1 RESOLVER.ADR.RecipNotFound; not found ##

I personally prefer the address not found. That to me is a little more definitive and doesn’t say oops you can’t do that, please call me and ask for permission to do it. It says oops, that address is wrong, check it and make sure you’re sending to someone who is still here.  My standard format for changing addresses is to leave the username and add in _DISABLED. So the new address looks like:

Then if you want to easily find everyone who has a disabled email… the search is like this:

> Get-Recipient -ResultSize Unlimited -Filter “EmailAddresses -like ‘*_DISABLED@domain.local’ -And RecipientType -eq ‘UserMailbox'”


Exchange 2007 Quick Tip: Getting Members of a Dynamic Distribution Group

If you need to get a listing of all the people (and email addresses) who will receive an email from a dynamic distribution group and then export that to CSV:

$members = Get-DynamicDistributionGroup -resultsize unlimited -Identity “Distribution Group Name”
Get-Recipient -resultsize unlimited -RecipientPreviewFilter $members.RecipientFilter | select Displayname,PrimarySmtpAddress | Export-Csv C:\temp\NameOfCSV.Csv

Exchange 2007 Quick Tip: Searching Email Addresses

A Couple quick tips since I had to do a similar search this morning. First let’s say you need to search all your recipients for an email address. Normally not a big deal since you should know their name and can check the mailbox. But what if its an email address that isn’t obvious and is on a different domain?

get-recipient -ResultSize Unlimited | where {$_.emailaddresses -match “”} | select name,emailaddresses | fl —- (source)

That will get all recipients then filter based on the domain listed. You can also of course just change the file to the full email address and that will work as well.

Unfortunately my search didn’t return what I was looking for so I wanted to check and see if maybe it was someone who was using a forwarding address. Here’s that search as well:

Get-Mailbox | Where {$_.ForwardingAddress -ne $null} |Select Name, ForwardingAddress, DeliverToMailboxAndForward | fl —- (source)

Exchange 2007 Add To Distribution Group From Text File

Another interesting request came my way. Since I work in a school district we have a decent amount of “turnover” or whatever you may want to call it every year and at random times throughout the year. This isn’t necessarily folks leaving the district but maybe moving from one grade level to another or taking leave and having a long term substitute take their place. So the request was to have these distribution groups which are populated with a listing of teachers per grade level, generated from our student information system. Naturally the information system doesn’t provide a way to update Exchange and is not at all AD integrated. So what I did was request an export of teachers by grade level (one text file per grade level) and their (thankfully aligned with AD) username. I then placed those text files in a folder, created the distribution groups and ran the below script.

First use powershell to create a distribution group: Continue reading “Exchange 2007 Add To Distribution Group From Text File”