Migrating a 2003 R2 Physical Server to Hyper-V 2012 R2 in 2018….

Here’s the details on how I was able to migrate a 2003 R2 server from physical to virtual.

I used Disk2VHD to convert the running physical machine to virtual hard drives. Note: Do not have it prep for the old Virtual Server software and do NOT have it convert to VHDX.  I used a temporary disk to copy the physical drives to during the conversion process. I then used a standalone Hyper-V server to mount the OS disk it had converted to VHD. Continue reading “Migrating a 2003 R2 Physical Server to Hyper-V 2012 R2 in 2018….”

Upgrade Steps Outline – SCVMM 2008 SP1 to 2012 SP1

 

System Center

What follows is the basic outline I followed to upgrade System Center Virtual Machine Manager 2008 SP1 running on a Windows Server 2008 R2 host to SCVMM 2012 SP1. Obviously each step has small libraries of documentation you could read, warning you could ignore, and whatnot – but if you’re simply looking to make sure you’ve got the right steps planned out, this is what I did to upgrade our setup:

  1. Make sure your OS / Data drive are at least 60 GB. You’ll need this for all the upgrades.
  2. Download SCVMM 2012 as well as SCVMM 2012 SP1. You can’t go straight to 2012 SP1 from 2008 SP1
  3. Upgrade SQL 2008 to 2008 R2 if you aren’t running it already. SCVMM 2012 SP1 doesn’t support 2008.
  4. BACKUP BACKUP BACKUP!
  5. Upgrade SCVMM to 2012
  6. Install SCVMM 2012 RU3  (http://support.microsoft.com/kb/2756127)
  7. Uninstall SCVMM 2012.
    I know, I know, but SCVMM 2012 does not support installation on Server 2012 and because there’s no direct upgrade path to SCVMM 2012 SP1 from 2008, you have to update the SCVMM database. Continue reading “Upgrade Steps Outline – SCVMM 2008 SP1 to 2012 SP1”

Monday Sucks: DPM BMR Restore of Hyper-V Cluster Nodes

First of all – let me tell you that I did not get to this point lightly. As a matter of fact it was quite honestly the LAST damned place I wanted to be. So how did I get here?

A while back Dell, who we purchased our EqualLogic (EQL for short) units from, contacted us about apparently requiring some annual maintenance as part of our service contract. So they email you up out of the blue and basically tell you that you have 30 days to schedule a time with an engineer who will assist you in updating your firmware on your EQL units. Oh and by the way, they want you to update your Host Integration Tools (HIT) Kit on all iSCSI connected servers as well at the same time. So, I went about collecting the information they wanted and getting some MX time setup for a weekend. I understand (and completely agree with) updating firmware and software that fixes big bugs so I don’t really blame Dell for wanting this done and quite honestly once a year is not too big of a deal.

So, the MX day comes along and all goes pretty well aside from the fact that it took about 12 hours to get the whole thing done because of the EQL firmware, switch firmware and the long list of servers that needed to be done, not to mention that (per Dell) HIT kit 3.5.1 needs to be uninstalled before moving to 4.0. Long day. However one little issue sprung up. We run Windows Core for our Hyper-V hosts. The tech who was running the show couldn’t find the documentation for how to remove 3.5.1 from Core – so the install of 4.0 was run directly over 3.5.1. Turns out… that was a BAD BAD move. Why? Well my Core Hyper-V hosts are basically headless and configured to boot directly from iSCSI not from internal disks. After said 4.0 installation – the system for some reason continues to generate more and more iSCSI connections to the boot LUN until…. the server crashes, the VMs on said node failover (ungracefully) to another host and spin back up.  All told my hyper-v hosts crash once a day, at least.

H0w do I know Dell’s HIT Kit is to blame? The logs… it’s all in the logs. I went from a few Event ID 116 (Removal of device xxxx on iSCSI session xxxx was vetoed by xxxx) events a week to  200+ a day per host .. and the logs started going nuts right after the HIT Kit update reboot.

Note: Typically there’s about 40-50 per EQL Unit not 200 per unit. When Nagios starts sending me these alarms I’m guaranteed a host crash within minutes.

Anybody else see the pattern that starts to develop at some random time after boot? This could happen 12 hours after boot or 3 hours. No rhyme or reason. Just starts creating new connections to the LUN and never releases the old ones until the system goes down in flames.

Case opened with Dell, spent almost 4 hours on the phone while they had tech after tech look at it and come up with bubkis. At Dell’s recommendation I’ve uninstalled 4.0 and gone back to 3.5.1. I’ve gone through the registry bit by bit looking for residual 4.0 stuffs. I’ve spent all kinds of time on technet and google looking for anything I can find to try and solve the issue. No joy. I’ve sent a multiple DSETs, a couple Lasso files. Nada. The final straw this morning (summary of email conversation – I was much more polite than this):

Tech: Please send more DSETs so we can send to Engineers.

Me: WTF happened to the last ones I sent on Friday?! Nothing has changed except for more crashes.

Tech: Oh… there they are teehee … let me get them to the engineers.

Me: Like what should have happened two days ago?

… sounds of birds chirping as the emails stopped coming …

Anyway, part of this was my own stupid. I should have (and dammit I thought about it and then forgot to add it to my checklist) but I should have snapshot the boot LUNs prior to the MX. Whoops. Thankfully I was smart enough to make sure I had BMR backups being done on the Hyper-V hosts (and had tested it before actually trusting it).

The process for running a BMR restore is pretty simple – and even though these are clustered hosts the process remains pretty much the same (even from an iSCSI boot perspective).

The generic process can be found here:
http://blogs.technet.com/b/dpm/archive/2010/05/12/performing-a-bare-metal-restore-with-dpm-2010.aspx

The only gotchas with my setup were:

1.I had to transition all the VMs from one host to the other and put the host in MX mode (SCVMM).

2. I had to allow the iSCSI connections to start and then cancel booting from the LUN. I was then prompted to boot from CD. This way the Boot LUN was attached to the server and Windows Setup could see it. Obviously doing the restore setup from Admin Tools wasn’t an option on Core.

3. Since my BMR was apparently a little too old (2 weeks?!) I had to disjoin the host from the domain and rejoin it. Not a big deal and the cluster picked the host right back up as if nothing had changed.

4. Only do one host at a time (depends on your cluster’s ability to tolerate failure obviously).

Now I get to spend many hours monitoring every little thing to make sure the host stays stable and it doesn’t go LUN happy again. Needless to say HIT Kit 4.0 is no longer on my ‘update’ list. Here’s hoping this fixes it…

UPDATE: No Go. This did not fix the issue. At this point I’m at a loss. Going back to a pre-install BMR recovery and still having the same problem — Which did NOT exist when the BMR backup was taken. Temporary solution now is I’m migrating one host to use physical disks inside the chassis (post on how I’m doing that upcoming) – so at least one host is stable and can host the critical VMs along with the ones that can’t handle spontaneous reboots very well (like SQL).

VHD Parent Locator Fixup Failed – DPM 2010 ID 3130

Error:
DPM failed to fixup VHD parent locator post backup for Microsoft Hyper-V \Backup Using Child Partition Snapshot\SERVER(SERVER) on DPMSERVER. Your recovery point is valid, but you will be unable to perform item-level recoveries using this recovery point. (ID 3130)

So I’ve seen this happen a couple times now on my DPM 2010 installation. I wasn’t too overly worried about it because it was an occasional alert and usually the next set of recovery points would be fine. However, after a couple VMs started having the same recovery point error on a near nightly basis it was time to start figuring out what was going on.

I found this thread which contained a link to this KB article and that was the solution for the original poster. Unfortunately that KB didn’t really apply to me because all the hosts/vms in question are 2008 R2. But, what it did do is point me to the root of the problem itself. What I had done a while back was setup checkpoints on the systems in question and then a short while later (when I was done testing a couple things) I deleted all the previous checkpoints for those systems and kept only the current running point in time. The problem was I never took the downtime to merge the differencing disks and naturally I later forgot that I needed to. So, what would end up happening is I would have say 5 minutes of one of these VMs being off as I made configuration changes and Hyper-V would begin the merge process. However SCVMM didn’t show anything like that happening, so when I was done my config changes, I would boot the VM and Hyper-V would cancel the merge process.

So, in short the solution to this error is quite simply shut down your VM, open the Hyper-V console (not SCVMM) and wait while the merge process completes. Depending on the size of the differencing disks it may take a (long) while.

And for reference – the easiest way to determine if you have a disk merge that needs to be completed is to browse to your VM storage and check to see if there are AVHD files in there. You can also open the VM’s XML configuration and look for this line:
disk_merge_pending type=”bool” True

HMH Access 21st Century, DEP and You!

Alright, if you haven’t had the pleasure of meeting Access 21st Century yet – and most of you never will – then here’s a quick rundown.

First here’s the FAQ / Tech Notes page – you’re going to need it.

Houghton Mifflin Harcourt is the developer/publisher. Long story short with this application, is it provides a learning environment (computer based) for at risk kids. And that’s about the extent of my knowledge for what the product actually does. Sorry – I’m a systems administrator, not a teacher so going hands on in depth with the learning environment just isn’t part of the role.

Anyway, there’s a couple important things to know that may very well trip you up if you end up installing this platform for your district.

First: DO NOT, I repeat DO NOT trust the built in database backup utility in the application. I’m not sure what the story is with it – but 90% of the time it fails to provide a backup at all. When it does – you can bet that you’re going to try 9 or 10 times to get it to restore. So, by all means use the utility but have a backup plan! My personal method (after a nightmare of problems with the application today) is to stop the Access 21st Century service and take a copy of the data directory in program files. Yes, its a manual process, but I see no other option. I’ll continue backing up the VM, but these other two methods of app DB backup and data directory backup might be the only way to get the data restored. Speaking of – when you can’t restore the database properly – try this:

Continue reading “HMH Access 21st Century, DEP and You!”

Hyper-V Cluster Service Crash During DPM 2010 Backup and EqualLogic Hardware VSS

Hurrah! Another interesting issue while using Hyper-V with CSV. The problem was odd – If I kicked off a backup of a protection group in DPM which is configured to backup using the Child Partition Snapshot of the VMs, on a random Hyper-V node the cluster service would crash, the attached VMs backups would fail, and the VMs would be brought up on another node. Definitely not what I would like to have seen happen. The problem is there wasn’t a whole heck of a lot of information to go off of in the logs. In fact the only really recurring error is Event ID 5121 which is expected – but only if you’re storage provider didn’t supply a hardware VSS component.

Cluster Shared Volume ‘Volume1’ (‘Cluster Disk – CSV’) is no longer directly accessible from this cluster node. I/O access will be redirected to the storage device over the network through the node that owns the volume. This may result in degraded performance. If redirected access is turned on for this volume, please turn it off. If redirected access is turned off, please troubleshoot this node’s connectivity to the storage device and I/O will resume to a healthy state once connectivity to the storage device is reestablished.

EqualLogic does support hardware VSS – but you need to configure it to work. This should remove any event id 5121 errors…. of course it didn’t with mine. (Need to reboot maybe…)

1. Launch Remote setup Wizard from Programs menu->Equallogic->Remote setup
wizard. ** c:\Program Files\EqualLogic\bin on Core
2. In remote setup wizard select “Configure this computer to access a PS
series SAN”
3. If the group is not added then add the group. If the group is added
select the group and click on “Modify”
4. Verify that “Group Name”, Group IP address, Chap Credentials and the
password is entered correctly. Note that the chap credentials for VSS/VDS
access could be different from iSCSI access.
5. restart the equalogic vss provider (net stop eqlvss & net start eqlvss)

Turns out the actual crashing was an easy fix with a patch from Microsoft – KB2494162. Install that, reboot and let DPM run rampant. So far I’ve been running my protection group every hour on the hour (to test) and its not hiccuped once.

Configuring Broadcom iSCSI Offload on Dell R715

See this document – starting on page 91 to page 102. The most important thing to do is make sure you select the correct adapter (page 101). Most documentation is going to show you that you need to select “Microsoft iSCSI Initiator” in the iscsicpl (discovery and targets tabs). This is not correct if you want to use the iSCSI offload functionality of the Broadcom NICs.You need to chose the “Broadcom NetXtreme” device(s) listed in the drop downs. If you select the Microsoft iSCSI Initiator you will still get decent performance – but the iSCSI Offload performance is a decent boost (albeit with a slight impact to CPU performance).

Here’s the difference I saw using iometer — (note – this was done against a iSCSI boot C: drive with a CSV in the cluster volumes directory. I know that’s not good to test against, but its what I had available.)

Without iSCSI Offload:

And with iSCSI Offload

So here’s the basic steps in server core:

  1. Open BACS and expand the NIC you’re planning to use for iSCSI. There will be a NIC icon and iSCSI icon:
  2. Highlight the iSCSI Adpapter in the left panel. In the right hand panel select the configuration tab. Default will be IPv4 DHCP. Disable that and manually enter an address (this is easier than figuring out which adapter is which in the next steps). Click Apply at the bottom. Wash, rinse and repeat for your remaining iSCSI adapters.
  3. Launch the iSCSI Initiator (iscsicpl)
  4. In the discover portal tab, click Discover Portal –> Enter your SAN IP and port # –> click Advanced –> In the local adapter drop down select your ISCSI adapter (Broadcom) –> in the initator IP make sure you select the IP you assigned in step 2.
  5. In the targets tab (refresh your targets if its empty) –> highlight your target –> click connect –>Enable mulitpath if needed –> click advanced –> repeat the selections you made in step 4
  6. You should now see the connections being offloaded in BACS

Dell R715 w/ EqualLogic Failover Cluster Validation Failure

Cluster Disk 0 does not support Persistent Reservations. Some storage devices require specific firmware versions or settings to function properly with failover clusters. Please contact your storage administrator or storage vendor to check the configuration of the storage to allow it to function properly with failover clusters.
Well that’s just great. Another stepping stone into my hair turning grey and being yanked out at the roots. The above is what I get when I try to validate the cluster I’m building for Hyper-V. Connected are 3 hosts, each with MPIO iSCSI connections to a pair of EqualLogic PS6000 arrays hosting a 2TB LUN for data. I thought I had followed all the correct steps to get this to work… until this happened.
A little searching brought me to this thread on Technet which basically describes my issue to a T.

The root cause of the problem is the iqn name that is passed to the Equallogic. If you boot diskless and the first 34 characters of your IQN names are identical then you will have this problem with Windows 2008 R2 Datacentre Edition, for clustering.

Somewhere between the Microsoft ISCSI Initiator and the Broadcom driver/Boot ROM the iqn name is being concatenated to 34 characters. So when you attempt to do a Cluster Validation the requests for PR going to the Storage are essentially seen to be coming from the same host by the Equallogic, as the first 34 characters of the iqn names are identical.

So, the workaround is to make sure your IQNs are unique within the first 34 characters. I have adjusted mine to reflect the server names.

The IQN name field is supposed to accept 255 characters. It is unclear at the moment if the problem is a bug in Microsoft or Broadcom. But given the time taken to prove to Dell there actually was a problem….I am happy to go with the workaround.

I’m not going to blame Microsoft at this point… I am beginning to get the suspicion that these Broadcom NICs and drivers/firmware are nothing but complete crap and need to killed with fire.

 

Following those hints above and setting the IQN for each server to the hostname worked perfectly.

Dell PowerEdge R715 iSCSI Boot with Server Core 2008 R2

Note: This is written as it comes along and so you’ll get to see the failures and hopefully a wrap up with how to make it actually work.

Alright, so here’s the project: Get Windows Server Core 2008 R2 booting from iSCSI. The why of it is somewhat simple. I want to be able to iSCSI boot so I can have a set of Hyper-V host servers at my primary location completely configured and perfectly happy. Then I will replicate those LUNs to our offsite SAN hardware. When disaster strikes, I can then just configure those servers at the DR location to boot from the replicated LUNs. (in theory)

Yes, there’s easier ways to do this for my scenario like Citrix Essentials for Hyper-V and other SAN replication software which would then allow me to just fail over the setup or configure it as a geo-cluster. But reality is those cost money… and it wasn’t included in the budget for this project. We got the money for the hardware/OS and that’s pretty much it.

Server: Dell PowerEdge R715 – 12 Core AMD – (3) 4 port Broadcom BMC5709C NICs

Storage: Dell EqualLogic SAN Group – 2 PS6000s and 2 PS4000s

(2) Dell PowerConnect 6248 switches.

iSCSI Boot Learning Material:

Broadcom NetXtreme User Guide (Dell) // Original by Broadcom

Dell Instructions to Perform Boot from iSCSI (page 21-24 & 36-38 )

In the BIOS you need to set the boot order to put the Embedded NIC first in the list, followed by DVD then local storage. Second you need to enable the embedded NIC (assuming that’s what you’re using for boot) to allow for iSCSI boot instead of PXE. (UPDATE: SEE BELOW FOR MORE ON THE BIOS CONFIGURATION)

Continue reading “Dell PowerEdge R715 iSCSI Boot with Server Core 2008 R2”