I had this great idea for laying down a new pair of servers as virtual servers running on some reasonably decent hardware. I’ve been fiddling nonstop for the last three weeks with this beast (ask Danna; she’s ready to disown me). I’ve learned a lot about how Virtual Server works. Enough to know that what I was trying to do would never work.

I wanted a single box that only ran virtual servers. I wanted two servers. One public facing server and one for my personal files in my home. I was going to have them both share the same RAID 5 disk array so I’d never have to worry about failed hard drives again. I even had a pretty diagram:

I had everything set up and the web servers were happily running along. In fact, I had fewer problems with the web servers than I ever had when they were running on real hardware. My home server was another matter. I’m using Windows Home Server (buy this: its PC backups are the best I’ve seen). Sure, it worked great when I initially set it up. I could even saturate my 100Mbit network copying data to the server, so virtualization didn’t hurt it much. Unfortunately, I just couldn’t get it to stick around very long. As soon as I really started to load it up with files it would disappear from the network. I couldn’t log into it, and after rebooting it the error logs were filled with all sorts of nasty reports of hard drive corruption. Sometimes the virtual hard drive was so corrupt I couldn’t even boot.

What’s the trouble? The trouble turns out to be documented in Virtual Server’s help guide. Here’s what was going on: the guest OS running in the virtual server uses lazy write caching for performance. So does the host OS. If these two caches significantly outpace the speed at which data can be written to the real hard disk the guest OS may think the disk has timed out. It will ask the controller to reset, which asks the real controller to reset, which further slows down the disk. You can see this steady decline in the Windows event log: I first saw a bunch of timeout errors. Next up were a bunch of delayed-write failure messages, followed by some rather nasty messages about the $MFT table being corrupt and unreadable. The help documentation recommends some ways around this. I tried the software remedies, but sort of getting a faster disk subsystem I was out of luck.

The really interesting thing here is that while these errors occurred on one of the virtual hard drives, because they all sat on the same physical RAID volume the corruption would make its way into other drives as well since the entire volume was slowing down significantly. And that, really, explains how all the web sites I host went down for all of yesterday. I had file table corruption in the web site server and it’s taken me until about 1:00AM to get everything checked out and running again. In the end I decided to get rid of the RAID array and run Home Server as the host OS and continue to keep the much lower bandwidth web servers as virtual.

The diagram isn't as symmetrical, but everything works great: