2008/09/23

Moving RHEL to ESX

We have a cloud application that we have purchased for in-house deployment. It's a long story that I won't share, but we have been seeing performance problems with it, and have ended up with a (physical) server in our datacenter that they have configured just like their cloud resources, so we can compare performance on our (virtual) system against the way they run things.

To make sure we capture the most data we can about this system (and to demonstrate that it either is or is not the virtualization layer causing slowness) we've been tasked with copying the physical server into a similarly-configured virtual machine.

Unfortunately, VMware Converter apparently can't do Linux. So we had to use the more standard drive-image-transfer toolkit of dd and netcat. But even after the image was transferred, the kernel would crash because it couldn't find the root disk.

This is to be expected, but only Google knows how to solve it. If you know the right keywords. Good screenshots of the process are at http://virtualaleph.blogspot.com/2007/05/virtualize-linux-server-with-vmware.html but it's modprobe.conf not modules.conf in RHEL5.

So here's the steps I took:

Boot from the (correct) RHEL install CD #1 with "linux rescue". Note that it has to be the correct RHEL version and architecture. Since the appliance was running 5.2 x64 edition, my 5.1 x86 cd didn't work, and I had to download a different CD1.

Skip the networking config since it won't help, search for RHEL installations, and continue to the root shell (in read-write mode)

chroot /mnt/sysimage and edit /etc/modprobe.conf. Change the eth0 module alias to pcnet32 (you can remove the eth1 alias if you don't have a second nic in your VM) and change the scsi_hostadapter alias to BusLogic. (again, you can remove other aliases if you want)

Then copy /boot/initrd-.img initrd--phys.img as a backup, and build a new initrd file with the new devices: mkinitrd -v -f /boot/initrd-.img

If that works, you should be able to boot the VM and have it come up cleanly.

--Joe

2008/09/17

Reflections on x4500+ZFS+NFS+ESX

I was asked about my thoughts on running ESX over NFS to a ZFS backend. For posterity, here they are:

x4500+ZFS+NFS+ESX is a quite functional stack. There are a few gotchas that I've run into:

First, the ESX "storage delegate" functionality doesn't. This is supposed to change the EUID that the ESX server sends with it writes. Well, it does for most of the requests, but not for things like creating the VM's swap file. So you pretty much have to export your NFS shares with root=vmkernel.ip.address

We have many ESX servers, so keeping the sharenfs= parameters got unweildy. I ended up putting them in a text file in the NFS share for easy editing, and when I have to add/change an ESX server, I edit the file and zfs set `cat zfs.shareprops` /pool/path/to/share

NFS is much better than iSCSI. At least in the version I did iSCSI testing, all of the ZFS volumes presented from OpenSolaris were recognized by ESX as being the same disk. This meant that I had a dozen paths to the same vmfs datastore, some 100GB, some 500GB, etc. This Was Bad. NFS made it better.

NFS also gives you a couple of other benefits: On NFS datastores, the vmdk files are by default thin-provisioned. This means that if you give your VM a 5TB vmdk, and don't use more than 10GB, it takes up 10GB of capacity on the physical disks. It's also much better understood by troubleshooting tools (wireshark) so it's easier to find problems like the storage delegate issue above. Also, it's a first-class citizen from Sun. NFS serving has been in Solaris since 1994, and isn't broken by the latest Nevada builds. Sun takes NFS seriously.

The downside of NFS is that ESX makes all its requests O_SYNC. This is good for ESX but bad for ZFS. Your nvram cards should help a lot. I ended up with a different solution: The business agreed that these are not Tier-1 VMs, and they're not on Tier-1 storage. So I've turned off all ZFS sync guarentees with /etc/system:


* zil_disable turns off all syncronous writes to ZFS filesystems. Any FSYNC,
* O_SYNC, D_SYNC, or sync NFS requests are services and reported completed
* as soon as they've been transferred to main memory, without waiting for
* them to be on stable storage. THIS BREAKS THE SAFETY SEMANTICS AND CAN
* CAUSE DATA LOSS! (clients have moved on thinking the data was safely written
* but it wasn't)
* However, in our case, we can afford to lose this data. For DEV/Test systems
* rollback to the latest (hourly) snapshot is considered acceptable.
set zfs:zil_disable=1


As the comment says, this would be a bad thing. But I know that the vmdk files are crash-consistant every hour and that's OK to the users. If they lose an hour of work, it's annoying but worth the cheaper storage.

Finally, and most importantly:

MAKE SURE YOUR POOL IS CONFIGURED FOR YOUR WORKLOAD. Vms are effectively a random-read and random-write workload. There is no sequential access of the vmdk files except when you're cloning a VM. So you have to understand the read and write characteristics of your zfs pool. RAID-Z and RAID-Z2 always read and write a full RAID stripe every time. This means it has to read from all of the disks in the pool to return a single byte of data to the ESX host. Mirrored pools, on the other hand, read from a single disk, and if the checksum is correct, passes it back to the ESX host. So in my case, I can have 44 simultaneous read requests from the ESX servers being serviced at the same time (44 disks in the pool) and/or 22 simultaneous writes (each write is written to two disks). Basically RAID-Z[2] is bad for random workloads, but mirroring is expensive.

With this in mind, performance on the thumper is excellent. We can easily saturate the onboard 1Gbps network link with NFS traffic, I've got link aggregation and can easily saturate the combined 2Gbps link. I haven't seen what happens with 4 uplinks, but I'd expect that the network will still be the slowest part of the chain. Doing basic I/O benchmarks on the thumper, I can get 1GBps out of the disks. Yes, that's 1GB per second.