2008/09/23

Moving RHEL to ESX

We have a cloud application that we have purchased for in-house deployment. It's a long story that I won't share, but we have been seeing performance problems with it, and have ended up with a (physical) server in our datacenter that they have configured just like their cloud resources, so we can compare performance on our (virtual) system against the way they run things.

To make sure we capture the most data we can about this system (and to demonstrate that it either is or is not the virtualization layer causing slowness) we've been tasked with copying the physical server into a similarly-configured virtual machine.

Unfortunately, VMware Converter apparently can't do Linux. So we had to use the more standard drive-image-transfer toolkit of dd and netcat. But even after the image was transferred, the kernel would crash because it couldn't find the root disk.

This is to be expected, but only Google knows how to solve it. If you know the right keywords. Good screenshots of the process are at http://virtualaleph.blogspot.com/2007/05/virtualize-linux-server-with-vmware.html but it's modprobe.conf not modules.conf in RHEL5.

So here's the steps I took:

Boot from the (correct) RHEL install CD #1 with "linux rescue". Note that it has to be the correct RHEL version and architecture. Since the appliance was running 5.2 x64 edition, my 5.1 x86 cd didn't work, and I had to download a different CD1.

Skip the networking config since it won't help, search for RHEL installations, and continue to the root shell (in read-write mode)

chroot /mnt/sysimage and edit /etc/modprobe.conf. Change the eth0 module alias to pcnet32 (you can remove the eth1 alias if you don't have a second nic in your VM) and change the scsi_hostadapter alias to BusLogic. (again, you can remove other aliases if you want)

Then copy /boot/initrd-.img initrd--phys.img as a backup, and build a new initrd file with the new devices: mkinitrd -v -f /boot/initrd-.img

If that works, you should be able to boot the VM and have it come up cleanly.

--Joe

2008/09/17

Reflections on x4500+ZFS+NFS+ESX

I was asked about my thoughts on running ESX over NFS to a ZFS backend. For posterity, here they are:

x4500+ZFS+NFS+ESX is a quite functional stack. There are a few gotchas that I've run into:

First, the ESX "storage delegate" functionality doesn't. This is supposed to change the EUID that the ESX server sends with it writes. Well, it does for most of the requests, but not for things like creating the VM's swap file. So you pretty much have to export your NFS shares with root=vmkernel.ip.address

We have many ESX servers, so keeping the sharenfs= parameters got unweildy. I ended up putting them in a text file in the NFS share for easy editing, and when I have to add/change an ESX server, I edit the file and zfs set `cat zfs.shareprops` /pool/path/to/share

NFS is much better than iSCSI. At least in the version I did iSCSI testing, all of the ZFS volumes presented from OpenSolaris were recognized by ESX as being the same disk. This meant that I had a dozen paths to the same vmfs datastore, some 100GB, some 500GB, etc. This Was Bad. NFS made it better.

NFS also gives you a couple of other benefits: On NFS datastores, the vmdk files are by default thin-provisioned. This means that if you give your VM a 5TB vmdk, and don't use more than 10GB, it takes up 10GB of capacity on the physical disks. It's also much better understood by troubleshooting tools (wireshark) so it's easier to find problems like the storage delegate issue above. Also, it's a first-class citizen from Sun. NFS serving has been in Solaris since 1994, and isn't broken by the latest Nevada builds. Sun takes NFS seriously.

The downside of NFS is that ESX makes all its requests O_SYNC. This is good for ESX but bad for ZFS. Your nvram cards should help a lot. I ended up with a different solution: The business agreed that these are not Tier-1 VMs, and they're not on Tier-1 storage. So I've turned off all ZFS sync guarentees with /etc/system:


* zil_disable turns off all syncronous writes to ZFS filesystems. Any FSYNC,
* O_SYNC, D_SYNC, or sync NFS requests are services and reported completed
* as soon as they've been transferred to main memory, without waiting for
* them to be on stable storage. THIS BREAKS THE SAFETY SEMANTICS AND CAN
* CAUSE DATA LOSS! (clients have moved on thinking the data was safely written
* but it wasn't)
* However, in our case, we can afford to lose this data. For DEV/Test systems
* rollback to the latest (hourly) snapshot is considered acceptable.
set zfs:zil_disable=1


As the comment says, this would be a bad thing. But I know that the vmdk files are crash-consistant every hour and that's OK to the users. If they lose an hour of work, it's annoying but worth the cheaper storage.

Finally, and most importantly:

MAKE SURE YOUR POOL IS CONFIGURED FOR YOUR WORKLOAD. Vms are effectively a random-read and random-write workload. There is no sequential access of the vmdk files except when you're cloning a VM. So you have to understand the read and write characteristics of your zfs pool. RAID-Z and RAID-Z2 always read and write a full RAID stripe every time. This means it has to read from all of the disks in the pool to return a single byte of data to the ESX host. Mirrored pools, on the other hand, read from a single disk, and if the checksum is correct, passes it back to the ESX host. So in my case, I can have 44 simultaneous read requests from the ESX servers being serviced at the same time (44 disks in the pool) and/or 22 simultaneous writes (each write is written to two disks). Basically RAID-Z[2] is bad for random workloads, but mirroring is expensive.

With this in mind, performance on the thumper is excellent. We can easily saturate the onboard 1Gbps network link with NFS traffic, I've got link aggregation and can easily saturate the combined 2Gbps link. I haven't seen what happens with 4 uplinks, but I'd expect that the network will still be the slowest part of the chain. Doing basic I/O benchmarks on the thumper, I can get 1GBps out of the disks. Yes, that's 1GB per second.

2008/07/28

How to grow an iSCSI-presented zvol in 3 easy steps

Well, ok, it's not quite 3 easy steps.

A couple of things that don't work: iscsitadm modify target -z . This only works if the iscsi target's backing store is a regular file, which in the case of a zvol, it is not.

The easy bit: Make the zvol bigger:
zfs set volsize=200G tank/iscsi/thevol

Now we have to hack around in the iscsi parameters file: Locate the /etc/iscsi/tgt//params.# file that corresponds to the right target and lun and change the <size> parameter to be the new (in hex) size of the bigger volume in 512-byte blocks. Or in other words,
zfs get -Hp volsize tank/iscsi/thevol | perl -lane 'printf("%llx", $F[2]/512)'


Once that's done, apparently you have to bounce the iscsitgtd to get it to reread the params file.

Then on to the initiator...

format c3tAREALLYLONGSTRINGOFDIGITSFORTHEDISKGUIDd0s0 and changing the parameters won't work, since I'm using EFI labels and it says very strongly
partition> label
Unable to get current partition map.
Cannot label disk while it has mounted partitions.


So I have to go in the other way. While I'm in format, print out the current partition table, and make note of the Last Sectors for the slices. Also, run prtvtoc against the disk to get any other useful bits.

Then I can make the actual partition changes with fmthard:
fmthard -s - /dev/rdsk/c3tAREALLYLONGSTRINGOFDIGITSFORTHEDISKGUIDd0s0

At first, just copy in the line(s) for the slices you already have, but move slice 8 to the end of the disk:
* First Sector Last
* Partition Tag Flags Sector Count Sector Mount Directory
0 2 00 34 251641754 251641787 /zones/mars/data
8 11 00 419413982 16384 419430365


Then (check it in format to make sure the disk is still healthy) change the Last sector and sector count for the real partition. (Last is s8's first -1, and the sector count is s8's first -34)

Then it's a simple growfs -M /zones/mars/data /dev/rdsk/c3tAREALLYLONGSTRINGOFDIGITSFORTHEDISKGUIDd0s0

--Joe

2008/06/26

Goodbye to a problem indicator

We have a new Sun T5220 with the new Niagara-II chip in it. I don't know if it's 8-core/8-thread or 4-core/16-thread but it shows up in Solaris as 16 CPUs.

I'm running a cpu-intensive process (bzip) and prstat is only showing it as using 1.2% of the CPU. On another older system, the same bzip takes up 25% of the 4-proc box.

On the old system, it's clear that bzip is a single-CPU bottleneck (because it's single-threaded). On the new one, well running full-out but only taking 1% doesn't look like much.

--Joe

2008/06/22

Forcing kernel dump on Opensolaris x86

Just in case you were caught off guard by the documentation that points to using "rip::c" in mdb, that apparently doesn't work on Opensolaris, at least on build 70 that we have installed on our x4500s.

But on the bright side, I found a mention at wikia (http://opensolaris.wikia.com/wiki/Miscellaneous_FAQ) that says I should use $<systemdump. And that works.