Showing posts with label dayjob. Show all posts
Showing posts with label dayjob. Show all posts

2013/09/09

Greenplum DCA and my roll-my-own ETL host

I'm trying to get my new Dell server with its 10gigE network cards to talk to the back-end switch of my greenplum DCA.
Other than the fact that Brocade doesn't seem to understand the difference between a support matrix that says "Using non-Brocade cables is not supported" and a software feature that checks to see if the inserted standards-compliant cable was manufactured by Brocade (vs a standards-compliant cable made/sold by Dell) and if not turning off the port. And other than the Dell sales tool not pointing out this incompatibility, I'm in good shape.
Once there's a link at the SFP+ layer, however, the greenplum switches are not set up for ETL work out of the box... And of course, since these back-end switches are not connected to the "real" network, I have to ssh-tunnel to get to the Switch Admin web tools.
The unused ports on the switches are set up as link aggregation members, and so do not work without even more of these cables. So first, I have to take them out of the CEE LAG groups (first disable the port via Port administration). Switch Administration -> CEE -> Link Aggregation, Edit LAG Group 2, and take out Te 0/18.
Then back over to Port Admin, to change the port to L2 Access mode, and we can enable it.
And finally, back over to Switch Administration -> CEE -> VLAN, edit VLAN 199, and add the Te 0/18 interface to the vlan.
And we have packets moving.
Testing with "gpssh -f hostfile ping -c 3 etl1-1" and "gpssh -f hostfile ping -c 3 etl1-2"
--Joe

2012/04/13

OpenSSL to Java keystores

I've been creating SSL configurations for various groups in the company, and since I like the standard command line, I've been doing it via OpenSSL. However, some groups use Java-based SSL servers that need their .key and .cert in the Java Keystore format.

So to get the whole instruction set together in one place,

openssl genrsa -out servername.key 2048
openssl req -new -x509 -key servername.key -out servername.csr
#
#Send off the CSR to get it signed, and pull down the intermediate CA certificates that our internal authority uses to sign.
#
openssl pkcs12 -export -in servername.cert -certfile intermediate.cert -inkey servername.key > servername.p12
#Give it a password at least 6 characters long so that Java doesn't complain
keytool -importkeystore -srckeystore servername.p12 -destkeystore servername.jks -srcstoretype pkcs12

2012/02/17

Yet another annoyance

I tend to keep a lot of stuff on my hard drive. Modern drives are big, and modern filesystems don't have a problem with searching through long, fragmented free lists that made the old suggestion of "keep the disks less than 90% full" smart. I defrag occasionally, and (at least on my laptop) a high-speed SD card configured for Readyboost to improve application-launch induced disk seeks.

So I've been getting popups (no, not malware) for several months reporting that I'm running out of disk space. These are Windows-looking officialish "Warning Event Notification" popups, reporting that "disk free space has fallen below the configured threshold." Annoying, displays in the center of the screen (even when locked/logged off) and takes focus from my work.

It turns out this particular message is caused by the Dell OpenManage Client utility that the company uses to set the BIOS password for the system, and it's controlled by a registry key: HKLM\SOFTWARE\Dell\OpenManage\Client\SysInfo\HDDThresholdValue. I set it to 0 to get rid of the messages entirely.

--Joe

2011/08/24

Listening ports

One of our many applications wouldn't start, with an obscure message that had nothing to do with the underlying problem (nsrexecd "Cannot start portmapper", to be specific and to make sure this is googleable for the next person)

It turns out that another process had been randomly assigned the ports that Networker had to listen on, to an outgoing TCP connection. Which, of course, meant that Networker couldn't bind to those ports to LISTEN. This is the first time this has happened. But it's a potential time bomb for any service that listens on specific ports. Such as Oracle, Weblogic, SAP, etc.

Linux controls what ports are randomly assigned using two sysctl's, ip_local_port_range and ip_local_reserved_ports. Unfortunately, the Oracle installer prerequisite check requires that ip_local_port_range be set wrong (1024-65500, which includes their own listener port) so we have to work with the other one, ip_local_reserved_ports. It's a "comma-separated list of ranges", so for us, I picked an excessive range for our big 3 applications- Oracle (1520 - 1530), SAP (3200 - 3699), and Networker (7937 - 8065).

sysctl net.ipv4.ip_local_reserved_ports=1520-1530,3200-3699,7937-8065


--Joe

2011/06/13

Link aggregation in a cross-platform environment

Everybody in the world knows that LACP (802.1ad) is the standard for Link Aggregation and Control, right? Well, not exactly.

We have VMware ESX and Solaris servers connected to our Cisco edge switches. Sounds good, right? We'd like to bond the multiple gig-E NICs into a multi-GB aggregate. Sounds good, right? Well, it's not so easy.

ESX doesn't support true 802.1ad aggregation. They fake it with their vSwitch NIC teaming properties. They do the same thing as L3 LACP (hash of the source and destination IPs) but don't call it that. Fortunately, they use the same hash algorithm as Cisco, so we can work with it.

On the cisco side, we add the interfaces to a channel-group with mode "on". This uses the default-for-the-switch port-channel load-balance setting, which we had to set to src-dst-ip.

Unfortunately, since that setting is a global switch option and is not set on a per-port-channel level, this means that our Solaris boxes (who speak LACP properly) can't use Layer-4 (hash of source and dest IPs and ports) balancing. This sucks, because our Solaris boxes are the heavy-network-hitters (backup servers) that could really use the extra bandwidth provided by spreading the multiple TCP connections across multiple links.

I'm not sure who to blame here, VMware for not doing LACP, or Cisco for not allowing multiple loadbalancing methods on different port channel groups.

--Joe

2011/05/31

Oh yeah, the rest of EMC World

The last days of EMC world were fairly uneventful. I was called in on a couple of work problems, which made it hard to concentrate on the talks. But from what I could tell, they were all either high-level "cloud is king" or very introductory sessions, so I didn't really get much out of them.

I did have a nice seafood dinner at the Rio after the conference closed out, and a quite forgettable plane ride home.

Now, back to the real world.

--Joe

Cleaning up View Composer VMs

We've had frequent issues where our VMware View desktops will get into a state of Provisioning Error (missing) with a popup box that a "Virtual Machine with Input Specification already exists"

This symptom is described pretty well in http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&externalId=1008658, but here's some more info:

At least in the version of Composer 4.5 that I'm running, the sviconfig command doesn't know the RemoveSviClone that they reference in the KB. So it's the manual way for me.

This seems to happen if the Composer database bits get out of sync with what's in the ADAM database that View uses (Can we please pick ONE database).

This weekend's problems came when the Oracle DB that supports our VirtualCenter, View Composer, and Update Manager environments had a corrupted file. I had to roll back to a previous Oracle state, which naturally meant that it wasn't quite the same as ADAM.

The manual cleanup (besides being MSSQL-specific in table names and interface reference) requires a significant amount of C&P to run through in SQL/Plus. So I declared an Oracle procedure that, given a VM name, cleans up the data automatically:


create or replace procedure cleanup_clone
( p_vmname in varchar )
as
begin
delete from SVI_VM_NAME where NAME = p_vmname;
delete from SVI_COMPUTER_NAME where NAME = p_vmname;
delete from SVI_SC_PDISK_INFO where PARENT_ID in
(select id from SVI_SIM_CLONE where VM_NAME = p_vmname);
delete from SVI_SC_BASE_DISK_KEYS where PARENT_ID in
(select id from SVI_SIM_CLONE where VM_NAME = p_vmname);
delete from SVI_SIM_CLONE where VM_NAME = p_vmname;

commit;

end cleanup_clone;


With this in place, I can "execute cleanup_clone('uscimposer-99');" at the SQL/Plus prompt (having logged in as the Composer user) and it nicely wipes out the input specification for that VM, and a new one can be provisioned. The only other manual step then, is to remove the provisioning-error'd VM from the View Admin interface.

--Joe

2011/05/11

EMC World 2011

I've made it through the 2nd day of EMC world, and am starting on the third. Tuesday brought some interesting talks on Networker and enterprise apps performance tuning (specifically MSSQL).

But the driving theme of the conference has me a bit confused. "IT As A Service" sounds great, and we keep hearing about how ITAAS can deliver benefits through standardization (aka service catalog)

At least in my experience, though, there's a problem- The service catalog is never "good". That is to say, it's either incomplete (sorry, we don't have MySQL in the catalog), or overly restrictive (pick a different DB platform for your LAMP app), or forces the business into shadow-IT operations (run your own d*** database). And in the case of business-driven tool selection, this is a problem.

The service catalog as I see it will cover maybe 90% of the requirements, and every process/function will need a slightly different 10%. In order to deliver to those processes, ITAAS has to deal with those 1-off "oh yeah, MySQL had to be installed in /usr/local instead of the standard /apps/mysql-version to make this OOTB app work" kind of gotchas that plague sysadmins.

And, of course, technology moves ahead faster than the service book. In particular, marketing to business decision makers moves a helluva lot faster. Think about iPhone/tablet/Android adoption- IT has had to completely rethink what kind of device a user will be coming from-- It's not a corporate-owned laptop running an image-deployed copy of Windows XP with IE 6, it's now the iPad the CEO bought for his daughter.

So how does ITAAS respond to these shifting sands? That's the brazilian-dollar question. Do we chase the business's tail and add too many poorly-supported products to our service catalog? Do we lock the business into the properly-blessed old way of doing things, and out of the innovation that drives us?

--Joe

2011/05/09

EMC World 2011

I'm here at EMC world 2011, taking advantage of their "Bloggers Lounge" where they have better WiFi and more comfortable chairs.

So far, the conference is unremarkable-- the first keynote could be summarized as "Cloud, blah, blah, lots of data, blah, new products, blah, blah, blah" Nothing particularly groundbreaking.

But still, being the first travel I've been on in almost 5 years, I'm looking forward to it. Lots of topics that can help my quest for Infrastructure Strategy.

2010/11/08

Building stand-alone Collectd plugin - Part 2

Actually, this wasn't as difficult as I was expecting.

The only real challenge was in the fact that the ESX Guest SDK libraries are only distributed as a shared object, which means my plugin had to dlopen() the required library, rather than being able to link it in statically. Luckily, I was able to cannibalize some of the example guest SDK code for this.

Here's the basic idea:
// Function to get how much CPU time we've gotten
VMGuestLibError (*GuestLib_GetCpuUsedMs)(VMGuestLibHandle handle, uint64 *cpuUsedMs);

//In the plugin_init function, I dlopen("libvmGuestLib.so") and assign the function
GuestLib_GetCpuUsedMs = dlsym(dlHandle, "VMGuestLib_GetCpuUsedMs");
//And open the Guestlib handle. Each plugin_read loop, I
// GuestLib_UpdateInfo(glHandle); and then can get the latest data.
glError = GuestLib_GetCpuUsedMs(glHandle, &(CpuUsedMs));
values[0].counter = CpuUsedMs;
// and plugin_dispatch_values().


Then I copied the resulting .libs/esx_guest* and the libvmGuestLib.so into the ~collectd/lib/collectd/ directory (where it looks for plugin SOs) and fired it up. I also had to add entries to the types.db for my data sources.

From here, I get cool graphs for my Linux VMs like these.


I'll put this code up on my personal site when I get a chance, and contribute these documentation to the collectd project.

--Joe

2010/11/03

Building stand-alone Collectd plugin

I'm working on building a plugin for the collectd data collection system (www.collectd.org) that will gather stats on our ESX RedHat VMs through the VMware Guest API.

Here's what I've found so far...

unpack collectd, configure and make it.
In a separate directory, create the source files for the new plugin.

(minimal plugin code:
#include "collectd.h"
#include "common.h"
#include "plugin.h"

static int my_read(void) {
value_t values[1];
value_list_t vl = VALUE_LIST_INIT;

values[0].counter = 0;

vl.values = values;
vl.values_len = 1;
sstrncpy (vl.host, hostname_g, sizeof (vl.host));
sstrncpy (vl.plugin, "test1", sizeof (vl.plugin));
sstrncpy (vl.type, "counter", sizeof (vl.type));

plugin_dispatch_values (&vl);

return 0;
}

void module_register(void) {
plugin_register_read ("test1", my_read);
}


Then I use libtool to build the .o:
$ libtool --mode=compile gcc -DHAVE_CONFIG_H -I ../collectd-4.10.1/src -Wall -Werror -g -O2 -MT test1.lo -MD -MP -MF test1.Tpo -c -o test1.lo test1.c
and link it:
$ libtool --tag=CC --mode=link gcc -Wall -Werror -g -O2 -module -avoid-version -o test1.la -rpath /apps/collectd-4.10.1/lib/collectd -lpthread -ldl test1.lo

This generates the files in ./libs/test1.* which I copy into the $prefix/lib/collectd/ directory and enable it in my config.

So much for part 1... Up next, getting actual data.

--Joe

2010/07/09

Windows NLB clustering and SIDs

A colleague has been working to set up a MS NLB cluster for a set of .NET machines. As is our standard practice, these are created as ESX VMs, and for convenience, we create them based on our standard template. Then the VMware guest customization process runs, and we have a VM we can turn over to the application team.

The problem in this case (the reason I'm involved in a Windows issue) is that NLB wasn't starting. There were various false-starts with configuration items randomly disappearing (why is only one of the NICs in the selection box on this system?).

Somehow, it was suggested that maybe the reason NLB wouldn't start was because the SID of the two VMs was the same. Of course, no, since we run the guest customization which does a NewSID(1m). That'd be impossible.

But it turns out that the SIDs were the same, popping the machines out of the domain and newsid'ing them resolved the issue. Whod'a thunk?

On further reflection, the system's SID is probably the best option for a locally-unique identifier to use to map the loadbalancing traffic via NLB. There has to be some way for all the cluster members to agree on who cares about which packets, so why not use the SID as part of the hash function? Makes perfect sense, since SIDs are, of course, unique.

--Joe

2010/03/29

Is today the last $weekday of the month?

Classic problem in sysadmin: How to run a script on the last Sunday (or whatever day) of the month. This snippet is correct for all days, even in leap years between 1904 and 2096, inclusive. It messes up on February 22 each century, except when the year is divisible by 400 (works in 1600, 2000, 2400, etc. but breaks in 2100, 2200, 2300) But I'll be retired by then...

My answer, in Korn Shell:


# Day of the week to run on:
DOW=Sunday

OLDLC_TIME=$LC_TIME
export LC_TIME=C
# Requires /usr/bin/ksh to run correctly.
# I couldn't be bothered to get the expr quoting and backquoting working in /bin/sh
case `date +%b` in
Jan|Mar|May|Jul|Aug|Oct|Dec) D=31;;
Apr|Jun|Sep|Nov) D=30;;
Feb) D=$(( 28 + ( $(date +%Y) %4 == 0 ));;
# FIXME: Y2.1K bug here.
esac

if [[ `date +%A` == "$DOW" ]] && [[ $(( $(date +%e) + 7 -gt $D ]]
# then next Sunday is after the end of this month, so today's the last Sunday of this month
then
....
fi

LC_TIME=$OLDLC_TIME

2009/09/29

Idle curiosity about iLOM

Why does the service processor on our brand new Sun T5240 server have a SPARC 885 processor, and run Linux? Why not (Open)Solaris?

Kinda ironic that Sun boots its latest servers with Linux.

Maybe it's the fact that it has 32MB of flash to work with, and only 128MB of ram. But that should be enough to run Solaris.

--Joe

2009/08/19

Netapp - Waster of space

We have a Netapp that we use to provide Tier-2 LUNS to our SAN. It was price-competitive on raw disk space, but I didn't realize at the time just how much overhead this appliance had.

The obvious overhead is RAID-DP and hot spare drives. Easily calculated. 1 HS per 30 drives of each size. DP is 2 drives per plex, so that's 6 wasted drives out of the 28 in two shelves, leaving 22 * 266GB drives usable = 5.7TB.

I'd heard that space is reserved for OS and bad-block overhead (about 10%) so that brings us down to 5.2TB usable.

Well, the web interface shows the aggregate as 4.66TB. So that's 600GB I haven't accounted for. But still, 4.66 TB is a good amount of space.

From the aggregate, we create a flexvol (note that this places 20% by default as inaccessible snap reserve space). On the flexvol, we create LUNs and present them to our servers. And here's where the space consumption is nasty:

By default, if you create a 1TB lun, OnTAP reserves 1TB of disk blocks in the volume. That's nice, and exactly what I'd expect. Although in practice, we use thin provisioning (lun create -o noreserve) for most of our LUNs

What I didn't expect going in was that the first time you create a snapshot, OnTAP would reserve ANOTHER 1TB for that LUN. And interestingly enough, that 1TB is never touched until there's no other space in the volume.

Ok, That ensures that even if you overwrite the ENTIRE lun after you take a snapshot. But it reduces the usable size of LUN-allocation to 2.33TB. And if you have multiple snapshots, those don't seem to go into the snap reserve, but rather are in addition to the 2*LUNsize that is already allocated.

So out of a raw disk capacity of (28*266) 7.2 TB (which is quoted as 28*300GB disks = 8.2TB) we get just over 2TB of space that can be used for holding actual system data.

Wow.

Now, there are non-default settings that can change that, but they're only available at the CLI, not the web interface:

# snap reserve 0 - this will set the snap reserve from 20% to 0%, which is recommended for volumes that hold only LUNs.
# vol options fractional_reserve ## - This changes the % of LUNsize that is reserved when a LUN snapshot is taken.

It is not entirely clear what happens to a LUN when its delta becomes larger than the fractional_reserve. Some documentation says it may take the LUN offline, but I would hope that only would happen if there's no remaining space in the volume (like what happens with snapshot overflow in traditional NAS usages). But it's not clear.

As far as I can tell, the current best practice is to set the snap reserve to the amount of change you expect in the volume, and set the fractional_reserve to the amount of change you expect in the LUN. And to set up either volume auto-grow and/or snapshot auto-delete to make sure you have free space when things get full.

On the gripping hand, the default options make sure that you have to buy a lot of disks to get the storage you need.

--Joe

2009/04/03

Discovering R for performance analysis

I've seen references in various conferences and performance blogs about the "R" statistical analysis package, and how it can be used to data mine system performance data. I'm going to learn it.

Fun.

2009/03/06

Firewall project

A big consumer of my time this week (and last week) is building a pilot implementation of a new internet-facing DMZ. Well, that's understating the requirements a bit. Corporate requires a special "reverse proxy" system to be sitting in the internet-facing parts, so we have to make some major changes anyway, but I wasn't happy with just having a DMZ out there, it needs to be reliable. Preferably more reliable than our internet feed. But we have more than 1 datacenter, with more than 1 internet provider, why not take advantage of that?

Basically, the goal is to have a single IP address (for www.dom.ain) that is internet-routed through both datacenter ISPs, and have Linux do some magic so that packets can come or go through whichever pipe. Apparently, there are companies that make such magic happen for lots of $$$ but in this economy, they aren't an option. And since Linux is free (and my time is already paid for) here's a chance to save the company money. That's what I sold to management anyway.

It should be simple enough: advertise that magic netblock out both pipes, put a Linux router on the link as the gateway for that block, NAT the magic.xxx address of www to the internal IP address of the apache server, and toss out of state packets over to its peer so that the firewalls between this box and the apache server wouldn't see them.

In ascii:

Internet --- Linux ---- FW --+-- LAN --- apache
^-v |
Internet --- Linux ---- FW --+


(We've assumed that the WAN is important enough internally that if it's down, our external site is going to have problems anyway. Which is true, unfortunately. WAN outages between our 2 main datacenters tend to break everything even for local users.)

So far I've gotten 3/4 of the packet-handling stuff working for a single system using just iptables. nat PREROUTING DNAT rewrites the magic.xxx to apache's address, POSTROUTING MASQUERADE gives apache something routable to return the packets to, and I can see the entries in the /proc/net/ip_conntrack file. Unfortunately, I can't seem to find how nat is supposed to de-masquerade the packets back according to the state that caused them.

I have a packet coming in from 10.0.05 (client) -> 192.168.1.13 (www) (magic block is 192.168.1/24). It leaves my box as 192.168.5.182 (lx-int) -> 192.168.6.13 (www-web0). www-web0 gets the SYN, and sends its SYN+ACK back 192.168.6.13 -> 192.168.5.182. I see those packets on the wire, and it's what I'd expect.

What I don't see is a way to take that SYN+ACK, look up in the connection tracking table for the original client and rewrite it to be 192.168.1.13 -> 10.0.0.5.

--Joe

2009/02/18

VMware View 3.0 and proxies

Oops, I haven't blogged the first part of this story. Oh well, maybe later. In brief, we have VMware VDM to satisfy das corporate security. It was working for people on our LAN and on the corporate network, and I got it to work from the internet (but requiring a valid smartcard (SSL User Certificates) before letting a user in). This was a cool project I'll have to document here some time.

Well, time moves on and VMware View Manager 3.0 (nee VDM 3.0) was released and implemented in this environment.

The first problem we noticed started when a home user upgraded their View client to 3.0 as they were prompted on the login page. This was when the smartcard authentication from the internet stopped working. A little investigation (watching network traffic, decrypting with Wireshark, etc) and I found that while the old client would send an HTTPS post command just like IE, the new client didn't send the user SSL certificate. But since VMware never supported this sort of setup, I just worked through it (another cool solution I'll have to post later). A little bit of rearchitecture, and I was able to still protect enough of the View environment to make me feel secure and to convince the security people that it was sufficient.

Now, I've got a similar error from the corporate network. Same message: Connection to View server could not be established". But WTF? this is on the LAN, there shouldn't be a proxy problem. IE works just fine*, but View can't connect.

That is to say IE worked fine with the proxy, but the proxy requires user authentication, which is cached for the browser session, and I didn't think of that until later.

So fire up Wireshark again, and once again, the first couple of View CONNECT :443 requests from IE happily sent the Proxy-Authorization: header, but the last one tried to do a CONNECT without that header, and was tossed back a Squid Authentication Required 407.

Ah, that's a relatively easy one to fix, if only I could get the proxy admin to turn of authentication (nope, that's verbotten) or do the same sort of magic as I did on the outside firewall deployment (eww, that'd be messy) or maybe bypass the proxy for this? I mean, they're on the LAN. Luckily VMware apparently thought of this and implemented an undocumented registry key: HKEY_LOCAL_MACHINE\SOFTWARE\VMware, Inc.\VMware VDM\ProxyBypass that contains a MultiSZ list of names or IPs for View to connect directly to instead of using the proxy.

Did I mention that all of this new behavior is undocumented? And that what I'd been doing in the first place was both unsupported and completely WORKING?

I'd guess that the new View client switched from a standard MS HttpRequest method to something they threw together without the nice functionality that IE bundles into its method. Oh well. It's working again now.

--Joe

2008/09/23

Moving RHEL to ESX

We have a cloud application that we have purchased for in-house deployment. It's a long story that I won't share, but we have been seeing performance problems with it, and have ended up with a (physical) server in our datacenter that they have configured just like their cloud resources, so we can compare performance on our (virtual) system against the way they run things.

To make sure we capture the most data we can about this system (and to demonstrate that it either is or is not the virtualization layer causing slowness) we've been tasked with copying the physical server into a similarly-configured virtual machine.

Unfortunately, VMware Converter apparently can't do Linux. So we had to use the more standard drive-image-transfer toolkit of dd and netcat. But even after the image was transferred, the kernel would crash because it couldn't find the root disk.

This is to be expected, but only Google knows how to solve it. If you know the right keywords. Good screenshots of the process are at http://virtualaleph.blogspot.com/2007/05/virtualize-linux-server-with-vmware.html but it's modprobe.conf not modules.conf in RHEL5.

So here's the steps I took:

Boot from the (correct) RHEL install CD #1 with "linux rescue". Note that it has to be the correct RHEL version and architecture. Since the appliance was running 5.2 x64 edition, my 5.1 x86 cd didn't work, and I had to download a different CD1.

Skip the networking config since it won't help, search for RHEL installations, and continue to the root shell (in read-write mode)

chroot /mnt/sysimage and edit /etc/modprobe.conf. Change the eth0 module alias to pcnet32 (you can remove the eth1 alias if you don't have a second nic in your VM) and change the scsi_hostadapter alias to BusLogic. (again, you can remove other aliases if you want)

Then copy /boot/initrd-.img initrd--phys.img as a backup, and build a new initrd file with the new devices: mkinitrd -v -f /boot/initrd-.img

If that works, you should be able to boot the VM and have it come up cleanly.

--Joe

2008/09/17

Reflections on x4500+ZFS+NFS+ESX

I was asked about my thoughts on running ESX over NFS to a ZFS backend. For posterity, here they are:

x4500+ZFS+NFS+ESX is a quite functional stack. There are a few gotchas that I've run into:

First, the ESX "storage delegate" functionality doesn't. This is supposed to change the EUID that the ESX server sends with it writes. Well, it does for most of the requests, but not for things like creating the VM's swap file. So you pretty much have to export your NFS shares with root=vmkernel.ip.address

We have many ESX servers, so keeping the sharenfs= parameters got unweildy. I ended up putting them in a text file in the NFS share for easy editing, and when I have to add/change an ESX server, I edit the file and zfs set `cat zfs.shareprops` /pool/path/to/share

NFS is much better than iSCSI. At least in the version I did iSCSI testing, all of the ZFS volumes presented from OpenSolaris were recognized by ESX as being the same disk. This meant that I had a dozen paths to the same vmfs datastore, some 100GB, some 500GB, etc. This Was Bad. NFS made it better.

NFS also gives you a couple of other benefits: On NFS datastores, the vmdk files are by default thin-provisioned. This means that if you give your VM a 5TB vmdk, and don't use more than 10GB, it takes up 10GB of capacity on the physical disks. It's also much better understood by troubleshooting tools (wireshark) so it's easier to find problems like the storage delegate issue above. Also, it's a first-class citizen from Sun. NFS serving has been in Solaris since 1994, and isn't broken by the latest Nevada builds. Sun takes NFS seriously.

The downside of NFS is that ESX makes all its requests O_SYNC. This is good for ESX but bad for ZFS. Your nvram cards should help a lot. I ended up with a different solution: The business agreed that these are not Tier-1 VMs, and they're not on Tier-1 storage. So I've turned off all ZFS sync guarentees with /etc/system:


* zil_disable turns off all syncronous writes to ZFS filesystems. Any FSYNC,
* O_SYNC, D_SYNC, or sync NFS requests are services and reported completed
* as soon as they've been transferred to main memory, without waiting for
* them to be on stable storage. THIS BREAKS THE SAFETY SEMANTICS AND CAN
* CAUSE DATA LOSS! (clients have moved on thinking the data was safely written
* but it wasn't)
* However, in our case, we can afford to lose this data. For DEV/Test systems
* rollback to the latest (hourly) snapshot is considered acceptable.
set zfs:zil_disable=1


As the comment says, this would be a bad thing. But I know that the vmdk files are crash-consistant every hour and that's OK to the users. If they lose an hour of work, it's annoying but worth the cheaper storage.

Finally, and most importantly:

MAKE SURE YOUR POOL IS CONFIGURED FOR YOUR WORKLOAD. Vms are effectively a random-read and random-write workload. There is no sequential access of the vmdk files except when you're cloning a VM. So you have to understand the read and write characteristics of your zfs pool. RAID-Z and RAID-Z2 always read and write a full RAID stripe every time. This means it has to read from all of the disks in the pool to return a single byte of data to the ESX host. Mirrored pools, on the other hand, read from a single disk, and if the checksum is correct, passes it back to the ESX host. So in my case, I can have 44 simultaneous read requests from the ESX servers being serviced at the same time (44 disks in the pool) and/or 22 simultaneous writes (each write is written to two disks). Basically RAID-Z[2] is bad for random workloads, but mirroring is expensive.

With this in mind, performance on the thumper is excellent. We can easily saturate the onboard 1Gbps network link with NFS traffic, I've got link aggregation and can easily saturate the combined 2Gbps link. I haven't seen what happens with 4 uplinks, but I'd expect that the network will still be the slowest part of the chain. Doing basic I/O benchmarks on the thumper, I can get 1GBps out of the disks. Yes, that's 1GB per second.