2009/03/06

Firewall project

A big consumer of my time this week (and last week) is building a pilot implementation of a new internet-facing DMZ. Well, that's understating the requirements a bit. Corporate requires a special "reverse proxy" system to be sitting in the internet-facing parts, so we have to make some major changes anyway, but I wasn't happy with just having a DMZ out there, it needs to be reliable. Preferably more reliable than our internet feed. But we have more than 1 datacenter, with more than 1 internet provider, why not take advantage of that?

Basically, the goal is to have a single IP address (for www.dom.ain) that is internet-routed through both datacenter ISPs, and have Linux do some magic so that packets can come or go through whichever pipe. Apparently, there are companies that make such magic happen for lots of $$$ but in this economy, they aren't an option. And since Linux is free (and my time is already paid for) here's a chance to save the company money. That's what I sold to management anyway.

It should be simple enough: advertise that magic netblock out both pipes, put a Linux router on the link as the gateway for that block, NAT the magic.xxx address of www to the internal IP address of the apache server, and toss out of state packets over to its peer so that the firewalls between this box and the apache server wouldn't see them.

In ascii:

Internet --- Linux ---- FW --+-- LAN --- apache
^-v |
Internet --- Linux ---- FW --+


(We've assumed that the WAN is important enough internally that if it's down, our external site is going to have problems anyway. Which is true, unfortunately. WAN outages between our 2 main datacenters tend to break everything even for local users.)

So far I've gotten 3/4 of the packet-handling stuff working for a single system using just iptables. nat PREROUTING DNAT rewrites the magic.xxx to apache's address, POSTROUTING MASQUERADE gives apache something routable to return the packets to, and I can see the entries in the /proc/net/ip_conntrack file. Unfortunately, I can't seem to find how nat is supposed to de-masquerade the packets back according to the state that caused them.

I have a packet coming in from 10.0.05 (client) -> 192.168.1.13 (www) (magic block is 192.168.1/24). It leaves my box as 192.168.5.182 (lx-int) -> 192.168.6.13 (www-web0). www-web0 gets the SYN, and sends its SYN+ACK back 192.168.6.13 -> 192.168.5.182. I see those packets on the wire, and it's what I'd expect.

What I don't see is a way to take that SYN+ACK, look up in the connection tracking table for the original client and rewrite it to be 192.168.1.13 -> 10.0.0.5.

--Joe

2009/02/19

Photo Archiving

This is in response to BenR's post at http://www.cuddletech.com/blog/pivot/entry.php?id=1016 which I can't seem to get past his comment-spam filter.



As a fellow father and sys/storage admin, I have similar questions. Have you made the jump to video already? A MiniDV tape at LP (90 mins) quality -- a little less than DVD quality but with worse compression, eats up 15GB of disk space when I dump the AVI stream. Not to mention the gigabytes of SD and CF cards from the camera.

I'm confident in my 3-tier archiving scheme: An active in-the-house full-quality copy on simple disk, a "thumbnail" (screen-resolution or compressed video) version on S3, and two copies of the original format on DVD - one onsite and one offsite.

I expect to have to move from DVD media periodically, but I can put that off until the higher-capacity disk wars play out. Every file on the DVDs are md5sum'd, and i know I can use ddrescue to pull data blocks off either wafer, if S3 and my home drive die, assuming the scratch doesn't hit both disks in the same place. It'd be nice to have an automatic system to track which file is on what DVD, but I haven't implemented such an HSM yet.

I'm enough of a pack rat to keep a DVD drive and probably a computer that can read it essentially forever, and if not, there's always eBay.

The biggest problem I face is not deleting all of the content from a card (or tape) before popping it back into the camera and adding more. So when I copy a media into the "system" I might have other duplicate copies of the pictures. I'd love to be able to deduplicate those and store only one copy (and links to it). And even better would be a content-aware dedup that could tell that x.jpg is the same picture as Y.raw... (and that song_64kvbr.mp3 can be derived from song.flac)

But I haven't put that together yet, either.

--Joe

2009/02/18

VMware View 3.0 and proxies

Oops, I haven't blogged the first part of this story. Oh well, maybe later. In brief, we have VMware VDM to satisfy das corporate security. It was working for people on our LAN and on the corporate network, and I got it to work from the internet (but requiring a valid smartcard (SSL User Certificates) before letting a user in). This was a cool project I'll have to document here some time.

Well, time moves on and VMware View Manager 3.0 (nee VDM 3.0) was released and implemented in this environment.

The first problem we noticed started when a home user upgraded their View client to 3.0 as they were prompted on the login page. This was when the smartcard authentication from the internet stopped working. A little investigation (watching network traffic, decrypting with Wireshark, etc) and I found that while the old client would send an HTTPS post command just like IE, the new client didn't send the user SSL certificate. But since VMware never supported this sort of setup, I just worked through it (another cool solution I'll have to post later). A little bit of rearchitecture, and I was able to still protect enough of the View environment to make me feel secure and to convince the security people that it was sufficient.

Now, I've got a similar error from the corporate network. Same message: Connection to View server could not be established". But WTF? this is on the LAN, there shouldn't be a proxy problem. IE works just fine*, but View can't connect.

That is to say IE worked fine with the proxy, but the proxy requires user authentication, which is cached for the browser session, and I didn't think of that until later.

So fire up Wireshark again, and once again, the first couple of View CONNECT :443 requests from IE happily sent the Proxy-Authorization: header, but the last one tried to do a CONNECT without that header, and was tossed back a Squid Authentication Required 407.

Ah, that's a relatively easy one to fix, if only I could get the proxy admin to turn of authentication (nope, that's verbotten) or do the same sort of magic as I did on the outside firewall deployment (eww, that'd be messy) or maybe bypass the proxy for this? I mean, they're on the LAN. Luckily VMware apparently thought of this and implemented an undocumented registry key: HKEY_LOCAL_MACHINE\SOFTWARE\VMware, Inc.\VMware VDM\ProxyBypass that contains a MultiSZ list of names or IPs for View to connect directly to instead of using the proxy.

Did I mention that all of this new behavior is undocumented? And that what I'd been doing in the first place was both unsupported and completely WORKING?

I'd guess that the new View client switched from a standard MS HttpRequest method to something they threw together without the nice functionality that IE bundles into its method. Oh well. It's working again now.

--Joe

2008/09/23

Moving RHEL to ESX

We have a cloud application that we have purchased for in-house deployment. It's a long story that I won't share, but we have been seeing performance problems with it, and have ended up with a (physical) server in our datacenter that they have configured just like their cloud resources, so we can compare performance on our (virtual) system against the way they run things.

To make sure we capture the most data we can about this system (and to demonstrate that it either is or is not the virtualization layer causing slowness) we've been tasked with copying the physical server into a similarly-configured virtual machine.

Unfortunately, VMware Converter apparently can't do Linux. So we had to use the more standard drive-image-transfer toolkit of dd and netcat. But even after the image was transferred, the kernel would crash because it couldn't find the root disk.

This is to be expected, but only Google knows how to solve it. If you know the right keywords. Good screenshots of the process are at http://virtualaleph.blogspot.com/2007/05/virtualize-linux-server-with-vmware.html but it's modprobe.conf not modules.conf in RHEL5.

So here's the steps I took:

Boot from the (correct) RHEL install CD #1 with "linux rescue". Note that it has to be the correct RHEL version and architecture. Since the appliance was running 5.2 x64 edition, my 5.1 x86 cd didn't work, and I had to download a different CD1.

Skip the networking config since it won't help, search for RHEL installations, and continue to the root shell (in read-write mode)

chroot /mnt/sysimage and edit /etc/modprobe.conf. Change the eth0 module alias to pcnet32 (you can remove the eth1 alias if you don't have a second nic in your VM) and change the scsi_hostadapter alias to BusLogic. (again, you can remove other aliases if you want)

Then copy /boot/initrd-.img initrd--phys.img as a backup, and build a new initrd file with the new devices: mkinitrd -v -f /boot/initrd-.img

If that works, you should be able to boot the VM and have it come up cleanly.

--Joe

2008/09/17

Reflections on x4500+ZFS+NFS+ESX

I was asked about my thoughts on running ESX over NFS to a ZFS backend. For posterity, here they are:

x4500+ZFS+NFS+ESX is a quite functional stack. There are a few gotchas that I've run into:

First, the ESX "storage delegate" functionality doesn't. This is supposed to change the EUID that the ESX server sends with it writes. Well, it does for most of the requests, but not for things like creating the VM's swap file. So you pretty much have to export your NFS shares with root=vmkernel.ip.address

We have many ESX servers, so keeping the sharenfs= parameters got unweildy. I ended up putting them in a text file in the NFS share for easy editing, and when I have to add/change an ESX server, I edit the file and zfs set `cat zfs.shareprops` /pool/path/to/share

NFS is much better than iSCSI. At least in the version I did iSCSI testing, all of the ZFS volumes presented from OpenSolaris were recognized by ESX as being the same disk. This meant that I had a dozen paths to the same vmfs datastore, some 100GB, some 500GB, etc. This Was Bad. NFS made it better.

NFS also gives you a couple of other benefits: On NFS datastores, the vmdk files are by default thin-provisioned. This means that if you give your VM a 5TB vmdk, and don't use more than 10GB, it takes up 10GB of capacity on the physical disks. It's also much better understood by troubleshooting tools (wireshark) so it's easier to find problems like the storage delegate issue above. Also, it's a first-class citizen from Sun. NFS serving has been in Solaris since 1994, and isn't broken by the latest Nevada builds. Sun takes NFS seriously.

The downside of NFS is that ESX makes all its requests O_SYNC. This is good for ESX but bad for ZFS. Your nvram cards should help a lot. I ended up with a different solution: The business agreed that these are not Tier-1 VMs, and they're not on Tier-1 storage. So I've turned off all ZFS sync guarentees with /etc/system:


* zil_disable turns off all syncronous writes to ZFS filesystems. Any FSYNC,
* O_SYNC, D_SYNC, or sync NFS requests are services and reported completed
* as soon as they've been transferred to main memory, without waiting for
* them to be on stable storage. THIS BREAKS THE SAFETY SEMANTICS AND CAN
* CAUSE DATA LOSS! (clients have moved on thinking the data was safely written
* but it wasn't)
* However, in our case, we can afford to lose this data. For DEV/Test systems
* rollback to the latest (hourly) snapshot is considered acceptable.
set zfs:zil_disable=1


As the comment says, this would be a bad thing. But I know that the vmdk files are crash-consistant every hour and that's OK to the users. If they lose an hour of work, it's annoying but worth the cheaper storage.

Finally, and most importantly:

MAKE SURE YOUR POOL IS CONFIGURED FOR YOUR WORKLOAD. Vms are effectively a random-read and random-write workload. There is no sequential access of the vmdk files except when you're cloning a VM. So you have to understand the read and write characteristics of your zfs pool. RAID-Z and RAID-Z2 always read and write a full RAID stripe every time. This means it has to read from all of the disks in the pool to return a single byte of data to the ESX host. Mirrored pools, on the other hand, read from a single disk, and if the checksum is correct, passes it back to the ESX host. So in my case, I can have 44 simultaneous read requests from the ESX servers being serviced at the same time (44 disks in the pool) and/or 22 simultaneous writes (each write is written to two disks). Basically RAID-Z[2] is bad for random workloads, but mirroring is expensive.

With this in mind, performance on the thumper is excellent. We can easily saturate the onboard 1Gbps network link with NFS traffic, I've got link aggregation and can easily saturate the combined 2Gbps link. I haven't seen what happens with 4 uplinks, but I'd expect that the network will still be the slowest part of the chain. Doing basic I/O benchmarks on the thumper, I can get 1GBps out of the disks. Yes, that's 1GB per second.