Simplified ESX patching

Here's what I've set up for our VMware ESX servers...

I have a space that's accessible via HTTP (snippet from httpd.conf)
<Directory /usslsbds001/esxpatches>
Options +Indexes
Order allow,deny
Allow from all

Alias /esxpatches/ "/usslsbds001/esxpatches/"

In there, I have directories corresponding to dates VMware has released patches (that I'm interested in)
# pwd
# ls -l
total 33
drwxr-xr-x 5 root root 5 Dec 6 15:24 20071115
drwxr-xr-x 8 root root 8 Dec 6 15:25 20071130
drwxr-xr-x 3 root root 13 Dec 12 14:13 latest
drwxr-xr-x 2 root root 14 Dec 12 14:14 packed

packed has the downloaded tgz files. $YYYYMMDD has the extracted patches for that date, and latest has the unpacked directory of 3.0.2 update 1, and symlinks ESX-1234567 -> ../YYYYMMDD/ESX-1234567. When a patch is superceded, I `chmod 0` it, and remove its link from latest.

I also have a scriptwriter that generates a set of esxupdate commands:
# cat ../latest/make-install.sh
# generate an "install" file for the ESX patches in the current directory

DS=`date +%Y%m%d%H%M`

ls | grep -v install | while read patch ; do
echo "esxupdate -n -r http://`uname -n`/esxpatches/latest/$patch update" >> install.$DS

rm -f install && ln -s install.$DS install

All of this rolls together on the ESX service console by simply doing (make sure HTTP client is open in the firewall)
GET http://thestorageplace/esxpatches/latest/install | sh

and rebooting...



Straddling the firewall with Zones

Our zonehosts have multiple NICs, on multiple subnets. This means that they have multiple default routes defined, so non-local packets are passed to those default routers in a round-robin fashion. In the past, this has not been a problem, because these default routers are actually just routers.

However now, I am creating a set of zonehosts that will be straddling a firewall. And like any good firewall, they will drop packets that are coming in on the "wrong" interface. So here's what I had to do to make this work:

Here's the config for this example:
ce0 ( -> fw interface
ce1 ( -> fw interface

On the global zone, edit /etc/ipf/ipf.conf to add the following rules for each interface
block out quick on ce0 to ce1: from to any
block out quick on ce1 to ce0: from to any

Now all the packets are put on their correct interface.

The only remaining question is "how does this deal with IPMP and link failures". That's something for this afternoon's research.



Solution to a VMware license problem

I'd been having a problem with our VirtualCenter installation: I'd removed an ESX server from the inventory, then tried to add it back. This operation would fail with an error message of "There are not enough licenses to perform the operation", and an event would show up reading "Not enough CPU licenses".

Now, we have plenty of VC agent licenses, (especially since I'd just removed that same server from the inventory) so I opened a call with VMware. After making the mistake of calling it a license problem (which bounced me to a FlexLM-only support group that couldn't bounce me back -- but they did validate the license file I was using, and verify that yes, we are really licensed) I was able to talk to a moderately useful representative.

We walked through the log collection process, gathered a bunch of data, discovered a corrupt VM in the inventory (removed it), gathered more logs, and I went home for the day. The next morning, the ESX server added with no problems. So I closed the case.

Now, over the weekend, we had one of our ESX servers die. I got paged and was told (third-hand: the user reported to ops who reported to another SE who told me) that something was wrong with $otherserver. Oh well, I logged in and could tell what they were complaining about -- $server was unresponsive. Unfortunately, I hadn't turned on HA on that cluster, so it didn't fix itself automatically, and I wasn't able to migrate the VMs to the other host (the VMs are on shared disk) So I deleted $server, added the VMs to the inventory via $otherserver, booted the VMs, and went on with my thanksgiving.

Today, when I booted $server (power was off, and I didn't have the DRAC configured, also the KVM was unplugged -- I think this was the original problem) and tried to add it back to the inventory, *POOF* same "There are not enough licenses to perform the operation". So do I open another mostly-useless support call? No! I'll fix it myself this time.

`strings -10 /usr/lib/vmware/vpx/vpxa | grep / | more` eventually found the config file /etc/vmware/vpxa.cfg. I "service vmware-vpxa stop" and then mv'd that config file to a backup, and added the server back in.

And the fscking thing worked. Grr. The newly-created vpxa.cfg file is exactly identical to the old one too.



Learn something new

Every once in a while I pick up a new trick... Here's solaris's answer to "what if a file has weird whitespace in its name". GNU userland has "find -print0" and xargs0, since you can't have a \0 as part of the filename, so it's safe to use as a delimiter.

find [ ... ] -exec cmd {} +

I had to use this when I was searching and cataloging (and checksumming) files from various previous hard drives. I had transferred all the data over to a ZFS pool (with compression) from a couple of Windows installs, and needed a good way to walk through "/tank/hdc/Documents and Settings/" and "/tank/hdc/Program Files" nicely. And out (of google) pops something that I've missed for years.

Not that I would have really expected to look for this gem in the man page, since I already knew how to use find. But there it was. I guess it's a documented interface now.


(P.S. with multibyte characters in file names, is it really safe to assume that \0 will no longer occur?)


wishful feature: zfs splitfs

I have a tank/data filesystem, with my important "stuff" in it, including /tank/data/oracle and /tank/data/webcontent. This is a production system, so I can't shut down to move the data around. I need to quota off the web content so it doesn't run Oracle out of space.

So what I'd like to do is...
zfs splitfs tank/data/webcontent
zfs set quota=5g tank/data/webcontent

Conceptually, it seems simple enough. Just create the appropriate new zfs filesystem entries in the pool with its root inode pointing at an existing directory. No data copying necessary.

Unfortunately, I think it would not work because there may be open files on the new (-ly partitioned) filesystem, so the (fsid,inode) pair on those open files would have to be changed to be (newid,inode) on all processes. Atomically. As part of the update to the zpool metadata. Or else the kernel would have to be able to realize that the same inode is referenced by two different filesystems. :(



Installing SunCluster

We've bought into Sun Cluster (AKA Java Availability Suite), and it's my job now to install it. I have 3x SunFire V490s, 32G memory each, 4 physical CPUs (8 cores), each has a quad-gig card (plus its 2 onboard nics) and a dual-port SAN card.

Solaris 10 update 4 (08/07). Sun Cluster 3.2.

Notes on the installation:

So far, it's been pretty straightforward except when I was trying to create the cluster. When it rebooted the first node, it never noticed that the node had rebooted in cluster mode.

This is because the rpc/bind (portmap) service is set to only allow local connections out of the box. A quick "svccfg network/rpc/bind setprop config/local_only=false" (on all nodes) and cluster is now up and running.

Bug opened.



Installing OpenSolaris b63 on VMware Workstation 5

I've got the Big Disk Server (an x4500) and since it's going to be an iSCSI target, I have to install it for now with OpenSolaris post-build-54. b54 is where the iscsitgt code got its putback.

Anyway, I've installed b63 on that monster, but before I can get everything working, I have a week-long "vacation" to learn EMC Control Center administration. So what should I do while I'm free? I'll work on setting up the rest of the administrative nicities that I need for the BDS.

Since I already have VMware Workstation on my new laptop, I'll install a b63 box, give it a couple of virtual disks for the ZFS layer, and see what code I can crank out while I'm gone.

It's never that easy, though.

Bug1: recent builds of OpenSolaris (specifically the mpt driver) cause VMware to crash. So it's IDE disks for me.
Bug2: the default X config makes the screen resolution bigger than my laptop's LCD, so I have to scroll through. Since I prefer a text-based install, I'd rather turn off X entirely.
Bug3: There's almost no documentation on how to get it to do that. There's the old "nowin" command line option (still in this version according to the docs) but I can't figure out how to pass that to grub. And the menu I'm given has 3 options (Install, Add drivers, or Shell) rather than the 7 the documentation shows.
Bug4: I don't want to install the whole distribution. The damn thing beeps if I haven't selected things correctly. Even though I have my laptop muted. And the VMware audio disconnected. And a headphone plug in the jack. How the fsck is it getting the beep through?

Bug3's workaround is to use the "Solaris Express" menu option in grub, rather than "Solaris Express Developer Edition". Grr.



NDMP tape restores

(Background: last week, "they" decreed that users should clear up unused disk space. Being a technology company, at least one user decided to write a script to clean up all his unused files, and ran this script on /net, or something. Anyway, there are now three important areas of the file shares that have no content any more. It's kinda interesting to note that all three of these areas had "test" as a component of the directory path.

The environment: Netapp filers, no snapshots of this space, monthly full backup controlled by EMC Networker via NDMP to TAN-attached SDLT600s

In the past, these sorts of problems would be handled by either Ops or my group, depending on the year, and where the (give us something to do) vs. (do it right) pendulum is swinging. Currently, it'a pointing at Ops. Except that their documentation is incomplete, so I have to get involved throughout, but on the bright side, they'll watch the tapes spin overnight. Assuming the restores go well.

Naturally, the restores aren't going well, otherwise I wouldn't be blogging about them.

My test restores (grab 1 file off tape) worked. The first restore worked using the nwrecover GUI. It was able to pull 200GB off tape and put it back onto the "autotest" share in about 24 hours.

$COWORKER's test restores (grab a couple of files off tape) didn't. They failed with an error of "NDMP Service Log: Only one path for each src and dst can be specified." Restore #2 (2GB of web content) broke with the same error message. Restore #3 (1MB of user scripts) failed also.

Well, ok, the error message reads like Networker's putting something wierd in the NDMP protocol.

A dig-in with Ethereal should help, and maybe I can figure out what inputs it needs to get the right outputs. Or not. Ethereal has some understanding of NDMP, but doesn't seem to be willing to splice back together the multi-packet NDMP requests and display them in a way that makes sense to me. Oh well.

But I know that NDMP is simply a Command & Control (C&C) protocol, the actual on-tape format is whatever native format the data server uses. In the case of Netapp, it's compatible with ufsdump on solaris. All I have to do is position the tape to the correct file, and pull the data over onto my Solaris backup server. Since I have shared tape drives, this'll be easy.

mminfo -q ssid=123456789 -r volume,mediafile,mediarec
sjimm 0.100.0 slot 16 drive 1
mt -f /dev/rmt/ciqsjb01_1 fsf 3
ufsrestore -ivfb /dev/rmt/ciqsjb01_1 60

browse to the data, and restore it. This works well for restore #2. The data comes back to the backup server, and it's ready to be copied off.

The backup for restore #3, on the other hand, spans 2 tapes. This makes things much more complicated. As I mentioned, NDMP is purely C&C. When a tape gets full, NDMP simply pauses the writes until the NDMP tape server has the next one ready, then resumes the writes. There's no feedback to the dump process that the tape has been changed, so dump considers it to be a single-volume backup. And inbetween the "unload the tape" and the "next tape is ready" steps, Networker naturally puts "load the next tape" (makes sense) and "write Networker label on the tape" (which adds file marks to the tape, which I have to skip before passing the next block to ufsrestore)

So how do I fake out ufsrestore to use the 6th file on tape 1, then when that runs out of data (rather than abort with an i/o error) wait until I load the next tape, then seek forward 3 files, and continue reading? Something like "(dd if=/dev/rmt/thefirsttape ; dd if=/dev/rmt/thesecondtape) | ufsrestore -ivfb - 60" should work, except that I can't tie up both tape drives for that long, and I don't trust Ops not to break things. I need it to switch tapes in the drive.

But this doesn't work, and I don't know why. mt gets an i/o error on the 2nd tape.
( # Have the first tape positioned correctly
dd if=/dev/rmt/ciqsjb01_1 bs=61440
mt -f /dev/rmt/ciqsjb01_1 offl
sjimm 0.100.0 drive 1 slot 22
sjimm 0.100.0 slot 23 drive 1
mt -f /dev/rmt/ciqsjb01_1 fsf 3
dd if=/dev/rmt/ciqsjb01_1 bs=61440 ) | ufsrestore -ivfb - 60

This should work, right?

In the end, I've opened a call with EMC. This is apparently a bug between Networker <7.2.2 and OnTap version >=7.2.2, and it's fixed in the latest version of Networker. But in the mean time, a full-saveset recover will work, and I have that running now.



Fun's over!!!

Looks like my fun with filesystems is over... http://sunsolve.sun.com/search/document.do?assetkey=1-26-102899-1 describes the same sort of output as http://enterprise-sa.blogspot.com/2006/11/fun-with-filesystems.html

Oh well.



The Good-Enough trap

There's a danger out there... It grows in little alcoves and cubicles, where a Group needs a piece of software to fill a particular need.

Smart People create The Solution. Sometimes there is an explicit description of the Requirements, which does not go beyond a handful of users. With or without Requirements, the Smart People select and begin work on The Platform. This selection is based on many criteria: how familiar it is to them (or the learning curve), wanting to add "$Platform programming" to the resume, I just read an article on how $Platform makes $task easy, etc. Usually not on the list is "ability to scale" or "backup support" or "algorithmic efficiency" or "we have supported hardware to run it on" or "plays nicely with other applications"

But The Solution is created and used by the Group. And it works. And the Group is more productive because of it. So naturally, since other groups want to be more productive, they want to be Users of The Solution too. And as the Users grow in number and timezone-diversity, the limitations of The Platform become more apparent. At least to those of us on the back end. Faults, Inefficiencies, Downtimes, Management headaches, these are usually hidden from the Users (or at least aren't visible enough often enough to generate real complaints) Eventually, Leadership recognizes the value of The Solution (or at least, they recognize the value of the increased productivity), and The Solution becomes an integral part of the business.

At this point, we are put into a difficult position. Limitations of The Platform, unscalable User management tools, hardware choices, etc. mean that The Solution needs to be upgraded, improved, or otherwise replaced. But naturally, work on The Solution really doesn't fall into the area of expertise of the Group any more (or at best, fixing The Solution doesn't generate billable hours), and the Smart People who developed it in the first place have either left the company, or are too busy to reimplement it. So it comes down to IT having to choose between: 1) Support and Maintain the Unmaintainable, or 2) Replace it with "the IT way", and deal with the costs of development as well as retraining the Users (and the political cost of insulting the Smart People's Solution) Usually we're stuck with (1).



Ask the right question, and you have the answer

I really hate when I'm sending an email to ask a highly-technical question, and in the process of formulating the question find the answer.

We're looking for the Next Big Disk Array to replace the Previous Big Disk Array, which has lately been showing its age in the performance arena. This is the Big Disk Array that we use as a Networker adv_file device, where we write the Big Database backups.

There's lots of people who sell BDAs. I can pretty much characterise the product options as:
  • Proprietary (usually web-based) interface that doesn't integrate with any other management tool (that's another rant to be ranted someday)
  • Proprietary ASIC on a controller board (possibly redundant Active/Active, or Active/Passive)
  • Some number of 1,2, or 4Gb Fibre and/or 1G iSCSI ports
  • Cache memory, usually up to 2GB
  • as many disks as will fit in that number of rack units
And it takes a 3-page PDF to marketspeak that. Anyway, from a performance standpoint, the only two numbers ever referenced are the uplink speed (!look! we have 4Gb fibre) and maximum throughput (which is never explicitly defined).

Max throughput, I generally assume, means "read whatever the optimal block size is out of cache, and imagine that the whole array is that fast" (cf/ peak transfer rates from consumer disk drives). Unless the unit supports expansion units, in which case it's "get as many expansion units as we can install, stripe a single disk group across all of them, and then report the aggregate throughput from that"

Neither is particularly helpful for me to figure out if we can write "database_backup.tar" onto the array fast enough. But I digress.

The question I was trying to ask is:

Where does it make sense to perform I/O reordering, redundancy, and cacheing:
  • On the array's controller card (which is a custom ASIC with 2GB of cache) -or-
  • In the Solaris I/O stack (including ZFS) on a server with 8GB of RAM and knowledge of the application's I/O pattern and years of performance optimization
In addition, this is not an exclusive-or: the Solaris layer is still going to be optimizing its I/O pattern, possibly with wrong assumptions about the performance and parallelism of the LUN. Or even worse, our PBDA couldn't act as a single big LUN, so the solaris layer is queueing 3 I/Os in parallel to what it thinks are 3 differnet disks, but in fact must be serialized by the controller with a long seek in between. This is clearly not optimal.

(Which reminds me... the custom ASIC has virtually no ability to actually measure or tune any performance of the system. There is no concept of exposing performance or profiling data, and there's no way to determine that these seeks are really causing the slowness. On the solaris side, OTOH, there' s things like seeksize.d that can help figure out why the fscking thing is so slow)

Just framing the question has taken me from 60/40 in favor of JBOD to about 95% in favor of it.


Fighting Rogue

I'm about >< close to going "rogue sysadmin".

What does that mean? I'm very close to just saying "screw it" when it comes to any sort of collaborative decision making about technlology, and I'm just going to implement what I think is best. That's what makes me a "senior technical lead", right? That I know best? Or that I can at least make a decision of what's best for my team without having to get buy-in from a half-dozen other managers whose groups have wildly different and conflicting goals?

Why? I'm very angry about several projects that have been stalled waiting for other groups to buy-in on a framework that will solve everyone's problems. The missing functionality (monitoring) has very much come to the front over the past 30 days or so, since the big SAP upgrade. And especially with Saturday's DST patching.



One helluva worklist

This is my first work-tuesday since november. (last tuesday counts as a work-monday since monday was a holiday) Last night, I realized I had forgotten how draining this work stuff is. I'm glad I have a supportive wife at home.

Met with $PHB yesterday from 3:00-4:50 for an hour-long meeting. Discussed what's happened since I was last in the office...

The company is going with ScaleFarce for CRM. This was a surprising turnaround given the (admittedly fourth-hand) account of the negative reception of this deal the first time. And in order to make it easy to integrate "their" systems with "ours", we're going to implement MS BizTalk as a inter-middleware layer.

Did I mention we have no experience (as a company) with MS BizTalk? And that my group is expected to deliver a production-quality landscape (including DEV and TEST systems) by early march?

Also, $COWORKER{"leftcoast"} is moving up the coast a couple of hours to work in a sales office (rather than in the building with the datacenter) And since he wants to be on a management track (he even has an MBA), he's going to be given lots of responsibility. $PHB wants to virtually split his org between "Systems Infrastructure" and "Application Infrastructure" (but since there's not enough people, everybody gets an SI hat and an AI hat)

He wants me to (long-term) sit as technical advisor/architect atop both teams. I'm probably going to be saddled with (short-term) technical team lead responsibilities over the SI hats. I think I agree with the long-term plan, but I'm not so sure about being a "team lead". I definitely don't want the administrative headache of being a real manager (with a box on the org chart), like salary, reviews, budgeting, HR issues, etc. not to mention the endless "IT Manager Meeting"s

After discussing history, we went on to list the team's major projects (as we see them) for 2006. The list filled the whiteboard. It's going to be an interesting year.

Catching up on sudokus:




Time Off In Lieu (although in this case, it's more appropriately In Labor)

I'm back to work today for what feels like the first time since before thanksgiving. That makes today one hell of a monday.

I was supposed to be back last tuesday, but my son went to the hospital that night, so I was out the rest of the week (he's fine, it was a virus that caused a high fever, but no other impact)

But that doesn't mean nothing's been happening around here since then. There was a flurry of activity the second weekend in December when I migrated the production SAP database to a new symmetrix, with a new LUN layout.

Then even more fun when I was trying to reformat parts of that new symmetrix to support our BCVs, which caused the production database to lose I/Os. That was real fun, let me tell you. (maybe in another post)

We had an unusually quiet (for me) end-of-year, $PHB sent out the holiday on-call schedule, and I was not on it. (I still got called once during my vacation, but only once)

Last week, I managed to copy the production SAP database over to the sandbox server without missing too much time at home (it was the afternoon we got back from the hospital, and I was able to get it done while he was sleeping), so the SAP team can run through another "trial" upgrade in prep for our upgrade in Feb.

Otherwise, it's been very quiet.

Oh yeah, and the networking team has moved out of my row of cubicles, over to the other side of the basement. So no more shouting over the cube wall "teh Intarweb's broke".

I'm already a week behind in my new year's resolutions:
1. 1400x1050
2. blog the "sudoku of the day"
3. Clean up my cubicle
4. Get my home computer working right (get it to stop locking up when I have the USB wireless adapter connected)
5. Install the copy of Adobe Premiere Elements SWMBO got me for christmas and learn to use it (by finishing the video of my cousin's wedding, and the kids videos)

Also on my todo list... need to set up the virtualization lab at the end of the hall. I've got 2 shiny new Dell 490s with dual dual-core (and I think HT and VT) "workstations" that I need to get working :)

So the sudoku for 1/1 reads: