Enterprise SA: 2006

2006/11/21

Legacy and transition

I hate trying to transition work to other people. I'm at the hopital right now helping SWMBO have our 2nd baby. So I'll be off for a while.

So I'm leaving unfinished several projects... the SAP upgrade sandbox systems, the BEA monitoring project, the Oracle installation & monitoring project, the whole EMC upgrade, the cluster implementation, as well as supporting the treasury project, the hyperion upgrade, the webfocus upgrade... not to mention the usual stuff. Much of it is in the critical path for our big SAP upgrade (4.5 to 6.0) in February.

And I guess I'm just not comfortable that I can successfully hand these projects over to the rest of my team.

Previously, I have interpreted this as a lack of communication on my part-- I haven't taken the time over the past 2 months (not like this wasn't a planned leave) to make sure that the rest of the team has the knowledge to keep these projects moving. Now, I'm not so sure that I could have done anything differently.

The members of the team that are skilled to take up any of these projects are vastly overcommitted (not all of these projects are just mine -- I just advise and consult on some of them) and I don't think I can help the remaining team learn what they would need to learn in order to make meaningful contributions to these projects (for example, they're windows administrators, and this is a solaris problem... it doesn't help if I basically use them as a speech-to-commandline interpreter)

Trouble is, I'm the most technically-skilled unix guy on the team, so I get in the critical path of so many projects. But am I realistically supposed to be able to transfer knowledge about ongoing problems where I'm also new to them?

Oh well, this post took a long time to come out, and lots of stuff has happened since then. The question still remains, though: How am I supposed to get everything done, including training a backup, when the whole team (me and all potential backups) are overcommitted?

--Joe

2006/11/16

Fun with Filesystems

I think there's a race condition in Solaris... we had a filesystem get full with Oracle archivelogs, so I removed them, then checked to see what effect that had:


# rm D*_60[012345]?_*.dbf
# df -h .
Filesystem             size   used  avail capacity  Mounted on
/oracle/D01/saparch    5.9G 16384E   6.4G 301049486643838% /oracle/D01/saparch

A moment later, it was happy:


# df -h .
Filesystem             size   used  avail capacity  Mounted on
/oracle/D01/saparch    5.9G   257M   5.6G     5%    /oracle/D01/saparch

This is not the first time I've noticed some wierdness with removing data on S10. Last time, I wiped out a copy of our big oracle database, (rm -rf sapdata*/*) which only took a few seconds, but to unmount the filesystem took over 8 hours.

--Joe

2006/10/17

Write something

It's been over a month since I last posted. It's not like I haven't been dealing with lots of enterprise-SA type material, just that I've been too busy to even breathe, much less distill my thoughts into something for this site. But since I'm sick right now, I sorta have a little bit of time on my hands...

Some of the recent topics that are worth discussing (probably in their own posts, or several posts)...

Thoughts from the monitoring meeting (discussions about what we need for enterprise monitoring, but not all related to monitoring): false buy vs. build dichotomy; fundamental architectural difference between BB-style and SNMP trap... (no explicit "OK" status) ; Industry combination of Monitoring tools with Management tools; the myth of Agentless monitoring; SNMP support on Windows (SNMP Informant)
An Infrastructures.org mailing list post Message-ID: <20060818174228.B26037@so.lanier.com>
The usefulness of professional services and consultancy in enterprise application deployment: experiences with CA, EMC, and Hyperion
Why the hell can't I keep my desk clean?
I miss going to conferences: VMworld is on now, LISA is in December. I'm expecting a new baby about halfway between, and there's no way I can go out of town for a week.
I hate being sick. Daytime TV sucks even with satellite and a DVR. If I'd known I was going to be sick this long, I should have joined NetFlix.

Not a bad topic list... now, discuss amongst myself.

--Joe

2006/08/25

The Ultimate P2V

There's been a lot of talk about the "Blue Pill" trick where a hypothetical virus would use the new x86 virtualization features (VT or pacifica) to move a running OS under a hypervisor (where the virus would run undetectably) It would be very interesting to extend this into a positive technology...

Imagine a program that uses Blue Pill to move the OS under a hypervisor. That's fine, but the OS is still coupled to the physical devices (network cards, disks, etc). Now have the hypervisor generate a virtual (hotplug) PCI bus and attach it to the running OS. And have it hotplug a vmnic and an emulated scsi controller. The OS notices the new redundant paths to the disks (standard multipathing software) and fails over all the network connections onto the virtual card. Then the hypervisor virtually unplugs the real PCI bus, and we're left with a completely virtualized (i.e. VMotion-able) machine. Without a downtime.

That would be really cool.

This would require:

A bluepill-compatible hypervisor that can create virtual hotplug PCI buses, and that can transport running VMs across physical machines
An OS that supports PCI hotplug, dynamic disk multipathing, and transparent network failover
All the disks on the physical system being on a SAN or otherwise multihosted

--Joe

2006/07/18

DamnDamnDamnDamn

The hard drive in my work laptop is in the process of dying. That is to say, it has died (bluescreen: kernel inpage error) but has occasionally spun up enough to boot Windows.

Just long enough for the backup software to load and start a backup, not long enough for the backup to finish.

On the bright side, Support has sent me a new drive, and it's an 80GB: a 20GB upgrade from what I had. So I should have enough space now for some of the virtual machines I've been meaning to create.

Unfortunately, I still haven't finished installing my software on the new image (so far going on 4 hours of work). The only reason I have email is because OWA actually works through Firefox on Linux. Whoda thunk?

--Joe

2006/06/22

Enterprise Monitoring

In the grand quest for the "One True Ring^W^W^WSilver Bullet^W^WIntegrated Solution", this week's goal is to reduce the number of tools we're using for enterprise monitoring.

Currently we have 5 major players:

Microsoft Operations Manager
Big Brother
MRTG
Cesura
Custom-written "check" scripts

MOM has "Management support" (and, therefore $$$), Big Brother has a rich history of success (and is free), MRTG is tightly integrated with the way Networking does their stuff, Cesura has gone out of business (but they had some really cool demo technology), and of course, nobody really knows what those scripts do.

On the bright side, there's this Hobbit project I've been following for a while, which looks like a better Big Brother than BB...

On the really bright side, I've not been tasked with getting all this crap together.

I just get called on to get it working because $COWORKER[0] doesn't know Solaris at all (production enterprise is Solaris) but he's the MOM wizard, and $COWORKER[1] needs to learn more about our environment (relative new guy), and needs some visibility in the larger organization. I just happen to be the only expert in the monitoring world, just like everywhere else.

So because there's money for MOM, we're looking to see if there's any way to get non-Windows platforms to work with this Microsoft solution. As it happens, there are several third-party addons (management pack extensions) that purport to "monitor" non-Windows clients. Also the Windows guys love MOM because it has links to MSKB articles about how to tune Exchange servers when there's a low memory alert, for example.

The first extension we tried (from eXc Software) sucked. Not so much that the eXc software sucked, but OOTB, it monitors 4 items: Total CPU usage (alert if CPU is >10% busy), CPU usage by process (alert if a single process is eating more than 10% of the CPU), disk free space, and swap space usage. And that's it. Anything more and we have to write our own JScript (or VB) test that runs on the MOM server, leverages their "clientless" (aka telnet) interface to gather status on the server, and the parses the output to create a MOM/WMI event. And then maintain that code. Not exactly what we had in mind.

But eXc also has an SNMP extension agent to monitor Solaris via SNMP, so we'll try that too. A few clickety-clicks later, I've configured the basic SNMP service that's installed with the community names and it's running on our test box. Except that the software is exclusively trap-driven. And the Solaris side doesn't have any (readily apparent) way to throw the traps. Basically the eXc stack is just the Solaris trap MIBs pre-configured.

Well, if we're going down the SNMP route, let's see what MOM can do on its own. After all, it says it can monitor via SNMP. One KB article later, (and lowering the monitoring standards significantly) and I have our linux-based Digi CM console server happily SNMP-trapping into MOM. Ok, it was a lot more than just the KB article, there was also some registry editing, MIB compiling, MIB editing so it would be acceptable to the MS SMI compiler, interpreting the help page for the MS SMI compiler, some minor VB scripting and finally, turning a checkbox on on the Digi. And we still only have SNMP traps. No queries, no performance trending, no performance alerts. Also, no MIB translation (so you have to be able to recognize that 1.3.6.1.4.1.332.10.14.14.0.2 means "authentication failure", which I'm sure we'll get good at in no time at all)

So back to the drawing board... there's 2 other extension packs for MOM that we're going to try out... one from here in Cincinnati (version 1.0 was released last week) and one that appears to be a whole management infrastructure that surrounds and integrates MOM (and happens to do non-Windows clients too)

The really unfortunate thing is (as I mentioned above) there's this Hobbit project, which would leverage our existing Big Brother clients and successes, and looks like it would be fairly straightforward to implement and has a reasonably sane, extensible architecture (but it isn't MOM -- the Windows guys really like MOM)

So I ask myself "what would it take to make Hobbit work with MOM?" (at least as well as the SNMP integration or the other products did)

Hobbit's backend consists of passing messages along "channels". Messages such as "serverX is down" and channels such as "status" or "page" (or "data"), passed via IPC to worker modules. It should just be a Small Matter Of Programming to create a worker module that would accept "stachg" (status change) and/or "data" channels, massage them into something like WBEM events, and toss them across to the WMI receiver on the MOM server. I mean heck, if VB can massage SNMP traps into WBEM, surely it can't be that hard. There's even sample channels in the hobbit distribution.

I think it'd take a couple of days of programming (and learning how MOM is different than Microsoft's WMI is different than WBEM). Unfortunately I'm the only one in the group who can code. And with everything else that's going on, the chances of me taking a couple of days is exceptionally slim.

Oh well, maybe somebody else will read this and think it's a cool, easy idea.

--Joe

2006/06/02

Need to build a secure (public) download site

I have a fairly simple task in front of me: Provide a place for random internet users to download (via anon. ftp, http and/or https) one of a set of several 300MB files. (Oh yeah, and they have no budget for hardware)

From this, I add the "usual" Enterprise Systems requirements: It has to be

manageable
secure
reliable

Seems straightforward: We have a Solaris 10 system in the DMZ in the central datacenter, it has enough mirrored disk space (over 20GB free) and it's running an application that's "more important" than this little download site, so reliability isn't a problem. If I create a zone on this server, it will be no less manageable than any of the rest (ok, the other) of the DMZ-based virtualization servers we have deployed.

That just leaves the "secure" requirement. There's lots of "interesting" opportunities there, though...

I think ideally the zone would be a mininmally installed zone (with just enough software to make apache and ftpd work) with everything mounted read-only from the global zone, and with a helper zone (only accessible to the LAN-side) having read-write access to the space (accessed via scp), with firewall rules allowing only (anyone->dlserver:80,443, and ftp) and (lan->helper:22) Oh yeah, and with traffic shaping to prevent this from eating too much of our outbound internet feed.

The firewall rules are easy... that's someone else's problem. "They" don't do traffic shaping, however, so I get to figure out the Solaris IPQOS functionality, if I get that far.

So how do you create a minimalist zone? Answers as I find them...

--Joe

2006/05/26

Two new annoyances in one day

Microsoft OWA that we use (2003) apparently uses Microsoft Word (or at least some component of MS Office) as an ActiveX control to create email messages. I noticed this when trying to send a quick email to the team from my in-laws (who don't have Office installed), and it kept popping up the "preparing to install Microsoft Office" dialog box. I cancelled the installation (since they don't have media or a license), and it fell back to a plain textbox for the message body, but the email I was responding to wasn't quoted.
Outlook apparently requires IE in order to respond to meeting requests...

Open Internet Explorer
Go to the File menu, and choose "Work Offline"
Exit IE
Outlook will still be online, will still get email, etc.
But if you try to respond to a meeting invitation, it will say that you're "Working Offline". Even though Outlook is online.

Sheesh.

2006/05/03

Thought for the day

Give a man a fish, he'll eat for a day.
Teach a man to fish, he'll eat for a lifetime.
Convince a man to focus on his core competancies while paying you to fish for him, and you have a perpetual revenue stream.

2006/04/30

Bad day (not Enterprise)

I'm having a bad computer day.

On Friday, SWMBO's work laptop (Dell D600, just like mine) started reporting "Primary Hard disk not found". Naturally, this is the system where we've been keeping a lot of our personal data while she's been on leave, and it's the one she's been using to surf the web all day... So I'm going to try to (hopefully) recover some of the data off it.

So I went out to Microcenter, bought a USB-IDE adapter (cable, really cool) and am trying to get it working. Nope. Just a little clicking sound, and the drive doesn't spin up. Damn it.

The other problem is with the home PC upstairs... Since I don't have a network cable run there, I figured it'd work to use the wireless network. I've had a D-Link DWL-122 USB wireless (802.1g) adapter (bought it a while ago for this purpose) but when I leave it plugged in, the system locks up solid. No mouse movement, no keyboard, nothing. If I don't have the wireless adapter plugged in, the system'll be fine for days. There're some net references to USB-related lockups with AMD processors and Via chipsets, some of which may be resolved by switching out the USB controller for a PCI one.

So while I'm at Microcenter, I figured I'll get a 2-port USB card to see if this'll fix the problem (assuming that I won't be able to get SWMBO's HDD back to life) so that she can be online during the day. But just as I walked in, there's a sale on PCI 802.1g wireless cards, so I pick one of those up too. I figure if the USB thing doesn't work out, the PCI card should work.

Nope. When I installed the USB card, it was recognized fine, but the DWL-122 "Cannot start. code 10". And that's the case any time it's plugged into the new card, but it works fine with the onboard ports. After a couple of retry, refresh, reinstall, reboot, reboot, reboot, reboot, I gave up. Of course the PCI card will work...

Or maybe not. About 1 minute after the I installed the card & booted, the damn thing locked up again.

And SWMBO's HDD still won't spin up.

--Joe

2006/04/28

Oh how I hate DLTs

So we have a long history of hating DLT drives here, beginning back before I had anything to do with backups when $COSA[0] would get paged almost every night because of a jukebox failure (which boiled down to the op not closing the jukebox door) going on through 3 more generations of DLT technology (7k, 8k, and now SDLT600) It seems that DLT is keeping up with its heritage.

A previous DLT8000 library had lots of problems, which we theorized was because of dust & environmental contaminants, so we switched to Qualstar libraries. These have the best dust filters in the industry. For this particular replacement, we switched to LTO-2, and have had virtually no problems.

On the other hand, $THEY, when it came time to replace their aging DLT libraries (8k 2/20, 7k 4/48, 4k 2/28) decided that they didn't want to follow our success, but would rather venture off into the brave new world of SDLT. And since the SDLT600s were current, that's what they'd get (despite the fact that these are the only drives in the company that can read this media)

So that's the background... recently, I've had to kick this particular library at least twice per week, with other interventions required (not by me) probably more often. Sometimes the drive has the 3 blinking lights, sometimes not. Usually, the drive reports that there's no tape loaded. Sometimes the library agrees. Even more rarely, Networker agrees.

Sometimes pulling & reinstalling the drive fixes the problem. Today we had to unload the drive, reseat the drive, (watch Networker load a tape before we could stop it), try to unload the drive (it couldn't), reseat the drive, then got an error message of "logical unit communication failure". So we bounced the library, and now it's working (for now)

And of course, the web interface for this library is just barely functional, so in order to actually do anything, I have to walk to the datacenter. With $COSA[1] following.

All this so that $COSA[1] can dump 250GB of filer data off to tape before he deletes it.

Sheesh
--Joe

First Post! What's it all about anyway?

I'm going to try this blog thing again... The past (couple) of times I've created a website with the intention of updating it regularly haven't worked very well, mostly because I think I don't journal well.

Also because I haven't had a particular focus for the sites, they end up as just a buncha random crap that I don't bother to update, yet for some reason I keep migrating from one site to another. Maybe this will be different.

So here's my focus for this site: "Collect my thoughts about Enterprise System Administration". What does that mean? Well, this is a space for:

Ideas about server management (configuration management)
Ideas about application infrastructure (applitecture)
The interrelation between office politics, policies, and technology
Wouldn't it be cool if...
Problem solving
And ideally, if I ever get around to actually implementing some of these great ideas, they'd be here too.

Please note that these are my opinions, which are not endorsed or sponsored by my employer. There may be information that is specific to my employer's systems and landscape, and your mileage may vary.

Enterprise SA