2006/06/22

Enterprise Monitoring

In the grand quest for the "One True Ring^W^W^WSilver Bullet^W^WIntegrated Solution", this week's goal is to reduce the number of tools we're using for enterprise monitoring.

Currently we have 5 major players:
MOM has "Management support" (and, therefore $$$), Big Brother has a rich history of success (and is free), MRTG is tightly integrated with the way Networking does their stuff, Cesura has gone out of business (but they had some really cool demo technology), and of course, nobody really knows what those scripts do.

On the bright side, there's this Hobbit project I've been following for a while, which looks like a better Big Brother than BB...

On the really bright side, I've not been tasked with getting all this crap together.

I just get called on to get it working because $COWORKER[0] doesn't know Solaris at all (production enterprise is Solaris) but he's the MOM wizard, and $COWORKER[1] needs to learn more about our environment (relative new guy), and needs some visibility in the larger organization. I just happen to be the only expert in the monitoring world, just like everywhere else.

So because there's money for MOM, we're looking to see if there's any way to get non-Windows platforms to work with this Microsoft solution. As it happens, there are several third-party addons (management pack extensions) that purport to "monitor" non-Windows clients. Also the Windows guys love MOM because it has links to MSKB articles about how to tune Exchange servers when there's a low memory alert, for example.

The first extension we tried (from eXc Software) sucked. Not so much that the eXc software sucked, but OOTB, it monitors 4 items: Total CPU usage (alert if CPU is >10% busy), CPU usage by process (alert if a single process is eating more than 10% of the CPU), disk free space, and swap space usage. And that's it. Anything more and we have to write our own JScript (or VB) test that runs on the MOM server, leverages their "clientless" (aka telnet) interface to gather status on the server, and the parses the output to create a MOM/WMI event. And then maintain that code. Not exactly what we had in mind.

But eXc also has an SNMP extension agent to monitor Solaris via SNMP, so we'll try that too. A few clickety-clicks later, I've configured the basic SNMP service that's installed with the community names and it's running on our test box. Except that the software is exclusively trap-driven. And the Solaris side doesn't have any (readily apparent) way to throw the traps. Basically the eXc stack is just the Solaris trap MIBs pre-configured.

Well, if we're going down the SNMP route, let's see what MOM can do on its own. After all, it says it can monitor via SNMP. One KB article later, (and lowering the monitoring standards significantly) and I have our linux-based Digi CM console server happily SNMP-trapping into MOM. Ok, it was a lot more than just the KB article, there was also some registry editing, MIB compiling, MIB editing so it would be acceptable to the MS SMI compiler, interpreting the help page for the MS SMI compiler, some minor VB scripting and finally, turning a checkbox on on the Digi. And we still only have SNMP traps. No queries, no performance trending, no performance alerts. Also, no MIB translation (so you have to be able to recognize that 1.3.6.1.4.1.332.10.14.14.0.2 means "authentication failure", which I'm sure we'll get good at in no time at all)

So back to the drawing board... there's 2 other extension packs for MOM that we're going to try out... one from here in Cincinnati (version 1.0 was released last week) and one that appears to be a whole management infrastructure that surrounds and integrates MOM (and happens to do non-Windows clients too)

The really unfortunate thing is (as I mentioned above) there's this Hobbit project, which would leverage our existing Big Brother clients and successes, and looks like it would be fairly straightforward to implement and has a reasonably sane, extensible architecture (but it isn't MOM -- the Windows guys really like MOM)

So I ask myself "what would it take to make Hobbit work with MOM?" (at least as well as the SNMP integration or the other products did)

Hobbit's backend consists of passing messages along "channels". Messages such as "serverX is down" and channels such as "status" or "page" (or "data"), passed via IPC to worker modules. It should just be a Small Matter Of Programming to create a worker module that would accept "stachg" (status change) and/or "data" channels, massage them into something like WBEM events, and toss them across to the WMI receiver on the MOM server. I mean heck, if VB can massage SNMP traps into WBEM, surely it can't be that hard. There's even sample channels in the hobbit distribution.

I think it'd take a couple of days of programming (and learning how MOM is different than Microsoft's WMI is different than WBEM). Unfortunately I'm the only one in the group who can code. And with everything else that's going on, the chances of me taking a couple of days is exceptionally slim.

Oh well, maybe somebody else will read this and think it's a cool, easy idea.

--Joe

No comments: