Enterprise SA

2007/05/29

Fun's over!!!

Looks like my fun with filesystems is over... http://sunsolve.sun.com/search/document.do?assetkey=1-26-102899-1 describes the same sort of output as http://enterprise-sa.blogspot.com/2006/11/fun-with-filesystems.html

Oh well.

--Joe

2007/04/28

There's a danger out there... It grows in little alcoves and cubicles, where a Group needs a piece of software to fill a particular need.

Smart People create The Solution. Sometimes there is an explicit description of the Requirements, which does not go beyond a handful of users. With or without Requirements, the Smart People select and begin work on The Platform. This selection is based on many criteria: how familiar it is to them (or the learning curve), wanting to add "$Platform programming" to the resume, I just read an article on how $Platform makes $task easy, etc. Usually not on the list is "ability to scale" or "backup support" or "algorithmic efficiency" or "we have supported hardware to run it on" or "plays nicely with other applications"

But The Solution is created and used by the Group. And it works. And the Group is more productive because of it. So naturally, since other groups want to be more productive, they want to be Users of The Solution too. And as the Users grow in number and timezone-diversity, the limitations of The Platform become more apparent. At least to those of us on the back end. Faults, Inefficiencies, Downtimes, Management headaches, these are usually hidden from the Users (or at least aren't visible enough often enough to generate real complaints) Eventually, Leadership recognizes the value of The Solution (or at least, they recognize the value of the increased productivity), and The Solution becomes an integral part of the business.

At this point, we are put into a difficult position. Limitations of The Platform, unscalable User management tools, hardware choices, etc. mean that The Solution needs to be upgraded, improved, or otherwise replaced. But naturally, work on The Solution really doesn't fall into the area of expertise of the Group any more (or at best, fixing The Solution doesn't generate billable hours), and the Smart People who developed it in the first place have either left the company, or are too busy to reimplement it. So it comes down to IT having to choose between: 1) Support and Maintain the Unmaintainable, or 2) Replace it with "the IT way", and deal with the costs of development as well as retraining the Users (and the political cost of insulting the Smart People's Solution) Usually we're stuck with (1).

--Joe

2007/03/20

Ask the right question, and you have the answer

I really hate when I'm sending an email to ask a highly-technical question, and in the process of formulating the question find the answer.

We're looking for the Next Big Disk Array to replace the Previous Big Disk Array, which has lately been showing its age in the performance arena. This is the Big Disk Array that we use as a Networker adv_file device, where we write the Big Database backups.

There's lots of people who sell BDAs. I can pretty much characterise the product options as:

Proprietary (usually web-based) interface that doesn't integrate with any other management tool (that's another rant to be ranted someday)
Proprietary ASIC on a controller board (possibly redundant Active/Active, or Active/Passive)
Some number of 1,2, or 4Gb Fibre and/or 1G iSCSI ports
Cache memory, usually up to 2GB
as many disks as will fit in that number of rack units

And it takes a 3-page PDF to marketspeak that. Anyway, from a performance standpoint, the only two numbers ever referenced are the uplink speed (!look! we have 4Gb fibre) and maximum throughput (which is never explicitly defined).

Max throughput, I generally assume, means "read whatever the optimal block size is out of cache, and imagine that the whole array is that fast" (cf/ peak transfer rates from consumer disk drives). Unless the unit supports expansion units, in which case it's "get as many expansion units as we can install, stripe a single disk group across all of them, and then report the aggregate throughput from that"

Neither is particularly helpful for me to figure out if we can write "database_backup.tar" onto the array fast enough. But I digress.

The question I was trying to ask is:

Where does it make sense to perform I/O reordering, redundancy, and cacheing:

On the array's controller card (which is a custom ASIC with 2GB of cache) -or-
In the Solaris I/O stack (including ZFS) on a server with 8GB of RAM and knowledge of the application's I/O pattern and years of performance optimization

In addition, this is not an exclusive-or: the Solaris layer is still going to be optimizing its I/O pattern, possibly with wrong assumptions about the performance and parallelism of the LUN. Or even worse, our PBDA couldn't act as a single big LUN, so the solaris layer is queueing 3 I/Os in parallel to what it thinks are 3 differnet disks, but in fact must be serialized by the controller with a long seek in between. This is clearly not optimal.

(Which reminds me... the custom ASIC has virtually no ability to actually measure or tune any performance of the system. There is no concept of exposing performance or profiling data, and there's no way to determine that these seeks are really causing the slowness. On the solaris side, OTOH, there' s things like seeksize.d that can help figure out why the fscking thing is so slow)

Just framing the question has taken me from 60/40 in favor of JBOD to about 95% in favor of it.

2007/03/12

Fighting Rogue

I'm about >< close to going "rogue sysadmin".

What does that mean? I'm very close to just saying "screw it" when it comes to any sort of collaborative decision making about technlology, and I'm just going to implement what I think is best. That's what makes me a "senior technical lead", right? That I know best? Or that I can at least make a decision of what's best for my team without having to get buy-in from a half-dozen other managers whose groups have wildly different and conflicting goals?

Why? I'm very angry about several projects that have been stalled waiting for other groups to buy-in on a framework that will solve everyone's problems. The missing functionality (monitoring) has very much come to the front over the past 30 days or so, since the big SAP upgrade. And especially with Saturday's DST patching.

--Joe

2007/01/09

One helluva worklist

This is my first work-tuesday since november. (last tuesday counts as a work-monday since monday was a holiday) Last night, I realized I had forgotten how draining this work stuff is. I'm glad I have a supportive wife at home.

Met with $PHB yesterday from 3:00-4:50 for an hour-long meeting. Discussed what's happened since I was last in the office...

The company is going with ScaleFarce for CRM. This was a surprising turnaround given the (admittedly fourth-hand) account of the negative reception of this deal the first time. And in order to make it easy to integrate "their" systems with "ours", we're going to implement MS BizTalk as a inter-middleware layer.

Did I mention we have no experience (as a company) with MS BizTalk? And that my group is expected to deliver a production-quality landscape (including DEV and TEST systems) by early march?

Also, $COWORKER{"leftcoast"} is moving up the coast a couple of hours to work in a sales office (rather than in the building with the datacenter) And since he wants to be on a management track (he even has an MBA), he's going to be given lots of responsibility. $PHB wants to virtually split his org between "Systems Infrastructure" and "Application Infrastructure" (but since there's not enough people, everybody gets an SI hat and an AI hat)

He wants me to (long-term) sit as technical advisor/architect atop both teams. I'm probably going to be saddled with (short-term) technical team lead responsibilities over the SI hats. I think I agree with the long-term plan, but I'm not so sure about being a "team lead". I definitely don't want the administrative headache of being a real manager (with a box on the org chart), like salary, reviews, budgeting, HR issues, etc. not to mention the endless "IT Manager Meeting"s

After discussing history, we went on to list the team's major projects (as we see them) for 2006. The list filled the whiteboard. It's going to be an interesting year.

Catching up on sudokus:
1/2:
132548769
478296153
695173842
913425678
784619235
256837491
567981324
321754986
849362517
1/3:
354629718
621857394
879314526
186732945
297485163
435961872
713546289
948273651
562198437

--Joe