Enterprise SA

2009/11/09

IT Reporting

A recent "Ask Slashdot" asked what information a sysadmin should take to an executive. Here's what I think. I've picked this up from a variety of sources, including a very-skilled manager.

--------------------------
There are three key things that executives want to hear:

1) What has the department done in the past? The core of this point is to get to the question "Does the past justify continued investment?" and its correlary "We've sunk so much money into IT, what have we gotten from it?" This is where usage statistics (website hits, business transaction data, dollars-per-downtime and Nines, return on cost-saving measures, etc) are presented. This should be in high-level terms with drill-down slides available, but only presented on request. Focus on the trends of service delivery vs. IT budget and/or headcount.

2) What is the department doing now? Here we focus on what is happening with their current business. This is where a primary element of capacity planning comes in: The Headroom Metric. How much additional user load can we support on our current systems and network, before the service is degraded? In concrete terms, ignoring everything except CPU, if you're delivering 100 pages per second, and using 40% of the server's CPU, you have a headroom of 150 additional pp/s. By extrapolating this to the business need - say the marketing department has launched 5 campaigns this year, the current systems may be able to support 10, but should not be expected to support 20 without additional investment. Note that this headroom metric must look at the end-to-end utilization, like disk, memory, network, and most importantly administration effort in order to be accurate.

3) What will the department do in the future? What are the business-focused projects that the department is working on? How will the investment in these projects result in money coming into or staying in the business? What is the Return on Capital, Return on Investment?

As far as timing, there should be at least an annual "full report" on the state of IT. Depending on the dynamics of the business, quarterly updates should be sufficient, unless something changes significantly. And depending on the team and scope of the projects. You don't want to face this with a "we haven't done anything since the last report" status. But it's also important to reconnect with the executives regularly so that they don't forget about what you're doing, and also so that you can react and change to meet their changing business plans.

The most important thing we in IT can do is to be aligned to the business. This means focusing on the things that matter: delivering the product or service in exchange for money. Everything else is overhead. And the better your IT department is at aligning itself, the better you look when an outsourcer tries to talk your executives into cutting everything except the "core competancies".

--Joe

2009/09/29

Idle curiosity about iLOM

Why does the service processor on our brand new Sun T5240 server have a SPARC 885 processor, and run Linux? Why not (Open)Solaris?

Kinda ironic that Sun boots its latest servers with Linux.

Maybe it's the fact that it has 32MB of flash to work with, and only 128MB of ram. But that should be enough to run Solaris.

--Joe

2009/08/27

FSF Windows 7 sins

I don't normally post political messages here, but this one's important, I think.

The Free Software Foundation has posted 7 Windows 7 "sins" at http://windows7sins.org/, and I think they left out what in my mind is the most important issue. It's sorta covered in "Corrupting Education" and "Lock-In", but not really:

With Windows 7 (and Office 2003 before that, and Vista before that, and XP before that, and Windows 9x/W2K before that) users will have to retire/obsolete all of their existing training in the Windows user interface in favor of the newest cosmetic decisions Microsoft has made for its products.

I don't argue that there aren't significant productivity benefits to the current Windows shell (vs. Program Manager in NT and 3.x) or in the improvements from '95 to XP. I haven't seen much of Vista's Aero, or the new Windows 7 UI, and I'm sure all of the changes have been run by major interface testers.

But when I switched from Office 2000 to Office 2003, I had a rather steep learning curve to deal with the "Ribbon" UI. Even though I taught Office 97 to Computers 101 users in grad school (and was able to take that through to O2K) I was lost with the new "Where the h*** did the menu go" interface. (Ok, If I were an Excel developer, would I consider search&replace General (Home) thing, or a Data thing. It used to be in the Edit menu... )

But I relearned. And I was able to relearn because as I was growing up, the UI changed dramatically (from Write on my Apple ][+ to PC/Word Perfect to WPfW to vim/TeX and on to MS Office*) But for someone who's used to and has memorized the keystrokes/mouse clicks to insert a text box, this is a whole new ballgame.

When I was applying for jobs after college for example, one of the companies asked that I take an "aptitude test" which included things like typing speed and accuracy, formatting documents, generating mail merges etc. This computer-based test was graded on if you click the right menu option first. If you picked "Edit" instead of "Tools" (or if you right-clicked and chose "Format") you got the question wrong. Not that this was a good test, but it's typical for the industry. And the answers completely changed when 2K7 came out.

Of course, in my line of work, we're more concerned about the OS than about the Office apps. So it's things like the changes in networking that annoy me about Vista. Wow, the way I set up a dialup connection has changed. Hmm, I wonder what happens if I right-click here... etc. So I have to learn a whole new way to fix things that go wrong. Not to mention that Vista Home is quite different interface-wise than Vista Business.

And I'd expect that the various Windows 7 editions will look different too. After all, would the wizard that helps gramma connect to the wireless internet at Starbucks be the best way for IT professionals to diagnose an 802.1x authentication problem? If I learn how to do it with my home PC, will that apply to the real business world?

--Joe

2009/08/19

Netapp - Waster of space

We have a Netapp that we use to provide Tier-2 LUNS to our SAN. It was price-competitive on raw disk space, but I didn't realize at the time just how much overhead this appliance had.

The obvious overhead is RAID-DP and hot spare drives. Easily calculated. 1 HS per 30 drives of each size. DP is 2 drives per plex, so that's 6 wasted drives out of the 28 in two shelves, leaving 22 * 266GB drives usable = 5.7TB.

I'd heard that space is reserved for OS and bad-block overhead (about 10%) so that brings us down to 5.2TB usable.

Well, the web interface shows the aggregate as 4.66TB. So that's 600GB I haven't accounted for. But still, 4.66 TB is a good amount of space.

From the aggregate, we create a flexvol (note that this places 20% by default as inaccessible snap reserve space). On the flexvol, we create LUNs and present them to our servers. And here's where the space consumption is nasty:

By default, if you create a 1TB lun, OnTAP reserves 1TB of disk blocks in the volume. That's nice, and exactly what I'd expect. Although in practice, we use thin provisioning (lun create -o noreserve) for most of our LUNs

What I didn't expect going in was that the first time you create a snapshot, OnTAP would reserve ANOTHER 1TB for that LUN. And interestingly enough, that 1TB is never touched until there's no other space in the volume.

Ok, That ensures that even if you overwrite the ENTIRE lun after you take a snapshot. But it reduces the usable size of LUN-allocation to 2.33TB. And if you have multiple snapshots, those don't seem to go into the snap reserve, but rather are in addition to the 2*LUNsize that is already allocated.

So out of a raw disk capacity of (28*266) 7.2 TB (which is quoted as 28*300GB disks = 8.2TB) we get just over 2TB of space that can be used for holding actual system data.

Wow.

Now, there are non-default settings that can change that, but they're only available at the CLI, not the web interface:

# snap reserve 0 - this will set the snap reserve from 20% to 0%, which is recommended for volumes that hold only LUNs.
# vol options fractional_reserve ## - This changes the % of LUNsize that is reserved when a LUN snapshot is taken.

It is not entirely clear what happens to a LUN when its delta becomes larger than the fractional_reserve. Some documentation says it may take the LUN offline, but I would hope that only would happen if there's no remaining space in the volume (like what happens with snapshot overflow in traditional NAS usages). But it's not clear.

As far as I can tell, the current best practice is to set the snap reserve to the amount of change you expect in the volume, and set the fractional_reserve to the amount of change you expect in the LUN. And to set up either volume auto-grow and/or snapshot auto-delete to make sure you have free space when things get full.

On the gripping hand, the default options make sure that you have to buy a lot of disks to get the storage you need.

--Joe

2009/07/13

SCSI disk identifiers

Whoever it was that thought they'd be cute and put the VT100 "clear screen" character string as part of their disk identifier, I want to buy you a drink.

The probe-scsi-all output wasn't nice.

Hemlock. Your choice of flavors.

--Joe