Enterprise SA: storage

Showing posts with label storage. Show all posts

2011/05/09

EMC World 2011

I'm here at EMC world 2011, taking advantage of their "Bloggers Lounge" where they have better WiFi and more comfortable chairs.

So far, the conference is unremarkable-- the first keynote could be summarized as "Cloud, blah, blah, lots of data, blah, new products, blah, blah, blah" Nothing particularly groundbreaking.

But still, being the first travel I've been on in almost 5 years, I'm looking forward to it. Lots of topics that can help my quest for Infrastructure Strategy.

2009/08/19

Netapp - Waster of space

We have a Netapp that we use to provide Tier-2 LUNS to our SAN. It was price-competitive on raw disk space, but I didn't realize at the time just how much overhead this appliance had.

The obvious overhead is RAID-DP and hot spare drives. Easily calculated. 1 HS per 30 drives of each size. DP is 2 drives per plex, so that's 6 wasted drives out of the 28 in two shelves, leaving 22 * 266GB drives usable = 5.7TB.

I'd heard that space is reserved for OS and bad-block overhead (about 10%) so that brings us down to 5.2TB usable.

Well, the web interface shows the aggregate as 4.66TB. So that's 600GB I haven't accounted for. But still, 4.66 TB is a good amount of space.

From the aggregate, we create a flexvol (note that this places 20% by default as inaccessible snap reserve space). On the flexvol, we create LUNs and present them to our servers. And here's where the space consumption is nasty:

By default, if you create a 1TB lun, OnTAP reserves 1TB of disk blocks in the volume. That's nice, and exactly what I'd expect. Although in practice, we use thin provisioning (lun create -o noreserve) for most of our LUNs

What I didn't expect going in was that the first time you create a snapshot, OnTAP would reserve ANOTHER 1TB for that LUN. And interestingly enough, that 1TB is never touched until there's no other space in the volume.

Ok, That ensures that even if you overwrite the ENTIRE lun after you take a snapshot. But it reduces the usable size of LUN-allocation to 2.33TB. And if you have multiple snapshots, those don't seem to go into the snap reserve, but rather are in addition to the 2*LUNsize that is already allocated.

So out of a raw disk capacity of (28*266) 7.2 TB (which is quoted as 28*300GB disks = 8.2TB) we get just over 2TB of space that can be used for holding actual system data.

Wow.

Now, there are non-default settings that can change that, but they're only available at the CLI, not the web interface:

# snap reserve 0 - this will set the snap reserve from 20% to 0%, which is recommended for volumes that hold only LUNs.
# vol options fractional_reserve ## - This changes the % of LUNsize that is reserved when a LUN snapshot is taken.

It is not entirely clear what happens to a LUN when its delta becomes larger than the fractional_reserve. Some documentation says it may take the LUN offline, but I would hope that only would happen if there's no remaining space in the volume (like what happens with snapshot overflow in traditional NAS usages). But it's not clear.

As far as I can tell, the current best practice is to set the snap reserve to the amount of change you expect in the volume, and set the fractional_reserve to the amount of change you expect in the LUN. And to set up either volume auto-grow and/or snapshot auto-delete to make sure you have free space when things get full.

On the gripping hand, the default options make sure that you have to buy a lot of disks to get the storage you need.

--Joe

2009/02/19

Photo Archiving

This is in response to BenR's post at http://www.cuddletech.com/blog/pivot/entry.php?id=1016 which I can't seem to get past his comment-spam filter.

As a fellow father and sys/storage admin, I have similar questions. Have you made the jump to video already? A MiniDV tape at LP (90 mins) quality -- a little less than DVD quality but with worse compression, eats up 15GB of disk space when I dump the AVI stream. Not to mention the gigabytes of SD and CF cards from the camera.

I'm confident in my 3-tier archiving scheme: An active in-the-house full-quality copy on simple disk, a "thumbnail" (screen-resolution or compressed video) version on S3, and two copies of the original format on DVD - one onsite and one offsite.

I expect to have to move from DVD media periodically, but I can put that off until the higher-capacity disk wars play out. Every file on the DVDs are md5sum'd, and i know I can use ddrescue to pull data blocks off either wafer, if S3 and my home drive die, assuming the scratch doesn't hit both disks in the same place. It'd be nice to have an automatic system to track which file is on what DVD, but I haven't implemented such an HSM yet.

I'm enough of a pack rat to keep a DVD drive and probably a computer that can read it essentially forever, and if not, there's always eBay.

The biggest problem I face is not deleting all of the content from a card (or tape) before popping it back into the camera and adding more. So when I copy a media into the "system" I might have other duplicate copies of the pictures. I'd love to be able to deduplicate those and store only one copy (and links to it). And even better would be a content-aware dedup that could tell that x.jpg is the same picture as Y.raw... (and that song_64kvbr.mp3 can be derived from song.flac)

But I haven't put that together yet, either.

--Joe

2008/09/17

Reflections on x4500+ZFS+NFS+ESX

I was asked about my thoughts on running ESX over NFS to a ZFS backend. For posterity, here they are:

x4500+ZFS+NFS+ESX is a quite functional stack. There are a few gotchas that I've run into:

First, the ESX "storage delegate" functionality doesn't. This is supposed to change the EUID that the ESX server sends with it writes. Well, it does for most of the requests, but not for things like creating the VM's swap file. So you pretty much have to export your NFS shares with root=vmkernel.ip.address

We have many ESX servers, so keeping the sharenfs= parameters got unweildy. I ended up putting them in a text file in the NFS share for easy editing, and when I have to add/change an ESX server, I edit the file and zfs set `cat zfs.shareprops` /pool/path/to/share

NFS is much better than iSCSI. At least in the version I did iSCSI testing, all of the ZFS volumes presented from OpenSolaris were recognized by ESX as being the same disk. This meant that I had a dozen paths to the same vmfs datastore, some 100GB, some 500GB, etc. This Was Bad. NFS made it better.

NFS also gives you a couple of other benefits: On NFS datastores, the vmdk files are by default thin-provisioned. This means that if you give your VM a 5TB vmdk, and don't use more than 10GB, it takes up 10GB of capacity on the physical disks. It's also much better understood by troubleshooting tools (wireshark) so it's easier to find problems like the storage delegate issue above. Also, it's a first-class citizen from Sun. NFS serving has been in Solaris since 1994, and isn't broken by the latest Nevada builds. Sun takes NFS seriously.

The downside of NFS is that ESX makes all its requests O_SYNC. This is good for ESX but bad for ZFS. Your nvram cards should help a lot. I ended up with a different solution: The business agreed that these are not Tier-1 VMs, and they're not on Tier-1 storage. So I've turned off all ZFS sync guarentees with /etc/system:


* zil_disable turns off all syncronous writes to ZFS filesystems.  Any FSYNC,
* O_SYNC, D_SYNC, or sync NFS requests are services and reported completed
* as soon as they've been transferred to main memory, without waiting for
* them to be on stable storage.  THIS BREAKS THE SAFETY SEMANTICS AND CAN
* CAUSE DATA LOSS! (clients have moved on thinking the data was safely written
* but it wasn't)
* However, in our case, we can afford to lose this data.  For DEV/Test systems
* rollback to the latest (hourly) snapshot is considered acceptable.
set zfs:zil_disable=1

As the comment says, this would be a bad thing. But I know that the vmdk files are crash-consistant every hour and that's OK to the users. If they lose an hour of work, it's annoying but worth the cheaper storage.

Finally, and most importantly:

MAKE SURE YOUR POOL IS CONFIGURED FOR YOUR WORKLOAD. Vms are effectively a random-read and random-write workload. There is no sequential access of the vmdk files except when you're cloning a VM. So you have to understand the read and write characteristics of your zfs pool. RAID-Z and RAID-Z2 always read and write a full RAID stripe every time. This means it has to read from all of the disks in the pool to return a single byte of data to the ESX host. Mirrored pools, on the other hand, read from a single disk, and if the checksum is correct, passes it back to the ESX host. So in my case, I can have 44 simultaneous read requests from the ESX servers being serviced at the same time (44 disks in the pool) and/or 22 simultaneous writes (each write is written to two disks). Basically RAID-Z[2] is bad for random workloads, but mirroring is expensive.

With this in mind, performance on the thumper is excellent. We can easily saturate the onboard 1Gbps network link with NFS traffic, I've got link aggregation and can easily saturate the combined 2Gbps link. I haven't seen what happens with 4 uplinks, but I'd expect that the network will still be the slowest part of the chain. Doing basic I/O benchmarks on the thumper, I can get 1GBps out of the disks. Yes, that's 1GB per second.

2008/07/28

How to grow an iSCSI-presented zvol in 3 easy steps

Well, ok, it's not quite 3 easy steps.

A couple of things that don't work: iscsitadm modify target -z . This only works if the iscsi target's backing store is a regular file, which in the case of a zvol, it is not.

The easy bit: Make the zvol bigger:
zfs set volsize=200G tank/iscsi/thevol

Now we have to hack around in the iscsi parameters file: Locate the /etc/iscsi/tgt//params.# file that corresponds to the right target and lun and change the <size> parameter to be the new (in hex) size of the bigger volume in 512-byte blocks. Or in other words,

zfs get -Hp volsize tank/iscsi/thevol | perl -lane 'printf("%llx", $F[2]/512)'

Once that's done, apparently you have to bounce the iscsitgtd to get it to reread the params file.

Then on to the initiator...

format c3tAREALLYLONGSTRINGOFDIGITSFORTHEDISKGUIDd0s0 and changing the parameters won't work, since I'm using EFI labels and it says very strongly

partition> label
Unable to get current partition map.
Cannot label disk while it has mounted partitions.

So I have to go in the other way. While I'm in format, print out the current partition table, and make note of the Last Sectors for the slices. Also, run prtvtoc against the disk to get any other useful bits.

Then I can make the actual partition changes with fmthard:

fmthard -s - /dev/rdsk/c3tAREALLYLONGSTRINGOFDIGITSFORTHEDISKGUIDd0s0

At first, just copy in the line(s) for the slices you already have, but move slice 8 to the end of the disk:

*                          First     Sector    Last
* Partition  Tag  Flags    Sector     Count    Sector  Mount Directory
       0      2    00         34 251641754 251641787   /zones/mars/data
       8     11    00  419413982     16384 419430365

Then (check it in format to make sure the disk is still healthy) change the Last sector and sector count for the real partition. (Last is s8's first -1, and the sector count is s8's first -34)

Then it's a simple growfs -M /zones/mars/data /dev/rdsk/c3tAREALLYLONGSTRINGOFDIGITSFORTHEDISKGUIDd0s0

--Joe

2008/01/15

What's going onto the disk?

On our thumpers, `iostat -nxz 5` gives a good picture of what's happening on each of the disks, and `zpool iostat 5` gives a good overall picture of how fast things are going at the moment, but neither of these break the picture down into "Who's writing?"

A Solution:

# fsstat -T u `zfs list -H -o mountpoint -t filesystem` 5
1200435752
    0     0     0     0     0      0     0     0     0     0     0 /uscisbds001
    0     0     0   106     0      0     0     0     0    53  217K /uscisbds001/esx3
    0     0     0     0     0      0     0     0     0     0     0 /uscisbds001/esxpatches
    0     0     0     2     0      0     0     0     0     0     0 /uscisbds001/isos
    0     0     0     0     0      0     0     0     0     0     0 /uscisbds001/nfs
    0     0     0     0     0      0     0     0     0     0     0 /uscisbds001/saperisrv1
    0     0     0     0     0      0     0     0     0     0     0 /uscisbds001/saperisrv2
    0     0     0     2     0      0     0     0     0     0     0 /uscisbds001/templates
    0     0     0 1.56K     0      0     0     0     0   654 15.8M /uscisbds001/temprestore

I can see that somebody's writing a buncha data to temprestore, and a little bit is happening in the esx3 directory.

--Joe

2007/06/16

NDMP tape restores

(Background: last week, "they" decreed that users should clear up unused disk space. Being a technology company, at least one user decided to write a script to clean up all his unused files, and ran this script on /net, or something. Anyway, there are now three important areas of the file shares that have no content any more. It's kinda interesting to note that all three of these areas had "test" as a component of the directory path.

The environment: Netapp filers, no snapshots of this space, monthly full backup controlled by EMC Networker via NDMP to TAN-attached SDLT600s

In the past, these sorts of problems would be handled by either Ops or my group, depending on the year, and where the (give us something to do) vs. (do it right) pendulum is swinging. Currently, it'a pointing at Ops. Except that their documentation is incomplete, so I have to get involved throughout, but on the bright side, they'll watch the tapes spin overnight. Assuming the restores go well.

Naturally, the restores aren't going well, otherwise I wouldn't be blogging about them.

My test restores (grab 1 file off tape) worked. The first restore worked using the nwrecover GUI. It was able to pull 200GB off tape and put it back onto the "autotest" share in about 24 hours.

$COWORKER's test restores (grab a couple of files off tape) didn't. They failed with an error of "NDMP Service Log: Only one path for each src and dst can be specified." Restore #2 (2GB of web content) broke with the same error message. Restore #3 (1MB of user scripts) failed also.

Well, ok, the error message reads like Networker's putting something wierd in the NDMP protocol.

A dig-in with Ethereal should help, and maybe I can figure out what inputs it needs to get the right outputs. Or not. Ethereal has some understanding of NDMP, but doesn't seem to be willing to splice back together the multi-packet NDMP requests and display them in a way that makes sense to me. Oh well.

But I know that NDMP is simply a Command & Control (C&C) protocol, the actual on-tape format is whatever native format the data server uses. In the case of Netapp, it's compatible with ufsdump on solaris. All I have to do is position the tape to the correct file, and pull the data over onto my Solaris backup server. Since I have shared tape drives, this'll be easy.

mminfo -q ssid=123456789 -r volume,mediafile,mediarec
sjimm 0.100.0 slot 16 drive 1
mt -f /dev/rmt/ciqsjb01_1 fsf 3
ufsrestore -ivfb /dev/rmt/ciqsjb01_1 60

browse to the data, and restore it. This works well for restore #2. The data comes back to the backup server, and it's ready to be copied off.

The backup for restore #3, on the other hand, spans 2 tapes. This makes things much more complicated. As I mentioned, NDMP is purely C&C. When a tape gets full, NDMP simply pauses the writes until the NDMP tape server has the next one ready, then resumes the writes. There's no feedback to the dump process that the tape has been changed, so dump considers it to be a single-volume backup. And inbetween the "unload the tape" and the "next tape is ready" steps, Networker naturally puts "load the next tape" (makes sense) and "write Networker label on the tape" (which adds file marks to the tape, which I have to skip before passing the next block to ufsrestore)

So how do I fake out ufsrestore to use the 6th file on tape 1, then when that runs out of data (rather than abort with an i/o error) wait until I load the next tape, then seek forward 3 files, and continue reading? Something like "(dd if=/dev/rmt/thefirsttape ; dd if=/dev/rmt/thesecondtape) | ufsrestore -ivfb - 60" should work, except that I can't tie up both tape drives for that long, and I don't trust Ops not to break things. I need it to switch tapes in the drive.

But this doesn't work, and I don't know why. mt gets an i/o error on the 2nd tape.

( # Have the first tape positioned correctly
dd if=/dev/rmt/ciqsjb01_1 bs=61440
mt -f /dev/rmt/ciqsjb01_1 offl
sjimm 0.100.0 drive 1 slot 22
sjimm 0.100.0 slot 23 drive 1
mt -f /dev/rmt/ciqsjb01_1 fsf 3
dd if=/dev/rmt/ciqsjb01_1 bs=61440 ) | ufsrestore -ivfb - 60

This should work, right?

In the end, I've opened a call with EMC. This is apparently a bug between Networker <7.2.2 and OnTap version >=7.2.2, and it's fixed in the latest version of Networker. But in the mean time, a full-saveset recover will work, and I have that running now.

--Joe

2007/03/20

Ask the right question, and you have the answer

I really hate when I'm sending an email to ask a highly-technical question, and in the process of formulating the question find the answer.

We're looking for the Next Big Disk Array to replace the Previous Big Disk Array, which has lately been showing its age in the performance arena. This is the Big Disk Array that we use as a Networker adv_file device, where we write the Big Database backups.

There's lots of people who sell BDAs. I can pretty much characterise the product options as:

Proprietary (usually web-based) interface that doesn't integrate with any other management tool (that's another rant to be ranted someday)
Proprietary ASIC on a controller board (possibly redundant Active/Active, or Active/Passive)
Some number of 1,2, or 4Gb Fibre and/or 1G iSCSI ports
Cache memory, usually up to 2GB
as many disks as will fit in that number of rack units

And it takes a 3-page PDF to marketspeak that. Anyway, from a performance standpoint, the only two numbers ever referenced are the uplink speed (!look! we have 4Gb fibre) and maximum throughput (which is never explicitly defined).

Max throughput, I generally assume, means "read whatever the optimal block size is out of cache, and imagine that the whole array is that fast" (cf/ peak transfer rates from consumer disk drives). Unless the unit supports expansion units, in which case it's "get as many expansion units as we can install, stripe a single disk group across all of them, and then report the aggregate throughput from that"

Neither is particularly helpful for me to figure out if we can write "database_backup.tar" onto the array fast enough. But I digress.

The question I was trying to ask is:

Where does it make sense to perform I/O reordering, redundancy, and cacheing:

On the array's controller card (which is a custom ASIC with 2GB of cache) -or-
In the Solaris I/O stack (including ZFS) on a server with 8GB of RAM and knowledge of the application's I/O pattern and years of performance optimization

In addition, this is not an exclusive-or: the Solaris layer is still going to be optimizing its I/O pattern, possibly with wrong assumptions about the performance and parallelism of the LUN. Or even worse, our PBDA couldn't act as a single big LUN, so the solaris layer is queueing 3 I/Os in parallel to what it thinks are 3 differnet disks, but in fact must be serialized by the controller with a long seek in between. This is clearly not optimal.

(Which reminds me... the custom ASIC has virtually no ability to actually measure or tune any performance of the system. There is no concept of exposing performance or profiling data, and there's no way to determine that these seeks are really causing the slowness. On the solaris side, OTOH, there' s things like seeksize.d that can help figure out why the fscking thing is so slow)

Just framing the question has taken me from 60/40 in favor of JBOD to about 95% in favor of it.

2006/11/16

Fun with Filesystems

I think there's a race condition in Solaris... we had a filesystem get full with Oracle archivelogs, so I removed them, then checked to see what effect that had:


# rm D*_60[012345]?_*.dbf
# df -h .
Filesystem             size   used  avail capacity  Mounted on
/oracle/D01/saparch    5.9G 16384E   6.4G 301049486643838% /oracle/D01/saparch

A moment later, it was happy:


# df -h .
Filesystem             size   used  avail capacity  Mounted on
/oracle/D01/saparch    5.9G   257M   5.6G     5%    /oracle/D01/saparch

This is not the first time I've noticed some wierdness with removing data on S10. Last time, I wiped out a copy of our big oracle database, (rm -rf sapdata*/*) which only took a few seconds, but to unmount the filesystem took over 8 hours.

--Joe

Enterprise SA