2007/06/16

NDMP tape restores

(Background: last week, "they" decreed that users should clear up unused disk space. Being a technology company, at least one user decided to write a script to clean up all his unused files, and ran this script on /net, or something. Anyway, there are now three important areas of the file shares that have no content any more. It's kinda interesting to note that all three of these areas had "test" as a component of the directory path.

The environment: Netapp filers, no snapshots of this space, monthly full backup controlled by EMC Networker via NDMP to TAN-attached SDLT600s

In the past, these sorts of problems would be handled by either Ops or my group, depending on the year, and where the (give us something to do) vs. (do it right) pendulum is swinging. Currently, it'a pointing at Ops. Except that their documentation is incomplete, so I have to get involved throughout, but on the bright side, they'll watch the tapes spin overnight. Assuming the restores go well.

Naturally, the restores aren't going well, otherwise I wouldn't be blogging about them.

My test restores (grab 1 file off tape) worked. The first restore worked using the nwrecover GUI. It was able to pull 200GB off tape and put it back onto the "autotest" share in about 24 hours.

$COWORKER's test restores (grab a couple of files off tape) didn't. They failed with an error of "NDMP Service Log: Only one path for each src and dst can be specified." Restore #2 (2GB of web content) broke with the same error message. Restore #3 (1MB of user scripts) failed also.

Well, ok, the error message reads like Networker's putting something wierd in the NDMP protocol.

A dig-in with Ethereal should help, and maybe I can figure out what inputs it needs to get the right outputs. Or not. Ethereal has some understanding of NDMP, but doesn't seem to be willing to splice back together the multi-packet NDMP requests and display them in a way that makes sense to me. Oh well.

But I know that NDMP is simply a Command & Control (C&C) protocol, the actual on-tape format is whatever native format the data server uses. In the case of Netapp, it's compatible with ufsdump on solaris. All I have to do is position the tape to the correct file, and pull the data over onto my Solaris backup server. Since I have shared tape drives, this'll be easy.

mminfo -q ssid=123456789 -r volume,mediafile,mediarec
sjimm 0.100.0 slot 16 drive 1
mt -f /dev/rmt/ciqsjb01_1 fsf 3
ufsrestore -ivfb /dev/rmt/ciqsjb01_1 60

browse to the data, and restore it. This works well for restore #2. The data comes back to the backup server, and it's ready to be copied off.

The backup for restore #3, on the other hand, spans 2 tapes. This makes things much more complicated. As I mentioned, NDMP is purely C&C. When a tape gets full, NDMP simply pauses the writes until the NDMP tape server has the next one ready, then resumes the writes. There's no feedback to the dump process that the tape has been changed, so dump considers it to be a single-volume backup. And inbetween the "unload the tape" and the "next tape is ready" steps, Networker naturally puts "load the next tape" (makes sense) and "write Networker label on the tape" (which adds file marks to the tape, which I have to skip before passing the next block to ufsrestore)

So how do I fake out ufsrestore to use the 6th file on tape 1, then when that runs out of data (rather than abort with an i/o error) wait until I load the next tape, then seek forward 3 files, and continue reading? Something like "(dd if=/dev/rmt/thefirsttape ; dd if=/dev/rmt/thesecondtape) | ufsrestore -ivfb - 60" should work, except that I can't tie up both tape drives for that long, and I don't trust Ops not to break things. I need it to switch tapes in the drive.

But this doesn't work, and I don't know why. mt gets an i/o error on the 2nd tape.
( # Have the first tape positioned correctly
dd if=/dev/rmt/ciqsjb01_1 bs=61440
mt -f /dev/rmt/ciqsjb01_1 offl
sjimm 0.100.0 drive 1 slot 22
sjimm 0.100.0 slot 23 drive 1
mt -f /dev/rmt/ciqsjb01_1 fsf 3
dd if=/dev/rmt/ciqsjb01_1 bs=61440 ) | ufsrestore -ivfb - 60


This should work, right?

In the end, I've opened a call with EMC. This is apparently a bug between Networker <7.2.2 and OnTap version >=7.2.2, and it's fixed in the latest version of Networker. But in the mean time, a full-saveset recover will work, and I have that running now.

--Joe