Common Event Format parsing

I've got some data in "Common Event Format" from our new Arcsight appliance, and I need to get it (or at least major parts of it) into a relational database. This should be fairly straightforward, except that the CEF format doesn't lend itself to be parsed easily.

CEF (if you're not aware) is a supposed standard that HP/Arcsight has for exchanging event data. I've found it described at various dead links to the arcisght.com website, or one active location at http://mita-tac.wikispaces.com/file/view/CEF+White+Paper+071709.pdf .

In theory, it has everything needed to wrap up any sort of event data into a convenient wrapper format. It's a pipe-delimited format, UTF-8 encoded, and each line indicates the CEF version (CEF:0 in all the data I have) so it's futureproof.

Except that it isn't really pipe-delimited. Sure, the first 7 columns are pipe-delimited, and have well-defined column names. And pipes embedded in the first 7 columns must be escaped with a backslash, and there's no support for quoting the value to escape the contents. But oh well, other than that, it's just a matter of looking for the first non-escaped 7 pipes.

It's the 8th column that's giving me fits, though. In order to make CEF a useful standard, everything interesting about the event is stuffed into the "Extension" field, which is made up of key=value pairs, where the keys and values are vendor-defined.

This Extension field is not pipe-delimited. It's space-separated key=value pairs. And the value can contain space characters without any protection. The only thing that's restricted in the values are \\, \=, \r, and \n. The following is a perfectly legal extension:

foo=bar baz=0
This straightforwardly sets two keys (foo and baz) to their appropriate values. Another valid extension is
foo=bar anotherkey=c:\\program files\\ceci n'est pas une pipe (|) has an \= to us!\n\\ so go away baz=0
This would set the same keys as above (foo and baz) plus the "anotherkey" would be set to
c:\program files\ceci n'est pas une pipe (|) has an = to us!
\ so go away

So to parse the CEF record, first I need to look at the first 7 columns where the only legal escapes are \| and \\, and I get 7 nicely-named fields. Then take the rest of the line, and split it on unescaped =, look back one word from there, and that becomes the key, and everything up to the last word before the next = is the value. (I'm pretty sure that the key can not contain a space, but that's not stated in the spec)

Here's what I came up with to parse out the extension pairs. Note that I'm not a great perl optimizer, suggestions are welcome.

        # Pull off thefirst keyword
        (undef,$key,$extension) = split( /([^\s]+)=/, $extension, 2);
        while ( $key ne "" ) {
                # split returns the value, the part that matches the () in the split
                # expression, and the rest of the string.
                ($prevval,$nextkey,$extension)=split( /([^\s\\]+)=/, $extension, 2);
                ($line{$key}=$prevval) =~s/\s+$//; # Store the discovered key/value pair