CEF (if you're not aware) is a supposed standard that HP/Arcsight has for exchanging event data. I've found it described at various dead links to the arcisght.com website, or one active location at http://mita-tac.wikispaces.com/file/view/CEF+White+Paper+071709.pdf .
In theory, it has everything needed to wrap up any sort of event data into a convenient wrapper format. It's a pipe-delimited format, UTF-8 encoded, and each line indicates the CEF version (CEF:0 in all the data I have) so it's futureproof.
Except that it isn't really pipe-delimited. Sure, the first 7 columns are pipe-delimited, and have well-defined column names. And pipes embedded in the first 7 columns must be escaped with a backslash, and there's no support for quoting the value to escape the contents. But oh well, other than that, it's just a matter of looking for the first non-escaped 7 pipes.
It's the 8th column that's giving me fits, though. In order to make CEF a useful standard, everything interesting about the event is stuffed into the "Extension" field, which is made up of key=value pairs, where the keys and values are vendor-defined.
This Extension field is not pipe-delimited. It's space-separated key=value pairs. And the value can contain space characters without any protection. The only thing that's restricted in the values are \\, \=, \r, and \n. The following is a perfectly legal extension:
foo=bar baz=0This straightforwardly sets two keys (foo and baz) to their appropriate values. Another valid extension is
foo=bar anotherkey=c:\\program files\\ceci n'est pas une pipe (|) has an \= to us!\n\\ so go away baz=0This would set the same keys as above (foo and baz) plus the "anotherkey" would be set to
c:\program files\ceci n'est pas une pipe (|) has an = to us! \ so go away
So to parse the CEF record, first I need to look at the first 7 columns where the only legal escapes are \| and \\, and I get 7 nicely-named fields. Then take the rest of the line, and split it on unescaped =, look back one word from there, and that becomes the key, and everything up to the last word before the next = is the value. (I'm pretty sure that the key can not contain a space, but that's not stated in the spec)
Here's what I came up with to parse out the extension pairs. Note that I'm not a great perl optimizer, suggestions are welcome.
# Pull off thefirst keyword (undef,$key,$extension) = split( /([^\s]+)=/, $extension, 2); while ( $key ne "" ) { # split returns the value, the part that matches the () in the split # expression, and the rest of the string. ($prevval,$nextkey,$extension)=split( /([^\s\\]+)=/, $extension, 2); ($line{$key}=$prevval) =~s/\s+$//; # Store the discovered key/value pair $key=$nextkey; }