Greenplum DCA and my roll-my-own ETL host

I'm trying to get my new Dell server with its 10gigE network cards to talk to the back-end switch of my greenplum DCA.
Other than the fact that Brocade doesn't seem to understand the difference between a support matrix that says "Using non-Brocade cables is not supported" and a software feature that checks to see if the inserted standards-compliant cable was manufactured by Brocade (vs a standards-compliant cable made/sold by Dell) and if not turning off the port. And other than the Dell sales tool not pointing out this incompatibility, I'm in good shape.
Once there's a link at the SFP+ layer, however, the greenplum switches are not set up for ETL work out of the box... And of course, since these back-end switches are not connected to the "real" network, I have to ssh-tunnel to get to the Switch Admin web tools.
The unused ports on the switches are set up as link aggregation members, and so do not work without even more of these cables. So first, I have to take them out of the CEE LAG groups (first disable the port via Port administration). Switch Administration -> CEE -> Link Aggregation, Edit LAG Group 2, and take out Te 0/18.
Then back over to Port Admin, to change the port to L2 Access mode, and we can enable it.
And finally, back over to Switch Administration -> CEE -> VLAN, edit VLAN 199, and add the Te 0/18 interface to the vlan.
And we have packets moving.
Testing with "gpssh -f hostfile ping -c 3 etl1-1" and "gpssh -f hostfile ping -c 3 etl1-2"


Moloch packet capture

I'm working to set up a full packet capture environment for our network, and so far Moloch is quite attractive. It seems "easy" to get started and so far is scaling out nicely. Unfortunately, it is almost completely undocumented. There's clearly a lot of power under the covers, but I'm having to dig through the source to figure it out. Oh well, I used to be a programmer. Here's some of what I have found so far. The easybutton-build.sh script works well. It downloads specific known-working versions of various dependencies (yara, libpcap, libnids, maxmind's geoip API) which is reasonable, and a libglib version, which is not. Really, let's not have to rebuild from scratch to fix a bug in a shared library. Just use the versions that the distribution provides unless there's a really good reason. apt-get install libgeoip-dev libglib2.0-dev libpcap-dev libnids-dev In my case (Ubuntu 12.10) this gives me the right version of geoip, +.14 versions of glib, the right version of libpcap, and -.01 version of libnids. Let's see if it all works with these minor differences. Now, on to the hacking...


Common Event Format parsing

I've got some data in "Common Event Format" from our new Arcsight appliance, and I need to get it (or at least major parts of it) into a relational database. This should be fairly straightforward, except that the CEF format doesn't lend itself to be parsed easily.

CEF (if you're not aware) is a supposed standard that HP/Arcsight has for exchanging event data. I've found it described at various dead links to the arcisght.com website, or one active location at http://mita-tac.wikispaces.com/file/view/CEF+White+Paper+071709.pdf .

In theory, it has everything needed to wrap up any sort of event data into a convenient wrapper format. It's a pipe-delimited format, UTF-8 encoded, and each line indicates the CEF version (CEF:0 in all the data I have) so it's futureproof.

Except that it isn't really pipe-delimited. Sure, the first 7 columns are pipe-delimited, and have well-defined column names. And pipes embedded in the first 7 columns must be escaped with a backslash, and there's no support for quoting the value to escape the contents. But oh well, other than that, it's just a matter of looking for the first non-escaped 7 pipes.

It's the 8th column that's giving me fits, though. In order to make CEF a useful standard, everything interesting about the event is stuffed into the "Extension" field, which is made up of key=value pairs, where the keys and values are vendor-defined.

This Extension field is not pipe-delimited. It's space-separated key=value pairs. And the value can contain space characters without any protection. The only thing that's restricted in the values are \\, \=, \r, and \n. The following is a perfectly legal extension:

foo=bar baz=0
This straightforwardly sets two keys (foo and baz) to their appropriate values. Another valid extension is
foo=bar anotherkey=c:\\program files\\ceci n'est pas une pipe (|) has an \= to us!\n\\ so go away baz=0
This would set the same keys as above (foo and baz) plus the "anotherkey" would be set to
c:\program files\ceci n'est pas une pipe (|) has an = to us!
\ so go away

So to parse the CEF record, first I need to look at the first 7 columns where the only legal escapes are \| and \\, and I get 7 nicely-named fields. Then take the rest of the line, and split it on unescaped =, look back one word from there, and that becomes the key, and everything up to the last word before the next = is the value. (I'm pretty sure that the key can not contain a space, but that's not stated in the spec)

Here's what I came up with to parse out the extension pairs. Note that I'm not a great perl optimizer, suggestions are welcome.

        # Pull off thefirst keyword
        (undef,$key,$extension) = split( /([^\s]+)=/, $extension, 2);
        while ( $key ne "" ) {
                # split returns the value, the part that matches the () in the split
                # expression, and the rest of the string.
                ($prevval,$nextkey,$extension)=split( /([^\s\\]+)=/, $extension, 2);
                ($line{$key}=$prevval) =~s/\s+$//; # Store the discovered key/value pair