The Importance of Non-Binary File Headers



  • At the previous company I worked for, we needed to design a “Generic Raw Format” data-format for logging incoming messages to a flat file. The messages would be coming in from a variety of organisations, in a variety of payload formats, implemented in a variety of languages, payloads both text-based and binary, but what we needed to have standardised was the timestamps and knowing how to delineate messages.

    We wanted a header at the start of the file, to indicate byte-ordering and version numbers and things like that. Initially, we decided on a text format header, with “key=value” pairs terminated by newlines, and then a binary section following that just in case one the vendor needs to store binary data in the header too.

    However, when I came to implement it, I decided that instead it would be better to use a binary format for at least all the most important fields. There was a circularity: the most important field was the “header length”, needed to identify the start of the first message, and it was difficult to know what the length would be until the header had been constructed, difficult to back-patch the length in, and complicated to work out what to do if you hadn’t reserved enough space for the header length. Furthermore, in regards to identifying whether we needed to byte-swap the data, it would be much simpler to store a ‘1’ in a 4-byte binary field and then the reader could see if it came out as ‘1’ or as a byte-swapped integer ‘1’. With the text-based header, it was proposed to have “BIGENDIAN=Y” or “BIGENDIAN=N”, leading to the possibility of developers getting confused with “LITTLENDIAN=Y” and “BIGENDIAN=1”, not to mention the question of “which architecture am I on now? Which one is big-endian again?” in both readers and writers.

    So the proposal was that a small binary header consisting of 2 integer fields should be used, in addition to text-format “key=value” pairs (and the “arbitrary binary extra header” of dubious usefulness). All these points were made to a meeting of the CTO and division technical heads. The proposal to use binary fields was knocked back. The CTO explained that it was important to be able to “cat” a file and see some meaningful numbers.

    “But we’re going to be writing a low-level dump tool for this file format anyway! We can use that to dump the header.”
    “But operations people might not have access to that tool.”
    “Well they need access to it.”
    “Well that would be one more thing for them to have to worry about.”
    “I wouldn’t recommend ‘cat’ as a way to view a binary file anyway – it can put your terminal into a strange state, and backspace/carriage return characters can create a lot of confusion.”
    “Well, that’s what we’re used to around here.”
    “Well I tried ‘cat’ on the binary header format and I’m still able to see the text header parameters.”
    “But it might not always work.”
    “How are the various organisations going to deal with the circularity in the header length?”
    “There are ways around that. Write them a library.”
    At this point I ran out of things to say.


  • Java Dev

    If the file's binary, I don't see the problem of a binary header... May be a human-recognizable magic value at the start so you have an inkling you're looking at a valid file.


  • Discourse touched me in a no-no place

    @tcotco said:

    At this point I ran out of things to say.

    I'll email him for you and tell him what a stupid idea it is, if you'd like.



  • Is it really bad that I don't think XML is entirely inappropriate here. It's WTFy for sure but it would at least give you something in which to encapsulate this data.

    Also, fuck endianness. Pick an endianness and fucking run with it, seriously, for the good of mankind pick an endianness. I had enough of that shit when dealing with Exif tags.



  • IMO, XML would work as long as an XSD was provided and was brutally enforced.

    So no, "But Financial is full of turd slinging monkeys and they can't figure out how to spell 'Cheeseburger' correctly! Can't you just translate 'Chi-s-berger', you know it will always be that way?"

    Having said that, I have to figure out how to fix an XML header parser that assumes the size is no bigger than 1K while spamming the crap out of metadata in internal nodes. Who the fuck serializes XML to a binary format? Sometimes I really hate fucking .NET and .NET 1.0/1.1 developers.



  • I would have absolutely no problem with having an XSD and whatnot. Isn't that the only sane way to do XML anyway?

    I do not envy your task in this.



  • I've found you can use XML without XSD, as long as you are the owner of the XML in question and never-ever-never-ever-never-ever accept it from outside input.

    I'm hoping I'll be able to avoid/rectify my header issue because the software in question has been fielded for almost 10+ years without issue, and the header file should in theory have gotten bigger as time went along.

    We are looking at a rewrite soon and this time I will enforce the non-use of binary serialization of an XML file with deadly force. Thus Haribo sugar free Gummi-Bears may find their way into some developers offices...


  • FoxDev

    @MathNerdCNU said:

    Thus Haribo sugar free Gummi-Bears may find their way into some developers orifices...

    i misread that initially.... i think it works better that way.



  • That reminds me of that old Netscape 4.x Mail storage format. Technically it had a "binary header" (everything fixed length fields) but the values (message length, message flags like read or deleted, etc.) were stored in a way that looks like text (i.e fixed-length Hex numbers) and the header was terminated with a CRLF. Quite easy to parse, easy to patch and still won't break your terminal if you cat the file (whose rest was 100% text only).

    So for your "General raw data format" you could use a header like this:

    uint32 MAGIC_VALUE  ("GRDF" in correct endianness)
    uint8 MESSAGE_LENGTH[8] (hex-encoded message length padded with leading zeroes - assuming 4GB is enough, if not use 16 hex-digits)
    uint8 TERMINATOR_MAGIC (always '\n')
    

    Very human readable, cat'able, and easy to binary patch. :)



  • @tcotco said:

    a “Generic Raw Format” data-format for logging incoming messages to a flat file.

    So you chose to re-invent tar, instead of just using it?



  • @Arantor said:

    Is it really bad that I don't think XML is entirely inappropriate here. It's WTFy for sure but it would at least give you something in which to encapsulate this data.

    Definitely use JSONx.



  • @chubertdev said:

    Definitely use JSONx.

    Aw hell yeah, Hannibal.



  • @flabdablet said:

    So you chose to re-invent tar, instead of just using it?

    You still need to define what each message looks like within the tar archive.

    I was thinking RFC2822 would work, but I'm sure there are many formats that have already been designed with library implementations available. No need to poorly re-invent formats unless your requirements are very specific and performance trumps all.



  • [@another_sam said:

    You still need to define what each message looks like within the tar archive.

    quote="tcotco, post:1, topic:4116"]
    The messages would be coming in from a variety of organisations, in a variety of payload formats, implemented in a variety of languages, payloads both text-based and binary, but what we needed to have standardised was the timestamps and knowing how to delineate messages.
    [/quote]

    ustar is well supported by existing library code, easily solves both the listed requirements, has general purpose filetype and pathname fields in its per-component headers that would lend themselves very naturally to payload origin and format identification, requires no format conversion at all on component bodies, and was designed to be written out with strict sequential access - all of which means it's ideal for accumulating a log of disparate incoming stuff. If for some reason you can't simply save your incoming messages as individual files, I can think of no good reason not to use tar.



  • That's what I get for not paying enough attention reading the post. Looks like tar would be just about perfect.

    Using 'cat' to view binary files is still a WTF and if you do it you deserve what you get.



  • As far as I understand 'tar', it's a format for packing multiple files into a single file. This is not our situation. We are trying to package millions of short messages into a single file, giving a timestamp (but not filename!) to each message. Each message was from about 100 to 300 bytes. A 4-byte timestamp and 4-byte message length field adds 8 bytes to each message. I don't know how much 'tar' would add, but it was designed for files, not messages, so I'd wouldn't expect it to be a very useful format.



  • @MathNerdCNU said:

    I've found you can use XML without XSD, as long as you are the owner of the XML in question and never-ever-never-ever-never-ever accept it from outside input.

    I've found schemas to mostly be used by those who weren't going to be a problem anyways. A few years ago I was on an EDI project where we exchanged data with another group in our company. We agreed to use XML and I was tasked with defining the data specification. I gave them an XSD with documentation. They asked me for a few examples and I gave them some.

    Fast forward a few weeks to integration testing. Everything we sent them was barfed up by their processor. They took a look at it and said "Ha, here's the problem - you put a carriage return after the first line in all of your examples, but there is no carriage return in the data produced in the test". I pull out the XSD and show them where whitespace-ignore is defined and tell them to fix their code to match the specification. They threw a tantrum and pushed the issue a few managers up the chain. It turns out that they simply ignored the XSD and rolled their own XML parser by pretending the input was simply an overly-verbose text file.

    The moral of the story is that the XSD is just another form of documentation and no more likely to be read or honored than any other documentation.



  • tar adds a 512-byte header to each component file. Component files are stored as-is inside the tar archive, with enough zero bytes appended to make them occupy a multiple of 512 bytes each. tar archives can be compressed and decompressed on-the-fly because they're only ever written at the end and never require backpatching (tar is short for "tape archive" after all), and since the format specifies that unused header fields are to be zero-filled, they'd effectively compress to virtually nothing after the first one.

    Largely because they don't have per-archive headers, only per-component headers, tar archives are also relatively robust in the face of corruption.

    The log file you talked about, with messages in different formats coming from different places, is in fact a multi-member archive with members in disparate formats. And even though you don't have an immediate requirement to store format or data source metadata, it's the kind of thing that would almost certainly scope creep its way into the design at some point.



  • @Jaime said:

    rolled their own XML parser

    This is of course TRWTF, but it's surprisingly common. Seems to me that the most likely reason for that is that XML itself was designed and documented by people who genuinely think of it as non-hideous.

    Note that I am not attempting to suggest that XML is not useful, merely that it is horribly ugly in almost every respect.


  • Discourse touched me in a no-no place

    @flabdablet said:

    Jaime:
    rolled their own XML parser

    This is of course TRWTF, but it's surprisingly common.

    The mind boggles that anyone would even want to, given there are functional parsers available.


Log in to reply