Mbox



  • mbox, the story of a WTF. (those who don't trust wikipedia are advised to skip to the external link section)

    I recently got into an argument about this "format" on slashdot, and I think that this confusion proves something - a little interoperability can be a very dangerous thing.

    Some people evidently think that it's safe to move files in this format between mail clients, or to concatenate them together. (actually, concatenation generally is safe, as long as the files came from the same program). Now, you can try it - maybe it is safe. Maybe the worst that will happen is slight mangling of lines that have the word "From" near the left side. But this is, I think, a case where it would be much easier to avoid screwing up if the different formats were actually clearly incompatible.



  • I think I saw that argument.

    Was this in the article about Thunderbird losing its two core developers? Someone bitched that the mail storage format was proprietary (sort of), and someone else replied and said you can just open it in mutt or whatever. And so it went.

    Clearly, maildir is superior. :-) 



  • So all I have to do to screw with mail clients is this?

    From



  • Yeah, mbox is a strange one. Even in it's "standard" format it's pretty bad: using "From" as a separator in a format that is used to store email was such a poor design decision I can't decide if it was just ill thought out or outright malicious. Of course other developers have recognised the problem and then "fixed" it in their own variants by using a different separator. At last count the mbox importer I wrote for Whisper (my own mail client) can handle four different "mbox" formats, all with different separators.

    Of course all of this pales in comparison to CSV files. Those have all sorts of interesting escaping and quoting rules depending on which variant you're dealing with.



  • @rbowes said:

    So all I have to do to screw with mail clients is this?

    From

     

    Depends on a lot more things, most mail are sent as multipart/mime so that from will be in a multipart block and thus hopefully not identified as the start of a new header block. 



  • The Real WTF(tm) is that mbox is still so much better than the proprietary format Outlook uses.



  • @rbowes said:

    So all I have to do to screw with mail clients is this?

    From

    Any remotely decent mailer will automatically toss your mail body into quoted-printable encoding, and encode the F, giving you "=46rom" which any mail client will understand and display correctly. Virtually everything is q-p nowadays anyway. Tacky mailers just stick a > or space in front of it, which is why you sometimes get mails saying ">From" in their body.



  • @asuffield said:

    Tacky mailers just stick a > or space in front of it, which is why you sometimes get mails saying ">From" in their body.
    Actually, I'm pretty sure this is done by some mail servers, not clients.



  • Actually the mbox format is a fairly mature, stable and very portable file format for mailboxes. It is trivial to port the file to other formats, and exteremely easy to parse.   And if all else fails, you can always open it and fix it by hand, if needed.

    For you to appreciate its elegance, you must first understand its intended use, that of maintaining e-mail mailboxes.  The delimiter line is intended to represent minimal "index" information, such as the sende and the timestamp, so that an index can easily be built.  With a simple index of file offsets, you can scan the file efficiently for a particular message.  Also, back in the elden days, in the long lost time before spam, most people archived their e-mail and there was little reason to delete messages, since most (if not all) of the messages you received were relevant or significant; so the flat, vertical format was appropriate for a continuously growing file.

    Sure it seems ancient now, but that does not mean it is obsolete.  E-mail had always an immediacy and significance to it, so it was always important to keep the data in a format that would be hard to corrupt and almost impossible to loose, and as easy to transport.  After all, if for some reason, a single message choked up your mailbox reader, you were only a few keystrokes away from fixing it with a text editor; and always be able to retrieve data from the other messages in a pinch.  Try doing that with a binary and compressed format.

    And what about the possibility of corrupting the mailbox by a stray delimiter keyword?  Well, that's a potential flaw of all deimited formats, be it tab, space, newline, or "From " line delimited.  Heck, the SMTP protocol marks the end of the message data with a single dot (.) on a line by itself -- talk about a potential misinterpretation of data!  That's why such formats build into their specification a requirement to escape the delimiter in some way.  For CVS, is to quote fields containing the delimiter; for the SMTP protocol, it is to double-up single dots on a line (..); and for the mbox format it is to add an closing angled-bracket (>) to any line starting with the string "From ".  This is done automatically by mail clients or even the mail servers (the same as they do for the SMTP dots), and completely abstracted from the end-user.  The client (or the server) will then interpret the escaped delimiter and render it appropriately, just like, say, Excel displays CVS fields without quotation marks, and automagically adds them when saving the file.

    As for it being safe "to move files in this format betwween mail clients", well it is -- as long as they support mbox format.  However, the most appropriate way is to use a migrating tool (usualy available or built-into the application).  This guarantees that any index or other necessary internal files are created.  It also safeguards against some minor differences in the programs' assumptions about the file.  Although mbox is a standard format, some programs differ in the way the actual context is stored.  As pointed out already, some will automatically store all content as quoted-printable, or some may decided to add additional information to the "From " delimiter line (Eudora adds the sender's name).  These things won't "mangle" or corrupt the file, but it may cause the application to interpret it differently, if it can't find the stuff it expected was inserted there by itself.

    All in all, its still a relevant format today, and obviously very popular.

        -dZ.

     



  • @DZ-Jay said:

    Actually the mbox format is a fairly mature, stable and very portable file format for mailboxes. It is trivial to port the file to other formats, and exteremely easy to parse.

    How do you know where the message ends? Is it the From line of the next message, or do you count bytes from the start of the body based on the Content-Length? How are lines in the message body beginning with “From ” encoded? Ok, so it’s “>From ”? All the time? (I don’t actually think that’s the case in the content-length-based ones, but whatever) Fair enough, but how are lines beginning with “>From ” encoded? (hint: it varies) Or do you just not care about a few stray characters, or dropped characters in the decoding? (i.e. if it quotes From as >From, leaves >From as >From, and decodes both to From.)

    And if all else fails, you can always open it and fix it by hand, if needed.

    For you to appreciate its elegance, you must first understand its intended use, that of maintaining e-mail mailboxes. The delimiter line is intended to represent minimal "index" information, such as the sende and the timestamp, so that an index can easily be built.

    Actually, the contents of that “delimiter line” (actually, it ‘brackets’ the message, along with a blank line at the end of the message. A blank line followed by the From line separates messages, except there’s no blank line at the beginning of the file, or From line at the end of the file) are not defined at all beyond “From ”.

    With a simple index of file offsets, you can scan the file efficiently for a particular message.

    Such an index is not part of the format.

    Also, back in the elden days, in the long lost time before spam, most people archived their e-mail and there was little reason to delete messages, since most (if not all) of the messages you received were relevant or significant; so the flat, vertical format was appropriate for a continuously growing file.

    Sure it seems ancient now, but that does not mean it is obsolete. E-mail had always an immediacy and significance to it, so it was always important to keep the data in a format that would be hard to corrupt and almost impossible to loose, and as easy to transport. After all, if for some reason, a single message choked up your mailbox reader, you were only a few keystrokes away from fixing it with a text editor; and always be able to retrieve data from the other messages in a pinch. Try doing that with a binary and compressed format.

    And what about the possibility of corrupting the mailbox by a stray delimiter keyword? Well, that's a potential flaw of all deimited formats, be it tab, space, newline, or "From " line delimited. Heck, the SMTP protocol marks the end of the message data with a single dot (.) on a line by itself -- talk about a potential misinterpretation of data! That's why such formats build into their specification a requirement to escape the delimiter in some way. For CVS, is to quote fields containing the delimiter; for the SMTP protocol, it is to double-up single dots on a line (..); and for the mbox format it is to add an closing angled-bracket (>) to any line starting with the string "From ". This is done automatically by mail clients or even the mail servers (the same as they do for the SMTP dots), and completely abstracted from the end-user. The client (or the server) will then interpret the escaped delimiter and render it appropriately, just like, say, Excel displays CVS fields without quotation marks, and automagically adds them when saving the file.

    As for it being safe "to move files in this format betwween mail clients", well it is -- as long as they support mbox format.

    Well, it’s safe as long as they support the SAME mbox format - however, they often won’t say which mbox format they support.

    Although mbox is a standard format, some programs differ in the way the actual context is stored.

    Or, you know, in how it tells WHERE THE MESSAGES END. (and a failure in such a mechanism will more likely result in silently cutting off the end of a message, or appending part of the next message to the message body, rather than any sort of coherent error message. You may not notice it until you’ve destroyed the messages in question irretrievably). I fail to see how this isn’t “mangling” and corrupting the file.

    As pointed out already, some will automatically store all content as quoted-printable, or some may decided to add additional information to the "From " delimiter line (Eudora adds the sender's name). These things won't "mangle" or corrupt the file, but it may cause the application to interpret it differently, if it can't find the stuff it expected was inserted there by itself.

    Then how the HELL is this a standard format, or even a format of any kind? I can only conclude you’re still not aware of the issues, and didn’t even click the Wikipedia link I posted (and, as I said, if you don’t trust Wikipedia itself, it links to plenty of further reading describing the same issues)

    Here’s some links:


Log in to reply