The Problematic Document Format (PDF)



  • You know the feeling when you think you know how something works and suddenly you are hit by a crapton of wtf regardless? Apparently, it's not Adobe's fault that their reader needs patching all the time.

     EDIT: metric tons



  • @Linked site said:

    Have you ever looked at the specifications for the PDF file format? You can download them from here (PDF).

    I lol'ed.

    Edit: Admittedly, the specs for HTML are in HTML too. I'd like to see the specs for JPEG as a JPEG image though.



  •  Interesting.

    So a truly barebones, empty PDF document may be produced by opening Notepad, not typing anything, and saving it as *.pdf?



  •  They're going at it half-assed. What they need to do is make it so you can run virtual machines in PDF. Then you'll finally be able to have everything you could possibly need.



  • @F-Secure said:

    PDF files can contain 3D objects, complete with embedded JavaScript?

     

     Bonus points to whoever writes the first flight simulator or FPS contained entirely within a PDF file.




  • @DOA said:

     They're going at it half-assed. What they need to do is make it so you can run virtual machines in PDF.
     

    A tittilating concept! Where may I subscribe to your newsletter?



  • @dhromed said:

     Interesting.

    So a truly barebones, empty PDF document may be produced by opening Notepad, not typing anything, and saving it as *.pdf?


    Doesn't work in Okular. Does that mean Okular fails at PDF standards or that I fail at recognizing humor? Probably the latter, given my nationality.



  • @derula said:

    ... or that I fail at recognizing humor? Probably the latter, given my nationality.

    Then why am I laughing at your self-reference?



  • @steenbergh said:

    Then why am I laughing at your self-reference?

    Dunno. Maybe you're not German?



  • Interestingly, opening that cmd thing in Foxit pops it up, but using Adobe Reader prompts you before launching. Isn't Foxit supposed to be the more secure one?



  • @MiffTheFox said:

    Interestingly, opening that cmd thing in Foxit pops it up, but using Adobe Reader prompts you before launching. Isn't Foxit supposed to be the more secure one?

    But he's not exploiting a bug or security vulnerability.  He's simply using the crazy nonsense in the PDF spec.  As for Adobe Reader giving a warning, he shows how you can change the warning to anyting you want -- like "Click Yes to Continue"


  • @dhromed said:

     Interesting.

    So a truly barebones, empty PDF document may be produced by opening Notepad, not typing anything, and saving it as *.pdf?

    That's a kind of philosophical question.  Can a null document really be said to be a null PDF, rather than a null JPG or a null file of any other type?  Are you suggesting that we can distinguish this null file from others, in that, despite them all being null, this one specifically does not have a PDF header as opposed to some other null file which might, for example, not have a JPG header?

    If a null pdf gets its extension renamed in a forest, and no format-specific parser opens it, does the type of its contents change?




  • What would be the correct MIME type for the empty file?

    (There's an idea for an April 1st RFC here. Damn, it's April 2nd.)

     



  • @Wrongfellow said:

    @F-Secure said:

    PDF files can contain 3D objects, complete with embedded JavaScript?

     

     Bonus points to whoever writes the first flight simulator or FPS contained entirely within a PDF file.


     

    That sounds soo tempting... I always find ways to use things in a way they are not supposed to be used. Or find ways to use things that are not supposed to be used at all. I've already done horrible, horrible things with c, brainfuck, cellular automata, c++ templates,...



  • @AnonymousCoward said:

    That sounds soo tempting... I always find ways to use things in a way they are not supposed to be used. Or find ways to use things that are not supposed to be used at all. I've already done horrible, horrible things with c, brainfuck, cellular automata, c++ templates,...
    Links or it didn't happen.



  • Many of the files that you can right-click-create-new in Windows are empty files, including .zip, .docx, .bmp, .txt and other types of files. That would mean that an empty file can be a file of many formats at the same time.



  • @DaveK said:

    @dhromed said:

     Interesting.

    So a truly barebones, empty PDF document may be produced by opening Notepad, not typing anything, and saving it as *.pdf?

    That's a kind of philosophical question.  Can a null document really be said to be a null PDF, rather than a null JPG or a null file of any other type?  Are you suggesting that we can distinguish this null file from others, in that, despite them all being null, this one specifically does not have a PDF header as opposed to some other null file which might, for example, not have a JPG header?

    If a null pdf gets its extension renamed in a forest, and no format-specific parser opens it, does the type of its contents change?


     

    There's no reason a file can't be of more than one format. I'd be willing to bet that damn near any file you can find is a valid plain text file, for some character encoding.



  • @DaveK said:

    That's a kind of philosophical question.  Can a null document really be said to be a null PDF, rather than a null JPG or a null file of any other type? 
    If your OS depends on file extensions for hints as to what application the file should be opened with, then the answer is yes.



  • @Someone You Know said:

    There's no reason a file can't be of more than one format. I'd be willing to bet that damn near any file you can find is a valid plain text file, for some character encoding.

     

    You can always invent new character encodings.



  • @Someone You Know said:

    There's no reason a file can't be of more than one format.
    See also: <font color="#999999">JPEG-Zip files </font>



  • @DaveK said:

    @dhromed said:

    So a truly barebones, empty PDF document may be produced by opening Notepad, not typing anything, and saving it as *.pdf?

    That's a kind of philosophical question.  Can a null document really be said to be a null PDF, rather than a null JPG or a null file of any other type? 

    I was once testing a product that was meant to take some figures and produce a file for import into another application.  For some reason it was producing a zero byte file.  In my defect report I listed the steps to reproduce and gave the name of the file but I was advised that they wouldn't look at the problem until I sent them a copy of the zero byte file.



  •  @ender said:

    @AnonymousCoward said:
    That sounds soo tempting... I always find ways to use things in a way they are not supposed to be used. Or find ways to use things that are not supposed to be used at all. I've already done horrible, horrible things with c, brainfuck, cellular automata, c++ templates,...
    Links or it didn't happen.

    I'm a bit hesitant to post anything major, not yet at least. Anyway, here's an implementation of a certain non-primitive recursive function which is calculated during compilation. It can handle arbitrarily long input & output values and just might overflow your stack. The code is c++ with only the following keywords used: struct, template, typename, typedef, int and return. Using enumerations would have been too easy.

     

    struct Zero {};

    template<typename a>
    struct Next
    {
        typedef a previous;
    };

    typedef Next<Zero> One;
    typedef Next<One> Two;

    template<typename m, typename n>
    struct Ackermann
    {
        typedef typename Ackermann<typename m::previous,typename Ackermann<m,typename n::previous>::value>::value value;
    };

    template<typename n>
    struct Ackermann<Zero,n>
    {
        typedef Next<n> value;
    };

    template<typename m>
    struct Ackermann<m,Zero>
    {
        typedef typename Ackermann<typename m::previous,Next<Zero> >::value value;
    };

    int main()
    {
        Ackermann<Two,Two>::value x;
        return 0;
    }


  • Discourse touched me in a no-no place

    @AnonymousCoward said:

    which is calculated during compilation.
    http://en.wikipedia.org/wiki/Template_metaprogramming



  • @PJH said:

    @AnonymousCoward said:
    which is calculated during compilation.
    http://en.wikipedia.org/wiki/Template_metaprogramming

    Yes, I know about wikipedia and even about boost, there are quite a bit usefull stuff you can do with template metaprogramming. I like to do the stupid and useless. In the previous example, using enums would make it more readable and probably even faster. I just wanted to see if I could do without them.



  • @DaveK said:

    That's a kind of philosophical question.  Can a null document really be said to be a null PDF, rather than a null JPG or a null file of any other type?  Are you suggesting that we can distinguish this null file from others, in that, despite them all being null, this one specifically does not have a PDF header as opposed to some other null file which might, for example, not have a JPG header?

    If a null pdf gets its extension renamed in a forest, and no format-specific parser opens it, does the type of its contents change?

     

    I would go so far as to say that digital data has no format, and all formatting/interpretation is only imposed on it by that which is interpreting the data. This is the inherent problem in storing digital information.

    Best you could do with an arbitrary sequence of digital information is run some kind of statistical analysis or geometric rearranging to check for patterns, but even then you might not figure out what the data contains.



  • @too_many_usernames said:

    I would go so far as to say that digital data has no format, and all formatting/interpretation is only imposed on it by that which is interpreting the data.
     

    This applies to all information.

    While it's true, I'm not sure how useful the fact is, other than uttering it to make sure one realises it consciously and explicitly.


  • Discourse touched me in a no-no place

    @too_many_usernames said:

    Best you could do with an arbitrary sequence of digital information is run some kind of statistical analysis or geometric rearranging to check for patterns, but even then you might not figure out what the data contains.
    Well that's basically what file does.



  • @PJH said:

    @too_many_usernames said:
    Best you could do with an arbitrary sequence of digital information is run some kind of statistical analysis or geometric rearranging to check for patterns, but even then you might not figure out what the data contains.
    Well that's basically what file does.

    So, what does file say about an empty file?



  • @Abdiel said:

    @PJH said:

    @too_many_usernames said:
    Best you could do with an arbitrary sequence of digital information is run some kind of statistical analysis or geometric rearranging to check for patterns, but even then you might not figure out what the data contains.
    Well that's basically what file does.

    So, what does file say about an empty file?



  • Discourse touched me in a no-no place

    @morbiuswilters said:

    @Abdiel said:

    @PJH said:

    @too_many_usernames said:
    Best you could do with an arbitrary sequence of digital information is run some kind of statistical analysis or geometric rearranging to check for patterns, but even then you might not figure out what the data contains.
    Well that's basically what file does.

    So, what does file say about an empty file?


    [code][beerfax@server tmp]$ touch x.jpg
    [beerfax@server tmp]$ file x.jpg
    x.jpg: empty
    [beerfax@server tmp]$ file -bi x.jpg
    application/x-empty
    [beerfax@server tmp]$ ls -l x.jpg
    -rw-rw-r-- 1 beerfax beerfax 0 Apr 9 19:16 x.jpg
    [/code]
    I'm sure the first file command reports something different on my home/work distro, but the emphasis is the same - it's a file with bugger all in it. (Not sure which distro the above came from - uname -a reports nothing useful.)


  • @dhromed said:

    This applies to all information.

    While it's true, I'm not sure how useful the fact is, other than uttering it to make sure one realises it consciously and explicitly.

     

    Oh, I like to think it's slightly more useful than other questionably useful truths, like packages of wheat bread that say "Contains Wheat Ingredients."

    I'd like to think that.


  • Discourse touched me in a no-no place

    @too_many_usernames said:

    Oh, I like to think it's slightly more useful than other questionably useful
    truths, like packages of wheat bread that say "Contains Wheat Ingredients."
    Or that milk, shock horror!, contains dairy products. Or that packs of peanuts may contain nuts...



    Deja vu.



    Then again, I did learn something off that thread before I saw the relevant QI episode - peas are, apparently, not nuts. They're quite sane.



  • @PJH said:

    peas are, apparently, not nuts.

    That is just nuts!



  • @AnonymousCoward said:

    @PJH said:
    peas are, apparently, not nuts.
    That is just nuts!

    Not A Nut – 01:37
    — Rathergood



  • @RTapeLoadingError said:

    @DaveK said:

    @dhromed said:

    So a truly barebones, empty PDF document may be produced by opening Notepad, not typing anything, and saving it as *.pdf?

    That's a kind of philosophical question.  Can a null document really be said to be a null PDF, rather than a null JPG or a null file of any other type? 

    I was once testing a product that was meant to take some figures and produce a file for import into another application.  For some reason it was producing a zero byte file.  In my defect report I listed the steps to reproduce and gave the name of the file but I was advised that they wouldn't look at the problem until I sent them a copy of the zero byte file.

    I hope you sent them the real zero-byte file that the application generated, and didn't succumb to the temptation to fabricate an entirely new zero-byte file from scratch!  That would be terrible dishonesty, don't ya know.




  • @DaveK said:

    @RTapeLoadingError said:

    @DaveK said:

    @dhromed said:

    So a truly barebones, empty PDF document may be produced by opening Notepad, not typing anything, and saving it as *.pdf?

    That's a kind of philosophical question.  Can a null document really be said to be a null PDF, rather than a null JPG or a null file of any other type? 

    I was once testing a product that was meant to take some figures and produce a file for import into another application.  For some reason it was producing a zero byte file.  In my defect report I listed the steps to reproduce and gave the name of the file but I was advised that they wouldn't look at the problem until I sent them a copy of the zero byte file.

    I hope you sent them the real zero-byte file that the application generated, and didn't succumb to the temptation to fabricate an entirely new zero-byte file from scratch!  That would be terrible dishonesty, don't ya know.


    Actually, they asked for a [i]copy[/i] of the file. So you should copy the file somewhere, and send them the copy. You know, so that you don't lose the original file.



  • @Abdiel said:

    @DaveK said:

    @RTapeLoadingError said:

    @DaveK said:

    @dhromed said:

    So a truly barebones, empty PDF document may be produced by opening Notepad, not typing anything, and saving it as *.pdf?

    That's a kind of philosophical question.  Can a null document really be said to be a null PDF, rather than a null JPG or a null file of any other type? 

    I was once testing a product that was meant to take some figures and produce a file for import into another application.  For some reason it was producing a zero byte file.  In my defect report I listed the steps to reproduce and gave the name of the file but I was advised that they wouldn't look at the problem until I sent them a copy of the zero byte file.

    I hope you sent them the real zero-byte file that the application generated, and didn't succumb to the temptation to fabricate an entirely new zero-byte file from scratch!  That would be terrible dishonesty, don't ya know.


    Actually, they asked for a copy of the file. So you should copy the file somewhere, and send them the copy. You know, so that you don't lose the original file.

    Wow, we're going to end up with lots of copies of this file lying around and getting backed up every night.  I hope the archiver has a good de-duping system, one that can tell the difference between different zero-sized file contents and doesn't mix them up when you restore it!




  • @DaveK said:

    @RTapeLoadingError said:

    @DaveK said:

    @dhromed said:

    So a truly barebones, empty PDF document may be produced by opening Notepad, not typing anything, and saving it as *.pdf?

    That's a kind of philosophical question.  Can a null document really be said to be a null PDF, rather than a null JPG or a null file of any other type? 

    I was once testing a product that was meant to take some figures and produce a file for import into another application.  For some reason it was producing a zero byte file.  In my defect report I listed the steps to reproduce and gave the name of the file but I was advised that they wouldn't look at the problem until I sent them a copy of the zero byte file.

    I hope you sent them the real zero-byte file that the application generated, and didn't succumb to the temptation to fabricate an entirely new zero-byte file from scratch!  That would be terrible dishonesty, don't ya know.

     

    Being a professional I sent them a copy of the actual zero byte file and not a mock up.  I reckon I could have fooled them with a fake one though.

    At this point in the testing process the relationship between the customer (who I was representing) and the software vendor had become pretty toxic so I was in "anything for an easy life" mode.  Incidentally, I listed a few of the more memorable WTFs here...

     http://forums.thedailywtf.com/forums/p/14246/210380.aspx#210380 



  • @PJH said:

    @morbiuswilters said:


    <font face="Lucida Console" size="2">[beerfax@server tmp]$ touch x.jpg
    [beerfax@server tmp]$ file x.jpg
    x.jpg: empty
    [beerfax@server tmp]$ file -bi x.jpg
    application/x-empty
    [beerfax@server tmp]$ ls -l x.jpg
    -rw-rw-r-- 1 beerfax beerfax 0 Apr 9 19:16 x.jpg</font>
    A wild JPG appears!  PJH uses text!  It's super effective!  JPG faints.


  • @belgariontheking said:

    A wild JPG appears!  PJH uses text!  It's super effective!  JPG faints.
     

    I'm in two minds about you posting while drunk and high.

    Meanwhile, continue;.



  • @dhromed said:

     Interesting.

    So a truly barebones, empty PDF document may be produced by opening Notepad, not typing anything, and saving it as *.pdf?

     

    Nope.  The PDF specification requires a header, a footer and an EOF marker in the file.  The header tells it what version it is (and has 2 other bytes before the line feed), so an example would be: "%PDF-1.4".  The footer tells it where to find the xref section (which lists the starting position of each object in the file, and where the other xref sections are).  And then there's the EOF marker "%%EOF".  

    The interesting thing, is that the spec is appendable.  Just remove the EOF marker, add what you want (even adding already defined objects to "overwrite" the object), add a new xref section, and add the footer and EOF marker.

    The other interesting thing, is that you can embed objects that are not referenced anywhere (this is designed to allow you to keep "versioned" objects even when you delete them in later revisions).  So you can trivially embed any stream data you want in to an object, and just never reference it.  The specification is REALLY flexible (a bit too flexible IMHO, considering the spec is something like 600+ pages long).  But basic reading is actually quite easy once you know what to parse for (I built a reader that extracts text using php in only around 500 lines with comments and formatting.  It was 500 lines because I pulled all the data from the PDF into a PHP Obect so that I could later re-write the file)... 



  • @ircmaxell said:

    .... The specification is REALLY flexible (a bit too flexible IMHO, considering the spec is something like 600+ pages long).  ..... 

     

     

    Sherlock Holmes: "Dr Watson, I have found that I am equipped with a penis. Quick hold it man before we loose it"

    Dr Watson: *sigh*


  • Discourse touched me in a no-no place

    @Helix said:

    Sherlock Holmes: "Dr Watson, I have found that I am equipped with a penis. Quick hold it man before we loose it"

    Dr Watson: sigh
    I take it Dr. Watson is exasperated as I am at Holmes' misspelling of the word 'lose'?



  • @PJH said:

    @Helix said:
    Sherlock Holmes: "Dr Watson, I have found that I am equipped with a penis. Quick hold it man before we loose it"
    Dr Watson: *sigh*
    I take it Dr. Watson is exasperated as I am at Holmes' misspelling of the word 'lose'?
    Hey, Holmes is hopped up on cocaine.  Cut him some slack.



  •  @PJH said:

    @Helix said:
    Sherlock Holmes: "Dr Watson, I have found that I am equipped with a penis. Quick hold it man before we loose it"
    Dr Watson: *sigh*
    I take it Dr. Watson is exasperated as I am at Holmes' misspelling of the word 'lose'?

     

    Either that or a typo of 'loosen' implying more lube.

    It was a novel piece I composed as an analogy (even equivalence) to ircmaxells' post.


Log in to reply