Which WTF is worse?



  • Two themes that appear every now and then in the mainstream media:

    a) Some description of an obscure new computing concept, said to be "more capabale(in terms of comptablility) than the turing machine", but the current implementation is as a simulation of the new concept on a standard PC.

    b) A new compressing algorithm, able to (lossless) compress *all* files.

    Fortunately for the "inventors", main stream journalists don't know much about CS...

    Makes me wanna cry "When the scientists claimed to have a proof that this is impossible, they did not refer to a statistical proof with a confidence value of 0.98. No, it's really really impossible!"
     




  • I believe in lossless compression. If such articles were left out of the mainstream media, where would be the loss?

    Q.E.D. 98%



  • Lossless compression of all files.



  • I may be picking nits here, but all files can be losslessly compressed. What's the WTF?

    (Obviously, the complete statement would be "losslessly compress *all known files in the universe* so they fit on a *single floppy disk*!! ZOMG!!!!1cos0)



  • Breaking News: P=NP.  Film at eleven.



  • @RiX0R said:

    I may be picking nits here, but all files can be losslessly compressed. What's the WTF?

    Perhaps you'd better learn something about compression.

    For
    any given algorithm, there are some files that will grow larger. 
    There is no possible algorithm that will losslessly compress all
    files, as a moment's thinking about the number of possible files of
    size 'n' bits as compared to the number of possible files of size <
    'n' bits should show you: there aren't enough distinct combinations of
    smaller strings of bits to distinguish amongst all possible longer
    strings of bits.

    Compression algorithms work by making
    more-frequently encoded combinations smaller at the expense of
    making  less-frequently encoded combinations larger.



  • @DaveK said:

    @RiX0R said:

    I may be picking nits here, but all files can be losslessly compressed. What's the WTF?

    Perhaps you'd better learn something about compression.

    Define "compression".

    To anal folk like you and me, "compression" means "making the original smaller". And technically, you're right: lossless compression always makes a nonzero subset of all possible files larger.

    To normal people, "compression" often means "subjecting the original to a compression algorithm". So technically, HE'S right: all files can be losslessly compressed. It's just that this doesn't necessarily make them smaller.

    Lossy compression is easy. It's called "del".



  • @CDarklock said:

    @DaveK said:
    @RiX0R said:

    I may be picking nits here, but all files can be losslessly compressed. What's the WTF?

    Perhaps you'd better learn something about compression.

    Define "compression".

    To anal folk like you and me, "compression" means "making the original smaller". And technically, you're right: lossless compression always makes a nonzero subset of all possible files larger.

    To normal people, "compression" often means "subjecting the original to a compression algorithm". So technically, HE'S right: all files can be losslessly compressed. It's just that this doesn't necessarily make them smaller.

    Lossy compression is easy. It's called "del".

    To put an end to this part of the discussion, when I wrote "compression", I refered to claims that the new algorithm would be able to make all files (assumed: larger than some minimum size) smaller. Obviously, every file can be run through the zip algorithm... but that doesn't necessarily make it smaller.



  • @DaveK said:

    Compression algorithms work by making
    more-frequently encoded combinations smaller at the expense of
    making  less-frequently encoded combinations larger.

    This is a gross misstatement. While it is true that you can't have an algorithm that will losslessly compress all possible bitstreams, this is not at all the reason, and it's not how compression algorithms "work". The basic concept of data compression is to identify true redundancy (ie, cases where the number of bits in the message is larger than the number of bits of information in the message) and reduce it. It is not about a trade-off, and many practical compression schemes do not involve one. It is all about information content and redundancy.

    If your input data was a set of files comprised entirely of zero bytes, then you could compress them by storing them as a binary integer describing their length. This would eliminate the redundancy inherent in the input data, reducing the size of all the files. There would be no possible input files which would grow as a result of this compression. No trade-off has occurred.

    (Curiously, if you do all this on a pair of supercooled caesium atoms instead, none of it holds. On a quantum scale, you can compress anything you like and it'll get smaller - the uncertainty principle prevents you from reading out the entire file, but you can pick any set of N bits to read and it'll give you the right answer, where N is the amount of classical information present. Simultaneously defying the laws of classical physics while being completely unable to use it for anything in the macro world - quantum theory at its best)



  • @asuffield said:

    @DaveK said:

    Compression algorithms work by making
    more-frequently encoded combinations smaller at the expense of
    making  less-frequently encoded combinations larger.

    This is a gross misstatement. While it is true that you can't have an algorithm that will losslessly compress all possible bitstreams, this is not at all the reason, and it's not how compression algorithms "work". The basic concept of data compression is to identify true redundancy (ie, cases where the number of bits in the message is larger than the number of bits of information in the message) and reduce it. It is not about a trade-off, and many practical compression schemes do not involve one. It is all about information content and redundancy.

    If your input data was a set of files comprised entirely of zero bytes, then you could compress them by storing them as a binary integer describing their length. This would eliminate the redundancy inherent in the input data, reducing the size of all the files. There would be no possible input files which would grow as a result of this compression. No trade-off has occurred.

    (Curiously, if you do all this on a pair of supercooled caesium atoms instead, none of it holds. On a quantum scale, you can compress anything you like and it'll get smaller - the uncertainty principle prevents you from reading out the entire file, but you can pick any set of N bits to read and it'll give you the right answer, where N is the amount of classical information present. Simultaneously defying the laws of classical physics while being completely unable to use it for anything in the macro world - quantum theory at its best)

    DaveK is describing the desired effect and you are describing the usual way to get there. Maybe "work" is the wrong word, though...



  • Here's what I do.  I compress all the files on my computer into a single .zip file.  Then I run a lossless compression to make it smaller.  Then I do it again and again and again until I get a one-character file that contains all the information.  Then I open it with my text editor and write down the character on a piece of paper.  That's how I do my backups.

     



  • @newfweiler said:

    Here's what I do.  I compress all the files on my computer into a single .zip file.  Then I run a lossless compression to make it smaller.  Then I do it again and again and again until I get a one-character file that contains all the information.  Then I open it with my text editor and write down the character on a piece of paper.  That's how I do my backups.

    Be careful with this.  I once tried to do a system restore from a "q" and wound up with a Windows 95 machine loaded with nothing but Solitaire and what appeared to be a hijacked copy of Castle Wolfenstein.  All my project code, Word documents, financial records, completely gone.....



  • Download the internet, print it, make a photo of it, scan it on a wooden tablet...



  • @asuffield said:

    If your input data was a set of files comprised entirely of zero bytes, then you could compress them by storing them as a binary integer describing their length. This would eliminate the redundancy inherent in the input data, reducing the size of all the files. There would be no possible input files which would grow as a result of this compression. No trade-off has occurred.

    Unless, of course, my file was a single byte long (or even a bit, depending on the filesystem), in which case the two+ bytes required for the binary integer would be larger than the initial file.

    It's sort of like that mathematical proof (whose name I can't remember) which proves that no function can exist that will determine whether or not any turing-complete subroutine will become caught in an infinite loop.



  • @wgh said:

    Be careful with this.  I once tried to do a system restore from a "q" and wound up with a Windows 95 machine loaded with nothing but Solitaire and what appeared to be a hijacked copy of Castle Wolfenstein.  All my project code, Word documents, financial records, completely gone.....

     

    Perhaps your paper was infected by a virus? I think I heard of that one...



  • Variable-length encodings are your friend.  For the specific case assufield described (which is extremely contrived, but we'll work with it) it's possible to compress every file containing a string of 0x00 bytes to no larger than itself.  Consider:

    Empty file compresses to empty file.

    1-byte 0x00 file compresses to 1-byte file containing 0x01.

    2-byte 0x00 0x00 file compresses 1-byte file containing 0x02.

    ...

    255-byte 0x00 file compresses to 1-byte file containing 0xFF.

    256-byte 0x00 file compresses to 2-byte file containing 0x01 0x00.

    257-byte 0x00 file compresses to 2-byte file containing 0x01 0x01.

    et cetera.

    The trick here is that asuffield has very strongly restricted the input so that we don't even have to keep track of what symbol we're encoding.  This kind of run-length encoding is fairly inefficient for a lot of cases where the input is less restricted.



  • @Albatross said:

    It's sort of like that mathematical proof (whose name I can't remember) which proves that no function can exist that will determine whether or not any turing-complete subroutine will become caught in an infinite loop.

    1. I think it's called Gödel's incompleteness theorem.
    2. Actually, the statement is less severe. IIRC, for a set of all possible n-bit long programs, there is no program taking any of these n-bit long programs as input and determining whether the given program will stop whose length is less than n.
    3. This does not mean that there is no program that can determine whether some of n-bit programs eventually stop that is shorter than n bits. Therefore, the theorem doesn't rule out automated code analysis. I remember this subject came up in some other thread.



  • No, he's referring to a stronger statement generally known as the "Halting problem":

    There is no Turing machine T that, given the description of another Turing machine M and that machine's input I, will output H iff M halts when applied to I and will output H' iff M does not halt when applied to I.  The encoded lengths of T, M, and I have nothing to do with it.

    You're right that this is a general case problem that is solvable for some subsets of computable functions, though.  There's also a trivial computable function that, given a description of a Turing machine M, that machine's input I, and a number N, will return whether M, when applied to I, halts in fewer than N transitions.

    (Halting problem threads are almost as frustrating as the Monty Hall problem.) 



  • @wgh said:

    @newfweiler said:

    Here's what I do.  I compress all the files on my computer into a single .zip file.  Then I run a lossless compression to make it smaller.  Then I do it again and again and again until I get a one-character file that contains all the information.  Then I open it with my text editor and write down the character on a piece of paper.  That's how I do my backups.

    Be careful with this.  I once tried to do a system restore from a "q" and wound up with a Windows 95 machine loaded with nothing but Solitaire and what appeared to be a hijacked copy of Castle Wolfenstein.  All my project code, Word documents, financial records, completely gone.....

    Try it again.  I just tried uncompressing your "q" and I got everything.  All your project code, Word documents and financial records.  Oh, I cashed out your brokerage account but I left your savings and checking accounts; I'm not a total bastard.  Those pictures you collect are interesting.

     



  • @Tweenk said:

    @Albatross said:

    It's sort of like that mathematical proof (whose name I can't remember) which proves that no function can exist that will determine whether or not any turing-complete subroutine will become caught in an infinite loop.

    1. I think it's called Gödel's incompleteness theorem.

    Gödel's incompletness theorem says something else, along the lines of "in every mathematical system, there are true propositions that cannot be proved within the system". Gödel wrote his theorem before Turing did, so it's rather unlikely that Gödel mentioned Turing completeness in his theorem.

     

    1. Actually, the statement is less severe. IIRC, for a
      set of all possible n-bit long programs, there is no program taking any
      of these n-bit long programs as input and determining whether the given
      program will stop whose length is less than n.
    2. This does not
      mean that there is no program that can determine whether some of
      n-bit programs eventually stop that is shorter than n bits. Therefore,
      the theorem doesn't rule out automated code analysis. I remember this
      subject came up in some other thread.

    Of course a program can determine for some programs whether or not they will stop. In fact, for the analysis to be impossible, the program to analyse must continuosly keep taking more memory, so that no memory configuration appears twice. If the same memory configuration appears twice, the analyzing program can determine for sure that the target program is caught in an infinite loop. 



  • @newfweiler said:

    @wgh said:

    Be careful with this.  I once tried to do a system restore from a "q" and wound up with a Windows 95 machine loaded with nothing but Solitaire and what appeared to be a hijacked copy of Castle Wolfenstein.  All my project code, Word documents, financial records, completely gone.....

    Try it again.  I just tried uncompressing your "q" and I got everything.  All your project code, Word documents and financial records.  Oh, I cashed out your brokerage account but I left your savings and checking accounts; I'm not a total bastard.  Those pictures you collect are interesting.

    I think it depends on the font. If you write the q in Frutiger Next, it will uncompress to Vista. In Tahoma, it uncompresses to XP; in Arial, it uncompresses to Win95. And yes, those interesting pictures are the result of your personal style of writing. 



  • @Angstrom said:

    Variable-length encodings are your friend.  For the specific case assufield described (which is extremely contrived, but we'll work with it) it's possible to compress every file containing a string of 0x00 bytes to no larger than itself.  Consider:

    Empty file compresses to empty file.

    1-byte 0x00 file compresses to 1-byte file containing 0x01.

    2-byte 0x00 0x00 file compresses 1-byte file containing 0x02.

    ...

    255-byte 0x00 file compresses to 1-byte file containing 0xFF.

    256-byte 0x00 file compresses to 2-byte file containing 0x01 0x00.

    257-byte 0x00 file compresses to 2-byte file containing 0x01 0x01.

    et cetera.

    The trick here is that asuffield has very strongly restricted the input so that we don't even have to keep track of what symbol we're encoding.  This kind of run-length encoding is fairly inefficient for a lot of cases where the input is less restricted.

     

    You are most certainly wrong.   Consider...

    0.  Assume algorithm is not allowed to compress a file into something larger than the start file.

    1.  Start with a file called "start" of length n.

    2.  Compress this to a file named "compressed" of length m (where m < n).

    3.  Assume that "compressed" has absolutely no redundancy. 

    4.  Create another file named "other" of length m which is a bit-by-bit copy of "compressed".
     
    5.  Compress "other" to a file named "other-compressed."  Since "other" has no redundancy and the algorithm cannot generate a file larger than the input it must output a bit-by-bit copy of "other".  (you can add in fancy rules to get around this like: output a copy but decrement it by 1, or shift all the bits, but my argument still holds for these you just have to change a couple things.)

    6.  At this point you have 2 files ("other-compressed", and "compressed") which are exactly the same.  However, when you decompress them they need to turn into 2 distinctly different entities.  This is a contradiction, so therefore this is impossible.

     

    NOTE:  This whole argument falls apart if you take away the constraint that m<n because in such case "other" and "start" could be the exact same, and therefor the decompression would work.
     



  • @newfweiler said:

    @wgh said:
    @newfweiler said:

    Here's what I do.  I compress all the files on my computer into a single .zip file.  Then I run a lossless compression to make it smaller.  Then I do it again and again and again until I get a one-character file that contains all the information.  Then I open it with my text editor and write down the character on a piece of paper.  That's how I do my backups.

    Be careful with this.  I once tried to do a system restore from a "q" and wound up with a Windows 95 machine loaded with nothing but Solitaire and what appeared to be a hijacked copy of Castle Wolfenstein.  All my project code, Word documents, financial records, completely gone.....

    Try it again.  I just tried uncompressing your "q" and I got everything.  All your project code, Word documents and financial records.  Oh, I cashed out your brokerage account but I left your savings and checking accounts; I'm not a total bastard.  Those pictures you collect are interesting.

    Um, those are for a research project.



  • can you take a picture of that q on a wooden table and fax it to me so that I can see what these pictures are ?   ;)



  • @tster said:

    @Angstrom said:

    Variable-length encodings are your friend.  For the specific case assufield described (which is extremely contrived, but we'll work with it) it's possible to compress every file containing a string of 0x00 bytes to no larger than itself.  Consider:

    Empty file compresses to empty file.

    1-byte 0x00 file compresses to 1-byte file containing 0x01.

    2-byte 0x00 0x00 file compresses 1-byte file containing 0x02.

    ...

    255-byte 0x00 file compresses to 1-byte file containing 0xFF.

    256-byte 0x00 file compresses to 2-byte file containing 0x01 0x00.

    257-byte 0x00 file compresses to 2-byte file containing 0x01 0x01.

    et cetera.

    The trick here is that asuffield has very strongly restricted the input so that we don't even have to keep track of what symbol we're encoding.  This kind of run-length encoding is fairly inefficient for a lot of cases where the input is less restricted.

     

    You are most certainly wrong.   Consider...

    0.  Assume algorithm is not allowed to compress a file into something larger than the start file.

    1.  Start with a file called "start" of length n.

    2.  Compress this to a file named "compressed" of length m (where m < n).

    3.  Assume that "compressed" has absolutely no redundancy. 

    4.  Create another file named "other" of length m which is a bit-by-bit copy of "compressed".
     
    5.  Compress "other" to a file named "other-compressed."  Since "other" has no redundancy and the algorithm cannot generate a file larger than the input it must output a bit-by-bit copy of "other".  (you can add in fancy rules to get around this like: output a copy but decrement it by 1, or shift all the bits, but my argument still holds for these you just have to change a couple things.)

    6.  At this point you have 2 files ("other-compressed", and "compressed") which are exactly the same.  However, when you decompress them they need to turn into 2 distinctly different entities.  This is a contradiction, so therefore this is impossible.

    NOTE:  This whole argument falls apart if you take away the constraint that m<n because in such case "other" and "start" could be the exact same, and therefor the decompression would work.

    While I applaud your rigor, and you're absolutely right in the general case, you're ignoring a quirk of the specific case asuffield presented: the compressor's inputs are only those files containing nothing but the byte 0x00, repeated N times.  The algorithm whose output I described uses the minimum number of bytes required to hold N.  It's an extreme example of a spot where RLE is actually an efficient coding mechanism that would never, ever arise in real life.

    In this case, the outputs from the compression function are not acceptable inputs to the compression function: they all contain at least one non-zero byte.  This is actually a more strict limit on the compressor, so it doesn't matter.

    PS: in case you're sure I'm wrong, here's a fast and dirty C hack implementing the algorithm for files up to MAX_ULONG bytes long:

    [owen@verdandi zerocomp]$ cat ./zerocomp.c
    #include <limits.h>
    #include <stdio.h>

    #define BLOCK_SIZE 512

    unsigned maskof (unsigned bits) {
      return (1 << bits) - 1;
    }

    unsigned count_zeros (char *block, size_t size) {
      unsigned count = 0;

      for (unsigned i = 0; i < size; ++i)
        if (!block[i])
          ++count;

      return count;
    }

    int main () {
      unsigned long count = 0;

      char block[BLOCK_SIZE];
      size_t read_size;

      while (0 < (read_size = fread (block, sizeof(*block), BLOCK_SIZE, stdin))) {
        if (read_size != count_zeros (block, read_size)) {
          fprintf (stderr, "Non-zero input byte.\n");
          return -1;
        }
        count += read_size;
      }

      while (count) {
        char output = (char) (count & maskof (CHAR_BIT));

        size_t written_size = fwrite (&output, sizeof(output), 1, stdout);
        if (written_size == 0) {
          perror ("zerocomp");
          return -1;
        }

        count >>= CHAR_BIT;
      }
    }
    [owen@verdandi zerocomp]$ dd if=/dev/zero bs=1 count=1 | ./zerocomp | (./dump; echo)
    01
    [owen@verdandi zerocomp]$ dd if=/dev/zero bs=1 count=255 | ./zerocomp | (./dump; echo)
    FF
    [owen@verdandi zerocomp]$ dd if=/dev/zero bs=1 count=256 | ./zerocomp | (./dump; echo)
    00 01
    [owen@verdandi zerocomp]$ dd if=/dev/zero bs=1 count=65536 | ./zerocomp | (./dump; echo)
    00 00 01

    Dump is just a tool that outputs stdin's bytes to stdout.  That dd command produces count zeros. I cheated a little and made the output format little-endian because it's easier to generate place.



  • (Edit timeout)

    It even satisfies "output size must be no larger than input size" for the 0-byte edge case.  It outputs a zero-byte file. 



  • @Albatross said:

    It's sort of like that mathematical proof (whose name I can't remember) which proves that no function can exist that will determine whether or not any turing-complete subroutine will become caught in an infinite loop.

     


    No it's not. It's a simple counting argument, exactly like the proof that no cryptographic hash function can avoid collisions on all inputs.



  • LEWL@FORUM!

     

    :-D



  • @Angstrom said:

    Variable-length encodings are your friend.  For the specific case assufield described (which is extremely contrived, but we'll work with it) it's possible to compress every file containing a string of 0x00 bytes [b]to no larger than itself[/b].

    Ooh!  I can do this one.  I think the function is called cp.



  • @ammoQ said:

     

    I think it depends on the font. If you write the q in Frutiger Next, it will uncompress to Vista. In Tahoma, it uncompresses to XP; in Arial, it uncompresses to Win95. And yes, those interesting pictures are the result of your personal style of writing. 

    I tried it and you're right!  I decompressed a handwritten i with a heart for a dot, and one with a smiley face for a dot, and ... well, try it yourself but don't try the smiley face one if you're at work.

     



  • @newfweiler said:

    I tried it and you're right!  I decompressed a handwritten i with a heart for a dot, and one with a smiley face for a dot, and ... well, try it yourself but don't try the smiley face one if you're at work.

    You've made me curious. I tried that i with a smiley thing, but it uncompressed to a COBOL development environement. Strange... maybe it's because I've written the smiley like this: :-(( 



  • @Angstrom said:

    In this case, the outputs from the compression function are not acceptable inputs to the compression function: they all contain at least one non-zero byte.  This is actually a more strict limit on the compressor, so it doesn't matter.

    Which is the whole point really (and I was hoping that people would get it without needing this lengthly explanation and demonstration). If there is no requirement for the compressor's output to be valid input to itself, then the proof does not work (and the statement isn't true). It's not just a quirk of the example, it's the reason for the example.

    This is interesting for a very large class of specialised compression algorithms, which are designed to operate on specific kinds of data. All the best compressors are specialised (if you think about it, they have to be - this discussion is part of the reason). And the reason why DaveK's original statement is wrong: specialised compressors behave like this contrived example, and do not have to make a trade-off.

    Incidentally, while your algorithm satisfies the requirements, it's suboptimal. One byte of zero does not comprise 8 bits of information, it's only one bit, and should be stored as such (yeah, your filesystem has issues with this, I know - it's interesting when you're compressing multiple files into one archive). Three bytes of zero comprises log_2(3) bits of information, or about 1.58 bits (yes, you can store fractional bits of information; you need to use an arithmetic coder to do it, and it's a vital part of some of the best compression schemes). The only "file of zero bytes" that can't be stored in a smaller amount of space is the zero-byte file (hence why my example allowed for same-length output, just to bag the pathological case - I could alternately have specified "all files comprised entirely of one or more zero bytes" and got the same result).



  • I didn't realize we were talking about only files with the byte 0x00.  I was talking about the general case.


Log in to reply