Two themes that appear every now and then in the mainstream media:

a) Some description of an obscure new computing concept, said to be "more capabale(in terms of comptablility) than the turing machine", but the current implementation is as a simulation of the new concept on a standard PC.

b) A new compressing algorithm, able to (lossless) compress *all* files.

Fortunately for the "inventors", main stream journalists don't know much about CS...

Makes me wanna cry "When the scientists claimed to have a proof that this is impossible, they did not refer to a statistical proof with a confidence value of 0.98. No, it's really really impossible!"

Two themes that appear every now and then in the mainstream media:

a) Some description of an obscure new computing concept, said to be "more capabale(in terms of comptablility) than the turing machine", but the current implementation is as a simulation of the new concept on a standard PC.

b) A new compressing algorithm, able to (lossless) compress *all* files.

Fortunately for the "inventors", main stream journalists don't know much about CS...

Makes me wanna cry "When the scientists claimed to have a proof that this is impossible, they did not refer to a statistical proof with a confidence value of 0.98. No, it's really really impossible!"

I believe in lossless compression. If such articles were left out of the mainstream media, where would be the loss?

Q.E.D. 98%

]]>

I may be picking nits here, but all files can be losslessly compressed. What's the WTF?

(Obviously, the complete statement would be "losslessly compress *all known files in the universe* so they fit on a *single floppy disk*!! ZOMG!!!!1cos0)

]]>Breaking News: P=NP. Film at eleven.

]]>I may be picking nits here, but all files can be losslessly compressed. What's the WTF?

Perhaps you'd better learn something about compression.

For

any given algorithm, there are some files that will grow larger.

There is no possible algorithm that will losslessly compress *all*

files, as a moment's thinking about the number of possible files of

size 'n' bits as compared to the number of possible files of size <

'n' bits should show you: there aren't enough distinct combinations of

smaller strings of bits to distinguish amongst all possible longer

strings of bits.

Compression algorithms *work* by making

more-frequently encoded combinations smaller at the expense of

making less-frequently encoded combinations larger.

@RiX0R said:I may be picking nits here, but all files can be losslessly compressed. What's the WTF?

Perhaps you'd better learn something about compression.

Define "compression".

To anal folk like you and me, "compression" means "making the original smaller". And technically, you're right: lossless compression always makes a nonzero subset of all possible files larger.

To normal people, "compression" often means "subjecting the original to a compression algorithm". So technically, HE'S right: all files can be losslessly compressed. It's just that this doesn't necessarily make them smaller.

Lossy compression is easy. It's called "del".

]]>@DaveK said:@RiX0R said:I may be picking nits here, but all files can be losslessly compressed. What's the WTF?

Perhaps you'd better learn something about compression.

Define "compression".

To anal folk like you and me, "compression" means "making the original smaller". And technically, you're right: lossless compression always makes a nonzero subset of all possible files larger.

To normal people, "compression" often means "subjecting the original to a compression algorithm". So technically, HE'S right: all files can be losslessly compressed. It's just that this doesn't necessarily make them smaller.

Lossy compression is easy. It's called "del".

To put an end to this part of the discussion, when I wrote "compression", I refered to claims that the new algorithm would be able to make all files (assumed: larger than some minimum size) smaller. Obviously, every file can be run through the zip algorithm... but that doesn't necessarily make it smaller.

Compression algorithms

workby making

more-frequently encoded combinations smaller at the expense of

making less-frequently encoded combinations larger.

This is a gross misstatement. While it is true that you can't have an algorithm that will losslessly compress all possible bitstreams, this is not at all the reason, and it's not how compression algorithms "work". The basic concept of data compression is to identify true redundancy (ie, cases where the number of bits in the message is larger than the number of bits of information in the message) and reduce it. It is not about a trade-off, and many practical compression schemes do not involve one. It is all about information content and redundancy.

If your input data was a set of files comprised entirely of zero bytes, then you could compress them by storing them as a binary integer describing their length. This would eliminate the redundancy inherent in the input data, reducing the size of all the files. There would be no possible input files which would grow as a result of this compression. No trade-off has occurred.

(Curiously, if you do all this on a pair of supercooled caesium atoms instead, none of it holds. On a quantum scale, you can compress anything you like and it'll get smaller - the uncertainty principle prevents you from reading out the entire file, but you can pick any set of N bits to read and it'll give you the right answer, where N is the amount of classical information present. Simultaneously defying the laws of classical physics while being completely unable to use it for anything in the macro world - quantum theory at its best)

]]>@DaveK said:Compression algorithms

workby making

more-frequently encoded combinations smaller at the expense of

making less-frequently encoded combinations larger.This is a gross misstatement. While it is true that you can't have an algorithm that will losslessly compress all possible bitstreams, this is not at all the reason, and it's not how compression algorithms "work". The basic concept of data compression is to identify true redundancy (ie, cases where the number of bits in the message is larger than the number of bits of information in the message) and reduce it. It is not about a trade-off, and many practical compression schemes do not involve one. It is all about information content and redundancy.

If your input data was a set of files comprised entirely of zero bytes, then you could compress them by storing them as a binary integer describing their length. This would eliminate the redundancy inherent in the input data, reducing the size of all the files. There would be no possible input files which would grow as a result of this compression. No trade-off has occurred.

(Curiously, if you do all this on a pair of supercooled caesium atoms instead, none of it holds. On a quantum scale, you can compress anything you like and it'll get smaller - the uncertainty principle prevents you from reading out the entire file, but you can pick any set of N bits to read and it'll give you the right answer, where N is the amount of classical information present. Simultaneously defying the laws of classical physics while being completely unable to use it for anything in the macro world - quantum theory at its best)

DaveK is describing the desired effect and you are describing the usual way to get there. Maybe "work" is the wrong word, though...

]]>Here's what I do. I compress all the files on my computer into a single .zip file. Then I run a lossless compression to make it smaller. Then I do it again and again and again until I get a one-character file that contains all the information. Then I open it with my text editor and write down the character on a piece of paper. That's how I do my backups.

]]>

Here's what I do. I compress all the files on my computer into a single .zip file. Then I run a lossless compression to make it smaller. Then I do it again and again and again until I get a one-character file that contains all the information. Then I open it with my text editor and write down the character on a piece of paper. That's how I do my backups.

Be careful with this. I once tried to do a system restore from a "q" and wound up with a Windows 95 machine loaded with nothing but Solitaire and what appeared to be a hijacked copy of Castle Wolfenstein. All my project code, Word documents, financial records, completely gone.....

]]>If your input data was a set of files comprised entirely of zero bytes, then you could compress them by storing them as a binary integer describing their length. This would eliminate the redundancy inherent in the input data, reducing the size of all the files. There would be no possible input files which would grow as a result of this compression. No trade-off has occurred.

Unless, of course, my file was a single byte long (or even a bit, depending on the filesystem), in which case the two+ bytes required for the binary integer would be larger than the initial file.

It's sort of like that mathematical proof (whose name I can't remember) which proves that no function can exist that will determine whether or not *any* turing-complete subroutine will become caught in an infinite loop.

Be careful with this. I once tried to do a system restore from a "q" and wound up with a Windows 95 machine loaded with nothing but Solitaire and what appeared to be a hijacked copy of Castle Wolfenstein. All my project code, Word documents, financial records, completely gone.....

Perhaps your paper was infected by a virus? I think I heard of that one...

]]>

Variable-length encodings are your friend. For the specific case assufield described (which is extremely contrived, but we'll work with it) it's possible to compress every file containing a string of 0x00 bytes to no larger than itself. Consider:

Empty file compresses to empty file.

1-byte 0x00 file compresses to 1-byte file containing 0x01.

2-byte 0x00 0x00 file compresses 1-byte file containing 0x02.

...

255-byte 0x00 file compresses to 1-byte file containing 0xFF.

256-byte 0x00 file compresses to 2-byte file containing 0x01 0x00.

257-byte 0x00 file compresses to 2-byte file containing 0x01 0x01.

et cetera.

The trick here is that asuffield has very strongly restricted the input so that we don't even have to keep track of what symbol we're encoding. This kind of run-length encoding is fairly inefficient for a lot of cases where the input is less restricted.

It's sort of like that mathematical proof (whose name I can't remember) which proves that no function can exist that will determine whether or not

anyturing-complete subroutine will become caught in an infinite loop.

- I think it's called Gödel's incompleteness theorem.
- Actually, the statement is less severe. IIRC, for a set of all possible n-bit long programs, there is no program taking any of these n-bit long programs as input and determining whether the given program will stop whose length is less than n.
- This does not mean that there is no program that can determine whether
*some*of n-bit programs eventually stop that is shorter than n bits. Therefore, the theorem doesn't rule out automated code analysis. I remember this subject came up in some other thread.

No, he's referring to a stronger statement generally known as the "Halting problem":

There is *no* Turing machine T that, given the description of another Turing machine M and that machine's input I, will output H iff M halts when applied to I and will output H' iff M does not halt when applied to I. The encoded lengths of T, M, and I have nothing to do with it.

You're right that this is a general case problem that is solvable for some subsets of computable functions, though. There's also a trivial computable function that, given a description of a Turing machine M, that machine's input I, and a number N, will return whether M, when applied to I, halts in fewer than N transitions.

(Halting problem threads are almost as frustrating as the Monty Hall problem.)

]]>@newfweiler said:Here's what I do. I compress all the files on my computer into a single .zip file. Then I run a lossless compression to make it smaller. Then I do it again and again and again until I get a one-character file that contains all the information. Then I open it with my text editor and write down the character on a piece of paper. That's how I do my backups.

Be careful with this. I once tried to do a system restore from a "q" and wound up with a Windows 95 machine loaded with nothing but Solitaire and what appeared to be a hijacked copy of Castle Wolfenstein. All my project code, Word documents, financial records, completely gone.....

Try it again. I just tried uncompressing your "q" and I got everything. All your project code, Word documents and financial records. Oh, I cashed out your brokerage account but I left your savings and checking accounts; I'm not a total bastard. Those pictures you collect are interesting.

]]>

@Albatross said:It's sort of like that mathematical proof (whose name I can't remember) which proves that no function can exist that will determine whether or not

anyturing-complete subroutine will become caught in an infinite loop.

- I think it's called Gödel's incompleteness theorem.

Gödel's incompletness theorem says something else, along the lines of "in every mathematical system, there are true propositions that cannot be proved within the system". Gödel wrote his theorem before Turing did, so it's rather unlikely that Gödel mentioned Turing completeness in his theorem.

- Actually, the statement is less severe. IIRC, for a

set of all possible n-bit long programs, there is no program taking any

of these n-bit long programs as input and determining whether the given

program will stop whose length is less than n.- This does not

mean that there is no program that can determine whethersomeof

n-bit programs eventually stop that is shorter than n bits. Therefore,

the theorem doesn't rule out automated code analysis. I remember this

subject came up in some other thread.

Of course a program can determine for some programs whether or not they will stop. In fact, for the analysis to be impossible, the program to analyse must continuosly keep taking more memory, so that no memory configuration appears twice. If the same memory configuration appears twice, the analyzing program can determine for sure that the target program is caught in an infinite loop.

]]>@wgh said:Try it again. I just tried uncompressing your "q" and I got everything. All your project code, Word documents and financial records. Oh, I cashed out your brokerage account but I left your savings and checking accounts; I'm not a total bastard. Those pictures you collect are interesting.

I think it depends on the font. If you write the q in Frutiger Next, it will uncompress to Vista. In Tahoma, it uncompresses to XP; in Arial, it uncompresses to Win95. And yes, those interesting pictures are the result of your personal style of writing.

]]>Variable-length encodings are your friend. For the specific case assufield described (which is extremely contrived, but we'll work with it) it's possible to compress every file containing a string of 0x00 bytes to no larger than itself. Consider:

Empty file compresses to empty file.

1-byte 0x00 file compresses to 1-byte file containing 0x01.

2-byte 0x00 0x00 file compresses 1-byte file containing 0x02.

...

255-byte 0x00 file compresses to 1-byte file containing 0xFF.

256-byte 0x00 file compresses to 2-byte file containing 0x01 0x00.

257-byte 0x00 file compresses to 2-byte file containing 0x01 0x01.et cetera.

The trick here is that asuffield has very strongly restricted the input so that we don't even have to keep track of what symbol we're encoding. This kind of run-length encoding is fairly inefficient for a lot of cases where the input is less restricted.

You are most certainly wrong. Consider...

0. Assume algorithm is not allowed to compress a file into something larger than the start file.

1. Start with a file called "start" of length n.

2. Compress this to a file named "compressed" of length m (where m < n).

3. Assume that "compressed" has absolutely no redundancy.

4. Create another file named "other" of length m which is a bit-by-bit copy of "compressed".

5. Compress "other" to a file named "other-compressed." Since "other" has no redundancy and the algorithm cannot generate a file larger than the input it must output a bit-by-bit copy of "other". (you can add in fancy rules to get around this like: output a copy but decrement it by 1, or shift all the bits, but my argument still holds for these you just have to change a couple things.)

6. At this point you have 2 files ("other-compressed", and "compressed") which are exactly the same. However, when you decompress them they need to turn into 2 distinctly different entities. This is a contradiction, so therefore this is impossible.

NOTE: This whole argument falls apart if you take away the constraint that m<n because in such case "other" and "start" could be the exact same, and therefor the decompression would work.

@wgh said:@newfweiler said:Try it again. I just tried uncompressing your "q" and I got everything. All your project code, Word documents and financial records. Oh, I cashed out your brokerage account but I left your savings and checking accounts; I'm not a total bastard. Those pictures you collect are interesting.

Um, those are for a research project.

]]>]]>

@Angstrom said:Variable-length encodings are your friend. For the specific case assufield described (which is extremely contrived, but we'll work with it) it's possible to compress every file containing a string of 0x00 bytes to no larger than itself. Consider:

Empty file compresses to empty file.

1-byte 0x00 file compresses to 1-byte file containing 0x01.

2-byte 0x00 0x00 file compresses 1-byte file containing 0x02.

...

255-byte 0x00 file compresses to 1-byte file containing 0xFF.

256-byte 0x00 file compresses to 2-byte file containing 0x01 0x00.

257-byte 0x00 file compresses to 2-byte file containing 0x01 0x01.et cetera.

The trick here is that asuffield has very strongly restricted the input so that we don't even have to keep track of what symbol we're encoding. This kind of run-length encoding is fairly inefficient for a lot of cases where the input is less restricted.

You are most certainly wrong. Consider...

0. Assume algorithm is not allowed to compress a file into something larger than the start file.

1. Start with a file called "start" of length n.2. Compress this to a file named "compressed" of length m (where m < n).

3. Assume that "compressed" has absolutely no redundancy.

4. Create another file named "other" of length m which is a bit-by-bit copy of "compressed".

5. Compress "other" to a file named "other-compressed." Since "other" has no redundancy and the algorithm cannot generate a file larger than the input it must output a bit-by-bit copy of "other". (you can add in fancy rules to get around this like: output a copy but decrement it by 1, or shift all the bits, but my argument still holds for these you just have to change a couple things.)6. At this point you have 2 files ("other-compressed", and "compressed") which are exactly the same. However, when you decompress them they need to turn into 2 distinctly different entities. This is a contradiction, so therefore this is impossible.

NOTE: This whole argument falls apart if you take away the constraint that m<n because in such case "other" and "start" could be the exact same, and therefor the decompression would work.

While I applaud your rigor, and you're absolutely right in the general case, you're ignoring a quirk of the *specific case* asuffield presented: the compressor's inputs are only those files containing nothing but the byte 0x00, repeated N times. The algorithm whose output I described uses the minimum number of bytes required to hold N. It's an extreme example of a spot where RLE is actually an efficient coding mechanism that would never, ever arise in real life.

In this case, the outputs from the compression function are not acceptable inputs to the compression function: they all contain at least one non-zero byte. This is actually a more strict limit on the compressor, so it doesn't matter.

PS: in case you're sure I'm wrong, here's a fast and dirty C hack implementing the algorithm for files up to MAX_ULONG bytes long:

[owen@verdandi zerocomp]$ cat ./zerocomp.c

#include <limits.h>

#include <stdio.h>

#define BLOCK_SIZE 512

unsigned maskof (unsigned bits) {

return (1 << bits) - 1;

}

unsigned count_zeros (char *block, size_t size) {

unsigned count = 0;

for (unsigned i = 0; i < size; ++i)

if (!block[i])

++count;

return count;

}

int main () {

unsigned long count = 0;

char block[BLOCK_SIZE];

size_t read_size;

while (0 < (read_size = fread (block, sizeof(*block), BLOCK_SIZE, stdin))) {

if (read_size != count_zeros (block, read_size)) {

fprintf (stderr, "Non-zero input byte.\n");

return -1;

}

count += read_size;

}

while (count) {

char output = (char) (count & maskof (CHAR_BIT));

size_t written_size = fwrite (&output, sizeof(output), 1, stdout);

if (written_size == 0) {

perror ("zerocomp");

return -1;

}

count >>= CHAR_BIT;

}

}

[owen@verdandi zerocomp]$ dd if=/dev/zero bs=1 count=1 | ./zerocomp | (./dump; echo)

01

[owen@verdandi zerocomp]$ dd if=/dev/zero bs=1 count=255 | ./zerocomp | (./dump; echo)

FF

[owen@verdandi zerocomp]$ dd if=/dev/zero bs=1 count=256 | ./zerocomp | (./dump; echo)

00 01

[owen@verdandi zerocomp]$ dd if=/dev/zero bs=1 count=65536 | ./zerocomp | (./dump; echo)

00 00 01

Dump is just a tool that outputs stdin's bytes to stdout. That dd command produces *count* zeros. I cheated a little and made the output format little-endian because it's easier to generate place.

(Edit timeout)

It even satisfies "output size must be no larger than input size" for the 0-byte edge case. It outputs a zero-byte file.

]]>