Amazing compression with BackupExec



  • Don't you just love the way BackupExec helps ensure you're really getting the most out of your tapes.  No wasting space with this program, no sir:

    [IMG]http://i154.photobucket.com/albums/s268/myxiplx/100-200GBtapefullwith67GBstored.jpg[/IMG]

    Yes, you're reading that right.  It's somehow managing to fill 100/200GB tapes (100GB native, 200GB compressed) with 67GB of data.  Negative compression ratios ftw!



  • LOL. It should be smart enough not to try compressing data that already has high information density. I bet that that data is already heavily compressed, ie backup dumps already ziped etc. Negative compression ratios in this case are FULLY NATURAL.There is a simple and logical reason why magical compression algorithms(that is all data fed to it is ALWAYS compacted) don't exist.



  • Way back in college we had to do an implementation of Huffman's algorithm.  IIRC we were told that Huffman coding was supposed to be able to guarantee compression on 'large files', though the definition of large files was conveniently omitted, as was an explanation for why you couldn't then just keep running huffman's over and over again until the file ceased to be 'large'.



  • @jcoehoorn said:

    Way back in college we had to do an implementation of Huffman's algorithm.  IIRC we were told that Huffman coding was supposed to be able to guarantee compression on 'large files', though the definition of large files was conveniently omitted, as was an explanation for why you couldn't then just keep running huffman's over and over again until the file ceased to be 'large'.

    You need information theory to really explain it. Basically, a given hunk of X bits of data contains Y bits of information, where Y <= X (and the rest is "redundant" data). It is impossible to transmit the data using less than Y bits. Any halfway decent compression algorithm gets as close as it can to Y on the first pass, so no further passes are going to accomplish anything, and you're probably adding some header and housekeeping information to the pile (so Y increases with each extra pass).

    To really understand it, you need a postgraduate education. This is a relatively new body of theory (usually considered to begin with Shannon in 1948) and a major field of research.



  • Maybe the program saved the backup as XML?



  • @asuffield said:

    @jcoehoorn said:

    Way back in college we had to do an implementation of Huffman's algorithm.  IIRC we were told that Huffman coding was supposed to be able to guarantee compression on 'large files', though the definition of large files was conveniently omitted, as was an explanation for why you couldn't then just keep running huffman's over and over again until the file ceased to be 'large'.

    You need information theory to really explain it. Basically, a given hunk of X bits of data contains Y bits of information, where Y <= X (and the rest is "redundant" data). It is impossible to transmit the data using less than Y bits. Any halfway decent compression algorithm gets as close as it can to Y on the first pass, so no further passes are going to accomplish anything, and you're probably adding some header and housekeeping information to the pile (so Y increases with each extra pass).

    To really understand it, you need a postgraduate education. This is a relatively new body of theory (usually considered to begin with Shannon in 1948) and a major field of research.

    Yes and no... there are rare cases where the Y bits held is smaller than a general algorithm will detect.

    For example the following text:

    "This entire line is reversed after the colon : noloc eht retfa desrever si enil eritne sihT"

    Contains redundant information that is easy to spot for a human, but would not be spotted by most algorithms.

    As a slightly less trivial example, a 1024 x 768 pixel 24-bit colour-depth Mandelbrot fractal picture contains no more "information" than a 10 line program that draws it, but I have yet to see a generic compression system that would reduce it to anything close to that number of bits.
     



  • @asuffield said:

    @jcoehoorn said:

    Way back in college we had to do an implementation of Huffman's algorithm.  IIRC we were told that Huffman coding was supposed to be able to guarantee compression on 'large files', though the definition of large files was conveniently omitted, as was an explanation for why you couldn't then just keep running huffman's over and over again until the file ceased to be 'large'.

    You need information theory to really explain it. Basically, a given hunk of X bits of data contains Y bits of information, where Y <= X (and the rest is "redundant" data). It is impossible to transmit the data using less than Y bits. Any halfway decent compression algorithm gets as close as it can to Y on the first pass, so no further passes are going to accomplish anything, and you're probably adding some header and housekeeping information to the pile (so Y increases with each extra pass).

    To really understand it, you need a postgraduate education. This is a relatively new body of theory (usually considered to begin with Shannon in 1948) and a major field of research.

     

    To understand the basics, you just need to understand the pigeonhole principle, and that (b^1 + b^2 + ... + b^(n-1)) < b^n for any positive integer n.  The "really explain it" part is understanding why good algorithms are able to do such a good job on typical uncompressed (text, image, sound, etc.) files (at the expense of doing a bad job on already-compressed data and atypical random junk).

     



  • @GettinSadda said:

    @asuffield said:

    @jcoehoorn said:

    Way back in college we had to do an implementation of Huffman's algorithm.  IIRC we were told that Huffman coding was supposed to be able to guarantee compression on 'large files', though the definition of large files was conveniently omitted, as was an explanation for why you couldn't then just keep running huffman's over and over again until the file ceased to be 'large'.

    You need information theory to really explain it. Basically, a given hunk of X bits of data contains Y bits of information, where Y <= X (and the rest is "redundant" data). It is impossible to transmit the data using less than Y bits. Any halfway decent compression algorithm gets as close as it can to Y on the first pass, so no further passes are going to accomplish anything, and you're probably adding some header and housekeeping information to the pile (so Y increases with each extra pass).

    To really understand it, you need a postgraduate education. This is a relatively new body of theory (usually considered to begin with Shannon in 1948) and a major field of research.

    Yes and no... there are rare cases where the Y bits held is smaller than a general algorithm will detect.

    You're not thinking information-theoretically. It has been proven impossible for any algorithm to determine the number of actual bits of information contained within an arbitrary string of bits; this is the essence of the proof that perfect compression is impossible. But that's not the level on which information theory usually operates. What you can do is take a set of a million 'typical' documents and measure the average information content, even though you can't state what the information content of any single document in that set is.

    Essentially, we can reason about the algorithms without needing to solve that problem.



  • @death said:

    LOL. It should be smart enough not to try compressing data that already has high information density. I bet that that data is already heavily compressed, ie backup dumps already ziped etc. Negative compression ratios in this case are FULLY NATURAL.There is a simple and logical reason why magical compression algorithms(that is all data fed to it is ALWAYS compacted) don't exist.

    Even if it's already compressed, a competent compression algorithm will only expand a file by one bit.



  • eh, it might be one of several things:

     

    (1)  the tapes are 100GB "unformatted".  Once you lay down inter-block gaps, sync bts, preambles, headers, and CRC's, you could easily lose 10 to 20 percent.

     

    (2)  Tapes are known to be a bit unreliable.  A good backup program would be smart to write out ECC data to recover from errors.  That ECC data takes up extra space of course.

     

     



  • I find it unlikely that this is a BE problem.

    LTO has hardware compression, so you have one of the following configurations:

    Hardware on, software off.

    Hardware off, software on.

    Hardware on, software on

     
    Using software compression when you have hardware available is usually a bad idea.  If this is what you're doing, then simply turn on hardware compression and turn off software.  You're likely to see improvement.  Maybe it's a BE WTF, but at least it's easily fixed.

     
    If you're using both, the WTF is obvious.

     
    If you're using hardware only, then it's a WTF but it's not a BE WTF.  Even if your data is encrypted, I would expect it to come out very close to a 1:1 ratio at worst.
     



  • @Carnildo said:

    @death said:
    LOL. It should be smart enough not to try compressing data that already has high information density. I bet that that data is already heavily compressed, ie backup dumps already ziped etc. Negative compression ratios in this case are FULLY NATURAL.There is a simple and logical reason why magical compression algorithms(that is all data fed to it is ALWAYS compacted) don't exist.

    Even if it's already compressed, a competent compression algorithm will only expand a file by one bit.

    There is no lower limit. It could increase the size by a fraction of a bit.

    Don't expect to understand what a fraction of a bit is unless you're familiar with compression coding mechanisms. 


Log in to reply