Xerox scanners/photocopiers randomly alter numbers in scanned documents

u2892

A truly brutal WTF

http://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres_are_switching_written_numbers_when_scanning

PJH

Website ain't too clever either

flabdablet

Lossy compression is lossy. Film at eleven.

GreyWolf1

Lossy compression is lossy. Film at eleven.

Person who did not read TFA misses whole point. Film at 11.

PJH

@flabdablet said:

Lossy compression is lossy.

Lossy compression isn't the issue here.

Faraday

@PJH said:

@flabdablet said:
Lossy compression is lossy.
Lossy compression isn't the issue here.

Correct. A lousy compression algorithm is the issue here.

Quango

@flabdablet said:

Lossy compression is lossy. Film at eleven.

Lossy is one thing. Compression that takes one value and changes it to another is the issue.

Xerox would scan a '6', the compression routine would convert this to an '8' and then print it out like that.

Imagine the fun in accounts departments, legal departments, goverment offices etc.

"Our oil platform supply contract is for 8 months, at a rate of $12.5m per month."

@PJH said:

@flabdablet said:
Lossy compression is lossy.
Lossy compression isn't the issue here.

Lossy compression is exactly the issue here. Remember that, loosely speaking, "lossy" compression is defined as a process which reduces the amount of data used to represent the "original" in a way which is non-reversible - i.e. you can't reconstruct the original from the compressed data as opposed to lossless which allows you to reconstruct the original down to the last bit.

So most common lossy image compression algorithms achieve a higher compression rate (as compared to lossless ones) by sacrificing image quality. However, the algorithm in question, JBIG2, basically simply replaces parts (pixel block) of the image which are "similar enough" to each other with a single instance of the block, thus saving space by only having to encode the block itself once. This allows it (under ideal circumstances) to maintain a fairly high image quality; however that is a trade-off that is paid for by incurring a loss in the - well, let's call it the "fidelity" of the image instead of the quality.

flabdablet

@Quango said:

Lossy is one thing. Compression that takes one value and changes it to another is the issue.

Xerox would scan a '6', the compression routine would convert this to an '8' and then print it out like that.

Yes, that's what a JBIG2 compression artefact looks like.

Perhaps somebody's boss took this wonderful piece of JBIG2 marketing fluff at face value before deciding not to specify the use of its rather less squishy lossless mode:

@jbig2.com said:

When JBIG2 compression is done properly, any perceptual differences between the compressed file and the original will be enhancements and not degradations.

Since nobody with the slightest bit of image processing experience could possibly doubt the veracity of such a claim, it follows that eights are better than sixes. Is that a problem?

Hmmmm

@Anonymouse said:

Lossy compression is exactly the issue here. Remember that, loosely speaking, "lossy" compression is defined as a process which reduces the amount of data used to represent the "original" in a way which is non-reversible - i.e. you can't reconstruct the original from the compressed data as opposed to lossless which allows you to reconstruct the original down to the last bit.
So most common lossy image compression algorithms achieve a higher compression rate (as compared to lossless ones) by sacrificing image quality. However, the algorithm in question, JBIG2, basically simply replaces parts (pixel block) of the image which are "similar enough" to each other with a single instance of the block, thus saving space by only having to encode the block itself once. This allows it (under ideal circumstances) to maintain a fairly high image quality; however that is a trade-off that is paid for by incurring a loss in the - well, let's call it the "fidelity" of the image instead of the quality.

Exactly. So, TRWTF is anyone that uses JBIG2 for any purpose without testing the compressor to determine if it maintains the required fidelity. I would include the standards bodies and Adobe in this "anyone" for ratifying it as two official standards and including it in PDF without making the limitations very clear. If the original authors were not aware of this sort of "corruption" then they have no business designing compression algorithms.

As to where the buck would stop if anyone tried to claim losses due to this, I expect it would stop with the person who clicked the scan button. I can't believe that someone like Xerox, in the duplicating business for many years, doesn't guard itself very well against any loss or damage to customers caused by inaccurate duplication.

dhromed

@flabdablet said:

it follows that eights are better than sixes. Is that a problem?

8 > 6

obviously

flabdablet

@dhromed said:

@flabdablet said:
it follows that eights are better than sixes. Is that a problem?

8 > 6
obviously

RESOLVED INVALID

@Anonymouse said:

@PJH said:
@flabdablet said:
Lossy compression is lossy.
Lossy compression isn't the issue here.
Lossy compression is exactly the issue here.
So most common lossy image compression algorithms achieve a higher compression rate (as compared to lossless ones) by sacrificing image quality. However, the algorithm in question, JBIG2, basically simply replaces parts (pixel block) of the image which are "similar enough" to each other with a single instance of the block, thus saving space by only having to encode the block itself once. .

There's something more than just a lousy lossy compression algorithm going on here.

A 6 does sort of look like an 8 with a small piece missing, so it would seem reasonable that the algorithm would fill in a couple of pixels, making the 6 look like an 8. But that's not really the case here. The author points this out in another example near the bottom of the article:

This is not a simple pixel error either, one can clearly see the characteristic dent the 8 has on the left side in contrast to a 6.

In other words, the algorithm isn't just filling in pixels to make a 6 look like an 8, it's actually changing a 6 into an 8. In the first example in the article you can see where a 3 becomes a 2, a 1 becomes a 3 and a 2 becomes a 3. Those numbers are sufficiently different from each other that it just can't be explained as "it replaces parts of the image which are "similar enough" to each other".

Even though they say there's no OCR going on here, OCR being fed by shitty compression is the most logical answer.

blakeyrat

This is all so they can claim the internal HD stores 100,000 pages! instead of 10,000 pages!

... meanwhile, nobody using the copier even knows it *has* an internal HD or how to use it.

@blakeyrat said:

This is all so they can claim the internal HD stores 100,000 pages! instead of 10,000 pages!
... meanwhile, nobody using the copier even knows it *has* an internal HD or how to use it.

Putting a hard drive in a copier is TRWTF.

error

@El_Heffe said:

@blakeyrat said:
This is all so they can claim the internal HD stores 100,000 pages! instead of 10,000 pages!

... meanwhile, nobody using the copier even knows it *has* an internal HD or how to use it.
Putting a hard drive in a copier is TRWTF.

Household appliance? It's a PC now. Printer? PC. TV? PC. Phone? PC. Refrigerator? PC. PC? ~~Mac~~ PC.

henke37

Overused joke is overused, news at midnight. Movie continues at 0:30.

TGV

The real WTF is posting this one day after it was on Ars' front page. Film at 8/21/1963, 19:00.

PJH

@TGV said:

The real WTF is posting this one day after it was on Ars' front page.

Are you presuming everyone on here bothers reading Ars'?

Snooder

@El_Heffe said:

There's something more than just a lousy lossy compression algorithm going on here.
A 6 does sort of look like an 8 with a small piece missing, so it would seem reasonable that the algorithm would fill in a couple of pixels, making the 6 look like an 8. But that's not really the case here. The author points this out in another example near the bottom of the article:
This is not a simple pixel error either, one can clearly see the characteristic dent the 8 has on the left side in contrast to a 6.
In other words, the algorithm isn't just filling in pixels to make a 6 look like an 8, it's actually changing a 6 into an 8. In the first example in the article you can see where a 3 becomes a 2, a 1 becomes a 3 and a 2 becomes a 3. Those numbers are sufficiently different from each other that it just can't be explained as "it replaces parts of the image which are "similar enough" to each other".

Actually, yes, it can. The scanners apparently use JBIG2 compression. Which, as taken from rudimentary research in Wikipedia, works by encoding all 'similar' repetitions of a section of the image as a single symbol. Which leads to the issue that in a lossy compression, a blurry '6' is similar enough to a blurry '8' from the same image that both will be encoded as an '8'and then once decoded you'll see that same 8 repeated.

Now, this is still a WTF if Xerox didn't make it apparent that the scanner software used lossy compression or provide settings for proper lossless compression of important images with text. However, I suspect that they did; but most users simply don't bother to read the documentation that well and just left it at the default setting.

Seahen

@flabdablet said:

Perhaps somebody's boss took this wonderful piece of JBIG2 marketing fluff at face value before deciding not to specify the use of its rather less squishy lossless mode:

That's the problem: "Perceptually" lossless just means it looks the same, when all you're looking at is the page as a whole. In other words, an optical illusion perfectly calculated to play on a well-known human bias.

Ronald

@dhromed said:

8 > 6

Is it just me or this looks like someone about to get a blowjob.

Ronald

@PJH said:

@TGV said:
The real WTF is posting this one day after it was on Ars' front page.
Are you presuming everyone on here bothers reading Ars'?

JoelKatz1

@Quango said:

Lossy is one thing. Compression that takes one value and changes it to another is the issue.

That's pretty much the definition of lossy compression.

Xerox would scan a '6', the compression routine would convert this to an '8' and then print it out like that.

That's what lossy compression does. It takes the original information and replaces it with something judged "sufficiently similar" but that takes less space to store. Here, the compression engine had an "86958" image already encoded and judged that a "66958" image was sufficiently similar, so it saved space by replacing one image piece with a reference to the other. Many lossy compression algorithms do this.

dkf

@joelkatz said:

Many lossy compression algorithms do this.

Most don't result in lots of alterations to printed values when doing local copying. (The ultimate in lossy compression is to convert the entire image into a single color and store just that. Yes, it loses a little bit of information, but the compression ratio is excellent and the algorithm is fast!!!)

witchdoctor

@joelkatz said:

@Quango said:

Lossy is one thing. Compression that takes one value and changes it to another is the issue.
That's pretty much the definition of lossy compression.

Xerox would scan a '6', the compression routine would convert this to an '8' and then print it out like that.

That's what lossy compression does. It takes the original information and replaces it with something judged "sufficiently similar" but that takes less space to store. Here, the compression engine had an "86958" image already encoded and judged that a "66958" image was sufficiently similar, so it saved space by replacing one image piece with a reference to the other. Many lossy compression algorithms do this.

Which is why you are supposed to test your lossy compression algorithms and parameters and check that the choices fit with the application. This one obviously does not fit, especially for a compression setting that the UI calls "normal".

@witchdoctor said:

Which is why you are supposed to test your lossy compression algorithms and parameters and check that the choices fit with the application. This one obviously does not fit, especially for a compression setting that the UI calls "normal".

Jeez. All they had to do was read this 328 page user manual for the copier, dated December 2010. It's right there at the top of page 107:

The Quality / File Size settings allow you to choose between scan image quality and file size.
• Normal/Small produces small files by using advanced compression techniques. Image quality is
acceptable but some quality degradation and character substitution errors may occur with some
originals.

@joelkatz said:

That's what lossy compression does. It takes the original information and replaces it with something jdged "sufficiently similar" but that takes less space to store. Here, the compression engine had an "86958" image already encoded and judged that a "66958" image was sufficiently similar, so it saved space by replacing one image piece with a reference to the other. Many lossy compression algorithms do this.

You are technically correct, but if you think that's an acceptable use of the term "lossy compression" you're an idiot. Lossy compression is supposed to lose non-important information like color tones or shapes or gradients, not actual information. Otherwise you might as well replace every page with a blank page.

dhromed

@anonymous234 said:

non-important information like color tones or shapes or gradients, not actual information.

There is no distinction between the two.

witchdoctor

@anonymous234 said:

@joelkatz said:
That's what lossy compression does. It takes the original information and replaces it with something jdged "sufficiently similar" but that takes less space to store. Here, the compression engine had an "86958" image already encoded and judged that a "66958" image was sufficiently similar, so it saved space by replacing one image piece with a reference to the other. Many lossy compression algorithms do this.

You are technically correct, but if you think that's an acceptable use of the term "lossy compression" you're an idiot. Lossy compression is supposed to lose non-important information like color tones or shapes or gradients, not actual information. Otherwise you might as well replace every page with a blank page.

It is an acceptable use of the term lossy compression. The exact same algorithm with the same parameters used on a picture of a cat wouldn't be much of a problem. They just picked the worst possible combination of lossy compression algorithm and parameters for compressing scans of documents. And called that combination "normal" compression in an office copier.

And putting a warning somewhere in the manual and in a small message at the bottom of the config screen does not excuse them calling it "normal". Because when you read normal in your office copier you don't expect a warning like "this might randomly replace text in your documents to save space".

Seahen

@witchdoctor said:

It is an acceptable use of the term lossy compression. The exact same algorithm with the same parameters used on a picture of a cat wouldn't be much of a problem. They just picked the worst possible combination of lossy compression algorithm and parameters for compressing scans of documents. And called that combination "normal" compression in an office copier.

Makes me wonder about Ray Kurzweil's claim that lossy compression is appropriate for uploaded human brains. Maybe he's hoping people like me will be replaced with duplicate copies of him.

JoelKatz1

@anonymous234 said:

@joelkatz said:
That's what lossy compression does. It takes the original information and replaces it with something jdged "sufficiently similar" but that takes less space to store. Here, the compression engine had an "86958" image already encoded and judged that a "66958" image was sufficiently similar, so it saved space by replacing one image piece with a reference to the other. Many lossy compression algorithms do this.

You are technically correct, but if you think that's an acceptable use of the term "lossy compression" you're an idiot. Lossy compression is supposed to lose non-important information like color tones or shapes or gradients, not actual information. Otherwise you might as well replace every page with a blank page.

That would make the very same algorithm "lossy compression" if color tones were unimportant but no longer "lossy compression" if used in an application where precise color tones were vital. The description "lossy compression" describes the algorithm itself, not the context in which it used.

Medinoc

TRWTF is using lossy compression in a copier in the first place, though this seems related to the other RWTF of giving them a hard drive.

dhromed

@Medinoc said:

TRWTF is using lossy compression in a copier in the first place,

Kind of. As if 1-bit reproduction isn't bad enough already.

blakeyrat

@dhromed said:

@Medinoc said:
TRWTF is using lossy compression in a copier in the first place,
Kind of. As if 1-bit reproduction isn't bad enough already.

I can't tell if this is supposed to be a joke, a case of a timepod being opened after 30 years, or Dhromed's company buying copiers exclusively from the city dump.

Ben L.

@blakeyrat said:

@dhromed said:
@Medinoc said:
TRWTF is using lossy compression in a copier in the first place,
Kind of. As if 1-bit reproduction isn't bad enough already.

I can't tell if this is supposed to be a joke, a case of a timepod being opened after 30 years, or Dhromed's company buying copiers exclusively from the city dump.

blakey, do you seriously need more than one non-background color for your scanned text?