What's the point?

Junkieman

Rewriting a legacy system is no trivial matter, and this particular system is story on it's own, but this system is the reason I learned about the "packed decimal" number format.

To put it simply, packed decimal is a system for storing numbers in a "packed" or "compressed" method. This is achieved by eliminating the decimal, you always assume two decimal places when converting. More importantly (read: bad) the last digit can be a number or character.

To skip all the details, the idea is this:

If the last digit is a alpha character, get the ascii value for it, subtract 16, and get the ascii char of the result. that is the last digit value. oh and make the result negative (because it conained an alpha character)

Thus, 1234A become -12340 (A = ascii(val(A) - 16) = 0)

Why?! you ask. Well to save space of coarse! Back in the day storage was a big issue. So I understood why they did what they did.

The perversion in all this? Our system had to import records from a fixed width flatfile that used this format. I wrote a pre-processor to read in the flatfile, calculate the correct values, and write out to a new text file. Immediately I noticed the new file was dramatically smaller than the original, 8Mb vs 40Mb (a test file, live data goes up to 1Gb).

"Oh crap" I thought, "It's probably missing some rows". Confused as I was, after verifying all rows and columns accounted for, I clicked what it was: The original is a fixed column length format, I wrote a tab-delimited file (which is preferred for our system).

The fixed-length file adds superfluous whitespace which bloats the file like you won't believe, about 5 times bigger than it needs to be.

"WHY?!" I shout at the roof, "Why mission to use the 'packed decimal' number format when tab delimited was clearly the solution?" With storage being such a big issue, they still used the most bloated form of transferring data.

Benanov

We have a saying at my office.

Why?

Because they're idiots.

(I've seen so many little WTFs at this place that my girlfriend now uses the phrase.)

rox_midge

Probably because the file is either coming from or headed to a mainframe, or the file format was designed by someone who came from a mainframe background.

A fixed-width file format has a lot of advantages when you're trying to process a large amount of data. Among them, you can seek to a specific record without having to parse the entire file up to that point. Mainframes use them a lot so that they can chew through ridiculous amounts of data very, very quickly.

modelnine

Storing data in a fixed width column set makes sense if you want to be able to seek in the data fast (i.e., because you don't read/process it sequentially). You can calculate the start of any field in the file by simply multiplying the record width in bytes with the record ID you're looking for and seeking to that offset, which is exactly what pretty much any database does to optimize row accesses, besides having indexes (and I guess you're loading the data from the TSV-file into one).

As such, this part isn't bad design, and encoding the sign in the number saves you 100/recordwidth % of the total storage for a fixed length field, which I guess is quite some space (in the megabytes) for the number of rows you're handling, even on the test file.

What I personally find disturbing though is the fact that the numbers are stored as plaintext; encoding them as binary integers of a fixed width would make so much more sense...

belgariontheking

@Benanov said:

(I've seen so many little WTFs at this place that my girlfriend now uses the phrase.)

Yes, and my wife.

Anyways, this is wildly different than the packed-decimal that I'm used to, which is where someone figured out that they only need four bits to describe a digit, so they could fit two digits in one byte.

@__Benanov__ said:

someone figured out that they only need four bits to describe a digit, so they could fit two digits in one byte.

That means they can fit a hundred different numbers in a byte! Amazing!

m0ffx

@belgariontheking said:

@Benanov said:
(I've seen so many little WTFs at this place that my girlfriend now uses the phrase.)
Yes, and my wife.
Anyways, this is wildly different than the packed-decimal that I'm used to, which is where someone figured out that they only need four bits to describe a digit, so they could fit two digits in one byte.

There are sound reasons for using "Binary Coded Decimal". Sounds like in the system you experienced, they had been storing the numbers as text, then realised that was double what was needed.

belgariontheking

@m0ffx said:

@belgariontheking said:
@Benanov said:
(I've seen so many little WTFs at this place that my girlfriend now uses the phrase.)
Yes, and my wife.
Anyways, this is wildly different than the packed-decimal that I'm used to, which is where someone figured out that they only need four bits to describe a digit, so they could fit two digits in one byte.
There are sound reasons for using "Binary Coded Decimal". Sounds like in the system you experienced, they had been storing the numbers as text, then realised that was double what was needed.

Yep. It was a mainframe system that ran COBOL and PL/1. I bet that takes some of you back. Black screens with floating green letters?

pitchingchris

@belgariontheking said:

Black screens with floating green letters?

Don't forget black and orange !