Playing hardball with vendors

slurpy

Back in 1996, I worked for a prominent financial institution that outsourced the task of generating monthly internal and customer statements. They would print out the customer statements, and also the internal versions, that included things like how much the commission was for each customer transaction. Both were provided to us in paper form. The customer statements were mailed, and the internal statements were micro fiched, and subsequently stored as bitmaps. The vendor, however, maintained the database with the actual info that was used to generate the statements.

Someone got the idea to allow customers to call up old statements via the web. Simple enough, just re-generate the statements from the source, but don't print the commission info. Except that the vendor would not give us the info in the database, at least not without an exhorbitant bribe, erm, fee. They did agree to give us the data in windows-meta-file-format.

For those not familiar with wmf, it's basically a list of records, each representing one character/line/shape, with font, color, x-y location, etc. The list is in no apparent sequence, and the order in which you encounter the text characters in the file has no bearing on how they appear on the page.

My boss asked me for leverage. Not a solution mind you, just leverage.

After much playing with it, I came up with the following that would at least allow us to regenerate the page with the relevant info omitted (no, ocr was not a viable option at the time).

1. Run some freeware I found on the web to convert wmf to Java source
2. Run a shell script on said Java to get the list of character/x-y coords, and put the records in an array
3. Write some code to scan the array (the ordering is essentially random) to effect 'strcmp' (e.g.: find any instance of the first char in the string to be found, then find any instance of the next char that has approximately the same y-coord, but a slightly increased x-coord, etc.)
4. Write some higher level code to locate the extent of a column of numbers beneath a heading, given its starting x-y and width (basically, scan for the max y-coord below the 'header' string where the first non-digit appears - the next header)
5. Remove all characters in this region from the list
6. Repeat for multiple regions (columns of commission values, etc.) that had to be removed
7. Recompose the Java program without the commission info
8. Compile it
9. Run it to generate the page
10. Save the page for subsequent display

Given teh hardware available at the time, it took about 9 minutes per page to complete, but it could be done offline, we could throw all the computers in the firm at it (night and background processing during the day) and be done in about a month.

My boss told the vendor to shove their bribe, erm, fee, because we had a workable solution.

They gave us the data in the format we needed at no charge just to keep the contract.

stevekj

There are quite a lot of WTFs here!

Outsourcing not just the task of generating customer and internal statements, but the actual data itself
Using bitmaps of microfiched printouts as backups (almost as good as printing a web page out, putting it on a wooden table, and taking a picture!)
And of course WMF format, which is close enough typographically to WTF that it needs no further elaboration.

I like the final solution in all its Rube Goldberg complexity, and the fact that it was successfully used to strongarm the vendor!

Benanov

Wow.

Just...wow.

You one-upped an insane vendor by being more insane, in order to get the real solution. Good show.

My internal big-O calculator just overflowed trying to figure out just how inefficient that was.

The real WTF is that you didn't sort the array made in step 2. ;)

slurpy

@Benanov said:

Wow.

Just...wow.

You one-upped an insane vendor by being more insane, in order to get the real solution. Good show.

My internal big-O calculator just overflowed trying to figure out just how inefficient that was.

The real WTF is that you didn't sort the array made in step 2. ;)

That was, in fact, my first thought, but it didn't really help. You'd end up with all the A's on the page together, then all the B's, etc. If you sorted by x/y coordinates, then the letters would be out of sequence - with no real oganization what-so-ever.

I used the term 'array' a bit liberally; I wound up building dual cross-linked trees so that you could scan by letter, then use a pointer to get to a hash of coordinates. It was hokey, but drastically sped it up.

slurpy

@stevekj said:

There are quite a lot of WTFs here!

Outsourcing not just the task of generating customer and internal statements, but the actual data itself
Using bitmaps of microfiched printouts as backups (almost as good as printing a web page out, putting it on a wooden table, and taking a picture!)
And of course WMF format, which is close enough typographically to WTF that it needs no further elaboration.

I like the final solution in all its Rube Goldberg complexity, and the fact that it was successfully used to strongarm the vendor!

WMF wasn't our idea - it was all the vendor was willing to give us. I had to start from there, and find a path back to a readable page. *sigh*

mallard

I suppose the commission info wasn't in predicatable locations?
If it was, I would covert to a bitmap and draw white boxes over the "secret" data...

maratcolumn1

Good job. I wonder why rendering the wmf and showing customers image
instead of text didn't work. You could tell them this is because of
security.

My story: once working for the oldest IT company I wrote a tool sorting
localizable and unlocalizable artwork files apart. I used as a source
data a table where for each file it was noted weather it is localizable
or not. The table was literally a table, in the PDF format. No OCR
involved! I'm still proud of it, although I didn't try to ask for a
format change really.

slurpy

@mallard said:

I suppose the commission info wasn't in predicatable locations?
If it was, I would covert to a bitmap and draw white boxes over the "secret" data...

The statement was formatted such that there were a series of transactions for a given day, with the commission under each. The last transaction for the day had a daily-total. Then there was some space, and another day's transactions. At the bottom of the month, there was a daily subtotal, and a monthly total. Depending upon how many transactions the person did, the line(s) with the commission-data moved vertically on the page. You had to first figure out the transaction count in order to calculate where the subtotals might be. Add to that pagination with headers, and footers that could vary in height (footnotes) and it was extremely difficult to do.

The pages had a watermark logo, so you couldn't just draw over them (of course, you would have no way of knowing that).

Also, 2 more WTFs... when they converted the original data to wmf, they must have scanned the bitmaps and done OCR, as there were infrequent, random period, command, dash, [reverse] single quote and asterisk characters in what a human would see as open-space areas on the page. I could only surmise that there were imperfections in the paper of the original printed document, and these scanned/ocr'd as characters. I had to find and delete them, making sure not to accidentally delete any financial information.

The other, more interesting wtf, was that the ocr didn't put all the characters on the same line at the same y-coordinate (I'm not talking about descenders here), nor did it place monospace characters at even x-ccordinate intervals. There was a plus/minus 5 pixel vertical (random) variation, and a plus/minus 7 pixel horizontal (random) variation. Considering that it was a monospace font, it drove me nuts, as you had to do a mathematical 'squint' to see if something was next to something else. I affectionately named it the jiggle-factor.

merreborn

OCR sucked back then. It's great now, however. Not only does it get it right, it does font matching and image extraction. There's amazing OCR software that comes with the premium JFAX account, IIRC.