How to compare data



  • The thread about good programmers made me think of this situation that happened to me several years ago.  At the time I was doing alot of work with geographical data.  We have customers with legacy systems with their geo data in tabular format.  So I wrote a Cobol program that compared their legacy data with data that came from GIS software like the ESRI products.  (The GIS is converted to tabular data first, of course.)  My program simply compared each component of a street address between the two data sets.  For example, does direction1 == direction2, streetname1 == streetname2 etc.  Can't be easier than that (although there are other issues with comparing this kind of data, but we won't go there).

    One particular customer strongly disliked the statistical results of the comparison.  (What do you mean only 63% matched??)  I verified that my program was working correctly, and reran it several times with no increase in the match rate.  So boss several steps above me (who used to be a programmer) decided to give it a shot.  Fine by me, whatever.  He did and sent the customer slightly better results.  About a week later, he sent me his program with the suggestion that I use it from now on because it worked better.  I took a look.  Here is how his program worked:

    Read in all data from file1.  For each record read, do this:

        If direction1 contains data, set the first byte of an array = 1

       If streetname1 contains data, set the second byte of the array = 1, etc until the file is completely processed....

    Okay, so he's creating basically an array of bitmaps indicating whether there is data in the file or not.  Odd...   but I continue...

    Read in all data from file2.  For each record read, do the same loop as above but incrementing each array position.  End up with the array full of ones, twos and zeroes indicating where data is.

    Finally, scan through the array again and if the array position ==2, read the records from both files and compare the actual data to see if it's the same.

    Seriously.  I had to read the program through three times, and also run it in debug to figure out how in hell it worked.  And then I ran it and my program in a "deathmatch" against the same data.  Different results but it would have taken too long to figure out exactly why.  But after getting the basica logic flow.. I don't really need to know why.  It's trash.



  • @jetcitywoman said:

    Different results but it would have taken too long to figure out exactly why.
     

    If I understand this correctly, the results are different because your version compares all the fields, while his only compares those for which both records have data, ignoring any fields which are empty in at least one of the records. 

    For instance, your version would consider "123 Some Street, Apt. 34" and "123 Some Street" to be different, while his would consider them to be the same.



  • @Someone You Know said:

    @jetcitywoman said:

    Different results but it would have taken too long to figure out exactly why.
     

    If I understand this correctly, the results are different because your version compares all the fields, while his only compares those for which both records have data, ignoring any fields which are empty in at least one of the records. 

    For instance, your version would consider "123 Some Street, Apt. 34" and "123 Some Street" to be different, while his would consider them to be the same.

    No, as far as I understand it, "123 Some Street, Apt. 34" and "123 Some Street" would be classified different in both, but "123 Some Street, Apt. 34" and "" would be skipped inthe boss^n version as would "" and "123 Some Street".

    If the results are quoted as a percentage, I guess the boss^n version would give a higher percentage because many non-matched entries would be ignored.



  • In a situation like this you simply ask the following:

    Do you want high numbers or correct numbers?

    If the answer is something other then the short "correct numbers" then you are wasting time writing any other than a random number generator with a lower bound threshold written out as a percentage.  The only way to make it consistant is to either do away with the random number and simply write "Hello World" but with numbers instead or give them the correct answer.

     When they ask why the value never changes and seems to be inaccurate, you state that it was written to simply be high, not correct as per request.


  • Winner of the 2016 Presidential Election

    @KattMan said:

    The only way to make it consistant is to either do away with the random number and simply write "Hello World" but with numbers instead or give them the correct answer.

     When they ask why the value never changes and seems to be inaccurate, you state that it was written to simply be high, not correct as per request.

    No, no, no. What you do is seed the PRNG with a checksum or hash of the data. That way the processing time is believable and the number only changes when the data changes.



  • @joe.edwards said:

    @KattMan said:

    The only way to make it consistant is to either do away with the random number and simply write "Hello World" but with numbers instead or give them the correct answer.

     When they ask why the value never changes and seems to be inaccurate, you state that it was written to simply be high, not correct as per request.

    No, no, no. What you do is seed the PRNG with a checksum or hash of the data. That way the processing time is believable and the number only changes when the data changes.

    I see.  You've done this before huh?



  • @jetcitywoman said:

    But after getting the BASICA logic flow.. I don't really need to know why.  It's trash.

    TRWTF is your boss used BASICA. Why did he use a line-number BASIC?



  • @jetcitywoman said:

    Seriously.  I had to read the program through three times, and also run it in debug to figure out how in hell it worked.  And then I ran it and my program in a "deathmatch" against the same data.  Different results but it would have taken too long to figure out exactly why.  But after getting the basica logic flow.. I don't really need to know why.  It's trash
     

    I did the same thing at a job two years ago.  My boss had written an address-mashing program in C several years prior (he's not a very good programmer) so he wanted a new one written, with the idiotic stipulation that the results be the same.  I wrote a program in C++ that did approximately the same thing, but I couldn't figure out how to reproduce some of the bugs his code had, so he wouldn't let us switch to it for actual data processing...



  • @samanddeanus said:


    TRWTF is your boss used BASICA. Why did he use a line-number BASIC?

     

    Once upon a time that's all there was.



  • @Heron said:

    @jetcitywoman said:

    Seriously.  I had to read the program through three times, and also run it in debug to figure out how in hell it worked.  And then I ran it and my program in a "deathmatch" against the same data.  Different results but it would have taken too long to figure out exactly why.  But after getting the basica logic flow.. I don't really need to know why.  It's trash
     

    I did the same thing at a job two years ago.  My boss had written an address-mashing program in C several years prior (he's not a very good programmer) so he wanted a new one written, with the idiotic stipulation that the results be the same.  I wrote a program in C++ that did approximately the same thing, but I couldn't figure out how to reproduce some of the bugs his code had, so he wouldn't let us switch to it for actual data processing...

     

    Sometimes you can't fix bugs because other layers or processes are counting on those bugs being in place.  Windows has lots of hacks in it where they continue reproducing bugs from previous versions for specific pieces of software.  Not that this is what your boss was doing, but I just wanted to say that there could be a reason for this. 



  • @tster said:

    Sometimes you can't fix bugs because other layers or processes are counting on those bugs being in place.  Windows has lots of hacks in it where they continue reproducing bugs from previous versions for specific pieces of software.  Not that this is what your boss was doing, but I just wanted to say that there could be a reason for this. 

     

    Well he insisted on it because we were filtering existing addresses in the database against the output of the program, and he didn't want to change the way the filtering was done.  Ah well.  I'm glad I left that job behind for greener pastures...



  •  I remember reading a PDF online, complete with examples, about this exact phenomenon, where old Windows bugs had to be maintained because programs depended on them, but I've lost track of it; any chance anybody here remembers what I'm talking about?



  • @Veinor said:

     I remember reading a PDF online, complete with examples, about this exact phenomenon, where old Windows bugs had to be maintained because programs depended on them, but I've lost track of it; any chance anybody here remembers what I'm talking about?

     

    Probalby was the book "The old new thing". The author also got a blog explaining some more recent stuff.



  • @Veinor said:

     I remember reading a PDF online, complete with examples, about this exact phenomenon, where old Windows bugs had to be maintained because programs depended on them, but I've lost track of it; any chance anybody here remembers what I'm talking about?

     

    I never used any of them, but wasn't that somewhat fixed with compatibility mode? e.g. if you select windows 98 in vista compatibility mode it will emulate (some of?) the bugs windows 98 had, among other things ofcourse.

    Maybe we should ask a wine developer. Somehow I've got a hunch they have a better understanding of the windows API bugs than microsoft themselves.



  • @tster said:

    @Heron said:

    @jetcitywoman said:

    Seriously.  I had to read the program through three times, and also run it in debug to figure out how in hell it worked.  And then I ran it and my program in a "deathmatch" against the same data.  Different results but it would have taken too long to figure out exactly why.  But after getting the basica logic flow.. I don't really need to know why.  It's trash
     

    I did the same thing at a job two years ago.  My boss had written an address-mashing program in C several years prior (he's not a very good programmer) so he wanted a new one written, with the idiotic stipulation that the results be the same.  I wrote a program in C++ that did approximately the same thing, but I couldn't figure out how to reproduce some of the bugs his code had, so he wouldn't let us switch to it for actual data processing...

     

    Sometimes you can't fix bugs because other layers or processes are counting on those bugs being in place.  Windows has lots of hacks in it where they continue reproducing bugs from previous versions for specific pieces of software.  Not that this is what your boss was doing, but I just wanted to say that there could be a reason for this. 

     

    This practice isn't limited to just software.  Microsoft's OOXML "standard" (in quotes because it's full of fail) mandates some incorrect behavior, such as not counting the year 2000 as a leap year because some more than a decade old version of Excel introduced that bug and people apparently have spreadsheets that depend on it.

    Why couldn't they just say: "We discovered a bug in the product.  It will be fixed in the next version, go fix your things if they depend on this." 



  • @dtech said:

    @Veinor said:

     I remember reading a PDF online, complete with examples, about this exact phenomenon, where old Windows bugs had to be maintained because programs depended on them, but I've lost track of it; any chance anybody here remembers what I'm talking about?

     

    I never used any of them, but wasn't that somewhat fixed with compatibility mode? e.g. if you select windows 98 in vista compatibility mode it will emulate (some of?) the bugs windows 98 had, among other things ofcourse.

    Maybe we should ask a wine developer. Somehow I've got a hunch they have a better understanding of the windows API bugs than microsoft themselves.

    They started this long befoe Windows XP. Windows 95 does not have the version number 4.0 because at the time many programs detected "modern Windows versions" by checking for (WINVER_MAJOR >= 3 && WINVER_MINOR >= 11). Since 0 < 11, Windows 4.0 obviously had to be older than Windows 3.11. (I think they might have even checked for WINVER_MAJOR == 3.)

    What did Microsoft do? Make Windows 95 version 3.95, since a lot of applications depended on that check to run and they couldn't determine which result an old app expected.

    There's a lot of stuff like that; with Windows 2000 they introduced shivs– er, shims, which are small bug-reproduction libraries that are loaded when certain programs are detected. Is it a wonder the Windows codebase is an unmaintainable mess with all the application-specific hacks and workarounds they use?

    I don't think it's Microsoft's fault alone, either – the debacle between Apple and Adobe over the former phasing out Carbon showed that even if you put a big "DEPRECATED! DO NO USE!" sticker on an API, the developers will still happily use it and expect it to work. When you get big enough that 90% of all developers target your platform (and an even higher percentage of all bad developers, I'd think) you find that you can't easily fix bugs wthout making your next release incompatible with several major programs, the developers of which are unwilling to change their two-year release cycle to push a compatibility update. Since your next release is supposed to support those apps you can choose to either turn your codebase into a mudball or disappoint your customers. And since Microsoft is essentially a marketing company it was clear who made that decision.



  • @tdb said:

    Why couldn't they just say: "We discovered a bug in the product.  It
    will be fixed in the next version, go fix your things if they depend on
    this." 

    [url]http://blogs.msdn.com/oldnewthing/archive/2003/12/24/45779.aspx[/url]

    Real world sucks. 8=[]



  • Nice tangent, but it doesn't accurately reflect the WTF that I was trying to communicate.  Let me clarify that there weren't any bugs in my program.  The statistics were accurate, albeit low because the tabular data in the customers old geo file had been hand-entered for the last 30 years, and since it was pre-GIS and ascii tabular data it allowed them to type any old bullcrap they wanted into it.  so instead of well-formatted addresses, very often I would find things like "0 OAK BRIDGE SOUTH SIDE NEAR MAIN ST".  So while the nifty new GIS contained "Main St" and might have contained an entry for the bridge, it did not have that exact spelling of that exact location. 

    Since our tabular file was stored in "old fashioned" flat indexed files, it was broken into fields.  Street direction, name, type, low address number, high address number, etc.  so the above example might have been stored by the system as 

    low address = 0

    high address = 0

    direction = '' (spaces or nulls)

    name = "OAK BRIDGE SOUTH SIDE NEAR MAIN"

    type = "ST" 

    No matter how big a stick I tried, I could NOT pound it into my manager's head that these things made the data un-comparable.   No, I was just being uncooperative.  These were the "other issues" that I alluded to in the  OP.  The primary WTF, though was how he eschewed my KISS logic for something so blazingly obscure and unnecessarily difficult.



  • @dtech said:

    I never used any of them, but wasn't that somewhat fixed with compatibility mode? e.g. if you select windows 98 in vista compatibility mode it will emulate (some of?) the bugs windows 98 had, among other things ofcourse.
    Until Windows XP, all compatibility issues had to be solved inside the Windows core, resulting in several special cases littering the codebase. These can never be removed simply because of how many versions of Windows have behaved like that.

    Starting with Windows XP, there's an actual "application compatibility layer" which avoids introducing "fixes" into the core codebase of Windows by using "shims" (small DLLs that are loaded in place of the real counterparts which change a small part of functionality but redirect all other calls to the real ones) and patches (against the broken app itself). Microsoft can create shims and patches and has full control of the system compatibility database, developers can pick which shims apply to which programs through developer compatibility databases (Application Compatibility Toolkit), and end-users can select several "modes" for an application which apply a predefined selection of shims, the most common being the W9XVersionLie.

    (crap, next time I need to read until the end of the thread before shooting my mouth off. Thanks, j6cubic for the explanation that isn't a Massive Wall of Text.)
    @tdb said:
    This practice isn't limited to just software.  Microsoft's OOXML "standard" (in quotes because it's full of fail) mandates some incorrect behavior, such as not counting the year 2000 as a leap year because some more than a decade old version of Excel introduced that bug and people apparently have spreadsheets that depend on it.
    ... huh?

    The year 1900 is counted as a leap year in Microsoft Excel products because a 1980's version of VisiCalc incorrectly counted it as such, and the marketing team wanted to boast bug-for-bug compatibility with VisiCalc for their competing spreadsheet product, Excel. (It's also why there are two different epochs for Excel files, one being 1904 and the other being 1900.) They have not corrected this behavior because it's assumed that, yes, people have spreadsheets that will report "incorrect" results if they fix it, either because they already worked around it and the fix would cause overcompensation or because it was not worked around and the results would be different, which is considered "incorrect". However, nothing in the OOXML standard refers to the year 2000 that I'm aware of.

    The OOXML "standards" are full of fail though. It's basically almost a direct conversion of the binary OLE file format that pre-XML Office files used into such a snowball of hairy XML that nothing but Microsoft's own products could hope to be compatible.

    Anyway, back on topic...@jetcitywoman said:

    No matter how big a stick I tried, I could NOT pound it into my manager's head that these things made the data un-comparable.   No, I was just being uncooperative.
    Even though I know it's a hoax, I'm reminded of this story:



  • @TwelveBaud said:

    @tdb said:

    This practice isn't limited to just software.  Microsoft's OOXML "standard" (in quotes because it's full of fail) mandates some incorrect behavior, such as not counting the year 2000 as a leap year because some more than a decade old version of Excel introduced that bug and people apparently have spreadsheets that depend on it.
    ... huh?

    The year 1900 is counted as a leap year in Microsoft Excel products because a 1980's version of VisiCalc incorrectly counted it as such, and the marketing team wanted to boast bug-for-bug compatibility with VisiCalc for their competing spreadsheet product, Excel. (It's also why there are two different epochs for Excel files, one being 1904 and the other being 1900.) They have not corrected this behavior because it's assumed that, yes, people have spreadsheets that will report "incorrect" results if they fix it, either because they already worked around it and the fix would cause overcompensation or because it was not worked around and the results would be different, which is considered "incorrect". However, nothing in the OOXML standard refers to the year 2000 that I'm aware of.

     

    It seems that I had the leap year thing wrong way around.  Thanks for putting me straight in that.  Now that I think of it, it makes more sense that way - why implement the year%100 exception without also implementing the year%400 exception to the exception.

    As such, there are also no reference to the year 2000 in OOXML.  However, googling for "ooxml leap year" returns plenty of references to the year 1900.  I found this one to be the most clear: [url]http://grokdoc.net/index.php/EOOXML_objections#The_Gregorian_Calendar[/url] 

    This one has an actual quote from ooxml: [url]http://www.robweir.com/blog/2006/10/leap-back.html[/url]

    It's mentioned here too (a bit past the halfway point): [url]http://ostatic.com/blog/ooxml-why-is-it-bad-and-what-can-we-do[/url]

    Another (short) reference with a screenshot: [url]http://avi.alkalay.net/2008/03/the-ooxml-date-bug.html[/url]

    Yet one, which has a couple of links of its own: [url]http://www.oreillynet.com/onlamp/blog/2007/01/a_most_ingenious_paradox_make.html[/url]



  • This is ... wow. I tended to believe that MS' XML formats were a step in the right direction after all, but now I'm convinced it's just another step on the spot.

    The shims thing is also rather odd.



  • @derula said:

    The shims thing is also rather odd.
    Well, think about it.

    "Northwind SMA" is a hypothetical application developed for Northwind Traders, a massive multinational company with hundreds of thousands of employees, each of which has a desktop and a laptop. Northwind SMA was done by a contracting team in 8 months several years ago for ten million dollars, and manages all of Northwind Traders' day-to-day business. Because Northwind SMA could not run on Windows 3.1, the developers hardcoded it to run on "Windows 95 or greater", except they don't actually check for the "or greater" case: It requires exactly Windows 95.

    During a pilot program to upgrade the company to Windows Fundamentals for Legacy PC's, it was discovered that Northwind SMA does not run, saying that it requires Windows 95. "Windows Fundamentals for Legacy PC's broke Northwind SMA. This must be fixed." There are several things that can be done:

    • Northwind Traders does not upgrade Windows, and sticks with Windows 95. Microsoft loses out on $20m in sales.
    • Windows Fundamentals for Legacy PC's is recompiled with a new chunk of code that causes it to always report being Windows 95. Other applications which require Windows 98 or later will fail, the OS will no longer be able to be upgraded, and significant expenses in testing and recertification of the modified operating system will be incurred.
    • Northwind SMA gets a special flag in the system. When the OS sees it trying to load the "get a version number" DLL, it loads a decoy shim DLL in its place, that dutifully lies about the version number. Although all program loads are slowed a little as Windows checks a massive database, the behavior needed to get Northwind SMA working only affects Northwind SMA; all other programs get the full functionality they were expecting.

    A highly-paid IT consultant reviews each choice carefully, then decides to go with option number 3, introduced in Windows 2000 (thanks for the correction; Windows XP just made it visible to people without special tools). The flag is added to the Application Compatibility Database and deployed to all computers along with the Windows FLP install. Northwind SMA runs on the system without issue, and other applications are not affected.

    (Before Windows 2000, option 3 was not available. Northwind Traders would have to ask Microsoft to change their operating system, a major expense, or not upgrade, also a major expense. Once the initial investment of the AppCompat infrastructure was spent, the incremental cost of fixing broken applications is so low even customers could do it, so it's the obvious choice.)


Log in to reply
 

Looks like your connection to What the Daily WTF? was lost, please wait while we try to reconnect.