The right $tool for file comparison



  • From PHP manual comments:
    http://fi2.php.net/manual/fi/function.file.php#39830


    /* this a little function I wrote that checks if two ascii files are the same. it opens the file then removes the spaces then coverts the crc32 to base 10 and compares the results. */

    function fcheck($tool)
     {    if(fopen("file.01", "r") != FALSE){
        $fp1 = 'file.02';
        $fp = 'semcond.01';
     
        $data = implode(" ", file($fp));
        $data1 = implode(" ", file($fp1));
        $test1 = (dechex(crc32($data)));
        $test2 = (dechex(crc32($data1)));
          if($test1 == $test2)
              $sfv_checksum = 'TRUE';
                else
                  $sfv_checksum = 'FALSE';
              return $sfv_checksum;
        }
     
     }




  • yep, there's a distressing trend to use gigabytes of RAM where kilobytes would do.

     

    Many won't believe this, but back in the 1970's many computers had just KILOBYTES of RAM. My PDP-8 had only 8K and it was quite possible to handle many megabytes of data.  Typically you'd process the file sequentially, with only a disk block or two in RAM at a time.    Compare with the above code which needs like  64 times  the file size to run quickly.

     

     



  • Even if you assume that this function would only work on small files, it's still a pretty big WTF.

    if (file_exists($file1) && file_exists($file2) && (file_get_contents($file1) == file_get_contents($file2))){

      return true;

    } else {

      return false;

    }

    Obviously this would still have resource issues if you're working with files that are 100 MB, but it's still better than reading the file, imploding the lines, calculating a CRC-32 of both files, and converting the checksum to decimal.
     



  • Those are interesting Extended Versions of a "boolean" return, I must say.

    I'm also amused by the parameter $tool and how it can be pronounced "stool" which is fitting because it's not used and can be flushed down the toilet.
     



  • Wow... just, wow. This should be front-page material for CodeSOD.

    Ok, I haven't posted anything when logged in yet, but I couldn't resist this one. The most obvious WTF has been spotted; the entire files are read into memory, which could be done way more efficient. But there are way more WTF's. Let me try some:

    1. Passing a parameter which is then unused

    2. using fopen() to test the existence of a file

    3. Testing the existence of a file that isn't even used

    4. Small one: comparing it with false, but not type checking. It should obviously be !==, although that probably wont make a difference 

    5. Comparing build-in filenames

    6. Each line is read and then joined with spaces between them. Why not just use file_get_contents and leave the returns? Even worse: The description says spaces are removed. It doesn't. In fact, it only adds some.

    7. Converting the two crc32's (integer values) to hex for comparing them. Why not just compare the integers? And he even states it is converted to base 10, which is thus exactly the opposite.

    8. Why the parenthesis around the dechex? Another minor one, but... I can't even imagine any reason for that. 

    9. Returning a string with "TRUE" or "FALSE", not just a boolean value.

    10. Not returning anything if the first file (unused) doesn't exist.

    This thing is... insane... It's obviously written by someone completely clueless on PHP. It looks a lot like trial and error... Especially the error part.

     



  • Also the poor variable names. $fp and $fp1 but $test1 and $test2.

    I can't imagine the code this function was ripped out of was any better either. 



  • "2. using fopen() to test the existence of a file"

    Note that there is a small but non-zero chance that the file might be gone between the time it takes to execute file_exists() and fopen(). How would you test the existence of a file then? 



  • @Sunstorm said:

    "2. using fopen() to test the existence of a file"

    Note that there is a small but non-zero chance that the file might be gone between the time it takes to execute file_exists() and fopen(). How would you test the existence of a file then?

     

    Is there a file_lock function?



  • @dhromed said:

    @Sunstorm said:
    "2. using fopen() to test the existence of a file"

    Note that there is a small but non-zero chance that the file might be gone between the time it takes to execute file_exists() and fopen(). How would you test the existence of a file then?

    Is there a file_lock function?
    No, but there's a function named much more obviously: flock(). I imagine using this command and having PHP run off with a small group of other server-side languages.
     



  • @Sunstorm said:

    "2. using fopen() to test the existence of a file"

    Note that there is a small but non-zero chance that the file might be gone between the time it takes to execute file_exists() and fopen(). How would you test the existence of a file then?

    Yeah, obviously there wouldn't be anything wrong testing it this way if the handle was actually *used*. But in fact, the handle nor the file is actually used in the rest of the function.



  • Come on, cut him some slack! At least the function doesn't print out the two files, place the printouts on a wooden table, photograph them each with a digital camera, and use image processing/OCR to compare the photos...


    And yes, I'm sure there are functions in the PHP library that can do each of those things.



  • @Evo said:

    @Sunstorm said:

    "2. using fopen() to test the existence of a file"

    Note that there is a small but non-zero chance that the file might be gone between the time it takes to execute file_exists() and fopen(). How would you test the existence of a file then?

    Yeah, obviously there wouldn't be anything wrong testing it this way if the handle was actually used. But in fact, the handle nor the file is actually used in the rest of the function.

    Also it is never closed. Yay, fd leakage!



  • @aquanight said:

    @Evo said:

    @Sunstorm said:

    "2. using fopen() to test the existence of a file"

    Note that there is a small but non-zero chance that the file might be gone between the time it takes to execute file_exists() and fopen(). How would you test the existence of a file then?

    Yeah, obviously there wouldn't be anything wrong testing it this way if the handle was actually used. But in fact, the handle nor the file is actually used in the rest of the function.

    Also it is never closed. Yay, fd leakage!

    If I remember right PHP handles that, after the script has been done resources are released. If it's good practice is another thing. I know the image library (my favority toy in PHP) used to leak memory if you didn't free up the image resources correctly.

     

    [offtopic]Maybe that's the best way to describe PHP. A room full of toys.[/offtopic]



  • @msarnoff said:

    Come on, cut him some slack! At least the function doesn't print out the two files, place the printouts on a wooden table, photograph them each with a digital camera, and use image processing/OCR to compare the photos...

    And yes, I'm sure there are functions in the PHP library that can do each of those things.

    Except for the photographing (specialized module for robotics interface?), pretty sure, yes. Convert the pages to PDF, convert to big jpegs and then OCR 'em.



  • @Evo said:

    9. Returning a string with "TRUE" or "FALSE", not just a boolean value.






    And storing it in a temporary variable before returning.



  • @Dragnslcr said:

    Even if you assume that this function would only work on small files, it's still a pretty big WTF.

    if (file_exists($file1) && file_exists($file2) && (file_get_contents($file1) == file_get_contents($file2))){

      return true;

    } else {

      return false;

    }

    Obviously this would still have resource issues if you're working with files that are 100 MB, but it's still better than reading the file, imploding the lines, calculating a CRC-32 of both files, and converting the checksum to decimal.
     

    We all know what happens when we assume something, right?  You make an ass out of u and... well, just yourself :P  It's these little "helper" functions that work their way down deep in the bowels of the library of functions/classes for the entire company and get used by other people who didn't actually read the code.  They assumed that this method would work with any file (function name is fcheck, not fcheckforsmallfilesonly), and they happily toss in a couple of GB files to compare, bringing the server to a halt.  I'm ashamed to say I've worked something similar: in-memory database merging libraries.  Of course the boss's idea was that the files would be small and "RAM is cheap".  That still doesn't help you when the server can only contain 4GB RAM.  Oh well.  It was eventually re-written properly when a client's server kept dying on them.

    Live and learn.  Hopefully learn from someone else.



  • @Daid said:

    @aquanight said:
    @Evo said:

    @Sunstorm said:

    "2. using fopen() to test the existence of a file"

    Note that there is a small but non-zero chance that the file might be gone between the time it takes to execute file_exists() and fopen(). How would you test the existence of a file then?

    Yeah, obviously there wouldn't be anything wrong testing it this way if the handle was actually used. But in fact, the handle nor the file is actually used in the rest of the function.

    Also it is never closed. Yay, fd leakage!

    If I remember right PHP handles that, after the script has been done resources are released. If it's good practice is another thing. I know the image library (my favority toy in PHP) used to leak memory if you didn't free up the image resources correctly.

     

    [offtopic]Maybe that's the best way to describe PHP. A room full of toys.[/offtopic]

    I must point out the "after the script has been done" - obviously the fd leaked as long as the script is doing things, and if this function is called several hundred times, you could wind up with a lot of fds leaked :) .


Log in to reply