Random path



  • Here is a snippet of code that generates a path for uploaded files:

    function generatePath($org, $filemd5)
    {
       srand(micro_seed());
       $directorydepth = rand(1,4);
    

    $imgdestination = "";
    for($i=0; $i<$directorydepth; $i++)
    {
    $imgdestination .= substr($filemd5, $i, 1)."/";
    if(!is_dir(imagedirectory($org).$imgdestination))
    if(!mkdir_owner(imagedirectory($org).$imgdestination))
    errorpage(100);
    }
    return $imgdestination;
    }

    Basically, it takes random number of characters from the md5sum for the names of subdirectories, giving a structure like this:

    ./1/orig_995897.jpg
    ./1/7/2/c/orig_995873.jpg
    ./e/orig_995841.jpg
    ./3/orig_995849.jpg
    ./3/orig_994183.jpg
    ./6/orig_995898.jpg
    ./d/3/orig_995939.jpg
    ./2/2/2/orig_1013854.jpg
    ./b/orig_995902.jpg
    

    I have only one question: WHY?


  • Discourse touched me in a no-no place

    @jpa said:

    [code]srand(micro_seed());[/code]
    Is that to generate "even more random" numbers?



  • Yes

    Of course, thats no longer needed, since PHP 4.2.0, but PHP used to have really crappy PRNG.

  • Discourse touched me in a no-no place

    @Shinhan said:

    Yes

    Of course, thats no longer needed, since PHP 4.2.0, but PHP used to have really crappy PRNG.

    Wrong. If the RNG needs seeding manually (which is a WTF in itself, but as you mentioned fixed in 4.2.0,) you do it once at the top of your script; not every time you happen to call a function that uses rand().



  • @jpa said:

    I have only one question: why THE FUCK ?
    FTFY



  • @jpa said:

    I have only one question: WHY?
    Maybe they want to avoid hitting a file-system limit on the number of files in a directory?



  • @Zecc said:

    @jpa said:

    I have only one question: WHY?
    Maybe they want to avoid hitting a file-system limit on the number of files in a directory?

    That's not usually a problem. However, as the number of files in single folder grows, some operations can become very slow (like when using ext2. Or ext3 without dir_index. Doing "ls" on such a directory is O(n^2) operation)


  • ♿ (Parody)

    The best scenario I can come up with is that he was worried that file names would collide, and making random directories with random depths would reduce the chances of an actual collision. However, seeding the PRNG, the same file should go to the same place.



  • @boomzilla said:

    The best scenario I can come up with is that he was worried that file names would collide, and making random directories with random depths would reduce the chances of an actual collision. However, seeding the PRNG, the same file should go to the same place.

    Only if micro_seed() were to return the same output each time. The whole point of seeding the PRNG is to ensure that the result is different (pseudo-random).


  • ♿ (Parody)

    @dohpaz42 said:

    @boomzilla said:
    The best scenario I can come up with is that he was worried that file names would collide, and making random directories with random depths would reduce the chances of an actual collision. However, seeding the PRNG, the same file should go to the same place.

    Only if micro_seed() were to return the same output each time. The whole point of seeding the PRNG is to ensure that the result is different (pseudo-random).


    Oops, you're right. I hadn't looked that closely, but assumed (while writing my comment) that they were seeding with something from the name or md5 or whatever. I still think that collision avoidance is the most likely reasoning behind this. Which isn't to say it's not a WTFy way of doing it.



  • I'm going to assume that this scheme exists for roughly the same reasons as address space randomization.

     



  • Simple- if they didn't randomise it, then anyone who had the file would be able to work out where to find it.



  • @jpa said:

    Here is a snippet of code that generates a path for uploaded files:

    function generatePath($org, $filemd5)
    {
       srand(micro_seed());
       $directorydepth = rand(1,4);
    
       $imgdestination = "";
       for($i=0; $i<$directorydepth; $i++)
       {
         $imgdestination .= substr($filemd5, $i, 1)."/";
         if(!is_dir(imagedirectory($org).$imgdestination))
           if(!mkdir_owner(imagedirectory($org).$imgdestination))
             errorpage(100);
       }
       return $imgdestination;
    }
    

    Basically, it takes random number of characters from the md5sum for the names of subdirectories, giving a structure like this:

    ./1/orig_995897.jpg
    ./1/7/2/c/orig_995873.jpg
    ./e/orig_995841.jpg
    ./3/orig_995849.jpg
    ./3/orig_994183.jpg
    ./6/orig_995898.jpg
    ./d/3/orig_995939.jpg
    ./2/2/2/orig_1013854.jpg
    ./b/orig_995902.jpg
    

    I have only one question: WHY?

    This is simply a retarded implementation of sharding. As mentioned above, directory lookups can of be O(n^2), so splitting a large number of files across directories often improves performance. Unfortunately, this nitwit did it non-deterministically so finding a file with only its name would involve up to four guesses. Maybe he didn't want to "waste" the intermediate directories by not putting files in them.

    Of course, he forgot to account for the fact that a file stored multiple times may not overwrite the previous version of itself. Also, a quarter of the files will end up one directory deep, meaning that his sharding algorithm only reduces the maximum file count per directory by a factor of 64. A similar naive implementation two levels deep would be four times better in terms of file count. A hash-bucket like implementation should be hundreds of times better. This guy definitely "out-clevered" himself. For added fun, this implementation breaks horribly and randomly after somebody stores a file named "a" (it gives a one in four chance of creating a scenario where one in sixteen saves will fail without a relevant error message).

    I actually have a bigger problem with forcing the caller to to the md5 before calling the function and the fact that a function named "generatePath" has the side effect of creating the directory structure and uses a global to decide where to create those directories. I also don't like that the function decides what to do on an error condition instead of bubbling up to the caller.

Log in to reply