Splitting folders without compression



  • I've worked myself into a horrible situation where I need to distribute 2TB of data from a server to a number of clients with different HDD space remaining. Don't ask. Anway, I need to split a folder (lets say it's 250GB), into 25 'archives' of 10GB, without actually compressing it (since this would cost way too much time). I tried the features on PowerArchiver, but that skips a lot of files, and takes about 6 hours to do what is essentially just copying.

    Has anyone tried this before? I'm running a Windows Server 2003 btw, so shell scripts aren't going to work. Any ideas?



  • You can try [url=http://unxutils.sourceforge.net/]tar[/url]. Don't know about the speed though...



  • Could you explain a little more about the problem?

    It sounds like you need to take your huge server folder, split it up into 25 chunks, and give 1 chunk each to 25 smaller "client" machines.

    Does the data need to be readable on the client side after it is copied?  Are there any single files > 10GB?

    It sounds like you could write a simple application to do this using anything up to and including javascript...



  • @Diep-Vriezer said:

    I've worked myself into a horrible situation where I need to distribute 2TB of data from a server to a number of clients with different HDD space remaining. Don't ask. Anway, I need to split a folder (lets say it's 250GB), into 25 'archives' of 10GB, without actually compressing it (since this would cost way too much time). I tried the features on PowerArchiver, but that skips a lot of files, and takes about 6 hours to do what is essentially just copying.

    Has anyone tried this before? I'm running a Windows Server 2003 btw, so shell scripts aren't going to work. Any ideas?

    This is an instance of the bin-packing problem. A correct solution to this problem is NP-complete, so you have to approximate, and the best approach depends on the distribution of file sizes. As a result, none of the standard tools really try - they just skip files (which is the best way for large numbers of very small files, assumed to be the most common backup solution). You'll probably have to implement something. In future, endeavour to avoid this problem, it's a pain.



  • I use winrar a lot, but not on that mass of files. It has an uncompressed option and lets you split the archive up into chunks of any size. No idea how it would perform with 250GB though.

     

    BTW, I hope to Bob that this isn't your daily backup scheme! 



  • @RaspenJho said:

    Could you explain a little more about the problem?

    I need to move from a software dynamic disk span solution (containing 4 disks) to a RAID 5 hardware solution, using the same 4 disks plus an additional one.

    @RaspenJho said:

    Does the data need to be readable on the client side after it is copied?  Are there any single files > 10GB?

    The data doesn't have to be directly readable, and there are some files larger than 10GB. The quickest and easiest way to do this is to just iterate trough all the files, copy (or cut) them to a folder location untill that folder location grows to >10GB, and start a new folder. I do however have a lot of files which are around 4GB (DVD's for instance), and quite some which are >20GB (imported MPEG). The 10GB file limit makes for easy distribution, but it's not really important.

    @RayS said:

    I use winrar a lot, but not on that mass of files. It has an uncompressed option and lets you split the archive up into chunks of any size. No idea how it would perform with 250GB though.

    I tried PowerArchiver which is really one of the greatest tools I've ever used. I'm thinking it works about the same as WinRAR. Anyway, using the 'uncompressed' option still takes ages, plus, you have to manually put files back into the huge archive, which takes even longer.

    Anyway I'll just start writing my own program, as TAR is one of the formats I already tried. Oh and NO this isn't my daily backup routine ;)



  • (Slight mistake in my earlier post: it's NP-hard, not NP-complete. Still too hard) 

    @Diep-Vriezer said:

    Anyway I'll just start writing my own program, as TAR is one of the formats I already tried. Oh and NO this isn't my daily backup routine ;)

    Don't try too hard - solve it in 100 lines or less, no more effort is justified. For files of this distribution, use the First Fit algorithm: allocate N 'bins' of suitable size, and put each file into the first bin where it'll fit. (The only fast algorithm that will pack these files more tightly than this is where you sort the list of files into decreasing order first, which is probably impractical due to large numbers of small files)



  • @asuffield said:

    (Slight mistake in my earlier post: it's NP-hard, not NP-complete. Still too hard) 

    @Diep-Vriezer said:

    Anyway I'll just start writing my own program, as TAR is one of the formats I already tried. Oh and NO this isn't my daily backup routine ;)

    Don't try too hard - solve it in 100 lines or less, no more effort is justified. For files of this distribution, use the First Fit algorithm: allocate N 'bins' of suitable size, and put each file into the first bin where it'll fit. (The only fast algorithm that will pack these files more tightly than this is where you sort the list of files into decreasing order first, which is probably impractical due to large numbers of small files)


    A compromise algorithm is possible.  Sort files by size if their size is greater than a heuristically determined threshhold.  Throw them in an array if smaller.  Append smaller to greater.  First fit from there.

    Note that if speed is important, getting the file size will swamp operations in the same loop.  Caching them is a good idea.

    What the hell, a Ruby implementation in under 60 lines.  Be kind.  

    #/usr/bin/ruby -w
    require 'find'

    $threshhold = 1024 # bytes
    $directory = '/Users/sollaa/Projects/'
    $bucket_size = 1024 # Bucket size
    $buckets = 10

    class Phile
      attr_reader :size, :filename
      def initialize(filename, size)
        @filename = filename
        @size = size
      end
    end

    class PhileArray
      attr_reader :role
      def initialize(size, *role)
        @array = []
        @size = size
        @role = role[0]
      end
     
      def sort_by_size
        smaller = []
        bigger = []
        self.each {|file| file.size < $threshhold ? smaller << file : bigger << file }
        bigger.sort {|a,b| a.size <=> b.size}
        p bigger.concat(smaller)
        return bigger.concat(smaller)
      end
     
      def each(&block)
        @array.each(&block)
      end
     
      def push(file)
        unless self.role == :unsorted
          if @size - file.size > 0
            @size -= file.size
            @array << file
          end
        else
          @array << file
        end
      end
     
      def process  # fill this in with calls to your archiver
      end
    end

    unsorted = PhileArray.new(nil, :unsorted)
    Find.find($directory) {|file| unsorted.push Phile.new(file, File.size(file))}
    sorted = unsorted.sort_by_size

    buckets = Array.new($buckets, PhileArray.new($bucket_size))
    sorted.each do |file|
      buckets.each {|bucket| break if bucket.push(file)}
    end
    buckets.each {|bucket| bucket.process}



  • I need to move from a software dynamic disk span solution (containing 4 disks) to a RAID 5 hardware solution, using the same 4 disks plus an additional one.

    How are you usually backing up the data? In an idea world, you'd just run a backup, take it offline, setup the raid5 and restore the backup. Since you don't seem to have a backup solution, you have quite a problem no matter what.

    I hope you realize that Raid5 is not a backup solution. It only protects you from hardware failure, not from data loss. If a user or a broken software program decides to delete files / format the array, the data can't be restored. That's why you have daily backups :)


    As far as your problem goes, just buy 4 new drives, set up the new raid and copy the data 1:1. Use the old drives to create regular backup copies, or use them for spares. I don't think this is a problem that needs a software solution.



  • Download cygwin or SFU (you'll understand why shortly).

    Use the find command in the root of the big-ass folder like this roughly:

    find -type f -size +500M -printf '%s %p\n' | sort -n

    You'll get output similar to this:

    3514810368 ./FC-5-x86_64-DVD-Unity-20060523.iso
    3525195776 ./FC-6-i386-DVD.iso
    3717459968 ./CentOS-5.0-i386-bin-DVD.iso
    4088006656 ./FC-6-x86_64-DVD.iso
    4287268864 ./CentOS-5.0-x86_64-bin-DVD.iso

    Use this to find all the big files that would be inefficient to archive or pack. Move these off to remote disks by hand, starting with all the smallest (over the cutoff size, 500MB in this case). Just drag-n-drop in explorer if need be.

    Finally use WinRAR or WinZIP or some other tool to create split archives of the remaining files. This part won't take forever and because the files are smaller you'll have an informative progress bar. You can and should (light/fast) compress these archives to make the network bottleneck less of a chokepoint.



  • tar cf - folders | split -b 10737418240 - ARC

    cat ARC* | tar xf -

    Or something to that effect.



  • @TehFreek said:

    tar cf - folders | split -b 10737418240 - ARC

    cat ARC* | tar xf -

    Unfortunately it's considerably slower because you end up copying the data between processes. On any data set this large, it's almost always worthwhile to do it in one pass. 



  • I second the notion of making a backup, creating the RAID 5 configuration, then restoring the backup. I know that when I do major, potentially catastrophic changes to the configuration of a server I always need a backup anyway. You may as well use a backup for this purpose as well. At least you'll know it works!

    As an alternative, I might suggest running to your local CompUSA and purchasing a few large external drives. Copy the data over (which should be faster over Firewire/USB 2.0 than Ethernet), then do your RAID 5 upgrade. Copy the data back, and either keep the drives for potential future upgrades or return them to the store complaining that the color clashed with the server.

    OK, returning the drives isn't the most ethical choice, obviously.
     



  • @APH said:

    I second the notion of making a backup, creating the RAID 5 configuration, then restoring the backup. I know that when I do major, potentially catastrophic changes to the configuration of a server I always need a backup anyway. You may as well use a backup for this purpose as well. At least you'll know it works!

    As an alternative, I might suggest running to your local CompUSA and purchasing a few large external drives. Copy the data over (which should be faster over Firewire/USB 2.0 than Ethernet), then do your RAID 5 upgrade. Copy the data back, and either keep the drives for potential future upgrades or return them to the store complaining that the color clashed with the server.

    OK, returning the drives isn't the most ethical choice, obviously. 

    The drives are "LVT" -> "Leuk voor thuis" which is dutch for "nice for at home".



  • @asuffield said:

    @TehFreek said:

    tar cf - folders | split -b 10737418240 - ARC

    cat ARC* | tar xf -

    Unfortunately it's considerably slower because you end up copying the data between processes. On any data set this large, it's almost always worthwhile to do it in one pass. 

    But this is going to be a pretty classic case of disk bound.  Is a couple of memory copies really going to matter compared to all the disk read/writes. 

     

    I was wondering the same about the claim that compression was to expensive.



  • @no name said:

    @asuffield said:
    @TehFreek said:

    tar cf - folders | split -b 10737418240 - ARC

    cat ARC* | tar xf -

    Unfortunately it's considerably slower because you end up copying the data between processes. On any data set this large, it's almost always worthwhile to do it in one pass. 

    But this is going to be a pretty classic case of disk bound.  Is a couple of memory copies really going to matter compared to all the disk read/writes.

    Even 10% is important when it's applied to something measured in hours. Disks may be slow, but they're not so slow that nothing else matters any more.



  • @asuffield said:

    @no name said:
    @asuffield said:
    @TehFreek said:

    tar cf - folders | split -b 10737418240 - ARC

    cat ARC* | tar xf -

    Unfortunately it's considerably slower because you end up copying the data between processes. On any data set this large, it's almost always worthwhile to do it in one pass. 

    But this is going to be a pretty classic case of disk bound.  Is a couple of memory copies really going to matter compared to all the disk read/writes.

    Even 10% is important when it's applied to something measured in hours. Disks may be slow, but they're not so slow that nothing else matters any more.

     

    But is it going to be anywhere even close to 10%?  And is developing and testing a custom solution for a one shot job going to save you more than  that overhead, even if it is 10%? 


Log in to reply