Best way to get size of directory.

Polygeekery

I am running in to some serious performance issues when retrieving directory sizes on our backup software, due to the nature of how the files are stored and the need to get an accurate size for billing.

Here is basically how things work now: When we take an initial backup, we generate hashes for all files on the client, compare them to hashes of files on the server and if they are not the same we transfer them to a directory with a name that corresponds to the UUID of said client and a subfolder that corresponds to the date and time. If we do already have them on the server, then the file is moved to a \Common folder and hard links are created in the filesystem at each relevant directory location to point to the location of said file. The most recent backup location for each backup is held in \UUID\Current.

When we first started, very few files were held in \Common and we could just query the filesystem attributes on the \UUID\Current folders and get a fairly accurate representation. Good enough for billing purposes, and most of those were not billed anyway so it was of little consequence.

Then, we moved to 2012R2 which has excellent deduplication and now that has went out the window as it heavily skews the filesystem attributes, so I started using a 'du' script to dump disk usage results to a CSV and then consumed it from there. Now, with well over 50M unique files per server and people wanting file versioning back over 90 days with 2-4 backups per day, the hard links have grown massively and a du script will take up to 12 hours to walk the filesystem and return results for disk usage.

Now, my first inclination was to just say screw it on the server side and collect the information at the client and send it when we accept the connection for the backup, place that in the database and use that to generate billing information. That seems like a poor idea though. If we were to miss anything, there would be potential for abuse. I would much rather keep this server side if possible. But I am having trouble coming up with any way to collect accurate information considering the usage of hard links and deduplication.

It gets even worse when we bring in data on removable storage, such as copying the information in from an external hard drive before we do the initial backup. In those cases, all of the conventional solutions we have come up with fall on their face except for a 'du' script as all of the initial files end up in the \Common folder. That is a minor issue though, that will take a fair amount of work to fix so it is behind the backburner.

I am really hoping that there is something basic that I am missing, but everything that we have tried so far server side either trips up on the hard links and/or dedup, or takes for freaking ever like 'du'. Feel free to make fun of me if you solve this problem in 10 seconds, I will just be happy to put it behind me.

PleegWat

If it's really plain du, there's an option --max-depth to not print subdirectories beyond a certain point. I assume there's more logic though, as I don't think du by itself deduplicates hardlinked entries.

accalia

well it seems to me that what you really want to do is scan the filesystem once and record everything into a database, then as new files are added/modified add or update the records in the database. why rescan the filesystem constantly for things that aren't going to change?

Polygeekery

@accalia said:

why rescan the filesystem constantly for things that aren't going to change?

That is a good point...and I will have to give it further thought. Changes are tricky, as they will occur in so many different ways and with so many different effects.

accalia

@Polygeekery said:

Changes are tricky, as they will occur in so many different ways and with so many different effects.

well the changes you care about are when the client adds/edits/deletes files, no? you don't actually care about the dedup percentage or any crap like that (for the purposes of billing the customer, obv you care about it for an actual server architecture standpoint thingie)?

so meter on the receiving service. :-D

blakeyrat

Why don't you just keep a running total, and do the full scan monthly (or so) to find discrepancies?

Polygeekery

@blakeyrat said:

Why don't you just keep a running total, and do the full scan monthly (or so) to find discrepancies?

...good idea. But that leads me to another thought...collect the stats client-side to show them what their usage is at any point in time and then run the full script once per month for billing purposes only. If the client side software is ever spoofed in any way, it doesn't matter to me as I still have accurate numbers and will only effect them.

Plus, now that you led me to this thought (which, really, I should have come up with a long time ago) it would be trivially easy to implement.

blakeyrat

@Polygeekery said:

...good idea. But that leads me to another thought...collect the stats client-side to show them what their usage is at any point in time and then run the full script once per month for billing purposes only.

When you say client, do you mean the actual customer's computer, or like a web front-end that you control?

@Polygeekery said:

Plus, now that you led me to this thought (which, really, I should have come up with a long time ago) it would be trivially easy to implement.

Duh, Blakeyrat is a genius. Pay me royalties.

Polygeekery

The actual machine being backed up. There is a GUI interface currently where they can choose directories, retention policy, whether or not to pull system images, etc. It already has a list of directories, all we need to do are pull the file size attributes from there, add them up and display them as a running total on that tab. Easy.

blakeyrat

Oh yeah. Just add an asterisk or something to it, so people don't think the number is final.

Polygeekery

@blakeyrat said:

Duh, Blakeyrat is a genius. Pay me royalties.

I will knock a dollar off the $50 you owe me.

Bonus for me, it lets me change my punchline so that you owe me $49 and @abarker $50.

Onyx

Can't you set a filesystem watcher? I know Windows has the equivalent of inotify you can hook into, but I don't know what it's called. I know Qt hooks into it so it is programaticaly accessible.

Basically, get totals periodically for verification purposes (like blakey suggested) and set up a watch. Every time a file is added / modified the watcher triggers and you can adjust the size accordingly.

This should be a snap for new files at least. I don't know in what way files get modified in this system though, that's the only tricky bit. I guess you could have a watch that triggers on write_open, records the size at that point, a second watch triggers on write_close and calculates the difference. New files would, of course, have initial size set to 0, meaning that new_size - old_size will always equal new_size anyway.

blakeyrat

@Onyx said:

Can't you set a filesystem watcher? I know Windows has the equivalent of inotify you can hook into, but I don't know what it's called. I know Qt hooks into it so it is programaticaly accessible.

I'd hope they were doing that anyway.

EDIT: then again, I'd hope someone would have thought of just keeping a running total before, and apparently that didn't happen. So who knows.

Polygeekery

@blakeyrat said:

Oh yeah. Just add an asterisk or something to it, so people don't think the number is final.

Yeah, already thought of that. In all honesty though the difference in precision between the two systems is not that critical. Our billing is done in 100GB chunks, so we would have to screw it up pretty badly for them to complain. ;)

blakeyrat

@Polygeekery said:

Our billing is done in 100GB chunks, so we would have to screw it up pretty badly for them to complain.

Or the computer crashed and you didn't save your running total often enough. Make no assumes.

Polygeekery

@Onyx said:

Can't you set a filesystem watcher? I know Windows has the equivalent of inotify you can hook into, but I don't know what it's called. I know Qt hooks into it so it is programaticaly accessible.

Yes, that is being done. The problem is that on Windows at least those numbers drift considerably from actual. It is the hard links that cause it I believe. In our case the numbers are pretty much gibberish after heavy updating (several concurrent backups).