Re-optimizing data

Manni_reloaded

I'm sorry that this is a long explanation, but you can skip to the end if you're familiar with GIS vector data, like Shapefiles. Or just skip it entirely.

The Shapefile has been an industry standard for decades, and it's a relatively decent format for storing geographic feature data (points, lines, or polygons representing real-world features). It uses a .SHP file which contains the geographic coordinates to draw each feature, and a .DBF which contains user-defined extra attributes (metadata) like each feature's name, purpose, construction material, etc. It's fairly compact, 30 KB for a small group of point features or upwards of 5-10 MB for a larger polygon dataset with lots of metadata (like string fields).

My company provides a web-based service component to deliver that data via the OGC standard WFS. However we don't directly support the Shapefile format. No explanation why, just that we don't support it. Instead the server admins need to convert their Shapefiles to our custom format, and I recently had the joy of looking into what this actually entails.

Our conversion utility breaks the Shapefile down into smaller geographic squares and stores them in subfolders and flat files. Folders are named based on the geographic area they represent, with subfolders breaking down each area into smaller and smaller pieces. I guess the theory is that if you know what area you want, just go to that folder. Within the final level of folders are a collection of XML-ish files that each contain the features that touch the given area. I don't mean references or pointers to features, or just the relevant chunk for that area- it has all the information for every feature in that area as well as their metadata from the DBF. And if the group of files within a folder exceeds a predefined limit, our web service won't serve it and we have to rerun this process using smaller squares.

Say for example that one such square covers a tiny 200'x200' area where Utah, Arizona, Colorado, and New Mexico meet. The XML-ish file would contain the complete shape definition and all metadata for those 4 states, even though all the surrounding sqares would contain most of the same information.

You end up with multiple levels of nested folders, producing hundreds and hundreds of subfolders, inside are thousands of ~10 KB files with non-standard XML contents, containing the same information duplicated over and over, which in the end is multiple times larger than the original data. I dare you to try moving a folder in Windows with 3000 subfolders and 20,000 small files.

Today I found the pièce de résistance: The original Shapefile contained 6 detailed polygons with only 4 metadata attributes, covering roughly the area of the UK, with the original file size of 416 KB. A trivial dataset by all accounts. After it was converted to our format, the result had more than 800,000 subfolders and 1.1 million files, taking up over 200 GB. Just getting the properties of this folder takes 10 minutes.

This is what happens when you think to yourself "I know how we can do this easier".

barfoo1

@Manni_reloaded said:

After it was converted to our format, the result had more than 800,000 subfolders and 1.1 million files, taking up over 200 GB. Just getting the properties of this folder takes 10 minutes.

I wonder how small that folder would get if you zipped it....

TGV

Sounds like your average tiling process...

boomzilla

Amid this steaming pile, I think my favorite bit was this:

@Manni_reloaded said:

a collection of XML-ish files

Anketam

The developer probably thought it was a brilant idea at the time.

boomzilla

@Anketam said:

The developer probably thought it was a brilant idea at the time.

No doubt. And it probably worked really well for whatever test data he was playing with.

Medinoc

@barfoo said:

I wonder how small that folder would get if you zipped it....

Probably not that small, ZIP archives are not "solid" (they compress all files separately).

A solid RAR archive on the other hand...

Cassidy

@Medinoc said:

@barfoo said:
I wonder how small that folder would get if you zipped it....

Probably not that small, ZIP archives are not "solid" (they compress all files separately).

barfoo probably used "zipped it" as a catch-all for "compressed it".

@Medinoc said:

A solid RAR archive on the other hand...

bzip/gzip/tar/rar ... which are better for what? ISTR reading that one was good for a small number of large files, another more suited to a large number of small files...

.. can anyone compare and contrast their experiences?

blakeyrat

@Cassidy said:

bzip/gzip/tar/rar ... which are better for what? ISTR reading that one was good for a small number of large files, another more suited to a large number of small files...
.. can anyone compare and contrast their experiences?

Who gives a shit whether the extremely wrong bad horrible data storage mechanism is better zipped or not? It's still wrong bad and horrible either way.

Jebus this forum.

Cassidy

@blakeyrat said:

Who gives a shit whether the extremely wrong bad horrible data storage mechanism is better zipped or not?

I certainly didn't (give a shit, that is). I was interested in people's experiences with compression formats on the whole.

Thought it would balance out the story of inflating a 416KB file to 200GB of data.

@blakeyrat said:

It's still wrong bad and horrible either way.

It is. I'd accepted it and moved on.

Manni_reloaded

@boomzilla said:

Amid this steaming pile, I think my favorite bit was this:
@Manni_reloaded said:
a collection of XML-ish files

The file extension is XML. The contents have elements you'd expect in a WFS query result like <gml:point>blah blah</gml:point>. But there's a chunk of encoded data at the beginning of the file, probably byte offsets to individual items if I had to guess. There's no <?xml ...> header tag. And I've found places where the text values weren't properly escaped for XML, so there's &'s and >'s right there in the text.

I've tried a couple XML readers, they declare these are not valid files. I had to write a custom parser to extract the data out. And yes I see the irony in this solution.

Lorne Kates

@blakeyrat said:

@Cassidy said:
bzip/gzip/tar/rar ... which are better for what? ISTR reading that one was good for a small number of large files, another more suited to a large number of small files...
.. can anyone compare and contrast their experiences?

Who gives a shit whether the extremely wrong bad horrible data storage mechanism is better zipped or not? It's still wrong bad and horrible either way.

Jebus this forum.

I care, because the dataset might be so horrifically wrong that compressing it actually inflates the size of the archive-- and then we've discovered a whole new level of WTF. And that's what this forum is all about.

Cassidy

@Manni_reloaded said:

The file extension is XML....
There's no <?xml ...> header tag.
And I've found places where the text values weren't properly escaped for XML, so there's &'s and >'s right there in the text.
I've tried a couple XML readers, they declare these are not valid files.

Summary: these files with an .XML filename extension do not contain well-formed XML.

So.... does any documentation exist that sheds some light upon the thought processes behind this ~~fuckwittery~~ design decision? I mean, it looks like some developer happened upon XML for a reason but didn't quite understand it enough to fully exploit it.

ObXMLDefence: yes, I'm a lover of XML and benefits it brings. Yes, I understand it gets the blame when it's used badly and inappropriately. Rather like Excel, etc.

Nexzus

Can you get to a fine enough "resolution" (for lack of a better word) where you'd hit the Windows' path length limit?

dhromed

@Nexzus said:

Can you get to a fine enough "resolution" (for lack of a better word) where you'd hit the Windows' path length limit?

Doesn't seem like much of a challenge.

Zemm

@dhromed said:

@Nexzus said:
Can you get to a fine enough "resolution" (for lack of a better word) where you'd hit the Windows' path length limit?

Doesn't seem like much of a challenge.

Indeed, I've download a ahemtorrentahem er-Linux ISO - that hit that limit!

boomzilla

@dhromed said:

@Nexzus said:
Can you get to a fine enough "resolution" (for lack of a better word) where you'd hit the Windows' path length limit?

Doesn't seem like much of a challenge.

Unicode to the rescue (that just sounds wrong)! Though I suspect they could manage to break the fuzzy 32,767 limit, too.

topspin

@boomzilla said:

Unicode to the rescue (that just sounds wrong)! Though I suspect they could manage to break the fuzzy 32,767 limit, too.

Unicode?? You mean UNC paths?!

boomzilla

@topspin said:

@boomzilla said:
Unicode to the rescue (that just sounds wrong)! Though I suspect they could manage to break the fuzzy 32,767 limit, too.

Unicode?? You mean UNC paths?!

I guess I could have meant that if I wanted to be wrong. Not that you can't also construct long UNC paths using Unicode and some special sauce.

pjt33

@Manni_reloaded said:

And if the group of files within a folder exceeds a predefined limit, our web service won't serve it and we have to rerun this process using smaller squares.
Say for example that one such square covers a tiny 200'x200' area where Utah, Arizona, Colorado, and New Mexico meet. The XML-ish file would contain the complete shape definition and all metadata for those 4 states, even though all the surrounding sqares would contain most of the same information.

Was there some recursion limit, or did it stop subdividing on the intersection because it ran out of disk space?