CSV Db

Captain

I have a bunch of CSV files I need to parse and transform into a single large report-like object. This sucker will be big, with probably hundreds of columns. I have been doing some simple transformations with Vim, but the scope of the rest is kind of enormous, and I'd like to automate it.

In particular, I have files that are organized by year. So I have a file for 2015, one for 2014, etc.

Each file, being a CSV file, has lots of columns. But I know for a fact that the 2015 and 2014 files don't have the same columns. I need to transform both files into a single file that contains data for 2014 and 2015, for all the columns they have in common.

I've looked into Python's CSV reader and CSVDict reader, and they look nice and all, but they parse the file into a list of dictionaries data structure, whereas I really want a more useful data structure to transform.

What would you do? Normally I'd use Haskell, and a CSV file library, and churn out some algorithm code. But surely this is a solved problem?

Kian

CSV is one of those formats that you think should be standardized, but isn't. I doubt you'll find a library that handles files quite like the ones you have.

By the way, do you have headers for the columns?

Captain

CSV is one of those formats that you think should be standardized, but isn't. I doubt you'll find a library that handles files quite like the ones you have.

It's the LibreOffice dialect of CSV. It shouldn't be too bad. I can transform that into another dialect easily enough with Vim.

By the way, do you have headers for the columns?

Yes, and that's another problem. They have a header, and also sub-headers.

Kian

@Captain said:

They have a header, and also sub-headers.

"Sub-headers"? I'm not familiar with that concept for CSV files. How do they work?

Captain

In a household:

Cats,           , Dogs, 
Meows, Scratches, Barks, Ruined Carpets
1500,       9001, 200,   5

The really crappy thing is that the csv file is more like:

By city:

Hepatitis,,      HIV,
Number, Z-score, Number, Z-score
500,    1.3,     40,     -0.25

blakeyrat

Learn Gawk, sounds like a perfect use for it.

Or write a quick-and-dirty application in a programming language you like.

Yamikuronue

...that is not a CSV anymore XD

Captain

You're right. It's a shitty CSV exported Excel spreadsheet. I tried to get in touch with the people who produced the data (http://countyhealthrankings.org) so I wouldn't have to scrape a scrape of a scrape but they just blew me off.

locallunatic

With dropping non-matching columns and the wonky "sub-headers", this looks like a quick and dirty app like blakey suggested is going to be the quick way to go. Is this a one off thing or are you going to be reusing whatever you throw together? That could change how much time you want to spend on your app.

Captain

I hope it's a one-off thing. It makes my head hurt.

boomzilla

I'd do this in two steps. First: merge the headers / sub-headers. Then merge the columns from the various files. I'm not aware of anything that would already be set up to do it, but it doesn't sound that hard in whatever language is familiar.

Kian

@Yamikuronue said:

...that is not a CSV anymore XD

As long as they keep the same number of records per line, even if those are empty or whitespace, it is. It's just not a proper table.

The other question I had: Are the headers / sub-headers consistent across years?

I'd just write a one off application to parse it.

accalia

@Yamikuronue said:

...that is not a CSV anymore XD

well, it is, it's just a less useful one. still a quick SED/AWK script can fix that

Yamikuronue

Oh, you're right, I misread that as being only one comma between top headers

WernerCD

I would be tempted to force the data into a database of some flavor (MSSQL is the one I'm most familiar with) and then you can do Sql selects/joins to your hearts content... Might be a "if all you have is a hammer" solution, but I've done it successfully before.

Desire to use openrowset is the real wtf?

RaceProUK

If you're gonna use a DB to sort out the data, then SQLite is probably better; no need to have a full DB server, and bindings exist for a lot of languages already

Maciejasjmj

If it's a one-off job... Yeah, it's like 5-6 lines in Powershell, and probably a similar amount of much uglier lines in whatever Linux shell you're using.

Or just parse this into data structures, merge, and serialize back. DB is an overkill to me if you only need to match headers.

And protip: use something designed to work with CSV, don't roll your own parser, or you'll be in a world of hurt and edge cases.

Captain

I highly doubt it's 5-6 lines in any language, considering I have to calculate the transpose of the data structure, find invariants, and use the invariants within each line of the original data structure. (Which is why I was looking for a specialized data structure...)

Captain

I do like the idea of a DB, but I'd have to figure out a way to "automatically" import the data, otherwise I'm stuck writing out table definitions of the same complexity I'm trying to avoid.

Maciejasjmj

Transpose what? Take the header/subheader pairs off both files, merge removing duplicates, then for each data point in a row look up where it belongs and put it there.

It's not gonna be a speed monster, but it should do the job.

WernerCD

I don't think I'd install a database JUST for this... but I normally have a Sql Instance spinning idly on my dev box. So I wouldn't personally be going from 0 to 60 per se... I'd already be half way there - if not further

RaceProUK

Oh.

In that case, carry on ;)

WernerCD

It's been AGES since I've had to do CSV based files... but I do know I've used something like this to import CSVs into a table: https://host4asp.net/import-csv-file-using-sql-server-management-studio/

I've also done ODBC Connections via hand crafted table definitions, and just done Sql against tables that are really CSVs via something like: http://help.interfaceware.com/kb/608 - with a good definition, you can point the ODBC at the file and then point SQL at the ODBC. Bam, SQL against CSV files.

These are a bit of effort for one offs - and I would only consider them because I've used them enough in the past to need less effort than other options like command line...

BUT this sounds like a lot of data manipulation, so building actual SQL tables (even if with something like SqlLite or one of the many alternatives ( http://stackoverflow.com/questions/417917/alternatives-to-sqlite )) MIGHT be worth it since there really is a LOT of power inside even the most basic Sql processors.

TheCPUWizard

Given the limited information I would probably Excel->XML->DB

sloosecannon

Or use SQL directly on the csv

Disclaimer: I have no idea how well that works on real data, I've only used it on pet projects. Pretty nifty tool though.