Fear ye, all DBAs who dare to tread here (or: The Column of Doom™)

tarunik

(First: this is in Coding Help because I really do need help here -- despite the title, DBAs are quite welcome, just brace yourself for the )

So, the application I work on takes a bunch of structured text files in in the general form

RECA KEY=42 FOO=BLAH BAR=BLUB ...
RECB KEY=43 FOO=BLEGH BAZ=FLUB ...
...

and loads them into an Oracle DB so that we have a master copy of them in a neatly backed up place, considering that the data in them is quite business-critical (or will be when we hit production, at least). As an aside: it does some processing on the data in them as well, but that's irrelevant here.

One of these files contains a bunch of information relevant to how the geography of the system's data (as much of the data the system works on is mapped to physical objects trackside) is related to the logical topology of the system; in other words, which servers and workstations "own" what portions of the data/geography, if you will. Most of the data records, including the bulk of the record types in this file, follow a semi-sane relational data model: everything has a surrogate key, ownership is established in a reasonably normal foreign-key way save for a few mildly y spots, and queries on this data return results in a reasonable span of time.

Then, you run into a type of object that exists rather solely to perform the actual logical to physical mappings. This is fine, until you look at how they implemented it. You'd think there'd be a foreign key in various places, or even a many-to-many association table, but no! Instead, they implemented this mapping in the "parent" record (i.e. the one that performs the logical-to-physical mappings, which look much like a tree, or perhaps a small forest of trees) as a comma-separated list of child record surrogate keys from multiple types of child records.

This all means that the only way to perform queries against this column is to use a substring match of some sort i.e. INSTR(wtf_column, some_other_foreign_key) > 0. Worse yet, some of these lists are long enough that they exceed the 4000 byte size limit on the Oracle VARCHAR2 type, so we are forced to store them as CLOBs instead (we could break them up across multiple columns, but that'd make them an even worse to query against).

The overall result is that queries that involve this table in multiple places wind up taking minutes at best, and several hours (last attempt at a query came back with 3357 rows in 20504.522 seconds and yields fewer rows than it probably ought to) in the worst case, despite the table wtf_column lives in having all of 510 rows. And an ordinary index is of no use, because it will neither help with the substring ops nor do anything useful with a CLOB. So, what am I supposed to do to query this in a reasonable span of time? Is a DIY approach, i.e. sucking all the data into my C++ code and letting it grind away at cranking out the report this query is for, going to be any faster than the pile of creeping epoxy that my queries are turning Oracle into? Or is Oracle Text the only salvation for what may be the ultimate database design ?

P.S. our vendor controls the data format -- their argument for doing it the way they are right now is that it's more human readable and better optimized for textual representation, given that they hand-hack these files whenever their custom-written data-manipulation GUI tool falls down on the job. I sincerely doubt that they'd be receptive to arguments involving "Oracle's choking on this", as well, considering they have a habit of Jeffing problems away with "this isn't in the requirements!!!1111!!!!" arguments or charging exorbitant amounts for change orders.

blakeyrat

@tarunik said:

P.S. our vendor controls the data format

Do they control the database schema?

tarunik

@blakeyrat said:

Do they control the database schema?

Effectively, yes they do -- we follow their lead because our life would turn into utter insanity otherwise, considering that the tool writes said structured text files back to disk after processing them, and both the original creator and the end recipient of the files must be able to read the files this tool writes. Worse yet, both the original creator and the end recipient of the files are quite sensitive to errors, and are likely to core or hang if the files are wrong .

blakeyrat

I don't see what's preventing you from parsing the files into a proper many-to-many table relation while also keeping the original text data and also writing said structured text files back to disk.

tarunik

@blakeyrat said:

I don't see what's preventing you from parsing the files into a proper many-to-many table relation while also keeping the original text data and also writing said structured text files back to disk.

I do have the hooks needed to special-case such a thing...that might just be the best way to go here, especially considering the report needed isn't a one-off.

cartman82

Yeah, my first instinct here is either refactor the table, or if you can't, create your own properly normalized tables that are updated from the other (perhaps by trigger or a batch job).

loopback0

This.

tarunik

@cartman82 said:

Yeah, my first instinct here is either refactor the table, or if you can't, create your own properly normalized tables that are updated from the other (perhaps by trigger or a batch job).

A full refactoring isn't feasible -- but the table's being loaded by a batch job to begin with, so hooking that batch job to maintain a properly normalized many-to-many relationship table on the side would be possible.

(We can add to the schema, just not modify existing bits without exposing a gigantic pile of .)

Weng

Yep. Do that.

blakeyrat

Adding to my preaching, another problem here is why it wasn't implemented that way in the first place. Because: duh.

Make sure whoever originally came up with this isn't allowed within 100 miles of your fix.

boomzilla

@tarunik said:

I do have the hooks needed to special-case such a thing...that might just be the best way to go here, especially considering the report needed isn't a one-off.

It's hard to imagine something worse than joining on stuff hidden in CLOBs.

loopback0

Can you even join on stuff hidden in CLOB once it's passed 4000 chars (and therefore you cannot convert to string)?

tarunik

@loopback0 said:

Can you even join on stuff hidden in CLOB once it's passed 4000 chars (and therefore you cannot convert to string)?

You can, but it's hideously expensive unless you can join on an Oracle Text index on that CLOB.

tarunik

@blakeyrat said:

Adding to my preaching, another problem here is why it wasn't implemented that way in the first place. Because: duh.

Make sure whoever originally came up with this isn't allowed within 100 miles of your fix.

Thankfully -- I have that power. Whoever came up with the original underlying design of the wtf_column=(BLUB, BLEGH, GLUB, BLAH, ...) mess will never see this code, guaranteed.

As to why it wasn't done right in the first place? I already gave the arguments I heard (directly from $vendor's chief architect) in the OP...they also argue that it actually performs better than using proper foreign keys when reading from a flat-file, which I don't buy.

blakeyrat

@tarunik said:

As to why it wasn't done right in the first place? I already gave the arguments I heard (directly from $vendor's chief architect) in the OP...

But they didn't write your database schema.

https://www.youtube.com/watch?v=ToKcmnrE5oY

"Well, of course the allies are going to say we're the baddies."

"Yes, but... they didn't get to design our uniforms."

loopback0

I didn't know it was even possible. Because it's fucking silly.

TIL.

edit: It would fit in with the Oracle policy of giving you just enough rope to hang yourself though.

boomzilla

@tarunik said:

they also argue that it actually performs better than using proper foreign keys when reading from a flat-file, which I don't buy.

That sounds plausible to me, actually. They reduce I/O, and the string data is just string data for them. And they probably aren't doing the sorts of things you need to deal with inside the DB.

tarunik

@blakeyrat said:

But they didn't write your database schema.

You missed my point. for whoosh. The whole point of our schema design is to mirror their data model as an Oracle DB. Had we had free reign to design the data our own way, we wouldn't have done it this way.

blakeyrat

@tarunik said:

You missed my point. for whoosh. The whole point of our schema design is to mirror their data model as an Oracle DB.

Right; but I still (as I stated before) don't get why that precludes having a proper many-to-many relationship table. The only reason I can think of is the guy who implemented this in the first place has absolutely no idea how data normalization works.

@tarunik said:

Had we had free reign to design the data our own way, we wouldn't have done it this way.

I'm not saying you can modify the text files, but you are in control of your own DB schema. And fixing this requires no more than having control of your own DB schema.

loopback0

An Oracle Whoosh™?

Weng

So we have the same issues with idiot inbound data formats and needing to preserve them in the database.

We do a straight load, and then normalize the data on import, and demonrmalize it back on update.

boomzilla

@tarunik said:

The whole point of our schema design is to mirror their data model as an Oracle DB. Had we had free reign to design the data our own way, we wouldn't have done it this way.

But if you're just mirroring the data, you don't need to join anything, so who cares? CLOB away. In fact, just put the whole fucking file in a CLOB and be done with it.

But I think that's not correct, and someone dropped the ball here. Reports and any query with a join denormalize data for various purposes. This is just someone not thinking very carefully about what they were doing.

blakeyrat

Yup. I've done it with Twitter API data. You have to keep the raw JSON in a text column (since Twitter is prone to adding/removing fields with little or no notice), but nothing stops you from parsing out the relevant values and making sane relations on them.

abarker

@boomzilla said:

It's hard to imagine something worse than joining on stuff hidden in CLOBs.

Have mercy on me, but I think I came up with some worse options:

[spoiler]Join on stuff hidden in a bit-masked int field?[/spoiler]

[spoiler]Join on stuff hidden in an encrypted CLOB?[/spoiler]

boomzilla

@abarker said:

Join on stuff hidden in a bit-masked int field?

I think this is better than the CLOB. IIRC, CLOBS are stored...who knows where, certainly not with the rest of the row data. And rightfully so, since they can be gigabytes in size. Also, one could probably come up with a function based index that would make the bit-mask a lot less painful.

@abarker said:

Join on stuff hidden in an encrypted CLOB?

Eh...yes, but I think I'll say that encryption is just a more thorough form of hiding than unencrypted CLOBification.

abarker

@tarunik said:

The whole point of our schema design is to mirror their data model as an Oracle DB.

Based on reading the topic so far[1]:

As long as you can convert between your DB schema and their flat file data schema - in both directions - it doesn't matter how close your DB schema is to their data model. Using the schema you described in a database is just plain dumb. Keep the original in a text field for ease of output, but then have the appropriate data structures to support the operations that a database needs. Querying on a partial string match in a CLOB is a WTCF.

[1] Yes, I know it's been said a few times, but it looks to need repeating.

abarker

How about this then: Join on stuff hidden and split across multiple CLOBs.

loopback0

You're a bad bad man.

boomzilla

That's awful and you should be ashamed of yourself. Still, you're joining on CLOBs. While I didn't specify, the spirit of my imagination was something worse that didn't involve CLOBs.

VinDuv

@boomzilla said:

something worse that didn't involve CLOBs

Join on bitfields ?

riking

@blakeyrat said:

Yup. I've done it with Twitter API data. You have to keep the raw JSON in a text column (since Twitter is prone to adding/removing fields with little or no notice), but nothing stops you from parsing out the relevant values and making sane relations on them.

You could even keep the raw in a jsonb column and then you get at least some performance even on brand-new queries.

boomzilla

@VinDuv said:

Join on bitfields ?

Already been suggested, and I gave my reasons why I don't think it's worse.

blakeyrat

@riking said:

You could even keep the raw in a jsonb column and then you get at least some performance even on brand-new queries.

SQL Server has native XML support, but does not yet have native JSON support. Still, we could have converted the JSON to XML then queried with XPATH but... whatever, I haven't worked at that company for like 4 years now.

Bulb

@tarunik said:

The whole point of our schema design is to mirror their data model as an Oracle DB. Had we had free reign to design the data our own way, we wouldn't have done it this way.

The thing is that if I had a flat file that says

parent=abc, children=(def, ghi, jkl, mno)

I would take it to mean it contains vector value. But database can't do a vector value, so I'd normalize it to an m:n relation like:

parent | child
-------+-------
abc    | def
abc    | ghi
abc    | jkl
abc    | mno

The original format is trivially reconstructed by

select parent, group_concat(child) from ... group by parent;

(I don't work with oracle in particular, so I am not sure they don't have different name for that function, but it should be standardish) while being simple to query.

And I would make it a general rule - whenever the flat file contains a list of keys, it explodes to a relation table. So the correspondence between the schema and the files is still regular, easily described, but it is also reasonable to work with.

And yes, the format is reasonable in flat files, because

 parent=abc, children=(def, ghi, jkl, mno)

is shorter than

parent=abc, child=def
parent=abc, child=ghi
parent=abc, child=jkl
parent=abc, child=mno

plus any other attributes related to abc in separate table. Plus it actually is indeed easier to work with if you are loading the data in some graph structure rather than relational database. I've done that couple of times in work project. Just convert it to relation if that's what you need to use.

loopback0

@Bulb said:

(I don't work with oracle in particular, so I am not sure they don't have different name for that function, but it should be standardish)

LISTAGG from 11.2g. Roll your own for 10g.

PleegWat

@loopback0 said:

Can you even join on stuff hidden in CLOB once it's passed 4000 chars (and therefore you cannot convert to string)?

Probably. It's long that'll break if you even write a where clause on it.

RaceProUK

@abarker said:

Join on stuff hidden in an encrypted CLOB?

@abarker said:

How about this then: Join on stuff hidden and split across multiple CLOBs.

The Evil Ideas thread is

tarunik

@abarker said:

Join on stuff hidden in a bit-masked int field?

This is much, much easier than joining on a CLOB -- a function index can be used to index bitmasked fields.

@abarker said:

Keep the original in a text field for ease of output, but then have the appropriate data structures to support the operations that a database needs. Querying on a partial string match in a CLOB is a WTCF.

That's the approach I'll take -- while tapping into the file read process itself is likely to be mildly annoying, there's also the option of bulk-moving the data into the normalized table as a post-processing pass. Perhaps this'd be a decent use of a stored procedure?

PleegWat

@loopback0 said:

LISTAGG from 11g. Roll your own for 10g.

11.2

loopback0

@PleegWat said:

It's long that'll break if you even write a where clause on it.

True, but LONG should have died a long time ago.

PleegWat

Quite so. It's in USER_TAB_PARTITIONS.

abarker

@boomzilla said:

That's awful and you should be ashamed of yourself. Still, you're joining on CLOBs. While I didn't specify, the spirit of my imagination was something worse that didn't involve CLOBs.

But fine:

Begin with a situation as outlined in the OP. But instead of putting the data in a CLOB, put the data from each record in a separate text file. Have the path for each text file stored in a text field where the CLOB used to be.

powerlord

HEY! How come we can't use both the Coding Help and I Hate Oracle Club tags at the same time?

abarker

@powerlord said:

HEY! How come we can't use both the Coding Help and I Hate Oracle Club tags at the same time?

Because they're discocategories, not tags.

loopback0

Because anyone using Oracle has given up with help?

boomzilla

@abarker said:

Begin with a situation as outlined in the OP. But instead of putting the data in a CLOB, put the data from each record in a separate text file. Have the path for each text file stored in a text field where the CLOB used to be.

LOL...you've reinvented Oracle CLOBs.

boomzilla

@powerlord said:

HEY! How come we can't use both the Coding Help and I Hate Oracle Club tags at the same time?

You can't really blame this one on Oracle, though I suppose it acted as an enabler.

abarker

@boomzilla said:

LOL...you've reinvented Oracle CLOBs.

But in a way that the DB engine won't natively handle it for you. You'll have to build code to access it all yourself.

riking

@boomzilla said:

LOL...you've reinvented Oracle CLOBs.

Except worse.

When a noSQL database added joins, this remark was made:

Congratulations, you've just reinvented SQL, except it's worse.

ben_lubar

I'm not a DBA, but with most noSQL databases, you can make a view that takes the comma separated field and emits a record for each pair of (record id, comma separated list value). Is something similar possible with Oracle?