SQL design for storing lots of similar-ish structures



  • I have a system where I receive many byte streams. These are stored in a simple table with a receive timestamp, a GUID, and a 'processed' flag.

    That's great so far. I can receive all my data, have some way of identifying it forever and knowing if I've attempted to decode it yet.

    The next bit is giving me a headache.

    Each stream represents a report. They have codes inside for the type of report, the type of system that made them, the serial number of the remote equipment, the sub-type of report (it may be a 'flash' report with some data omitted), the acquisition dates covered etc..
    Generally the reports are similar in that some things (like the items above) will be present in all reports and some will either present the data in different levels of detail or have extra data segments entirely.

    Should I just have a table for each report type and subtype or should I try and decompose everything so there is a 'basic' table with the things that will always be present and then linked tables for subparts x,y,z,z' etc... depending on what is actually present.

    The 'table for every report type' seems like it won't have a chance of catching me out later when we make a new report that doesn't quite fit into the demarcations between reports we have at the moment, but it doesn't feel very SQLy.

    Any recommendations?



  • @Cursorkeys just add varchar(255) ReportType to your existing table.


  • Discourse touched me in a no-no place

    GUESS WHAT!!!!!

    This is what Document Store Databases are for. YOU FOUND IT!

    Metadata to a SQL table.
    Actual report to DSDB.

    Downside: You have to touch 2 systems to query them.
    Downside: Aggregation will be a performance bitch (you will want to pluck out data that gets routinely aggregated and shove it in SQL)


  • Discourse touched me in a no-no place

    Alternatively, if you're running recent SQL Server and can represent the reports as json or XML, you can do the same right in SQL.



  • @anotherusername said in SQL design for storing lots of similar-ish structures:

    @Cursorkeys just add varchar(255) TableType to your existing table.

    That's what I'm worried about :)

    @Weng said in SQL design for storing lots of similar-ish structures:

    GUESS WHAT!!!!!

    This is what Document Store Databases are for. YOU FOUND IT!

    Metadata to a SQL table.
    Actual report to DSDB.

    Downside: You have to touch 2 systems to query them.
    Downside: Aggregation will be a performance bitch (you will want to pluck out data that gets routinely aggregated and shove it in SQL)

    Thanks, that seems like a good idea.
    So possibly MongoDB? There seems to be good support for C# for that one after a quick Google.



  • @Cursorkeys said in SQL design for storing lots of similar-ish structures:

    Should I just have a table for each report type and subtype or should I try and decompose everything so there is a 'basic' table with the things that will always be present and then linked tables for subparts x,y,z,z' etc... depending on what is actually present.

    If you have a lot of variation it may be worth it to split them out. I have some things like this and then different views to put them together into groups that share various similarities. It's convenient but not as performant as having everything in a single table.

    As for the document bit...what are you doing with that part of the data? Is it just a blob that you don't really care about? Or only care about once?



  • @Weng said in SQL design for storing lots of similar-ish structures:

    Alternatively, if you're running recent SQL Server and can represent the reports as json or XML, you can do the same right in SQL.

    Yes, it's SQL Server 2008 R2 currently. It shouldn't be a problem to get them into XML. I'll look into that too then, it would be neater.

    Edit: FILETABLE is not available in SQL Server 2008, that's a bugger.

    @boomzilla said in SQL design for storing lots of similar-ish structures:

    As for the document bit...what are you doing with that part of the data? Is it just a blob that you don't really care about? Or only care about once?

    It gets made into Excel reports. Some Excel reports cover just one document, some pull data from up to 50 documents based on thresholds/dates/unit serial numbers.
    It's most likely any one document will keep getting referred to at least once a month (monthly overview and month-on-month trend reports).



  • It might be worth to take a look at the Entity-Attribute-Value pattern for representing the metadata of your reports.

    You lose some of the advantages of an RDBMS (such as constraints, easy querying and proper data types), but if you have a really variable set of data it might be more effective than a number of tables.


  • Discourse touched me in a no-no place

    @Cursorkeys said in SQL design for storing lots of similar-ish structures:

    Any recommendations?

    Can you identify what attributes you're going to query or aggregate over ahead of time? Those sorts of things are much easier if you can do them in conventional SQL.

    (This is a reasonable case for denormalizing the data as you aren't really too worried about updates; the original reports remain the definitive source of truth. The extracted info columns are just there for querying convenience.)



  • @Cursorkeys said in SQL design for storing lots of similar-ish structures:

    It gets made into Excel reports. Some Excel reports cover just one document, some pull data from up to 50 documents based on thresholds/dates/unit serial numbers.
    It's most likely any one document will keep getting referred to at least once a month (monthly overview and month-on-month trend reports).

    So...the stream that you're saving. Is that the report or whatever itself or the input to generate the report? I'm a little bit confused at what's going on. I guess my question, regarding putting some of it into some other data store is: How painful is it to access it from a normal SQL server BLOB?

    Unless moving it out makes things a lot better, I'd leave it there.

    In my schema, I have something like 70 objects that are all sort of the same that share a parent table. Then they each have their own table that stores data specific to them. There are various link tables, too, when they relate to other objects. But the main parent table and the children tables are linked by their PKs. I'm using Java / Hibernate / Oracle, and Hibernate calls this inheritance strategy Joined (as opposed to Table Per Class or Single Table).

    My initial take is that you'd have your GUID and the stream and any other guaranteed common attributes in the main table, then split out other stuff. Perhaps you have a couple levels of this depending on the level of commonality of attributes among reports. So pulling together a single report object would require joining at least 2 tables.

    But if, when you put together a report that uses multiple of these, you can simply pull it by the GUID it's super easy because you're just pulling up one table. If you need the whole object, then it's still easy if you're using an ORM. If not, then this is going to be a PITA, but then if you're already having to look up various fields depending on the type for your processing it's going to be that no matter what.



  • @Cursorkeys How many different report types are there? 3? or 300?

    If it's 300, take Weng's advice and store the report as a JSON or XML document (depending on which your DBMS supports better.) Using a completely different DB for that is ridonkulous, though, don't do that.

    If it's 3, use a "star schema" arrangement. Put all the data common to all reports in a single "center" table. Data unique to specific reports put in tables with a foreign key to the "center" table. Left join as needed to get the data back out.



  • @boomzilla said in SQL design for storing lots of similar-ish structures:

    @Cursorkeys said in SQL design for storing lots of similar-ish structures:

    It gets made into Excel reports. Some Excel reports cover just one document, some pull data from up to 50 documents based on thresholds/dates/unit serial numbers.
    It's most likely any one document will keep getting referred to at least once a month (monthly overview and month-on-month trend reports).

    So...the stream that you're saving. Is that the report or whatever itself or the input to generate the report? I'm a little bit confused at what's going on. I guess my question, regarding putting some of it into some other data store is: How painful is it to access it from a normal SQL server BLOB?

    It's the input to generate human-readable reports. It's a large amount of datalogging readings for the most part so is much too fine-grained to be useful directly.

    I'm not sure about having it as a BLOB. I'll take a look at how performant that would be.

    Unless moving it out makes things a lot better, I'd leave it there.

    In my schema, I have something like 70 objects that are all sort of the same that share a parent table. Then they each have their own table that stores data specific to them. There are various link tables, too, when they relate to other objects. But the main parent table and the children tables are linked by their PKs. I'm using Java / Hibernate / Oracle, and Hibernate calls this inheritance strategy Joined (as opposed to Table Per Class or Single Table).

    My initial take is that you'd have your GUID and the stream and any other guaranteed common attributes in the main table, then split out other stuff. Perhaps you have a couple levels of this depending on the level of commonality of attributes among reports. So pulling together a single report object would require joining at least 2 tables.

    But if, when you put together a report that uses multiple of these, you can simply pull it by the GUID it's super easy because you're just pulling up one table. If you need the whole object, then it's still easy if you're using an ORM. If not, then this is going to be a PITA, but then if you're already having to look up various fields depending on the type for your processing it's going to be that no matter what.

    Yep, tables of metadata sound like a good plan. I should be able to have at least most of the info I need to be able to identify what documents are required as metadata without having to dig through loads of them every time.



  • @blakeyrat said in SQL design for storing lots of similar-ish structures:

    @Cursorkeys How many different report types are there? 3? or 300?

    If it's 300, take Weng's advice and store the report as a JSON or XML document (depending on which your DBMS supports better.) Using a completely different DB for that is ridonkulous, though, don't do that.

    If it's 3, use a "star schema" arrangement. Put all the data common to all reports in a single "center" table. Data unique to specific reports put in tables with a foreign key to the "center" table. Left join as needed to get the data back out.

    It's 11 different types currently. I'll look into the star-schema arrangement. Having to add another DB is definitely something I'd want to do as the last choice.



  • @Cursorkeys said in SQL design for storing lots of similar-ish structures:

    I'm not sure about having it as a BLOB.

    Oh, well, whatever it is. BLOB / CLOB...you said "byte stream" so I assumed BLOB.

    @Cursorkeys said in SQL design for storing lots of similar-ish structures:

    It's 11 different types currently.

    That sounds like a good candidate for the star schema (funny world where blakey gives the name and someone else describes it), though if they're not too different, and having some stuff be null isn't a problem, keeping them all in one table might not be too bad.



  • Or store the things that are not common for all types in nullable fields.





  • @groo That makes sense if you can identify one or two fields that are sometimes optional.

    If you have a dozen of very different content types and you try to chuck everything into the same table, then you'll only end up with a lot of partially filled rows.

    Partial tables (a shared parent table + separate tables for derived entities) are a cleaner approach, EAV can also be a sane approach if you're willing to accept the consequences and limitations.



  • I had previously asked a similar question in this thread.

    It might help:

    https://what.thedailywtf.com/topic/18655/alternatives-to-eav-anti-pattern/





  • IMO you're overthinking it. Forget about xml blobs, star schemas, mongo, bongo and all this crap.

    One table per report. Primary key, timestamps. Views as needed. That's it.

    If you're using an ORM, you can maybe do some inheritence mapping in the code. But KISS in the database.


Log in to reply
 

Looks like your connection to What the Daily WTF? was lost, please wait while we try to reconnect.