Representing a tree in a relational database (efficiently)



  • (Somewhat inspired by the current frontpage article) Luckily, I'm no lazy developer but just a student, so I can afford asking this question without losing my face :)

    So basically, how WOULD you store a tree (with variable, arbitrary numbers of children) in a relational database?

    The only real approach I see here is storing the tree "upside down",  giving each node a "parent" field that contains the primary key of the parent node. But that seems horribly inefficient to me if you wanted to traverse the tree from top to bottom for example.

    Any better ideas? 



  • We discussed this a while back: http://forums.worsethanfailure.com/forums/thread/100920.aspx

    I used 'Modified Preorder Tree Traversal' quite successfully as part of the geographic search engine I described in another thread.

    -cw



  • If you give each record a parent field, that's the way to do it, and it's quite efficient.  You'll have a tree where the nodes are not ordered.  If that's what you want it's fine.

    To find a node's children (traversing top-down) all you need is "select * where parentId = ?".  Put an index on parentId and it's quite efficient.

    --------------
    [url=http://mailboxesbeverlyhills90210.com/]Beverly Hills Mailboxes[/url]



  • That's to find the direct children of the parent; but if you want all of the descendent nodes of a parent -- including the children of the children of the children of the children -- it gets messier.

    -cw



  • If you want all the descendants of a parent, then yes, you can't do that with one query at that point.  That's a problem that should probably be solved not in the SQL layer.  If you want really fast answers to queries like that, represent it as a tree in memory, with the DB providing your persistent storage.  A DB isn't a substitute for structures in memory.



  • @JavaCoder said:

    If you want all the descendants of a parent, then yes, you can't do that with one query at that point.

    In Oracle, you can, using the SELECT ...START WITH ... CONNECT BY ... statement. 

     

    That's a problem that should probably be solved not in the SQL layer. If you want really fast answers to queries like that, represent it as a tree in memory, with the DB providing your persistent storage. A DB isn't a substitute for structures in memory.

    Depending on the size of the tree, it might take quite a lot of memory to do that.



  • A lot depends on how the tree is going to be used and how often it will be changed.   Something like the preorder traversal makes the queries super fast, but at the expense of having to recalculate a lot of things and write to a significant number of rows if you add or remove from the tree. 

    In my case, I was storing geographic regions --  a hundred thousand of them or so.  Expensive to calculate what is inside of what,but we didn't have to do it all that often.    On the other hand, if you were maintaining, say, the employees at microsoft and you were adding hundreds and removing hundreds of nodes everyday, that would get expensive and you'd need a way to mitigate that.

    It's always about picking the right solution for the problem.

    -cw



  • @CodeWhisperer said:

    It's always about picking the right solution for the problem.

    And a relational database probably isn't it.

    We would need to see the real problem description to find a saner solution. "Build the tree structure in memory" is not a bad first cut, but the right answer for large data sets probably looks more like "build a structure on disk that allows you to rapidly navigate the tree structure". There are loads of algorithms for that kind of thing, and we can't even begin to guess at which one might be right from the data provided.



  • It totaly depents on the problem and what you are going to do with it.

    If your maximum branch dept is 3, you might go for 3 tables. (roots, branches, leafs ;) not as uncommon as you think: Product group -> Product -> Product property for a webshop for example?)

    If you need a lot of moving around of parent nodes (moving whole brance X to parent Y) then you better got for the 1 table with parent field approuce.

    If it needs to be super fast, you might not go for a DB at all. 



  • Here is a trivial example of a tree structure in SQL2K5, and the code needed to traverse the tree (down):

    if object_id('tempdb..#tree') is not null
      drop table #tree
    

    create table #tree (
    parent_id int null,
    id int not null,
    descr varchar(255)
    )

    insert #tree ( parent_id, id, descr )
    select null, 1, 'root' union
    select 1, 2, 'trunk' union
    select 2, 3, 'branch 1' union
    select 2, 4, 'branch 2' union
    select 3, 5, 'leaf 1/1' union
    select 3, 6, 'leaf 1/2' union
    select 4, 7, 'leaf 2/1'

    ;with tree_rec as (
    select parent_id, id, descr
    from #tree
    where id = 3 --> this is the anchor node... change it at will

    union all

    select T.parent_id, T.id, T.descr
    from #tree T
    join tree_rec R
    on T.parent_id = R.id
    )
    select *
    from tree_rec



  • How terrible a solution would it be to store a key to a parent node for each record and reconstruct the tree in your programming / scripting language of choice?



  • @gotaq said:

    How terrible a solution would it be to store a key to a parent node for each record and reconstruct the tree in your programming / scripting language of choice?

    That depends how big your resulting tree is...

    You would have to either:
       1)  Read the entire dataset into memory and parse it out manually  (chunky)
       2)  Query the database for each level you need (chatty)

    The optimal solution is to let the database handle the data, assuming your RDMS can handle it without jumping through too many hoops.  I know that both Oracle and SQL2K5 can handle this need efficiently.

    There are all sorts of approaches to hierarchies, each with their own complexities, each bringing different "pluses" and "minuses" to the table.  The simple parent/child relationship (shown above) is about the simplest recursive hierarchy you can create in a relational database.



  • @RaspenJho said:

    @gotaq said:

    How terrible a solution would it be to store a key to a parent node for each record and reconstruct the tree in your programming / scripting language of choice?

    That depends how big your resulting tree is...

    You would have to either:
       1)  Read the entire dataset into memory and parse it out manually  (chunky)
       2)  Query the database for each level you need (chatty)

    The optimal solution is to let the database handle the data, assuming your RDMS can handle it without jumping through too many hoops.  I know that both Oracle and SQL2K5 can handle this need efficiently.

    There are all sorts of approaches to hierarchies, each with their own complexities, each bringing different "pluses" and "minuses" to the table.  The simple parent/child relationship (shown above) is about the simplest recursive hierarchy you can create in a relational database.

    It sounds like Option 1, above, might work if the tree only needed to be constructed once, but read from or searched frequently.

    I have to admit that the SQL query above goes a bit over my head (yeah, I'm a bit of a novice).  It looks to me like it's creating a temp table and inserting 7 records, and then selecting one of them and its direct children.  Will that query recurse to an arbitrary depth, or is there a way to make it do so, without adding an additional subselect / union for each additional level?

    And what sort of data does it return?  It looks like it would be a flat list of nodes that would then have to be reconstructed into a tree -- I suppose there isn't any reasonable way to get an actual tree structure returned from a database?



  • It doesn't matter if the data need be constructed only once...  The RDBMS is still going to do it more efficiently than custom middle-tier code...  This is where caching comes into play.

    As far as the example, I created a temp table and added records to it just to "simulate" a real hierarchical data structure.  The real implementation would be users or rules or some other hierarchy that really makes sense for an application.  The temp table build-out is not part of the demonstration of recursiveness.

    The query above uses a SQL2K5 feature called Common Table Expressions (CTE).  You will note the two distinct select statements within the CTE, one of which selects the "anchor" member(s), and the other one selects the descendants (Note how the descendant query references the CTE that contains it).  It handles an arbitrary number of levels, up to the max recursion limit of the database.

    Here are some sample result sets from the query I posted:

    -- anchor node = 1
    

    parent_id id descr


    NULL 1 root
    1 2 trunk
    2 3 branch 1
    2 4 branch 2
    4 7 leaf 2/1
    3 5 leaf 1/1
    3 6 leaf 1/2

    -- anchor node = 3

    parent_id id descr


    2 3 branch 1
    3 5 leaf 1/1
    3 6 leaf 1/2



  • So either way you have to build the tree in the middle tier -- it's just that you can have the database return the data for the specific sub-branch of the tree you need?

    I'm still not clear on how the SQL statement recurses, but that probably just means I need to study up on my SQL (is that particular query specific to SQL Server 2005?)



  • I'm sorry for the rather vague formulation. I just thought, since apparently relational databases are commonly accepted as superior to hierarchal systems and since it was presented as an "abc" question, there would be a commonplace solution for it.

    But as for concrete problems, consider for example (really just an example, because I'm curious. I'm not actually planning to build such a thing) a forum with a threaded view. Common operations would include: Adding nodes in arbitrary places, removing leaves, fetching all nodes of a subtree.



  • @PSWorx said:

    I'm sorry for the rather vague formulation. I just thought, since apparently relational databases are commonly accepted as superior to hierarchal systems and since it was presented as an "abc" question, there would be a commonplace solution for it.

    As a rule, most things which are commonly accepted are wrong (because there are far more idiots in the world than people who can count past ten without taking off their shoes). People don't use relational databases because they're the best way, they use them because you can drop ten grand on Oracle and blame somebody else when it doesn't work. The rest is just efforts to justify a CYA decision, or to work around a lack of anything resembling in-house developers who are any good at their jobs. The number of cases where they are genuinely the best solution, from a technical perspective, is so small that you're relatively unlikely to ever encounter one (mostly megacorp stuff). You're somewhat more likely to encounter a case where they are a crappy technical solution but the best choice from a financial perspective - but even that doesn't come around every day, and the decision is usually not made on that basis. "Nobody ever got fired for buying a database".

     

    But as for concrete problems, consider for example (really just an example, because I'm curious. I'm not actually planning to build such a thing) a forum with a threaded view. Common operations would include: Adding nodes in arbitrary places, removing leaves, fetching all nodes of a subtree.

    Every decent mail client implements this. (Note that a web forum is just a really crappy mail client). I can't think of any that involve databases. Most just build the tree in memory on the fly; a few stash a copy of the tree on disk so that they don't have to rebuild it every time. The essential realisation here is that the mail client is going to have to scan the entire current folder anyway, in order to display the from/date/subject list, and the cost of building the tree structure in memory is negligible by comparison. It's a little more subtle with the typical forum interface, but the same principle applies: the cost of finding thread structure is small compared to all the other work that has to be done.

    I fully expect that most web forums use databases, because most of them are written by incompetent PHP morons. Try not to emulate them.
     



  • @asuffield said:

    Every decent mail client implements this. (Note that a web forum is just a really crappy mail client). I can't think of any that involve databases. Most just build the tree in memory on the fly; a few stash a copy of the tree on disk so that they don't have to rebuild it every time. The essential realisation here is that the mail client is going to have to scan the entire current folder anyway, in order to display the from/date/subject list, and the cost of building the tree structure in memory is negligible by comparison. It's a little more subtle with the typical forum interface, but the same principle applies: the cost of finding thread structure is small compared to all the other work that has to be done.

    I fully expect that most web forums use databases, because most of them are written by incompetent PHP morons. Try not to emulate them.

    I've yet to encounter mail client or forum software that builds tree structure on the fly from a sequence of mails/posts, using subject titles or otherwise, in a way that is anywhere near reliable. Usually they get confused on the full depth and branching of a long thread and miss out on postings. So I believe most modern forum software will store tree structure based on user intent (read: me clicking a "reply" or "quote" button). In that sense, forum software is not just a mail client, crappy or not. Mail is inherently linear (chronological) and single user; Forums are tree like and multi user.
     



  • @asuffield said:

    I fully expect that most web forums use databases, because most of them are written by incompetent PHP morons. Try not to emulate them.
    I'm curious to know why you think databases (I'm assuming you mean relational databases here) are inappropriate for web forums and how you would approach the problem.

    Say we're dealing with a simple web forum where there is only one root forum, and threads are created by users in this forum, and posts are made by users into threads. Let's also say that it's import to track posts and threads equivalently by who posted them, as well as the standard query-by-thread approach. It's not really a tree, since posts have two different parents (threads and users) depending on how you look at it. I'm curious about how you, or anyone thinks of this.



  • @gotaq said:

    So either way you have to build the tree in the middle tier -- it's just that you can have the database return the data for the specific sub-branch of the tree you need?

    I'm still not clear on how the SQL statement recurses, but that probably just means I need to study up on my SQL (is that particular query specific to SQL Server 2005?)

    The underlined bolded bits point out the recursion...

    ;with tree_rec as (
      select  parent_id, id, descr
      from    #tree
      where   id = 3    --> this is the anchor node...  change it at will
    

    union all

    select T.parent_id, T.id, T.descr
    from #tree T
    join tree_rec R
    on T.parent_id = R.id
    )
    select *
    from tree_rec



  • @JvdL said:

    I've yet to encounter mail client or forum software that builds tree structure on the fly from a sequence of mails/posts, using subject titles or otherwise, in a way that is anywhere near reliable.

    Gnus, mutt, thunderbird, evolution...

    Mail messages are threaded based on the In-Reply-To header (and a few others) that is inserted by the sending mail client. If yours does not work, get a new one.


    So I believe most modern forum software will store tree structure based on user intent (read: me clicking a "reply" or "quote" button). In that sense, forum software is not just a mail client, crappy or not. Mail is inherently linear (chronological) and single user; Forums are tree like and multi user.

    Mail is neither linear nor single-user. It is inherently out-of-order and would make no sense if it wasn't multi-user. I can only presume that you have never used a mailing list. 



  • @Welbog said:

    @asuffield said:

    I fully expect that most web forums use databases, because most of them are written by incompetent PHP morons. Try not to emulate them.
    I'm curious to know why you think databases (I'm assuming you mean relational databases here) are inappropriate for web forums and how you would approach the problem.

    Say we're dealing with a simple web forum where there is only one root forum, and threads are created by users in this forum, and posts are made by users into threads. Let's also say that it's import to track posts and threads equivalently by who posted them, as well as the standard query-by-thread approach. It's not really a tree, since posts have two different parents (threads and users) depending on how you look at it. I'm curious about how you, or anyone thinks of this.

    (Yes, I meant to say "relational databases", or even "RDBMSes", in that sentence) 

    Firstly, I wouldn't approach any problem by building a web forum. They're a classic example of the "pounding in nails with your forehead" syndrome. The entire concept is a shoddy reinvention of mail and news systems, and has no particularly good excuse for existing. If I were building a system to tackle that problem, it would look rather more like groups.google.com or gmane.org - those are two decent examples of how to approach the problem in a sane manner.

    Once you start thinking of it as a user interface to a newsgroup or mailing list, the shape of the problem becomes more apparent. You typically are interested only in recent postings, so the primary storage system should be chronological. A daily/monthly maildir or mbox format would therefore be a reasonable first cut for the primary storage, and the standard tools can be used to do most of the work on that level.

    The rest is all in the indexing. You'll want an index to find any message by its ID - that's easy to maintain, since it's insert-only, and the only point at which we're going to need any storage structure more sophisticated than an append-only file. A typical display algorithm then works like this: read the N most recent additions to the primary store. For each one, follow its 'parent' pointer (ie, In-Reply-To field or equivalent) and use the index to fetch that post, building the tree in memory as you go. That gives you the path up to the root of the thread.

    To find sibling posts, you'll need an index in the opposite direction: for each message ID, it stores the (append-only) list of message IDs that are in-reply to it. In a maildir format, this can be constructed in a file alongside each message, as each message arrives into the system. You can use these files to find as many or as few sibling posts as you want to display in the thread.

    Note that the search algorithms I have described here can be implemented in about 20 or 30 lines of code and the lookup is linear in the number of posts you are examining - nothing is ever going to be faster than that, and it's going to make any RDBMS look slothful. The insertion code should come in under 100 lines.

    If you have a desire to display posts based on users, that can be accomplished with another append-only file, one per user, storing the postings they have made in chronological order. 



  • @asuffield said:

    @JvdL said:

    I've yet to encounter mail client or forum software that builds tree structure on the fly from a sequence of mails/posts, using subject titles or otherwise, in a way that is anywhere near reliable.

    Gnus, mutt, thunderbird, evolution...

    Mail messages are threaded based on the In-Reply-To header (and a few others) that is inserted by the sending mail client. If yours does not work, get a new one.

    While Thunderbird builds a perfect tree from mail headers, and mail clients that do this are apparently considered "worth their salt": I REALLY HATE TREES. The visual display of such a beast reduces my ability to converse efficiently via email. The branching quickly grows out of control, and active leafs are not necessarily on the far end of a list of email items. On particularly "multi-htreaded" conversations with 1 other person, it's really preferrable to have their replies at the bottom/top* of a big linear list, and reply in sequence.

    *) interesting to note that at home, I have TB display new emails at the bottom, but at work, Outlook displays them at the top. The difference between them appears to be as big as the two main {accolade} styles: [vertical aligned] vs. [ { immediately after the control statement ]: no significant difference.



  • @dhromed said:

    While Thunderbird builds a perfect tree from mail headers, and mail clients that do this are apparently considered "worth their salt": I REALLY HATE TREES. The visual display of such a beast reduces my ability to converse efficiently via email. The branching quickly grows out of control, and active leafs are not necessarily on the far end of a list of email items. On particularly "multi-htreaded" conversations with 1 other person, it's really preferrable to have their replies at the bottom/top* of a big linear list, and reply in sequence.

    mutt happily switches between those two views (and a bunch of others) with a couple of keystrokes. I don't use thunderbird, so I'm not sure if it can do the same - of the decent email clients, it's probably the most clumsy and least flexible.



  • @asuffield said:

    mutt happily switches between those two views (and a bunch of others) with a couple of keystrokes. I don't use thunderbird, so I'm not sure if it can do the same

    TB can switch.



  • @CodeWhisperer said:

    That's to find the direct children of the parent; but if you want all of the descendent nodes of a parent -- including the children of the children of the children of the children -- it gets messier.

    -cw

    Not really, SELECT * FROM tree WHERE parent is null

    gets all parents, then you use those ID's to get all children



  • @jminkler said:

    Not really, SELECT * FROM tree WHERE parent is null

    gets all parents, then you use those ID's to get all children

     That's different than what I said.  I was referring to getting all the descendents of a given parent.  So, say you have an company org structure in the db, I could ask "Tell me all the people that the Sr. Vice President has under him" and get back myself, my boss, his boss, plus all the people in all lateral organizations. 

    If you just used a simple 'pointer to parent', you'd need to ask for A = all the people who's parent was the VP; then ask for B = all the people whose parent is in A, then C = all the people whose parent is in B, and so on until you get to the leaf nodes where they are the parent to nobody.

    In the case of the modified preorder design, each node is stored with a 'min child id' and 'max child id'.   You first query the parent you are interested in, it returns "max id = A, min id = B"; then you query for all entries who's id is between min and max and voila, you get the entire set back in one query.

    -cw



  • You are describing nested sets...  Check this url out:

     



  • @RaspenJho said:

    You are describing nested sets...  Check this url out:

     



    Nested sets FTW:

    http://www.dbmsmag.com/9603d06.html

    http://www.mrnaz.com/static/articles/trees_in_sql_tutorial/



  • To answer the OP's original question, we need to know which RDBMS he is using.

    SQL Server 2005 can handle hiearchies very easily and very efficiently, as mentioned.   I would never use the nested set model there.  SQL 2000 can do it pretty well also, using a UDF, but it is not quite a simple or efficient as in SQL 2005.  I can provide some links and examples if necessary.

    For other database engines, the nested set *may* be the way to go, but it still depends on exactly what you are modeling.

    I think I would recommend avoiding the reinventing the wheel to try to improve upon a relational database approach, as has been suggested here.  Sometimes when people don't understand how databases work or how to to use  them properly (i.e., writing clean, efficient SQL, good indexes and constraints, a good data model, etc) they think they are better off writing their own systems from scratch instead.  I suppose in that case maybe they are, but I would suggest simply learning SQL instead and leveraging proven technologies.

     



  • With apologies to the "I hate Oracle" members, check out Oracle's

    CONNECT BY

    ORDER SIBLINGS BY

    START WITH

    and LEVEL

    A thing of beauty for this type of thing.



  • @LoztInSpace said:

    With apologies to the "I hate Oracle" members, check out Oracle's

    CONNECT BY

    ORDER SIBLINGS BY

    START WITH

    and LEVEL

    A thing of beauty for this type of thing.

     

    I agree here... Long time Oracle hater... But I know and love these functions, even as a SQL-Server guy (And Oracle 10 even has some new ones, so you can retrieve the root element better... I think it was called "connectByRoot")

    But I dont see these functions as a big showstopper for any other dbms. Even on SQL Server 2000 it was not hard to perform recursive querries, and still keep using a set based aproach. The suggested nested sets offer a great performance boost, since you are able to retrieve the data you need from a single index scan, but it was already mentioned that those solutions require a constant "upkeep" so you can keep your data in shape. And since I am a fierce "hater of Triggers" (There are some valid uses, but I never encountered a trigger that was doing something "valid"), I dont trust them doing something reliable... ;)

    So, if you have static data, then you can consider a nested set. But what happens if you have a record, and you need to move, or delete it with a precomputed nested set? Say from position 17? This means you have to renumber a whole lot of records if the set contains 10.000.354.123 rows...

    The important thing is that you need to know what you want to do with the data. Small trees, which contain only product groups, or geographical location might be a valid candidate for those. For big trees, that "wont fit in memory", a nested set might be a problem, which will only show once you got more records, then you most likely have in your test DB...

     


Log in to reply