Querying tree structure in db

Kiyote

Brief background:

We store the client location structure in an organization table. (I'm going to strip it of its technical nicities in order to be brief.)

LocationId

Name

ParentLocationId

Now, this is all basically a nice tree structure, but this is, of course, incredibly hard to query against to find the list of locations at and below a given point in the tree.

Our current solution was to add a LocationKey field, which is a string composed of a separated list of Ids of every location above you.

For example:

Location A, Location B, Location C, each are children in order from left to right.

A is the parent of B is the parent of C.

So, the LocationKey for A is "A".

The LocationKey for B is "A.B"

The LocationKey for C is "A.B.C"

This way, if we want to find each and every location at and below a location, we simply do:

"SELECT * FROM LOCATION WHERE LOCATION_KEY LIKE '(some location's key)%'"

If you want to find every location *below* you (not at and below) you just add a "." to the like clause. '(some key).%'

So, is there a better way to do this? We have found it to be quite fast, but perhaps we're missing something obvious? We can't possibly be the only ones trying to store trees in a table, but there is remarkbly little information out on the 'net about the concept.

lpope187

If you are using SQL Server 2005 you can use CTEs (Common Table Expressions) to accomplish outputting hierarchies. I know Oracle has a similar method, but I don't remember the syntax. I'd search the documentation for recursive querying for the database your using - if it has a similar feature you should get close.

rdrunner

Hello...

I have some questions...

1. Which language are you developing in?

-> Is it .Net?

2. Is it resonable to keep the "Locations" all in a client as a lookup table?

-> If 2 times "Yes..."

Take a look at recursively navigating a relation you add to a Dataset. It makes querries like this on the CLIENT "quite easy". Look at the Datarow.getChildRows function there...

3. If you want to do this on a server... Which DB are you using(answer differs by a lot depending on DB)?

SQL Server 2000:

Write a SP to perform those querries.

1. Select the "root" node you need into a table variable. Add a "Level" integer to the row

2. while your last querry returned any rows and "@Level < 100" (optional parameter = MaxLevels)

set @Level = Level + 1 //Important... If you change this after the querry, your @@Rowcount will be reset to 1 and you get 100 itterations

SELECT * FROM LOCATION WHERE parentLocationID in (select id from @MyTmpTable where level = @Level -1)

SQL Server 2005:

Take a look at BOL at the CTE like already suggested. They are perfect for recursive querries...

Oracle:

Take a look at the "connect by" clause that is supported by oracle

Hope this helps...

P.s. If you need "real" Code for any example let me know... Currently I dont have any DB/Development Stuff on my box... fighting with Vista RC2... Damn... This feels so "naked" :D

SilverDirk

So wierd... I was just hopping on TDWTF forums to ask this very question.

The best I had come up with so far was to define the record mostly as you had, and use a WTFish query to pull it in. (thus the idea to ask on TDWTF forums, heh)

create table Linkage (
  ParentID integer,
  ChildID integer
);

create table DataTable {
   ID integer,
   // etc
   // etc
);

select L0.ParentID as ParentID, {data fields}
from Linkage L0 left join DataTable d on d.ID = L0.ChildID
where L0.ParentID = {Root}
union all
select L1.ParentID as ParentID, {data fields}
from Linkage L0 left join Linkage L1 on L1.ParentID = L0.ChildID left join DataTable d on d.ID = L1.ChildID
where L0.ParentID = {Root}
union all
...
select L5.ParentID as ParentID, {data fields}
from Linkage L0 ......
...

I generate the query in a loop for as many levels of hierarchy as I need. Its ugly, but seems to work ok.

My situation is more complicated because I don't have a tree... I have a directed graph. The logical way to handle it is to query each record individually, add records it links to to a "work list" (implemented as a set keyed on NodeID), then query links of items in the work list until I have as many results as I want or exhaust the nodes of the graph. but it seems inefficient to run a bunch of sequential queries, since I'll get hit with communications turnaround time on each iteration.   So I came up with the above WTF to pull accross a batch at a time.

Any suggestions?

CodeWhisperer

Depending on how big your dataset is, and how often you have to update it, you might want to try something like Modified Pre-order Tree Traversal.

Basically, you put a few extra values on each row that describes where in the tree it is, then you can very quickly query against that value to find everything underneath a specific node in your tree. I used this in building a geographic database a while back (the country has states, states have counties, counties have census blocks, etc.), it worked like a charm and blindingly fast. But doing updates requires that you probably have to touch every row in the table if you do an insert/delete/update, so it's not for cases where you want to change the data constantly (at least not on big data sets). In my case, the geographic boundaries don't change all that often so it wasn't a problem, but your mileage may vary.

Maybe I'm remembering the name wrong, because I'm not finding the search results I'd expect, but here* is a decent blog post about it with diagrams and everything :) Hope it helps.

-cw

*huh...I can't seem to paste in a link, and the HTML editor seems wonked, so here's the url: http://iamcam.wordpress.com/2006/03/24/storing-hierarchical-data-in-a-database-part-2a-modified-preorder-tree-traversal/

rdrunner

Hi...

I dont like using a path of an object inside the object itself. If there are any updates to the location in the tree you need to be very carefull to make sure you update everything correct..

Here is an example SQL that will show you how to implement a function on a SQL Server 2000 to querry this type of data:

create function dbo.get_cat ( @cat_id int )
returns @t table ( cat_id int , mother_cat_id int, cat_name varchar(30) , level int )
as
begin
declare @l int
set @l = 0
insert into @t select cat_id, mother_cat_id , cat_name, @l from yourtable where mother_cat_id = @cat_id

while @@row_count>0
begin
set @l = @l + 1

insert into @t select d.cat_id, d.mother_cat_id, d.cat_name, @l
from yourtable d
join @t t
on t.level = @l -1
and t.cat_id = d.mother_cat_id
end

return
end

nonDev

If your data is mostly static and/or you don't need to support versioning you might look into the object path option. The basic idea is to have each object contain a complete path to itself eg. object id4 path might be id1/id2/id3/id4/. This way you can query for object's child object or descendants in single like query eg. select * from table where path like 'id1/%'. It's a pain to maintain the correct path for each object but it pays off if you need quick retrieval. It's a trade-off. Either you read fast or you write fast :)

Then there's always Celko's nested sets .

rdrunner

Well...

If you have LOTS of Rows in you table, then the "full Path" might cause some performance problems... especially if you select where path like '%id1025743%' go get all children from a node that is not the root of a path it will cause an Index scan to find the records it needs... In that case you will write slow and read slow with the path implementation ;)

Also depending on the depth of the path the index could grew quite large, since you will be storing string. Another thing to consider is that you wont be able to have trees that are nested in an arbitery(spelling? I am a damn kraut...) depth... It will be limited to the size of your path sting...

I recently read a nice artice about recursive querries and an alternate solution to the path problem... It had to do with oranizing the path as a "value" and not as a string... I am trying to find it but i am currently unlucky ;)

rdrunner

Edit:

Frédéric BROUARD / T-SQL

Recursive Queries in SQL:1999 and SQL Server 2005 – SQLServerCentral

How many of you have written resursive queries in SQL? Or any language since school for that matter? Not many people write recusrive queries because of the complexity, the difficulty to understand how they work, and the chance for heap overflows. However, SQL Server 2005 implements Common Table...

P.s.: It was of course an article that also mentioned Joes Celko's Book ;)

nonDev

Well.. One could denormalize all the way. Have a table containing id, depth, ancestorId, order and level information. This way you have a record handy for every question. Sure it's lots of records but this is TDWTF right?

vr602

PLEASE please please implement this as a Nested Set. It's exactly the right thing for you, regardless of what dbms you are using. Here's a good explanation, starting about 1/4 of the way down at "The Nested Set Model".

MySQL :: Developer Zone

I'll try to explain it briefly, but the above link is much better. Each record has a Left-ID and a Right-ID. You number each element "round in a circle" starting at the top element and going down then across. I hear you say wtf? OK if top element is A which is parent to B and C, and B has children D and E then draw it as a usual tree structure, then go round anti-clockwise allocating Left-ID and Right-ID:
      1.A.10
        |        |
    2.B.7 8.C.9
   |         |
3.D.4 5.E.6

The joy of this is that, all the children of an element have Left-ID ( and Right-ID ) between parent's Left-ID and Right-ID. And the parents of any element have Left-ID < child's Left-ID and Right-ID > child's Right-ID. And leaf-level nodes have Left-ID = Right-ID - 1. So finding all parents or all children is super-simple.
Children: select ... where Left-ID > @parent_left_id and Right-ID < @parent_right_id order by Left-ID
Parents: select ... where Left-ID < @child_left_id and Right-ID > @child_left_id order by Left-ID
The only real overhead is when you insert or delete an element, you have to update lots of other elements to correct their Left-ID and Right-ID, adding or subtracting 2. So it's worth sorting bulk inserts for example, or you'll hit a lot of updates on entries that are already in your hierarchy. But it's not a big worry with fairly static data. As long as your "insert" and "delete" methods are carefully written, it isn't a problem.
Also, we implemented this with a Level-ID, to facilitate "what's my immediate parent"- type queries. You can calculate level by counting parents. Hope this helps. Regards.

kasm

I was actually wondering how I would go about doing this sort of thing a few weeks ago... although I would be doing it in MySQL, I've found that Oracle has exactly the functionality I would want built in:

http://www.lc.leidenuniv.nl/awcourse/oracle/server.920/a96540/queries4a.htm

Does anyone know how I would do this sort of thing in MySQL, or am I stuck using subqueries or using application logic to build the tree?

Edit: Apparently this is a much requested feature:

http://bugs.mysql.com/bug.php?id=2341

nonDev

All this talk gave me an idea on how to store hierarchical data more efficiently. I've written up a small article at http://orangebeta.blogspot.com/2006/11/on-storing-ordered-hierarchies-in-sql.html

CodeWhisperer

I guess I'm not sure how this is more effecient than the 'modified postfix traversal' method. You store a lot more data than it does, adding/removing a node is still often going to require rebuilding the heirarchy data for many of the nodes, and the query to find descendants is a lot more complex.

-cw

nonDev

[quote user="CodeWhisperer"]the 'modified postfix traversal' method.[/quote]

I must say I haven't heard (had success with google) of that method before. Can you elaborate?

CodeWhisperer

My bad, 'preorder', not 'postfix'...gah.

It appears to be the same idea as 'nested sets'; each node has three pieces of data attached (a standard id, and then a 'left' and 'right' value). The basic idea is that the children of a node in the heirarchy will have ids that fall between the left & right values of that node). So, if you have a node with a left value of 9 and a right value of 13, then any row with an id between 10 & 12 will be a child. So, two additional pieces of info per row, and a nice simple query to find children.

I don't have a pretty picture to show, but this guy's blog entry shows an example:

Private Site

-cw

Kiyote

Doesn't this imply that a node can at most 2 children? We have a totally arbitrary tree structure to store in the database. It represents the organizational heirarchy of a location, and in this case comprises 15,000 discrete locations organized in to a ragged 15 level tree.

Thus far we have found storing the pre-computed path to be the fastest and (relatively speaking) easiest way to implement this.

The added wrinkle, of course, is that all of the locations are versioned in time. That is, say a location changes its name on January 14, 2007....up until that time any selection from the database will return name A, after that, it returns name B. It's a nightmare of a system, which is why I'm looking for ideas on how to simplify the implementation.

Kiyote

Ah, okay, I gave a quick once-over to the articles. I understand now. It has nothing to do with the actual number of children.

But as far as I can tell, all of these different methods are just various ways of doing the same attempt to deal with the impedence mismatch of the data-format-to-storage-format.

The article from MySQL seems to advocate per-event high-processor cost (recompute the tree every time you touch it) to save on storage and retrieval costs. ie - It may be faster to do a SELECT BETWEEN than a LIKE.

So, what I gather then is that there are many ways to attempt to solve this problem, all of which have their advantages and disadvantages, none of them inherently better than the other?