I'm working on a problem at work. I'm presenting the data here as if it were a tag cloud that I were implementing, because fundamentally it is the same many-to-many type of data model (although what I'm actually working with has nothing to do with blogs).
So I have three tables like this (greatly simplified):
Post
id
Post
id
1
2
3
Link
p_id t_id
1 1
1 2
2 1
2 2
3 1
3 3
Tag
id
1
2
3
What I would like to do is write a query to give me all posts which are tagged with all of a given set of tags. For instance, if I wanted to get all posts tagged by both tags 1 and 2, the query should give me posts 1 and 2 (but not post 3 since it doesn't have tag 2). Right now, I am doing it like this:
select Post.id from Post where
exists(select Post.id from Link where Post.id = Link.p_id and Link.t_id = 1) and
exists(select Post.id from Link where Post.id = Link.p_id and Link.t_id = 2);
I don't think this is the best way to do this, as it involves n subqueries for n tags. Once in production, I might have millions of posts and thousands of tags, and it wouldn't be uncommon to need to run this query on 100 tags or so. Right now, that would require 100 subqueries which I don't think will scale very well (I haven't populated large amounts of data to test yet).
If I were trying to run a query for all posts containing any of a given set of tags, it would be much easier:
select distinct Post.id from Post, Link where Post.id = Link.p_id and Link.t_id in (T1, T2, ...);
That query I think would scale well for 100 tags to search on. But like I said, that's not what I'm doing. :)
Any non-WTF solutions that could be suggested would be appreciated.