How should I be writing this query?



  • I'm working on a problem at work.  I'm presenting the data here as if it were a tag cloud that I were implementing, because fundamentally it is the same many-to-many type of data model (although what I'm actually working with has nothing to do with blogs).

    So I have three tables like this (greatly simplified):

    Post
    id

    1
    2
    3

    Link
    p_id t_id

    1 1
    1 2
    2 1
    2 2
    3 1
    3 3

    Tag
    id

    1
    2
    3

    What I would like to do is write a query to give me all posts which are tagged with all of a given set of tags. For instance, if I wanted to get all posts tagged by both tags 1 and 2, the query should give me posts 1 and 2 (but not post 3 since it doesn't have tag 2). Right now, I am doing it like this:

    select Post.id from Post where
    exists(select Post.id from Link where Post.id = Link.p_id and Link.t_id = 1) and
    exists(select Post.id from Link where Post.id = Link.p_id and Link.t_id = 2);

    I don't think this is the best way to do this, as it involves n subqueries for n tags. Once in production, I might have millions of posts and thousands of tags, and it wouldn't be uncommon to need to run this query on 100 tags or so. Right now, that would require 100 subqueries which I don't think will scale very well (I haven't populated large amounts of data to test yet).

    If I were trying to run a query for all posts containing any of a given set of tags, it would be much easier:

    select distinct Post.id from Post, Link where Post.id = Link.p_id and Link.t_id in (T1, T2, ...);

    That query I think would scale well for 100 tags to search on. But like I said, that's not what I'm doing. :)

    Any non-WTF solutions that could be suggested would be appreciated.



  • Simple query (2 tags) 

    select a.p_id from link a, link b where a.p_id = b.p_id and a.t_id=1 and b.t_id=2

     
    Complicated query (4 tags)

    select p_id from link where t_id in (:t1, :t2, :t3, :t4) group by p_id having count(*) = 4

     (assuming the entries in link are unique; otherwise replace "*" by "distinct t_id")
     

     

     

     



  • @ammoQ said:

    Simple query (2 tags) 

    select a.p_id from link a, link b where a.p_id = b.p_id and a.t_id=1 and b.t_id=2

     
    Complicated query (4 tags)

    select p_id from link where t_id in (:t1, :t2, :t3, :t4) group by p_id having count() = 4

     (assuming the entries in link are unique; otherwise replace "" by "distinct t_id")

    I find the Simple query less obvious than the Complicated one.



  • @dhromed said:

    @ammoQ said:

    Simple query (2 tags) 

    select a.p_id from link a, link b where a.p_id = b.p_id and a.t_id=1 and b.t_id=2

     
    Complicated query (4 tags)

    select p_id from link where t_id in (:t1, :t2, :t3, :t4) group by p_id having count() = 4

     (assuming the entries in link are unique; otherwise replace "" by "distinct t_id")

    I find the Simple query less obvious than the Complicated one.

    now that you say that...  



  • @ammoQ said:

    @dhromed said:
    @ammoQ said:

    Simple query (2 tags) 

    select a.p_id from link a, link b where a.p_id = b.p_id and a.t_id=1 and b.t_id=2

     
    Complicated query (4 tags)

    select p_id from link where t_id in (:t1, :t2, :t3, :t4) group by p_id having count() = 4

     (assuming the entries in link are unique; otherwise replace "" by "distinct t_id")

    I find the Simple query less obvious than the Complicated one.

    now that you say that...  

    But wait,

    count(*)?

    You did that on purpose, right? To see if we're awake.

    count(1)



  • At least in Oracle, there is absolutely no performance difference between "count(*)" and "count(1)".



  • @ammoQ said:

    At least in Oracle, there is absolutely no performance difference between "count(*)" and "count(1)".

    Ah, kay.


Log in to reply