Outer Join or WHERE NOT EXISTS



  • I think I know already, but which would be faster if you wanted to find only the records that don't join?

    Method 1: Do an outer join and then a WHERE table2.whateverfield IS NULL

    Method 2: do a WHERE NOT EXISTS 


  • Discourse touched me in a no-no place

    Can't you trial both on the dataset concerned and do a DESCRIBE on them?



  • @PJH said:

    Can't you trial both on the dataset concerned and do a DESCRIBE on them?
    That sounds like less work than posting here.  I was looking more for some general intuition, maybe someone who paid attention in DB class.


  • Discourse touched me in a no-no place

    @belgariontheking said:

    I was looking more for some general intuition, maybe someone who paid attention in DB class.
    My general intuition would be any case where it matters (i.e. the query itself takes a while) then the difference between the queries would be minimal on any sensible DB (or no difference whatsoever; i.e. they resolve in the background to the same thing.)



    Then again, sod's law would probably invoke itself with the particular query you have in mind with the database you have in mind.



    I think you're probably going to have to profile it, instead of guessing (or relying on others' guesses.)



  • I did some research on this back when SQL Server 2005 was new.



    The consensus was that WHERE NOT EXISTS produces a generally more optimal (and faster) query plan. We implemented a rule that unless a column was needed from the OUTER table, we would use the WHERE NOT EXISTS syntax.



    Deviation from the rule required an analysis of the two equivalent queries and close inspection of the execution plans.



  • Well for what its worth, I tried similar queries on MS SQL 2008 on tables on the order of 130k and 3k both joined on a integer primary key.  End result is that although the query plans were slightly different, each was 50% of the batch cost based on the actual execution plan.

    The reality is that although there may be general rules for query performance, there is no substitute for actually profiling the queries.



  • @PJH said:

    I think you're probably going to have to profile it
    Well sumbitch.  It turns out that the outer join was with a subquery defined as a derived table with an alias, so NOT EXISTS wouldn't be appropriate anyway.  Neither would NOT IN.

    Oh well.  asking useless questions FTL.


  • Discourse touched me in a no-no place

    @belgariontheking said:

    @PJH said:
    I think you're probably going to have to profile it
    Well sumbitch.  It turns out that the outer join was with a subquery defined as a derived table with an alias, so NOT EXISTS wouldn't be appropriate anyway.  Neither would NOT IN.


    I refer the 'onorable Rivan to the reply I gave some hours ago:

    @me said:
    Then again, sod's law would probably invoke itself with the particular query you have in mind with the database you have in mind.



  • OUTER JOIN and WHERE NOT EXISTS do fundamentally different things.  Be careful if you change a not exists type query into an outer join that there isn't more than one matching row.  You still can probably use WHERE NOT EXISTS if you use a common table expression (assuming SQL 2005 or later, Oracle has a similar feature).



  • What Jamie says is true. The two are very different unless you have a unique constraint on your FK in which case the query plan should work itself out to be the same.
    Where you can interchange (and I don't think a subquery would make a difference - it's the nature of the data not the source) then EXISTS should be faster as it can stop as soon as it gets a positive answer. NOT EXISTS will probably be the same because it cannot know the answer until it has exhausted all possibilities.

    EXISTS and IN are also different when it comes to NULLs so are not always interchangable either (unless there is a NOT NULL constraint in place in which case they can be).


    Basically, my outlook is

    a) Even if it looks like there are multiple ways of doing the same thing then there are going to be differences as the designers would not bother implementing the same thing twice for no reason.

    b) Where two options are provably identical (IN with not null vs EXISTS) then the optimiser should generate the same plan for both cases so why worry about which is faster?

    c) Always check assumptions for b) by looking at the query plan :)



  • @LoztInSpace said:

    a) Even if it looks like there are multiple ways of doing the same thing then there are going to be differences as the designers would not bother implementing the same thing twice for no reason.
     

     Thread hijack:  is BETWEEN faster than doing a manual (min =< value =< max) in SQL, for whatever quantity of "faster"



  • @belgariontheking said:

    Method 1: Do an outer join and then a WHERE table2.whateverfield IS NULL

    Method 2: do a WHERE NOT EXISTS

     

    I put my bet on Method 2.



  • @dhromed said:

     Thread hijack:  is BETWEEN faster than doing a manual (min =< value =< max) in SQL, for whatever quantity of "faster"
     

    I don't think so.



  • @ammoQ said:

    @dhromed said:
     Thread hijack:  is BETWEEN faster than doing a manual (min =< value =< max) in SQL, for whatever quantity of "faster"
    I don't think so.
    I agree with fearless leader. 

    When looking through explain plans, it looks like actually translates BETWEEN straight into its <= and >= counterparts.



  • @belgariontheking said:

    @ammoQ said:

    @dhromed said:
     Thread hijack:  is BETWEEN faster than doing a manual (min =< value =< max) in SQL, for whatever quantity of "faster"
    I don't think so.
    I agree with fearless leader. 

    When looking through explain plans, it looks like actually translates BETWEEN straight into its <= and >= counterparts.

    BETWEEN was faster in the early '90s.  I think I still have an install disk for SQL Server 4.2, I wonder if it will install on anything?  There is a lot of twenty year old database advice floating around the Internet that doesn't apply anymore.  I also know a lot of people who still make choices ignoring the fact that SQL Server has been doing ad-hoc batch caching for fourteen years.



  • @Jaime said:

    @belgariontheking said:

      

    When looking through explain plans, it looks like actually translates BETWEEN straight into its <= and >= counterparts.

    BETWEEN was faster in the early '90s.  I think I still have an install disk for SQL Server 4.2, I wonder if it will install on anything?  There is a lot of twenty year old database advice floating around the Internet that doesn't apply anymore.  I also know a lot of people who still make choices ignoring the fact that SQL Server has been doing ad-hoc batch caching for fourteen years.

    Yep--gotta specify database & version on a question like that ... I just got an explain back from a Teradata 3 server that shows BETWEEN getting translated the same way, so there's another DBMS where it doesn't matter.



  • @Jaime said:

    There is a lot of twenty year old database advice floating around the Internet that doesn't apply anymore.

    See also COUNT(1) vs COUNT(*), order of tables in joins, ANSI vs old-school joins...



  • @LoztInSpace said:

    See also COUNT(1) vs COUNT(*), order of tables in joins, ANSI vs old-school joins...
     

    A lot of tuning advice is based on urban legends, incorrect measuring methods and out-dated technology.

    How about this one: queries run 4 times faster when you write the select statement with a capital S?


    erich@troubadix SQL> set timing on
    erich@troubadix SQL> select count(*) from user_tab_columns;

      COUNT(*)
    ----------
             0

    Elapsed: 00:00:00.28
    erich@troubadix SQL> Select count(*) from user_tab_columns;

      COUNT(*)
    ----------
             0

    Elapsed: 00:00:00.07
    erich@troubadix SQL>          

    (The one who replies to this with "caching" didn't get the joke)

     



  • @ammoQ said:

    (The one who replies to this with "caching" didn't get the joke)
     

    I don't get the joke either, because I'm just that thick.

    @ammoQ said:

    COUNT(1) vs COUNT(*),

    In MSSQL, afaik, no difference, as it's smart enough to ignore the * in count(*)

     



  • @dhromed said:

    @ammoQ said:
    COUNT(1) vs COUNT(*),
     

    learn2quote



  • @ammoQ said:

    learn2quote
     

    :care

    It was not a particularly damaging ad-hominem quote, I think, though of course I admit that I never paid attention to the username out of ignorant unroutine.



  • @SCSimmons said:

    Yep--gotta specify database & version on a question like that
    OK ... the original question was Whoracle, and all of my comments were Whoracle specific



  • Sorry to be so slow replying, but the fastest is Method 3. Which is, update your inner table, setting a flag field, just using a direct inner join and = (equals). Then select the entries where the flag isn't set. Faster than outer join or not exists in many cases.



  • @vr602 said:

    Sorry to be so slow replying, but the fastest is Method 3. Which is, update your inner table, setting a flag field, just using a direct inner join and = (equals). Then select the entries where the flag isn't set. Faster than outer join or not exists in many cases.
    Wait, I think you're confused.

    While I'm creating the inner table, I have no idea what will be in the inner join and what will be on the outer join, so how can I expect to set a flag?



  • @belgariontheking said:

    @vr602 said:

    Sorry to be so slow replying, but the fastest is Method 3. Which is, update your inner table, setting a flag field, just using a direct inner join and = (equals). Then select the entries where the flag isn't set. Faster than outer join or not exists in many cases.
    Wait, I think you're confused.

    While I'm creating the inner table, I have no idea what will be in the inner join and what will be on the outer join, so how can I expect to set a flag?

    You could use a temp table, but I still think it's a bad idea.  Any DBMS worth a penny will autiomatically make a temporary work table to process a JOIN or an IN if it will speed up the process.  Explicitly coding it this way takes a bunch of options away from the database engine.  Modifying your schema to accomodate a query as vr602 suggested sounds insane to me.  I see many hours of debugging multi-threading issues in vr602's future.



  • I agree with Jamie.  "Method 3" is completely retarded.  How could anything that does writes possibly be slower than something that doesnt?  The DB has to identify the rows, update them then identify the rows that were not updated.  All you are doing is introducing extra I/O not to mention how this would possibly work with more than one user.

    Any DB engine would be able to work this out without resorting to this nonsense.



  • One key difference between method 1 and method 2 is that the LEFT JOIN query could return extra rows from the outer table, in the case of a many-to-one relationship; whereas the NOT EXISTS query would not.

    Just mentioning it because I didn't see this point in the thread so far.



  • @RaspenJho said:

    One key difference between method 1 and method 2 is that the LEFT JOIN query could return extra rows from the outer table, in the case of a many-to-one relationship; whereas the NOT EXISTS query would not. Just mentioning it because I didn't see this point in the thread so far.

    Mentioned on 2/2/2010 by me.



  • @LoztInSpace said:

    @Jaime said:

    There is a lot of twenty year old database advice floating around the Internet that doesn't apply anymore.

    See also COUNT(1) vs COUNT(*), order of tables in joins, ANSI vs old-school joins...

    While your point is generally valid, as recently as a couple of years ago I significantly improved the performance of one of our queries by changing the order of the tables in the FROM clause.

    To be fair, although this was an Oracle 9i database, it was operating in rules-based mode because the primary application we were running on it had that as a requirement. Fortunately the next version of the app (to which we upgraded at the end of 2008) can deal with the CBO. Now we just have to deal with the (database, not app) bug that means we have to have the DB set to not rewrite OR queries as UNIONs, so a query like "... where indexedfield1 = 'A' or indexedfield2 = 'B'" has terrible performance.



  • @Scarlet Manuka said:

    Now we just have to deal with the (database, not app) bug that means we have to have the DB set to not rewrite OR queries as UNIONs, so a query like "... where indexedfield1 = 'A' or indexedfield2 = 'B'" has terrible performance.
    While your bug definitely sounds retarded, have you ever heard of the IN operative?



  • @Scarlet Manuka said:

    @LoztInSpace said:

    @Jaime said:

    There is a lot of twenty year old database advice floating around the Internet that doesn't apply anymore.

    See also COUNT(1) vs COUNT(*), order of tables in joins, ANSI vs old-school joins...

    While your point is generally valid, as recently as a couple of years ago I significantly improved the performance of one of our queries by changing the order of the tables in the FROM clause. To be fair, although this was an Oracle 9i database, ....

    Oh for fucks sake.  This completely underlines the point, it is not an exception.  The CBO was introduced in 1992.  1992!!  Just google "top 10 1992" if you want a reminder how long ago that was.

    Hey, as recently as yesterday I managed to increase my RAMDISK by using "HIMEM.SYS" in my autoexec.bat.  Performance increase is noticeable even without pressing the turbo button.

    There is no point to your "point" as we are talking exactly about how irrelevant 20+ year old "truths" can be.



  • @LoztInSpace said:

    I agree with Jamie.  "Method 3" is completely retarded.  How could anything that does writes possibly be slower than something that doesnt?

    You need to have another look at this thought, and insert a "not" or, maybe change "faster" for "slower". I guess that's what you meant anyway. Also, you're wrong: There are times where this is much faster. That may seem "retarded" to you, but it's a reality in certain database cases.


  • @vr602 said:

    @LoztInSpace said:

    I agree with Jamie.  "Method 3" is completely retarded.  How could anything that does writes possibly be slower than something that doesnt?

    You need to have another look at this thought, and insert a "not" or, maybe change "faster" for "slower". I guess that's what you meant anyway. Also, you're wrong: There are times where this is much faster. That may seem "retarded" to you, but it's a reality in certain database cases.

    It's never a reality.  A temp table or table variable will work just as well in 99.999% of cases, and will not introduce multi-user bugs or lock contention.  Even the temp table solution should only be used in those edge cases where the DBMS can't optimize the query.



  • Bored at work, catching up on old threads. What's your point?

    @belgariontheking said:

    @Scarlet Manuka said:

    Now we just have to deal with the (database, not app) bug that means we have to have the DB set to not rewrite OR queries as UNIONs, so a query like "... where indexedfield1 = 'A' or indexedfield2 = 'B'" has terrible performance.
    While your bug definitely sounds retarded, have you ever heard of the IN operative?

    indexedfield1 and indexedfield2 are not the same field. That's why I put the numbers at the end...

    @LoztInSpace said:

    @Scarlet Manuka said:

    While your point is generally valid, as recently as a couple of years ago I significantly improved the performance of one of our queries by changing the order of the tables in the FROM clause. To be fair, although this was an Oracle 9i database, ....

    Oh for fucks sake.  This completely underlines the point, it is not an exception.  The CBO was introduced in 1992.

    There is no point to your "point" as we are talking exactly about how irrelevant 20+ year old "truths" can be.

    It appears that you have missed my point altogether, which is that as recently as 2008 the current version of an enterprise aplication [b]still[/b] required the RBO and would not work with the CBO. So when you have to work within constraints like that, 20 year old database advice can be very relevant indeed. Did you not read the part where I said it significantly improved the performance of that query? How can you claim it is irrelevant if it has a profound effect?



  • Did you have to read the whole thread again to work up a nice rage after all that time?



  • Who needs rage when you have boredom?



  • @Scarlet Manuka said:

    Who needs pants when you have boredom?
    BTKTFY


Log in to reply