[RESOLVED] Join versus Where Exists
I want to get data out of a single table, restricted to where an ID field exists on another table. The ID may exist more than once on the second table, so doing a join would cause me to get duplicate values from the first table. However, it's also possible that there would be records on the first table where the data I retrieve is identical, and I would want each of these to be retrieved - so I can't use Distinct.
I can do it easily enough with this:
Code:
Select A.Field1
, A.Field2
, A.Etc
From tblAgent A
Where Exists (Select 1 from tblTransaction T where T.FirmID = A.ParentFirmID)
... but I worry that this would be very inefficient.
Is there a neater way to do it? A particular type of join, perhaps?
Thanks...
Re: Join versus Where Exists
Exists is perfect for this need.
You can SELECT * in this case as the WHERE part of an EXISTS is all that is ever evaluated. Don't go thinking that SELECT 1 is faster!
If this was a stored procedure and if SPEED was a factor, you could build a "distinct list of PARENTFIRMID's" into a TEMP TABLE and then SELECT and JOIN from that.
Re: Join versus Where Exists
What DBMS are you using?
I'd probably start by getting the DISTINCT IDs from T first... then either use an inner join to the sub-query or use in an IN clause ... IF SQL Server 2008 or newer, I'd poss the DISTINCT ID into a CTE (Common Table Expression) then join to it. Although if it's this simple, the CTE is probably overkill.
Code:
select
A.Field1, A.Field2, A.Etc
from tblAgent A
inner join (select distinct FirmID from tblTransaction T)
on A.ParentFirmID = T.FirmID
Code:
select
A.Field1, A.Field2, A.Etc
from tblAgent A
where A.FirmID in (select distinct FirmID from tblTransaction T)
Code:
; with T as (
select distinct FirmID from tblTransaction
)
select
A.Field1, A.Field2, A.Etc
from tblAgent A
inner join T
on A.ParentFirmID = T.FirmID
-tg
Re: Join versus Where Exists
szlamany - Our coding standards say to never use SELECT *, so I tend to stick with that even if it doesn't matter.
It's SQL 2008 R2, and it's to be used for a view that will only be used as part of an SSIS package. Speed may be a factor, so I think I might go for the CTE. I just wasn't sure whether that (or a temp table) would necessarily be faster, but presumably the IF EXISTS would run for every record...?
Re: Join versus Where Exists
Quote:
Our coding standards say to never use SELECT *
This is probably one of the rare cases you should break the coding standards. It's a good standard if the * is actually going to be evaulated but in the case it isn't so the standard is non-sensical. I don't think it will actually make any difference because I don't believe your 1 will get evaluated either but I'm not 100% sure on that. At best your select 1 will perform the same as Select * but at worst it will be worse.
Re: Join versus Where Exists
The fact it might run for every record scared me - thus the suggestion of a TEMP TABLE to join from instead.
You can always check out the actual execution plan - see how that addresses the EXISTS.
Re: Join versus Where Exists
My personal experience has taught me:
CTEs tend to be faster than sub queries
Sometimes a temp table or a table variable can be faster than a CTE... I havent' found what the tipping point is, but it probably has to do with the number of CTEs (I had a bout a dozen of them) and their complexity, and the fact that some of the CTEs fed into other CTEs which fed into... you get the idea.
Above all, starting off with minimal data then joining to other tables, is the most effective performance enhancer I've seen with SQL. The natural instinct is to start with everything, then whittle it down... but if you can invert that logic, you can squeeze out some interesting time savings.
Code:
; with T as (
select distinct FirmID from tblTransaction
)
select
A.Field1, A.Field2, A.Etc
from T
inner join tblAgent A
on A.ParentFirmID = T.FirmID
Since T has the least amount of rows/data in it, when you then join to tblAgent, assuming ParentFirmID is indexed, you can get some huge performance gains since it will only then include the rows that actually match. if FirmID in tblTransaction is indexed, then your CTE should result in just an index scan, speeding things up a tad more, and if it's clustered... it doesn't get much more better than that.... AH... actually there is one more tweak, I'm a little surprised szlamany didn't mention it - change the DISCTING to a GROUP BY:
Code:
; with T as (
select FirmID from tblTransaction gorup by FirmID
)
select
A.Field1, A.Field2, A.Etc
from T
inner join tblAgent A
on A.ParentFirmID = T.FirmID
The net effect is the same, how it arrives at it will be different. I think at that point you will have squeezed as much performance out of that as you can.
-tg
Re: Join versus Where Exists
Guys - thank you. The execution plan was fairly non-committal, probably because there's no actual data in the Agent table anymore (someone helpfully dropped and recreated it), but it did look as if the EXISTS would have a detrimental effect. Alternating between SELECT * and SELECT 1 didn't seem to make any difference at all.
I've elected to go with the CTE, and swap the DISTINCT for a GROUP BY as suggested by TG.
Thanks for your help, everyone - much appreciated.
Re: Join versus Where Exists
Hmmm - I'm not sure whether I've successfully rated your posts. I just get the message that "you must spread some reputation around before rating this post" or words to that effect. I got the same message for each of you. :(
Re: Join versus Where Exists
Quote:
Originally Posted by
techgnome
actually there is one more tweak, I'm a little surprised szlamany didn't mention it - change the DISCTING to a GROUP BY:
I do usually beat up DISTINCT - have for years...
Current info on the interweb seems to indicate that DISTINCT and GROUP BY now reduces down to the same execution plan. In the old days (SQL 2000-ish) GROUP BY was done during RESULT SET building and DISTINCT was done after RESULT SET building - thus my dislike.
I've also watched junior programmers of mine use DISTINCT to get around phantom rows created by JOIN's to one-to-many links. That's just poor coding in general!
btw - spread rep around means you need to give to others - we are blocked due to the fact that you've given to us without doing others as well. Keeps people from repeat rep'ing each other I guess...
Re: Join versus Where Exists
Quote:
Originally Posted by
szlamany
btw - spread rep around means you need to give to others - we are blocked due to the fact that you've given to us without doing others as well. Keeps people from repeat rep'ing each other I guess...
That's a little mean if it's always the same bunch of people who help you out.