[RESOLVED] Why is "not in" slower than "join"

Printable View

Feb 13th, 2008, 07:42 AM
srisa

[RESOLVED] Why is "not in" slower than "join"

I have this query to retrieve rows from table r which do not have a corresponding entry in the table ra.

Code:

SELECT name, r_id FROM r WHERE r_id NOT IN (SELECT r_id FROM ra GROUP BY r_id);

This one is taking about 8 seconds to execute.
On the other hand, this query

Code:

SELECT r.name, r.r_id FROM r LEFT JOIN ra on r.r_id = ra.r_id WHERE ra.r_id is NULL;

is taking less than a second.
Can you explain the reason for the difference in time taken?
Thank you.
Feb 13th, 2008, 08:06 AM
szlamany

Re: Why is "not in" slower than "join"

If you are using MS SQL then you can review the execution plan and see exactly why it's different.

From a 30,000 ft view...

I have always felt that JOIN is the natural SQL engine experience. The optimizer might actually be smart enough to go to the table ra first in that query (you would have to view the execution plan to really know that).

NOT IN is a tough construct to understand. If you imagine for a second it might actually build a "list" and then hit each r_id against that list. Building that can't be fast...

btw - you can use JOIN instead of LEFT JOIN and actually get rid of your WHERE statement. JOIN will only return a row from the table on the "left" if the sister row exists in the table on the "right" - which is what your WHERE clause is testing.
Feb 13th, 2008, 08:17 AM
FunkyDexter

Re: Why is "not in" slower than "join"

Quote:

btw - you can use JOIN instead of LEFT JOIN and actually get rid of your WHERE statement. JOIN will only return a row from the table on the "left" if the sister row exists in the table on the "right" - which is what your WHERE clause is testing.

That won't work because he's actually test that there's not a row in the sister table.

I must admit I tend to use a not exists clause for that myself but I've never performance tested it.
Feb 13th, 2008, 08:18 AM
srisa

Re: Why is "not in" slower than "join"

Quote:

Originally Posted by szlamany

If you are using MS SQL then you can review the execution plan and see exactly why it's different.

btw - you can use JOIN instead of LEFT JOIN and actually get rid of your WHERE statement. JOIN will only return a row from the table on the "left" if the sister row exists in the table on the "right" - which is what your WHERE clause is testing.

I am using MySQL. Will 'EXPLAIN' review the execution plan.
I am not sure if JOIN will do what I want. I want to know r_id's in table r which are NOT in table r_id. I can't claim to be knowledgeable than you but JOIN will give me rows which have entry in both tables.
Feb 13th, 2008, 08:24 AM
szlamany

Re: Why is "not in" slower than "join"

Quote:

Originally Posted by FunkyDexter

That won't work because he's actually test that there's not a row in the sister table.

oops - I didn't catch that the WHERE clause was looking for that condition!

JOIN will not work in this situation...
Feb 13th, 2008, 08:30 AM
szlamany

1 Attachment(s)

Re: Why is "not in" slower than "join"

In MS SQL you can see an execution plan.

I did this set of queries

Code:

Create Table #Test1 (ID1 int) Create Table #Test2 (ID2 int) Insert into #Test1 Select MasId From Retfiles.dbo.Person_T Insert into #Test2 Select MasId From Retfiles.dbo.Person_T Where MasId<15000 Select * From #Test1 Left Join #Test2 on #Test2.Id2=#Test1.Id1 Where #Test2.Id2 is null Select * From #Test1 Where Id1 not in (Select Id2 From #Test2) Drop Table #Test1 Drop Table #Test2

And this is the execution plan - note the additional steps taken in the second query.

The plan only represents the two SELECT queries - you can see that the first one (JOIN) is 13% of the total batch time - the NOT IN was 87% of the total batch time. That's about 8 times slower.
Feb 13th, 2008, 09:02 AM
srisa

Re: Why is "not in" slower than "join"

There is a statement called 'EXPLAIN' which gives out info about the performance of a query in MySQL. And here is the link which gives detailed info about the 'EXPLAIN'.explain
Feb 13th, 2008, 09:10 AM
szlamany

Re: Why is "not in" slower than "join"

Have you used EXPLAIN on your queries to see if the info given is meaningful? The example on the link you gave is rather simplistic.
Feb 13th, 2008, 09:20 AM
srisa

1 Attachment(s)

Re: Why is "not in" slower than "join"

I have attached screen shot of the result as given by phpmyadmin.
First result is for the join, second one is for the not in.
Feb 13th, 2008, 09:43 AM
FunkyDexter

Re: Why is "not in" slower than "join"

I've never examined the results of explain before but looking at the results you got and checking against the info on your link, two things imediately stand out:-
1. The link says that Type is the all important column and gives a list of which types are prefereble. In your results the join has an all and a ref, the 'in' has an all and an index. a ref is prefereable to an index so that's one reason it's faster.
2. Look at the row count being processed. The join processes 1 and 5545, the 'in' processes 5530 and 5565. As it's a subquery I suspect you should actually multiply those two numbers to get a meaningful indication of performance, which means the 'in' would be something like 5000 time less efficient! I suspect that lots of other factors come into play meaning you won't see anything like that much difference but it's not a good starting point.

Quote:

That's about 8 times slower

I'm surprised it makes that much difference :eek:. I don't suppose your in a position to test out what a 'not exists' would do are you?
Feb 13th, 2008, 09:54 AM
srisa

Re: Why is "not in" slower than "join"

With the "not in" we assume that the optimizer will retrieve all the r_id's from ra table and do something like

Code:

select name, r_id from r where r_id not in (1,2,3,4);

But that doesn't seem to be the case. It appears that a full table scan of ra table is being performed for every row of table r. Is that how the optimizer works? Can we somehow force it to perform the subquery just once?
Feb 13th, 2008, 10:21 AM
szlamany

Re: Why is "not in" slower than "join"

Quote:

Originally Posted by srisa

Can we somehow force it to perform the subquery just once?

Yes - you can.

That's what LEFT JOIN does!!!

That's why it's faster - that's why it's more natural. It's the whole essence of the set-based query processor that SQL uses.
Feb 13th, 2008, 10:24 AM
srisa

Re: Why is "not in" slower than "join"

Thanks folks. This was a good learning experience.
Feb 13th, 2008, 10:27 AM
szlamany

1 Attachment(s)

Re: [RESOLVED] Why is "not in" slower than "join"

Ok - I did these two and compared them

Code:

Select * From #Test1 Left Join #Test2 on #Test2.Id2=#Test1.Id1 Where #Test2.Id2 is null Select * From #Test1 Where not exists (Select Id2 From #Test2 Where #Test2.ID2=#Test1.ID1)

And the NOT EXISTS was faster - but only by 1% over the whole run. Actually the two execution plans are pretty similar. The FILTER that the first one had is what the NOT EXISTS eliminated.

That FILTER is the actual WHERE ID2 IS NULL construct.
Feb 13th, 2008, 10:35 AM
srisa

1 Attachment(s)

Re: [RESOLVED] Why is "not in" slower than "join"

And this is what EXPLAIN says for 'not exists'.
Feb 14th, 2008, 04:07 AM
FunkyDexter

Re: [RESOLVED] Why is "not in" slower than "join"

So it looks like the Not exists would probably be fairly efficient on MySQL but rotton on SQLServer. Left join's definitely the way to go on both though. Thanks for the results, guys.:thumb:
Feb 14th, 2008, 07:26 AM
szlamany

Re: [RESOLVED] Why is "not in" slower than "join"

Actually NOT EXISTS was 49% of the run time in that example - the LEFT JOIN was 51%. Basically the query optimizer selected nearly identical execution plans - just a different hash match at the end.
Feb 14th, 2008, 10:58 AM
FunkyDexter

Re: [RESOLVED] Why is "not in" slower than "join"

Oh I see. Sorry, I missread your post. Pretty similar overall then.