Alternative to OUTER JOIN (SELECT....

**Axcontrols2** · Oct 13th, 2018, 09:56 AM

I use "OUTER JOIN" because I want all rows from TABLE_A even if no matching row in B exists.
I use "JOIN ( SELECT DISTINCT" because I only want one row from B if more than one exists.

Code:

SELECT col_a, col_b, col_c, b_id FROM TABLE_A
LEFT OUTER JOIN ( SELECT DISTINCT b_id FROM TABLE_B ) AS B WHERE B.b_id = TABLE_A.id

The above works fine and fast on sensible servers. For example JET, MSSQL and PC running MySQL 8 and another server running MariaDB 10 on a 64bit ARM processor.

But, for testing, I also run my queries against MySQL 5.1 running on a single core armada processor.

But on that old server the above query takes forever.

If I just do a standard OUTER JOIN on both tables (without JOIN(SELECT) then I get more rows than I want but, even on the old server, it's fast. So it has nothing to do with the number of rows being returned.

I understand that the SELECT inside the JOIN means the server has to do a lot of queries and the problem is just that the old server hasn't got the power to do the work.

So is there an alternative syntax that avoids the use of the JOIN(SELECT?

Thanks

**si_the_geek** · Oct 13th, 2018, 10:13 AM

One alternative (which will be a bit slower for JET etc) is to create a temporary table from the sub-query, and in the query join to that temporary table.

It should be much quicker for the armada device, as it is presumably being limited by memory rather than processor, and this would mean that you avoid the need for memory usage of storing b_id (because the temporary table means it is stored to disk etc).

**Axcontrols2** · Oct 13th, 2018, 11:08 AM

Code:

CREATE TEMPORARY TABLE B SELECT DISTINCT b_id FROM TABLE_B;
SELECT col_a, col_b, col_c, b_id FROM TABLE_A
LEFT OUTER JOIN B ON B.b_id = TABLE_A.id

I tried that - thought it would work. But no.

It is strange though - if I do a join on, for example, 4 real tables and return 20,000 rows even the armada returns all rows quickly. But if I just JOIN on one Temporary table then it's slow.

ExecuteNonQuery executes in around 3500ms. I assume that includes the time it took to create the temp table. Then Reader .Read returns rows quickly showing the syntax is good but every so often Reader .Read hangs for maybe 500ms to 1000ms causing the whole thing to take forever.

Also tried it using CREATE VIEW and joining on that but it works just the same.

**FunkyDexter** · Oct 15th, 2018, 02:18 AM

The syntax in your original query looks wrong and I doubt it runs on anything. You've mixed up an ANSII style join with the older NON-ANSII style. If you use the JOIN syntax you need an ON to specify the relationship. If you want to specify the relationships in a WHERE clause you express the tables (or sub-queries) as a comma separated list.

Anyway, I believe this should give you what you want in a nice pure ANSII form and is probably the most performant way of doing it since it avoids any correlation issues:-

Code:

SELECT A.col_a, A.col_b, A.col_c, B.b_id
FROM TABLE_A A
LEFT JOIN TABLE_B
	on A.id = B.b_id
GROUP BY A.col_a, A.col_b, A.col_c, B.b_id, A.id

Edit>Note that this will give you the id from Table B if it's there or NULL if it isn't. It strikes me, though, that you may not need to join to Table B at all. You could just use:-

Code:

SELECT A.col_a, A.col_b, A.col_c, A.id
FROM TABLE_A A

The difference being that this will always return a value in the last column, regardless of whether there is a row in Table B or not.

Edit2>

if I just JOIN on one Temporary table then it's slow

Hard to be sure without seeing execution plans etc. but I would hazard a guess that your Temp table isn't indexed while your physical table is.

**Axcontrols2** · Oct 16th, 2018, 04:33 AM

Originally Posted by FunkyDexter

The syntax in your original query looks wrong and I doubt it runs on anything.

Sorry. You're right. I just keyed in that 'example' query without double checking. It should have been "ON" not "WHERE". Something like:

SELECT col_a, col_b, col_c, b_id FROM TABLE_A
LEFT OUTER JOIN ( SELECT DISTINCT b_id FROM TABLE_B ) AS B ON B.b_id = TABLE_A.id

[QUOTE=FunkyDexter;5325621]
Anyway, I believe this should give you what you want in a nice pure ANSII form and is probably the most performant way of doing it since it avoids any correlation issues:-

Code:

SELECT A.col_a, A.col_b, A.col_c, B.b_id
FROM TABLE_A A
LEFT JOIN TABLE_B
	on A.id = B.b_id
GROUP BY A.col_a, A.col_b, A.col_c, B.b_id, A.id

I never tried it with GROUP. I'll give that a whirl.

Originally Posted by FunkyDexter

It strikes me, though, that you may not need to join to Table B at all. You could just use:-
[code]SELECT A.col_a, A.col_b, A.col_c, A.id
FROM TABLE_A A

I don't understand how that will give me 'anything' from table B

Thanks for your comment about indexing on the temp table. Having said that, there's only around 20,000 rows in the temp table and other 'real' tables with that few rows don't perform much better with or without indexing. But I'll test it anyway. But first I'll try your GROUP BY suggestion.

**Axcontrols2** · Oct 16th, 2018, 06:16 AM

Originally Posted by FunkyDexter

Anyway, I believe this should give you what you want in a nice pure ANSII form and is probably the most performant way of doing it since it avoids any correlation issues:-

Code:

SELECT A.col_a, A.col_b, A.col_c, B.b_id
FROM TABLE_A A
LEFT JOIN TABLE_B
	on A.id = B.b_id
GROUP BY A.col_a, A.col_b, A.col_c, B.b_id, A.id

Tested that. It does work on most 'sensible' servers but on a low powered Armada device the ExecuteReader() fails with an error indicating that the /tmp/ folder isn't big enough.

I can increase the size of the tmp folder and it'll work but the current size is the default for that device so want to leave it like that.

**FunkyDexter** · Oct 16th, 2018, 06:56 AM

I don't understand how that will give me 'anything' from table B

The only thing you're selecting from B is B.b_id which is equal to A.ID. So my thinking was that you could just select A.ID without needing to join to B.

It won't produce quite the same result. If there's a value in A.ID (not a null) which doesn't appear in B.b_id you would get the value if you selected A.ID and a null if you selected B.b_id from a left join. So it really depends on whether that could occur in your data and, if it can, whether you care. If it can't occur or you don't care it would give you a more efficient query.

Thanks for your comment about indexing

It wasn't necessarily a recommendation. More a guess at why you might be seeing a performance hit when you used the temp table. It might be a good idea to index it or it might not. If you're processor bound it would be a good thing but if you're disk or memory bound (which it sounds like you are on the Armada device) then it might just push you over the edge and break.

on a low powered Armada device

I'm afraid I know nothing about Armada devices or how to best work with them so I can't help there. If Shaggy were here he'd probably make a joke about "Log Shipping" or something.

**Axcontrols2** · Oct 16th, 2018, 08:53 AM

@FunkyDexter. First let me thank you for your input. Really appreciated.

The 'Armada' device I'm referring to is a Western Digital 'MyCloud' storage device which has MySQL 5.1 built in. It has 256MB Ram and a 1GHz 1 core Armada processor.

I just use that as a MySQL test server because it's low on processor power, low on memory and I can physically pick it up and stick it on the other end of a slow network link.

I do test with proper (real) servers with lots of power and memory running MySQL 8 and MariaDB 10. But I optimize my application and queries to run in worst case scenario.

But let's say that I don't actually need the value from table B. But I do need to know if any matching rows in B exist.
So is there a way to get all the data from A and a simple null/not null flag that indicates if 'any' matching row/rows exist in B.

**FunkyDexter** · Oct 16th, 2018, 09:22 AM

There's a few ways. The one that jumps out to me is a CASE EXISTS. I don't know if it will perform better or worse than the join. It introduces a correlated sub query which is bad for performance but it would remove the Group By so it might make up for it there. And it should reduce the memory usage (This is SQL Server syntax but I think it would be the same for MySQL:-

Code:

Select	A.col_a,
		A.col_b,
		A.col_c,
		Case When Exists (Select *
							From Table_B B
							Where B.b_id = A.ID) then 1 else 0 end as TheresARecordInTableB
From Table_A A

I normally steer people away from corelated sub-queries because they have to be resolved individually for each row (known as RBAR) but where you're memory bound it might be more efficient. And since it's an EXISTS query the engine doesn't need to fully resolve it, it just checks the stats on the index. It might improve things, it might make them worse. Suck it and see on that one.

Bear in mind I'm no MySQL expert. Over the years I've gotten pretty good at predicting what SQLServer will do with a given query but MySQL's still a bit of a mystery to me.

**Axcontrols2** · Oct 16th, 2018, 12:32 PM

I've never used a Sub Query in the column list before. Looks weird.

But. That works. Thank you.

It's an Inventory program and many years ago each inventory item had a 'Option' field which could contain something like Red, Green, Blue etc. That was on the basis that, in those days, if an inventory item had a different 'Option' it also had a different part number so therefore a different record in table A so, at that time, table B didn't exist.

But the requirement changed and a single part number can now have many options (or no options) So now table A has one row per unique part number and the options, if any exist, are in table B

The grid loads the OnHand inventory from table A and if options exist in table B the grid row shows a small [+] icon. The user can click that and the row expands downwards to show additional rows with the data from table B and the [+] becomes [-] so the user can close the options again.

When populating the OnHand inventory I'm currently doing an "OUTER JOIN(SELECT DISTINCT ......" on table B just to get B.ID and using that, if it exists, to decide whether to show the [+] options button on the row.

But the WD NAS just hasn't got the guts to do that sub query "SELECT DISTINCT" in the JOIN so I was looking for other ways to do it. That works fine on real servers but I don't actually have any control over what server will be used in future so I test for worst case scenario.

Your code works fine on any of my servers.

Another solution I started using yesterday was to add a field to table A and to populate that with 1 or 0 whenever options were added/removed. Now I don't need to read table B at all. I can see in table A whether options exist and show my [+] button.

Not sure I like the idea of adding that column to table A though. If ever I edit code that adds/removes options to B I have to remember to keep that flag in A sync'd.

What do you think about the idea of adding that column?

Anyway. Thank you once again.

**si_the_geek** · Oct 16th, 2018, 01:08 PM

As it is just for whether or not you display the [+] , you could load the b_id's separately (ordered) and store them in an array/list, and do a binary search of that array/list.

The extra field might work, but leaves you open to mistakes... if anybody ever manually edits one of the tables the values could be wrong, but your program would trust those values.

**techgnome** · Oct 16th, 2018, 01:25 PM

Does MySQL Support scalar functions? If so, then you could create a scalar function that accepts an ID and returns 0/1 if there are related records or not, and either use it in the query, or as a computed field in Table A ... not sure if it would be any better or worse than a related query or a sub query, but it might be an option.

-tg

**FunkyDexter** · Oct 17th, 2018, 05:12 AM

I've never used a Sub Query in the column list before

Generally I steer people away from them. Have a read up on "Correllated Sub Query Performance" to get an idea why. On a decent server it'll probably perform worse than your original query or my suggested query in post 5. But on your mini server it'll use less memory so you're not hitting the caps that those queries are. You're kinda damned if you do and damned if you don't on this one so, if the sub-query is giving you "good enough" performance, I'd just quietly go with it and have your excuses ready in case you ever have to show it to a DBA.

I reckon Si and TG's suggestions are both worth pursuing as well. Si's will mean an extra round trip to the server but it saves having to issue a potentially poor performing sub query and it takes some load off the server and distributes it onto the client PCs which is a good thing if you're having to worry about small servers. I doubt TG's will make much difference to performance either way (though it might, hard to predict without trying) but it will make the query neater and more maintainable which is always a good thing.

Thread: Alternative to OUTER JOIN (SELECT....

Thread Tools

Display

Alternative to OUTER JOIN (SELECT....

Re: Alternative to OUTER JOIN (SELECT....

Re: Alternative to OUTER JOIN (SELECT....

Re: Alternative to OUTER JOIN (SELECT....

Re: Alternative to OUTER JOIN (SELECT....

Re: Alternative to OUTER JOIN (SELECT....

Re: Alternative to OUTER JOIN (SELECT....

Re: Alternative to OUTER JOIN (SELECT....

Re: Alternative to OUTER JOIN (SELECT....

Re: Alternative to OUTER JOIN (SELECT....

Re: Alternative to OUTER JOIN (SELECT....

Re: Alternative to OUTER JOIN (SELECT....

Re: Alternative to OUTER JOIN (SELECT....

Posting Permissions