-
Jun 23rd, 2018, 08:10 AM
#1
Removing duplicates
I have this SQL statement to find duplicate rows in a table....
SQL Code:
select A, B, count(*) as count FROM table group by A, B having count>1
The table being queried also has an int field named Version which is unique for the grouped records.
Can some please show me how to take those results and remove the duplicate the has the lowest version?
For example if the result is
A|B|count
a|b|2
then there are 2 records that could look like this
A|B|version
a|b|1
a|b|2
I need to remove the record where version = 1
Thanks for looking
Kevin
Last edited by kebo; Jun 23rd, 2018 at 08:21 AM.
Process control doesn't give you good quality, it gives you consistent quality.
Good quality comes from consistently doing the right things.
Vague general questions have vague general answers. A $100 donation is required for me to help you if you PM me asking for help. Instructions for donating to one of our local charities will be provided.
______________________________ Last edited by kebo : Now. Reason: superfluous typo's
-
Jun 23rd, 2018, 05:20 PM
#2
Re: Removing duplicates
What if there's 3?
A|B|version
a|b|1
a|b|2
a|b|3
Should only 1 be removed? or 1 & 2? I ask because it changes how the query gets built.
This is something I've dealt with a lot in the past. Usually I'm only concerned with the most recent, rather than simply removing the lowest version.
Also, what DBMS are you using? SQL Server or Access or something else? May or may not make a difference.
-tg
-
Jun 23rd, 2018, 05:48 PM
#3
Re: Removing duplicates
If you only want to keep the highest Version then'
Code:
"Select A, B, Version ... ORDER BY A,B,Version"
Dim valA As String = "", valB As String = ""
For i As Integer = dt.Rows.Count - 1 To 0 Step -1
If valA = dt.Rows(i).Item("A").ToString AndAlso valB = dt.Rows(i).Item("B").ToString Then
dt.Rows(i).Delete()
Else
valA = dt.Rows(i).Item("A").ToString
valB = dt.Rows(i).Item("B").ToString
End If
Next
This was written free handed and untested but it seems something close to this would do the job.
-
Jun 23rd, 2018, 06:03 PM
#4
Re: Removing duplicates
Soooo.... let's give a vb.net solution in the Database Development forum. :P
I'd rather do it all in SQL (which I have done before) where all the slicing and dicing can be done on sets of data rather than resorting to looping through things.
if you only want just the max line...
Code:
select ST.A, ST.B From SomeTable ST
inner join (select A, B, max(Version) MaxVer from SomeTable Group By A,B) M on ST.A=M.A and ST.B = M.B and ST.Version = M.MaxVer
If all you want to do is simply eliminate the lowest version...
Code:
select ST.A, ST.B From SomeTable ST
inner join (select A, B, min(Version) MinVer from SomeTable Group By A,B) M on ST.A=M.A and ST.B = M.B and ST.Version <> M.MinVer
-tg
-
Jun 23rd, 2018, 06:31 PM
#5
Re: Removing duplicates
The .net solution would be simples, but I don't want to go that way.
The reality is that I want only the newest version (the highest value), but since all of my data is showing only 2 duplicates so I guess either way would work. I'll give it shot.
Thanks
I'm on MySQL btw.
Process control doesn't give you good quality, it gives you consistent quality.
Good quality comes from consistently doing the right things.
Vague general questions have vague general answers. A $100 donation is required for me to help you if you PM me asking for help. Instructions for donating to one of our local charities will be provided.
______________________________ Last edited by kebo : Now. Reason: superfluous typo's
-
Jun 23rd, 2018, 09:26 PM
#6
Re: Removing duplicates
Code:
The .net solution would be simples, but I don't want to go that way.
No problem, I was guessing you needed to clean up a database table by permanently removing the duplicates, which my example would do when you called the datatable Update method. tg example doesn't do that. At least I don't think it does, my SQL skills aren't what they use to be. Hell, come to think about it, none of my skills are what they use to be.
Last edited by wes4dbt; Jun 23rd, 2018 at 10:06 PM.
-
Jun 24th, 2018, 09:48 AM
#7
Re: Removing duplicates
no, it doesn't delete the duplicates... its not supposed to. I don't think that was the intention. I've worked with a lot of cases where you need to keep versions of data... this seemed like a similar case... otherwise, why is there a version field? It's for cases where you need time-sensitive changes, and can't delete or update the original row.
-tg
-
Jun 24th, 2018, 10:38 AM
#8
Re: Removing duplicates
In my case it is actually supposed to remove the records, but I'll accept what tg has posted because I think I can work it out from there.
When we do a release, the version value for the record is rev'ed and we generate a set of "Golden Docs" for the release. These doc are the historical record and they contain every bit of data known about the release at the time of release so there is no reason to maintain the old stuff in the DB. Unfortunately, there was a twist made to how my customer handles releases a number of months ago, and that left about 200 old versions in the database that need dealing with.
Process control doesn't give you good quality, it gives you consistent quality.
Good quality comes from consistently doing the right things.
Vague general questions have vague general answers. A $100 donation is required for me to help you if you PM me asking for help. Instructions for donating to one of our local charities will be provided.
______________________________ Last edited by kebo : Now. Reason: superfluous typo's
-
Jun 24th, 2018, 02:39 PM
#9
Re: Removing duplicates
What database system are you using?
In Sql Server and Amazon Redshift you can use Row_Number Over (Partition By <field(s)> Order By <field(s)>) then use the Where clause to filter by that row number field keeping all the 1's.
For example using Sql Server:
Code:
With CTE As (
Select A
,B
,Row_Number() Over (Partition By A, B Order By A, B) As [RowNum]
From Table
)
Select A
,B
From CTE
Where [RowNum] = 1;
Edit: I should clarify this will keep the first instance of each (whether it has any duplicates) if you want to delete just the duplicates change it from a Select From CTE into a Delete From CTE and change the Where to "> 1" rather than "= 1"
For MySQL databases you'll have to create your own ranking technique: sql - ROW_NUMBER() in MySQL - Stack Overflow
Last edited by JuggaloBrotha; Jun 24th, 2018 at 09:23 PM.
-
Jun 24th, 2018, 03:30 PM
#10
Process control doesn't give you good quality, it gives you consistent quality.
Good quality comes from consistently doing the right things.
Vague general questions have vague general answers. A $100 donation is required for me to help you if you PM me asking for help. Instructions for donating to one of our local charities will be provided.
______________________________ Last edited by kebo : Now. Reason: superfluous typo's
-
Jun 28th, 2018, 08:38 AM
#11
Re: Removing duplicates
Perhaps something like the following (I don't have MySql so this may not work).
Code:
SELECT *
FROM dbo.Customer
WHERE CustomerIdentifier NOT IN ( SELECT MIN(CustomerIdentifier)
FROM dbo.Customer
GROUP BY CompanyName ,
ContactName ,
ContactTitle ,
Address ,
City ,
PostalCode );
See the following
https://social.technet.microsoft.com...-identity.aspx
-
Jul 2nd, 2018, 02:57 AM
#12
Re: Removing duplicates
I don't know MySQL syntax very well but I believe both the following are ANSII standard so should work to actually do the delete (in this I've assumed you're deleting everything except the latest - I don't think you ever answered whether that was what you were after). I'm not in apposition to test so I've included a couple of selects you can use to see what the result will be before running :-
Code:
Delete NotMax
--Select NotMax.*
From Table NotMax
Left Join (Select A, B, Max(VersionNo) as VersionNo
From Table
Group By A, B) IsMax
On NotMax.A = IsMax.A
And NotMax.B = IsMax.B
And NotMax.VersionNo = IsMax.VersionNo
Where IsMax.A is null
Delete
--Select *
From Table NotMax
Where VersionNo < (Select Max(VersionNo)
From Table IsMax
Where IsMax.A = NotMax.A
And IsMax.B = NotMax.B)
Note, if MySQL supports an Except it ca be done much more simply but I don't think that's Ansii standard so I haven't included it here.
The best argument against democracy is a five minute conversation with the average voter - Winston Churchill
Hadoop actually sounds more like the way they greet each other in Yorkshire - Inferrd
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|