Results 1 to 12 of 12

Thread: Removing duplicates

  1. #1

    Thread Starter
    Still learning kebo's Avatar
    Join Date
    Apr 2004
    Location
    Gardnerville,nv
    Posts
    3,758

    Removing duplicates

    I have this SQL statement to find duplicate rows in a table....
    SQL Code:
    1. select A, B, count(*) as count
    2. FROM table
    3. group by A, B
    4. having count>1

    The table being queried also has an int field named Version which is unique for the grouped records.

    Can some please show me how to take those results and remove the duplicate the has the lowest version?

    For example if the result is

    A|B|count
    a|b|2


    then there are 2 records that could look like this

    A|B|version
    a|b|1
    a|b|2

    I need to remove the record where version = 1

    Thanks for looking
    Kevin
    Last edited by kebo; Jun 23rd, 2018 at 08:21 AM.
    Process control doesn't give you good quality, it gives you consistent quality.
    Good quality comes from consistently doing the right things.

    Vague general questions have vague general answers.
    A $100 donation is required for me to help you if you PM me asking for help. Instructions for donating to one of our local charities will be provided.

    ______________________________
    Last edited by kebo : Now. Reason: superfluous typo's

  2. #2
    Smooth Moperator techgnome's Avatar
    Join Date
    May 2002
    Posts
    34,541

    Re: Removing duplicates

    What if there's 3?

    A|B|version
    a|b|1
    a|b|2
    a|b|3

    Should only 1 be removed? or 1 & 2? I ask because it changes how the query gets built.
    This is something I've dealt with a lot in the past. Usually I'm only concerned with the most recent, rather than simply removing the lowest version.
    Also, what DBMS are you using? SQL Server or Access or something else? May or may not make a difference.


    -tg
    * I don't respond to private (PM) requests for help. It's not conducive to the general learning of others.*
    * I also don't respond to friend requests. Save a few bits and don't bother. I'll just end up rejecting anyways.*
    * How to get EFFECTIVE help: The Hitchhiker's Guide to Getting Help at VBF - Removing eels from your hovercraft *
    * How to Use Parameters * Create Disconnected ADO Recordset Clones * Set your VB6 ActiveX Compatibility * Get rid of those pesky VB Line Numbers * I swear I saved my data, where'd it run off to??? *

  3. #3
    PowerPoster
    Join Date
    Sep 2005
    Location
    Modesto, Ca.
    Posts
    5,206

    Re: Removing duplicates

    If you only want to keep the highest Version then'

    Code:
    "Select A, B, Version ...  ORDER BY A,B,Version"
    
                Dim valA As String = "", valB As String = ""
                For i As Integer = dt.Rows.Count - 1 To 0 Step -1
                    If valA = dt.Rows(i).Item("A").ToString AndAlso valB = dt.Rows(i).Item("B").ToString Then
                        dt.Rows(i).Delete()
                    Else
                        valA = dt.Rows(i).Item("A").ToString
                        valB = dt.Rows(i).Item("B").ToString
                    End If
                Next
    This was written free handed and untested but it seems something close to this would do the job.

  4. #4
    Smooth Moperator techgnome's Avatar
    Join Date
    May 2002
    Posts
    34,541

    Re: Removing duplicates

    Soooo.... let's give a vb.net solution in the Database Development forum. :P
    I'd rather do it all in SQL (which I have done before) where all the slicing and dicing can be done on sets of data rather than resorting to looping through things.

    if you only want just the max line...
    Code:
    select ST.A, ST.B From SomeTable ST
    inner join (select A, B, max(Version) MaxVer from SomeTable Group By A,B) M on ST.A=M.A and ST.B = M.B and ST.Version = M.MaxVer

    If all you want to do is simply eliminate the lowest version...
    Code:
    select ST.A, ST.B From SomeTable ST
    inner join (select A, B, min(Version) MinVer from SomeTable Group By A,B) M on ST.A=M.A and ST.B = M.B and ST.Version <> M.MinVer
    -tg
    * I don't respond to private (PM) requests for help. It's not conducive to the general learning of others.*
    * I also don't respond to friend requests. Save a few bits and don't bother. I'll just end up rejecting anyways.*
    * How to get EFFECTIVE help: The Hitchhiker's Guide to Getting Help at VBF - Removing eels from your hovercraft *
    * How to Use Parameters * Create Disconnected ADO Recordset Clones * Set your VB6 ActiveX Compatibility * Get rid of those pesky VB Line Numbers * I swear I saved my data, where'd it run off to??? *

  5. #5

    Thread Starter
    Still learning kebo's Avatar
    Join Date
    Apr 2004
    Location
    Gardnerville,nv
    Posts
    3,758

    Re: Removing duplicates

    The .net solution would be simples, but I don't want to go that way.

    The reality is that I want only the newest version (the highest value), but since all of my data is showing only 2 duplicates so I guess either way would work. I'll give it shot.
    Thanks

    I'm on MySQL btw.
    Process control doesn't give you good quality, it gives you consistent quality.
    Good quality comes from consistently doing the right things.

    Vague general questions have vague general answers.
    A $100 donation is required for me to help you if you PM me asking for help. Instructions for donating to one of our local charities will be provided.

    ______________________________
    Last edited by kebo : Now. Reason: superfluous typo's

  6. #6
    PowerPoster
    Join Date
    Sep 2005
    Location
    Modesto, Ca.
    Posts
    5,206

    Re: Removing duplicates

    Code:
    The .net solution would be simples, but I don't want to go that way.
    No problem, I was guessing you needed to clean up a database table by permanently removing the duplicates, which my example would do when you called the datatable Update method. tg example doesn't do that. At least I don't think it does, my SQL skills aren't what they use to be. Hell, come to think about it, none of my skills are what they use to be.
    Last edited by wes4dbt; Jun 23rd, 2018 at 10:06 PM.

  7. #7
    Smooth Moperator techgnome's Avatar
    Join Date
    May 2002
    Posts
    34,541

    Re: Removing duplicates

    no, it doesn't delete the duplicates... its not supposed to. I don't think that was the intention. I've worked with a lot of cases where you need to keep versions of data... this seemed like a similar case... otherwise, why is there a version field? It's for cases where you need time-sensitive changes, and can't delete or update the original row.

    -tg
    * I don't respond to private (PM) requests for help. It's not conducive to the general learning of others.*
    * I also don't respond to friend requests. Save a few bits and don't bother. I'll just end up rejecting anyways.*
    * How to get EFFECTIVE help: The Hitchhiker's Guide to Getting Help at VBF - Removing eels from your hovercraft *
    * How to Use Parameters * Create Disconnected ADO Recordset Clones * Set your VB6 ActiveX Compatibility * Get rid of those pesky VB Line Numbers * I swear I saved my data, where'd it run off to??? *

  8. #8

    Thread Starter
    Still learning kebo's Avatar
    Join Date
    Apr 2004
    Location
    Gardnerville,nv
    Posts
    3,758

    Re: Removing duplicates

    In my case it is actually supposed to remove the records, but I'll accept what tg has posted because I think I can work it out from there.

    When we do a release, the version value for the record is rev'ed and we generate a set of "Golden Docs" for the release. These doc are the historical record and they contain every bit of data known about the release at the time of release so there is no reason to maintain the old stuff in the DB. Unfortunately, there was a twist made to how my customer handles releases a number of months ago, and that left about 200 old versions in the database that need dealing with.
    Process control doesn't give you good quality, it gives you consistent quality.
    Good quality comes from consistently doing the right things.

    Vague general questions have vague general answers.
    A $100 donation is required for me to help you if you PM me asking for help. Instructions for donating to one of our local charities will be provided.

    ______________________________
    Last edited by kebo : Now. Reason: superfluous typo's

  9. #9
    PowerPoster JuggaloBrotha's Avatar
    Join Date
    Sep 2005
    Location
    Lansing, MI; USA
    Posts
    4,286

    Re: Removing duplicates

    What database system are you using?

    In Sql Server and Amazon Redshift you can use Row_Number Over (Partition By <field(s)> Order By <field(s)>) then use the Where clause to filter by that row number field keeping all the 1's.

    For example using Sql Server:
    Code:
    With CTE As (
      Select A
            ,B
            ,Row_Number() Over (Partition By A, B Order By A, B) As [RowNum]
      From Table
    )
    Select A
          ,B
    From CTE
    Where [RowNum] = 1;
    Edit: I should clarify this will keep the first instance of each (whether it has any duplicates) if you want to delete just the duplicates change it from a Select From CTE into a Delete From CTE and change the Where to "> 1" rather than "= 1"

    For MySQL databases you'll have to create your own ranking technique: sql - ROW_NUMBER() in MySQL - Stack Overflow
    Last edited by JuggaloBrotha; Jun 24th, 2018 at 09:23 PM.
    Currently using VS 2015 Enterprise on Win10 Enterprise x64.

    CodeBank: All ThreadsColors ComboBoxFading & Gradient FormMoveItemListBox/MoveItemListViewMultilineListBoxMenuButtonToolStripCheckBoxStart with Windows

  10. #10

    Thread Starter
    Still learning kebo's Avatar
    Join Date
    Apr 2004
    Location
    Gardnerville,nv
    Posts
    3,758

    Re: Removing duplicates

    I'm on MySQL
    Process control doesn't give you good quality, it gives you consistent quality.
    Good quality comes from consistently doing the right things.

    Vague general questions have vague general answers.
    A $100 donation is required for me to help you if you PM me asking for help. Instructions for donating to one of our local charities will be provided.

    ______________________________
    Last edited by kebo : Now. Reason: superfluous typo's

  11. #11
    Karen Payne MVP kareninstructor's Avatar
    Join Date
    Jun 2008
    Location
    Oregon
    Posts
    6,686

    Re: Removing duplicates

    Perhaps something like the following (I don't have MySql so this may not work).

    Code:
    SELECT  *
    FROM    dbo.Customer
    WHERE   CustomerIdentifier NOT IN ( SELECT  MIN(CustomerIdentifier)
                                        FROM    dbo.Customer
                                        GROUP BY CompanyName ,
                                                ContactName ,
                                                ContactTitle ,
                                                Address ,
                                                City ,
                                                PostalCode );
    See the following
    https://social.technet.microsoft.com...-identity.aspx

  12. #12
    Super Moderator FunkyDexter's Avatar
    Join Date
    Apr 2005
    Location
    An obscure body in the SK system. The inhabitants call it Earth
    Posts
    7,902

    Re: Removing duplicates

    I don't know MySQL syntax very well but I believe both the following are ANSII standard so should work to actually do the delete (in this I've assumed you're deleting everything except the latest - I don't think you ever answered whether that was what you were after). I'm not in apposition to test so I've included a couple of selects you can use to see what the result will be before running :-
    Code:
    Delete NotMax
    --Select NotMax.*
    From Table NotMax
    Left Join (Select A, B, Max(VersionNo) as VersionNo
    			From Table
    			Group By A, B) IsMax
    	On NotMax.A = IsMax.A
    	And NotMax.B = IsMax.B
    	And NotMax.VersionNo = IsMax.VersionNo
    Where IsMax.A is null
    
    
    
    Delete
    --Select *
    From Table NotMax
    Where VersionNo < (Select Max(VersionNo)
    					From Table IsMax
    					Where IsMax.A = NotMax.A
    					And IsMax.B = NotMax.B)
    Note, if MySQL supports an Except it ca be done much more simply but I don't think that's Ansii standard so I haven't included it here.
    The best argument against democracy is a five minute conversation with the average voter - Winston Churchill

    Hadoop actually sounds more like the way they greet each other in Yorkshire - Inferrd

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width