Remove duplicates in sql?



  • I have a postgresql database which got triplicates of its entries. (That part is fixed)

    So now I have

    id | name
    1  | bob
    1  | bob
    1  | bob
    2  | jane
    2  | jane
    2  | jane

    Can I remove the two extras and leave the one for the whole set of 400+?

    Thanks :)
     



  • This should do what you want, but you probably ought to put unique constraints on the table such that it doesn't happen to begin with.  Data integrity is an oft overlooked but vital part of systems. 

    Select ID, Name from Blah Group By ID, Name

     



  • I fixed the table and the code - but I wanted to remove all the duplicates in one query.

    How does a select remove the extras? 



  • Okay I gotcha now - It was the end of the day and I wanted to go home.  Offhand, I can't think of a way to remove the duplicates in one shot.  What I'd do is create a temporary table and select the unique records into it.  Then you delete the contents of the original and then reinsert the contents from the temp table back into the original.

    So you would have something along the lines of the following.

    Insert Into Temp_MyTable (ID, Name) Select ID, Name From MyTable Group By ID, Name

    Delete From MyTable

    Insert Into MyTable (ID, Name) Select ID, Name From Temp_MyTable Order By ID

    If you need it done in one shot so that the database is still usable while your working on it, you could wrap the statements in a transaction so that all the modifications are committed at the same time.
     



  • How about:

    select distinct id,name from mytable

     



  • [quote user="lpope187"]

    Okay I gotcha now - It was the end of the day and I wanted to go home.  Offhand, I can't think of a way to remove the duplicates in one shot.  What I'd do is create a temporary table and select the unique records into it.  Then you delete the contents of the original and then reinsert the contents from the temp table back into the original.

    So you would have something along the lines of the following.

    Insert Into Temp_MyTable (ID, Name) Select ID, Name From MyTable Group By ID, Name

    Delete From MyTable

    Insert Into MyTable (ID, Name) Select ID, Name From Temp_MyTable Order By ID

    If you need it done in one shot so that the database is still usable while your working on it, you could wrap the statements in a transaction so that all the modifications are committed at the same time.
     

    [/quote]

     

    That makes sense, thanks :)     



  • [quote user="lpope187"]create a temporary table and select the unique records into it.  Then you delete the contents of the original and then reinsert the contents from the temp table back into the original.[/quote]

    Wouldn't a [code]SELECT INTO [original table][/code] remove the need for a temp table?
     



  • You cant "Select into" the table you plan deleting the lines from...

     What i would STRONGLY suggest is to get the ID fixed first. There needs to be a unique identifier for each record. If there is one already, you can clear all the rows with a single delete statement:

     Delete from table where UniqueID not in (select min(uniqueID) from Table group by [All], [other], [stuff])

     This will eliminate the need for a temp table.

    Also after you fix this I would also suggest adding a constraint to the table that will actually enforce uniquness... That way you will notice if there is more code that needs to be fixed :)



  • Just noticed the add at the top of the page of this thread.

     

    dedupe software only £325

    Simple, effective, works with most data

    formats.  Free trial download.

     

    So for £325, you get software that runs 3 lines of code!

     

    That's a price WTF if I ever saw one



  • I don't know if there's similar functionality in postgre, but SQL Server 2005 lets you do this neat little trick: (Lines starting with -- are comments)

    --Before
    SELECT * FROM deleteduplicates;
    

    --Show the results of attaching row numbers partitioned by id
    WITH [numberedduplicates](row_number, id, name) AS (
    SELECT
    ROW_NUMBER() OVER (PARTITION BY id ORDER BY id),
    id,
    name
    FROM deleteduplicates
    )
    SELECT * FROM numberedduplicates;

    --Delete the rows that aren't the first row for a particular id
    WITH [numberedduplicates](row_number, id, name) AS (
    SELECT
    ROW_NUMBER() OVER (PARTITION BY id ORDER BY id),
    id,
    name
    FROM deleteduplicates
    )
    DELETE FROM numberedduplicates WHERE row_number > 1;

    --After
    SELECT * FROM deleteduplicates;

    The resulting output:

    id          name
    ----------- --------------------------------
    1           bob
    1           bob
    1           bob
    2           jane
    2           jane
    2           jane
    3           mary
    4           jeff
    4           jeff
    

    (9 row(s) affected)

    row_number id name


    1 1 bob
    2 1 bob
    3 1 bob
    1 2 jane
    2 2 jane
    3 2 jane
    1 3 mary
    1 4 jeff
    2 4 jeff

    (9 row(s) affected)

    (5 row(s) affected)

    id name


    1 bob
    2 jane
    3 mary
    4 jeff

    (4 row(s) affected)

    I just found out about this one the other day. :)


Log in to reply