Remove duplicates in sql?

jesirose

I have a postgresql database which got triplicates of its entries. (That part is fixed)

So now I have

id | name
1 | bob
1 | bob
1 | bob
2 | jane
2 | jane
2 | jane

Can I remove the two extras and leave the one for the whole set of 400+?

Thanks :)

lpope187

This should do what you want, but you probably ought to put unique constraints on the table such that it doesn't happen to begin with. Data integrity is an oft overlooked but vital part of systems.

Select ID, Name from Blah Group By ID, Name

jesirose

I fixed the table and the code - but I wanted to remove all the duplicates in one query.

How does a select remove the extras?

lpope187

Okay I gotcha now - It was the end of the day and I wanted to go home. Offhand, I can't think of a way to remove the duplicates in one shot. What I'd do is create a temporary table and select the unique records into it. Then you delete the contents of the original and then reinsert the contents from the temp table back into the original.

So you would have something along the lines of the following.

Insert Into Temp_MyTable (ID, Name) Select ID, Name From MyTable Group By ID, Name

Delete From MyTable

Insert Into MyTable (ID, Name) Select ID, Name From Temp_MyTable Order By ID

If you need it done in one shot so that the database is still usable while your working on it, you could wrap the statements in a transaction so that all the modifications are committed at the same time.

and_h

How about:

select distinct id,name from mytable

jesirose

[quote user="lpope187"]

Okay I gotcha now - It was the end of the day and I wanted to go home. Offhand, I can't think of a way to remove the duplicates in one shot. What I'd do is create a temporary table and select the unique records into it. Then you delete the contents of the original and then reinsert the contents from the temp table back into the original.

So you would have something along the lines of the following.

Insert Into Temp_MyTable (ID, Name) Select ID, Name From MyTable Group By ID, Name

Delete From MyTable

Insert Into MyTable (ID, Name) Select ID, Name From Temp_MyTable Order By ID

If you need it done in one shot so that the database is still usable while your working on it, you could wrap the statements in a transaction so that all the modifications are committed at the same time.

[/quote]

That makes sense, thanks :)

dhromed

[quote user="lpope187"]create a temporary table and select the unique records into it. Then you delete the contents of the original and then reinsert the contents from the temp table back into the original.[/quote]

Wouldn't a [code]SELECT INTO [original table][/code] remove the need for a temp table?

rdrunner

You cant "Select into" the table you plan deleting the lines from...

What i would STRONGLY suggest is to get the ID fixed first. There needs to be a unique identifier for each record. If there is one already, you can clear all the rows with a single delete statement:

Delete from table where UniqueID not in (select min(uniqueID) from Table group by [All], [other], [stuff])

This will eliminate the need for a temp table.

Also after you fix this I would also suggest adding a constraint to the table that will actually enforce uniquness... That way you will notice if there is more code that needs to be fixed :)

Cratig

Just noticed the add at the top of the page of this thread.

dedupe software only £325

Simple, effective, works with most data

formats. Free trial download.

So for £325, you get software that runs 3 lines of code!

That's a price WTF if I ever saw one

db2

I don't know if there's similar functionality in postgre, but SQL Server 2005 lets you do this neat little trick: (Lines starting with -- are comments)

--Before
SELECT * FROM deleteduplicates;
--Show the results of attaching row numbers partitioned by id

WITH [numberedduplicates](row_number, id, name) AS (

SELECT

ROW_NUMBER() OVER (PARTITION BY id ORDER BY id),

id,

name

FROM deleteduplicates

)

SELECT * FROM numberedduplicates;
--Delete the rows that aren't the first row for a particular id

WITH [numberedduplicates](row_number, id, name) AS (

SELECT

ROW_NUMBER() OVER (PARTITION BY id ORDER BY id),

id,

name

FROM deleteduplicates

)

DELETE FROM numberedduplicates WHERE row_number > 1;
--After

SELECT * FROM deleteduplicates;

The resulting output:

id name ----------- -------------------------------- 1 bob 1 bob 1 bob 2 jane 2 jane 2 jane 3 mary 4 jeff 4 jeff

(9 row(s) affected)

row_number id name

1 1 bob
2 1 bob
3 1 bob
1 2 jane
2 2 jane
3 2 jane
1 3 mary
1 4 jeff
2 4 jeff

(9 row(s) affected)

(5 row(s) affected)

id name

1 bob
2 jane
3 mary
4 jeff

(4 row(s) affected)

I just found out about this one the other day. :)