Let’s say that we have a heap table without keys, solely used for dumping raw data into. That data at some point is filtered and cleaned up for further use. As time passes and data accumulates, duplicate rows creep in causing all sorts of issues. So, using SQL, how do we remove the duplicates but keep the one row necessary?
For this example you can use this Postgres scratchpad
select ctid,* from public."Orders"
Check rows with ROWID 1 and 2 as well as 4 and 5 are identical between them.
We want to keep ONE random row of each duplicate group.How?
Internally each row is stamped with a unique id used by the DBMS for its own purposes but is exposed to the outer world with the CTID. The CTID, however, is not a constant value and changes dynamically upon operations on the table, so it cannot be used to identify the rows in the long term, though it is sufficient for our purposes.
So we see that each row has a number which acts as something of an auto-increment Primary Key, uniquely identifying each row.
Getting hold of this key allows us to identify the candidates for deletion:
select ctid,* from public."Orders" as a where a.ctid>ANY (select b.ctid from public."Orders" as b where a."Order_ID"=b."Order_ID" and a."Order_Date"=b."Order_Date")
Now replace SELECT with DELETE and youre going to end up with one row per duplicate group.
SQL Workshop - Removing Duplicate Rows