Remove Duplicate (Rows) Keys

Often DBAs and developers are stuck with dealing with duplicate rows. Duplicate rows may be inserted when we disable a primary or unique key and insert data, or do a direct load.

Let's take the following table as example.

Create table my_table (
my_id number (5),
my_serial number (10),
my_name varchar2 (40),
my_address varchar2 (60));

where the primary key is my_id and my_serial.

The most common and widely used SQL to remove duplicate rows is

DELETE FROM my_table
WHERE rowid not in (SELECT MIN(rowid) 
FROM my_table 
GROUP BY my_id, my_serial); 

here the GROUP BY clause contains the primary key or the columns which identify each row as unique.

Another way to remove duplicate primary key rows is to write the duplicates to a table, and then remove those rows from the table. The EXCEPTIONS INTO clause is valid option only when validating a constraint (ENABLE clause). So if you do not have a Primary key with the table, to use this option, you may have to create the Primary key with the DISABLE clause and then enable it with the EXCEPTIONS INTO clause. The EXCEPTIONS INTO clause writes the rowid of all duplicate rows to a table specified, the default table is named EXCEPTIONS. You may create this table by using the script utlexcpt.sql from $ORACLE_HOME/rdbms/admin.

So, to write the duplicate rows to an exceptions table for our example we do:

1. Create the exceptions table using the script utlexcpt.sql
2. Enable the primary key and write the duplicate row rowids to the exceptions table.

ALTER TABLE my_table enable primary key 
exceptions into exceptions;

3. To view the duplicate rows you may use the SQL

select * from my_table
where exists
(select 1 from exceptions
where exceptions.row_id = my_table.rowid);

4. You may use the row_id values of exceptions table to remove the duplicate rows.

Now, let me give you a situation which I face often. I have a table which has more than 100 million rows and this table is added with 10-20 million rows every week through batch jobs using SQL direct load. Now for some reason, if the job fails (out of space in the index tablespace, or the same datafile is used), since direct load is used, all the data will be loaded to the table, but the indexes will be left in direct load state. Here is what I do to eliminate duplicate rows and enable the primary key.

1. Disable the primary key - this drops the index associated with the PK.

2. Create a temporary table (we call it my_table_dup) and load the records from the recent jobs.

3. Create a non-unique index on my_table with the same column as primary key.

4. Run the following procedure to remove the duplicates. Here I assume that there is only one duplicate row for each duplication of PK.

rem to delete dup rows from my_table
rem

declare
   cursor c1 is select * from my_table_dup;

begin
   for r1 in c1 loop
      delete from my_table
      where rowid = (select min(rowid) from my_table
                     where my_id = r1.my_id
                     and my_serial = r1.my_serial
                     group by my_id, my_serial
                     having count(*) > 1);
      commit;
   end loop;
end;
/

5. Once this procedure is complete, drop the temporary index created in step 3 and enable the primary key (don't forget to specify the storage parameters and tablespace).

Search BijooS.com Exact Match Search      
Home Oracle DBA TKMCE Alumni H1B Info Guestbook

Biju Thomas is Oracle7.3 OCP, Oracle8 OCP, 
Oracle8i OCP and Oracle9i OCA/OCP Certified DBA

Questions/Comments? Write to webmaster@bijoos.com. © The scripts, tips and articles appearing on BijooS.com cannot be reproduced elsewhere without the prior permission from the webmaster.