Remove Duplicate (Rows) Keys
Often DBAs and developers are stuck with dealing with duplicate rows. Duplicate rows may be inserted when we disable a primary or unique key and insert data, or do a direct load.
Let's take the following table as example.
Create table my_table ( my_id number (5), my_serial number (10), my_name varchar2 (40), my_address varchar2 (60));
where the primary key is my_id and my_serial.
The most common and widely used SQL to remove duplicate rows is
DELETE FROM my_table WHERE rowid not in (SELECT MIN(rowid) FROM my_table GROUP BY my_id, my_serial);
here the GROUP BY clause contains the primary key or the columns which identify each row as unique.
Another way to remove duplicate primary key rows is to write the duplicates to a table, and then remove those rows from the table. The EXCEPTIONS INTO clause is valid option only when validating a constraint (ENABLE clause). So if you do not have a Primary key with the table, to use this option, you may have to create the Primary key with the DISABLE clause and then enable it with the EXCEPTIONS INTO clause. The EXCEPTIONS INTO clause writes the rowid of all duplicate rows to a table specified, the default table is named EXCEPTIONS. You may create this table by using the script utlexcpt.sql from $ORACLE_HOME/rdbms/admin.
So, to write the duplicate rows to an exceptions table for our example we do:
1. Create the exceptions table using the script utlexcpt.sql
ALTER TABLE my_table enable primary key exceptions into exceptions;
3. To view the duplicate rows you may use the SQL
select * from my_table where exists (select 1 from exceptions where exceptions.row_id = my_table.rowid);
4. You may use the row_id values of exceptions table to remove the duplicate rows.
Now, let me give you a situation which I face often. I have a table which has more than 100 million rows and this table is added with 10-20 million rows every week through batch jobs using SQL direct load. Now for some reason, if the job fails (out of space in the index tablespace, or the same datafile is used), since direct load is used, all the data will be loaded to the table, but the indexes will be left in direct load state. Here is what I do to eliminate duplicate rows and enable the primary key.
1. Disable the primary key - this drops the index associated with the PK.
2. Create a temporary table (we call it my_table_dup) and load the records from the recent jobs.
3. Create a non-unique index on my_table with the same column as primary key.
4. Run the following procedure to remove the duplicates. Here I assume that there is only one duplicate row for each duplication of PK.
rem to delete dup rows from my_table rem declare cursor c1 is select * from my_table_dup; begin for r1 in c1 loop delete from my_table where rowid = (select min(rowid) from my_table where my_id = r1.my_id and my_serial = r1.my_serial group by my_id, my_serial having count(*) > 1); commit; end loop; end; /
5. Once this procedure is complete, drop the temporary index created in step 3 and enable the primary key (don't forget to specify the storage parameters and tablespace).