Thursday, May 12, 2005

the Doppelganger

Data scrubbing is a fact of database life. Whenever you import and normalize data you do run the risk of duplicate data imports as well. This can be especially true if your source database had few restrictions on preventing duplicate data entry. In one particular table I had many rows with duplicate data down to the entrytime field; everything was identical except for the PKID. I tried the typical Cursor script found in various websites, but that would Yield an estimate of 14hrs while the test server processed all 16k rows! I could not very well do that to the production server, on my quest I tried several while loops and eventually came up with this select statement which for my situation (2 duplicate rows for 1 real row of data) First thing I did was dump out all the duplicate rows PK's and FK's into a temp table called #Dup, I hit upon this select statment and I sucessfully narrowed down 16k rows down to 8k of unique data, and was able to delete the duplicate FK's based on the PKid.


SELECT A.PKID
FROM #Dup AS A
INNER JOIN (SELECT FKID FROM #Dup GROUP BY FKID) B
ON A.FKID = B.FKID
AND A.PKID IN (SELECT TOP 1 PKID FROM #DUP C WHERE A.FKID = C.FKID)

The Select Statement finds one of the Duplicate tables, you can use this statement to delete only these rows leaving a unique record behind.

No comments:

Post a Comment