Simple duplicates, that is to say, 100% identical records, can be easily resolved by using a DISTINCT.
When sourcing from single, well designed systems (e.g. ERP, CRM, billing systems etc), this would rarely happen as the data models and table designs will usually prevent such things from happening.
You may see this more when joining data from System A with data from System B - especially when different grains of data exist in one of the systems or the relationship is established through an inferred key of sorts.
You may also see this when reading data from CSV files, Excel sheets, Access database or any other “desktop driven, human operated” data sources - believe it or not, this is still where a lot of key data is found, even in multi-billion dollar high-tech enterprises.
When you are talking about duplicates in the sense that two records share the same business (natural) key but they have different attributes or values … then I would flag this during development and work with the business to identify the cause of this problem, methods of detection and resolution and (ideally) having the originating system cleaned up.
While it is easy to design an ETL solution that automatically addresses this issue, it may seriously undermine the value of your BI solution if you do not engage the the business owner of this data or the appointed SME.
After all, the technical problem is black and white - the table design does not allow multiple records with the same unique key.
However, from a business sense, the problem is not black and white at all. One record could be wrong and one record could be right - but how would the ETL know which of these to ‘pass’ and which to ‘fail’? Even more complicated, BOTH records could be completely valid and this would force a rethink of the target data model to capture this requirement and still maintain referential integrity. Or perhaps BOTH records should be considered wrong and put aside for reporting / alerting.
As Jim already said… designing ETL processes is an ART, it is not just “banging out a few lines of code” by some “coder”, far removed from the business. In my book, a good ETL developer is part data architect, part business analyst, a born sceptic and evidence driven.
ErikR (BOB member since 2007-01-10)