Unique Constraint Error

system · January 16, 2013, 9:14pm

Hi BOB Experts,

Please help me to resolve the following issue.

I got the Unique Constraint Error because of duplicate records. Is there any workaround or transformation where ETL log display the record ID where Unique constraint trigger?

As you know in SQL Procedure we can display the record ID if duplicate record try to insert in the target table.

Thank you very much in advance for your help!

amirjamal95 (BOB member since 2012-10-31)

system · January 16, 2013, 10:33pm

This error is usually a database-level error and is triggered when loading data into the target database.

The error log will tell you exactly which data flow and table is at fault.

Go into that data flow and open the properties of the target table (double click.) Turn on the OVERFLOW option, setting either to capture the faulty data or the offending SQL Query into a separate log file on the job server.

If you run the job again, the DF will now not abort when the error occurs, instead it will ‘dump’ the offending record / query into the overflow file and continue writing data into the target table.

The overflow file will then tell you exactly what records were at fault.

ErikR (BOB member since 2007-01-10)

system · January 17, 2013, 4:24pm

Thanks Erik, it was very helpful description. After selected Yes, i can see all duplicate records on log.

Just wondering, Is there any possibility that we can load error file on desktop instead of job server?

Thanks alot.

amirjamal95 (BOB member since 2012-10-31)

system · January 17, 2013, 5:06pm

If you want the error file on your local machine you have to copy it there (from the job server) or you have to specify a different location for the file in the target table.

eganjp (BOB member since 2007-09-12)

system · January 17, 2013, 7:41pm

You could also create a process, as part of our ETL clean up steps, that would look for the existence of overflow files and load them into error tables - making the faulty data centrally available for querying and resolution.

ErikR (BOB member since 2007-01-10)

system · January 17, 2013, 7:41pm

Eganjp,
I tried to give local path in the target table option but do not get any file, Could you please see the following path whether this is right or wrong?

use Overflow File: Yes
File Name: C:\DS_STG_AC_S1_MSF620.txt

Appreciate that.

Thanks.

[/img]

amirjamal95 (BOB member since 2012-10-31)

system · January 17, 2013, 8:13pm

That would be the C:\ drive of the Job Server - not YOUR local C: drive.

Always remember: it is the Job Server that is running the job, not your local Designer. (Unless you have a job server on your PC but that would be unlikely and unwise).

ErikR (BOB member since 2007-01-10)

system · January 18, 2013, 1:50pm

If you want the ETL job to write the file to your local machine (which I think is a really bad idea) then you probably need to specify a UNC path such as \MyDevelopmentMachine\MySharedDrive\MyFileName.dat.

Will the job server have the necessary security rights to write to your local machine? I sure hope it doesn’t. Tthat would be an easy way to breach security for the clients I work with.

eganjp (BOB member since 2007-09-12)

system · January 18, 2013, 4:42pm

Thanks all for such a valuable information.

I have one more questions, is there any workaround or options where if ETL find any dirty data (like unique Constraint) display the dirty data in log or send file to the jobServer and stop/abort the job as well.

In the OverFlow option (target table option), ETL display and send the dirty data file to the job server but did not abort the job. I would like to stop the job and re-run after fix the records.

Appreciate your help.

Thanks.

amirjamal95 (BOB member since 2012-10-31)

system · January 18, 2013, 4:59pm

This is where you have to be the all-knowing, all-seeing ETL Developer and create a well formed Dataflow that involves a significant amount of due diligence. You have to ask yourself, “How bad can the input data be?” and “What should we do about it?”

Add validation transforms to check for valid/missing values. Checking for duplicates before they hit the target table is much more difficult. You may have to write the rows to a staging table first and check for duplicates there before you try to load the rows to the final table. Often times duplicates are a data problem not a programming problem. Someone creates two rows in a dimension table that a duplicates. You load the fact table and code a join to the dimension table. Boom! You just created duplicate rows in the fact table and it wasn’t your fault. In this case you have a data quality/programming issue upstream from your table/job.

eganjp (BOB member since 2007-09-12)

system · January 22, 2013, 10:43pm

Simple duplicates, that is to say, 100% identical records, can be easily resolved by using a DISTINCT.

When sourcing from single, well designed systems (e.g. ERP, CRM, billing systems etc), this would rarely happen as the data models and table designs will usually prevent such things from happening.

You may see this more when joining data from System A with data from System B - especially when different grains of data exist in one of the systems or the relationship is established through an inferred key of sorts.

You may also see this when reading data from CSV files, Excel sheets, Access database or any other “desktop driven, human operated” data sources - believe it or not, this is still where a lot of key data is found, even in multi-billion dollar high-tech enterprises.

When you are talking about duplicates in the sense that two records share the same business (natural) key but they have different attributes or values … then I would flag this during development and work with the business to identify the cause of this problem, methods of detection and resolution and (ideally) having the originating system cleaned up.

While it is easy to design an ETL solution that automatically addresses this issue, it may seriously undermine the value of your BI solution if you do not engage the the business owner of this data or the appointed SME.

After all, the technical problem is black and white - the table design does not allow multiple records with the same unique key.

However, from a business sense, the problem is not black and white at all. One record could be wrong and one record could be right - but how would the ETL know which of these to ‘pass’ and which to ‘fail’? Even more complicated, BOTH records could be completely valid and this would force a rethink of the target data model to capture this requirement and still maintain referential integrity. Or perhaps BOTH records should be considered wrong and put aside for reporting / alerting.

As Jim already said… designing ETL processes is an ART, it is not just “banging out a few lines of code” by some “coder”, far removed from the business. In my book, a good ETL developer is part data architect, part business analyst, a born sceptic and evidence driven.

ErikR (BOB member since 2007-01-10)

system · January 23, 2013, 6:27am

I wrote a rather obsessive blog post on this subject a while back:
http://www.etldoctor.com/2012/09/02/multiple-solutions-to-multiple-records/. Tried to cover all the bases, although I’m sure Jim and Erik have thought of other things…

JeffPrenevost (BOB member since 2010-10-09)

system · January 23, 2013, 2:35pm

Nice blog post Jeff. Have you signed up for the Blogger Link program for BOB? Here’s a link to it: Automatic Blog Links Coming.

With regard to tables without primary keys not making much sense I would say that is very true for early implementations of data warehouses. However, many of our largest tables (both dim and fact) do not have declared primary key columns. That’s not to say that there aren’t available candidate columns available. In fact most of these tables used to have declared primary keys. We simply got rid of them when we switched to partitioned tables and local indexes. We felt confident enough in our ETL process that we chose to not create any primary key or unique indexes. We do still have surrogate keys but without any constraints.

I suspect that our implementation is an exception rather than the rule. If I were working with a client on a new data warehouse with a less than experienced crew I absolutely would have primary keys implemented all over the place.

eganjp (BOB member since 2007-09-12)

system · January 23, 2013, 8:10pm

Jim, I can see that design approach work well and it is effectively the same approach I am using in my EDW designs.

I usually use surrogate keys for both facts and dimensions (SCD2 and otherwise), which are declared primary keys. But as the surrogate keys are created by the ETL and are completely meaningless (as they ought to be) - logically speaking, you no longer have a real primary key based on actual business values. I also do not use foreign key constraints.

I do not believe either are required when the referential integrity is enforced by the ETL process and where the ETL process is the ONLY process that is allowed to modify, add or remove data from the EDW.

In certain cases though, some particular query tools and other metadata driven tools may need the primary and foreign key constraints to provide specific drill or query paths. But in the norm, a well designed data warehouse with an ETL process of equal quality would not need these to ensure the data integrity remains intact.

ErikR (BOB member since 2007-01-10)

system · January 23, 2013, 8:25pm

At least in Oracle you can declare the constraints (PK, FK) and then disable them. This means there is no performance hit but they are still out there for query tools to see.

eganjp (BOB member since 2007-09-12)