I have an existing ETL Dataflow which has Source Table linked to Target table using a simple Query Transform. This has been running for last 3 years without any major issue.
The problem is that S1 has 2 Primary key fields (say F1 and F2) and T1 has Only 1 Primary Key Field (F1). F2 is not even mapped.
99% times, F1 is unique but in rare cases it is not unique. Ideally it would be better to add F2 to Target and get it mapped and add to the keys of Target table. But due to some internal issues, we cannot do that for at least a month.
To prevent Primary key constraint violation until that time, Is there a way to temporarily ignore the duplicate records being written to the target ?? I don’t care if the first or last record that gets written. I tried Distinct Rows on Query transform but for some reason is not working and still getting primary key constraint error.
Is there an alternate solution to prevent duplicates? I would like this to be as simple as possible so that approval is easy to make this change for production.
Look into the Overflow File feature in the target table. If a row fails to load it goes into the overflow file and the rest of the load completes.
In the Technical Manual this can be found in the “Processing data with problems” section.
It’s possible to do this with the Bulk Loader option also. If you use the file method you can set up the control file for Oracle’s SQL*Loader to send failed rows to a file.
Also in the Bulk Loader you could set the maximum rejects to a very high value. I’m not sure how high you can go though.
You could also try adding an additional column using the gen_row_num_by_group function. The syntax will be something like ‘gen_row_num_by_group(F1, F2)’. Then filter out all rows where the row number column is >1 and you should have removed your duplicates.
Given that performance is important I would be willing to bet that the overflow file method would perform a lot better than the gen_row_num_by_group method. Plus, sorting 50 million rows so they are grouped together? That would be a resource hog.
The downside of the overflow file method is that the first row “in” wins. If the two rows are exact duplicates then it isn’t an issue. Based on the description it sounds like that isn’t the case but Diana said it doesn’t matter which row makes it in.
I’m sure you’re correct. I just suggested the alternative as I know that some sites aren’t keen on producing overflow files. I had one operations manager ask me to keep as much processing as possible inside Data Services as his operators were ‘a bunch of idiots who couldn’t be trusted to find their way out of the office without help’.
Overflow file does have a performance impact, depending on the frequency of the error it might not be noticeable though.
By default we execute an array insert of 100 rows at once and the database will tell us “at least on row failed with error xyz”. So we load the same 1000 rows a second time, this time with array size=1 and then we log each individual row in the overflow file in case it fails. Then we switch back to array size=1000 and try the next batch.
So in worst case the performance of the overflow file is that of a
load with array size=1000
plus
load with array size=1.
Array size=1 load performance is times slower than with larger array sizes! Hence it depends on how often a packet of 1000 rows throws an error.
I just thought I throw that in for no particular reason as it does not change the statement as such. One of these minor details that might be important in other cases.
I am on Data services 4.0 SP3 and went through your recomondation of “Processing data with problems” section. and could not find use over flow file option with Bulk loader mechanism . Can you please direct where to look for over flow file option with BUlk loader mechanism.
Using an “overflow file” with Oracle bulk loader requires you to use the FILE method not the API method. You have to create a SQL*Loader control file that specifies where to direct the rows that could not be loaded. See the Oracle documentation on how to construct the control file.
[quote:216b53cb75=“DI.net”]Thanks for your reply. I am running Sybase IQ 15.x . for Sybase it does not have API option only bulk loader append.
[/quote]