Ignore Duplicate records

system · October 15, 2012, 7:15pm

I have an existing ETL Dataflow which has Source Table linked to Target table using a simple Query Transform. This has been running for last 3 years without any major issue.

Source Table (S1 - 50 Fields) —> QUERY (Mapping 25 Fields) —> Target (T2 - 25 Fields)

The problem is that S1 has 2 Primary key fields (say F1 and F2) and T1 has Only 1 Primary Key Field (F1). F2 is not even mapped.

99% times, F1 is unique but in rare cases it is not unique. Ideally it would be better to add F2 to Target and get it mapped and add to the keys of Target table. But due to some internal issues, we cannot do that for at least a month.

To prevent Primary key constraint violation until that time, Is there a way to temporarily ignore the duplicate records being written to the target ?? I don’t care if the first or last record that gets written. I tried Distinct Rows on Query transform but for some reason is not working and still getting primary key constraint error.

Is there an alternate solution to prevent duplicates? I would like this to be as simple as possible so that approval is easy to make this change for production.

Thanks in advance…

Diana_baker (BOB member since 2005-03-18)

system · October 15, 2012, 7:24pm

Is this a DWH environment? So yes, implement a late arriving dimension.

If no, try implement something like a late arriving dimension

Johannes Vink (BOB member since 2012-03-20)

system · October 15, 2012, 9:26pm

No it is not a DWH environment and the target table is not designed to do SCD1 or SCD2.

Infact this target table gets Data deleted each time the job is run prior to loading it from source.

I am looking for a solution to ignore the duplicates without any major design changes.

Thanks…

Diana_baker (BOB member since 2005-03-18)

system · October 15, 2012, 11:12pm

Look into the Overflow File feature in the target table. If a row fails to load it goes into the overflow file and the rest of the load completes.

In the Technical Manual this can be found in the “Processing data with problems” section.

It’s possible to do this with the Bulk Loader option also. If you use the file method you can set up the control file for Oracle’s SQL*Loader to send failed rows to a file.

Also in the Bulk Loader you could set the maximum rejects to a very high value. I’m not sure how high you can go though.

eganjp (BOB member since 2007-09-12)

system · October 16, 2012, 9:35am

You could also try adding an additional column using the gen_row_num_by_group function. The syntax will be something like ‘gen_row_num_by_group(F1, F2)’. Then filter out all rows where the row number column is >1 and you should have removed your duplicates.

Nemesis (BOB member since 2004-06-09)

system · October 16, 2012, 1:41pm

What if the duplicate row is already on the database?

eganjp (BOB member since 2007-09-12)

system · October 16, 2012, 2:10pm

I don’t think it can be. Diana has specified that the target table data is deleted before the table is loaded.

Nemesis (BOB member since 2004-06-09)

system · October 16, 2012, 2:18pm

Given that performance is important I would be willing to bet that the overflow file method would perform a lot better than the gen_row_num_by_group method. Plus, sorting 50 million rows so they are grouped together? That would be a resource hog.

The downside of the overflow file method is that the first row “in” wins. If the two rows are exact duplicates then it isn’t an issue. Based on the description it sounds like that isn’t the case but Diana said it doesn’t matter which row makes it in.

eganjp (BOB member since 2007-09-12)

system · October 16, 2012, 3:14pm

I’m sure you’re correct. I just suggested the alternative as I know that some sites aren’t keen on producing overflow files. I had one operations manager ask me to keep as much processing as possible inside Data Services as his operators were ‘a bunch of idiots who couldn’t be trusted to find their way out of the office without help’.

Nemesis (BOB member since 2004-06-09)

system · October 16, 2012, 4:21pm

Overflow file does have a performance impact, depending on the frequency of the error it might not be noticeable though.

By default we execute an array insert of 100 rows at once and the database will tell us “at least on row failed with error xyz”. So we load the same 1000 rows a second time, this time with array size=1 and then we log each individual row in the overflow file in case it fails. Then we switch back to array size=1000 and try the next batch.

So in worst case the performance of the overflow file is that of a
load with array size=1000
plus
load with array size=1.

Array size=1 load performance is times slower than with larger array sizes! Hence it depends on how often a packet of 1000 rows throws an error.

Werner Daehn (BOB member since 2004-12-17)

system · October 16, 2012, 5:08pm

Good to know Werner! I was wondering how that happened but I was too lazy to trace it.

eganjp (BOB member since 2007-09-12)

system · October 16, 2012, 5:13pm

I just thought I throw that in for no particular reason as it does not change the statement as such. One of these minor details that might be important in other cases.

Werner Daehn (BOB member since 2004-12-17)

system · October 17, 2012, 10:05pm

Thank you eganjp, nemesis and Werner for your replies.

Diana_baker (BOB member since 2005-03-18)

system · October 17, 2012, 10:42pm

You’re welcome. Let us know what your final solution is.

eganjp (BOB member since 2007-09-12)

system · November 19, 2012, 6:01pm

Jim,

I am on Data services 4.0 SP3 and went through your recomondation of “Processing data with problems” section. and could not find use over flow file option with Bulk loader mechanism . Can you please direct where to look for over flow file option with BUlk loader mechanism.

Thanks[/quote]

DI.net (BOB member since 2007-07-26)

system · November 19, 2012, 6:14pm

Using an “overflow file” with Oracle bulk loader requires you to use the FILE method not the API method. You have to create a SQL*Loader control file that specifies where to direct the rows that could not be loaded. See the Oracle documentation on how to construct the control file.

eganjp (BOB member since 2007-09-12)

system · November 19, 2012, 6:27pm

Jim,

Thanks for your reply. I am running Sybase IQ 15.x . for Sybase it does not have API option only bulk loader append.

Any thoughts on use over flow file with sybase IQ

Thanks

DI.net (BOB member since 2007-07-26)

system · November 19, 2012, 6:36pm

[quote:216b53cb75=“DI.net”]Thanks for your reply. I am running Sybase IQ 15.x . for Sybase it does not have API option only bulk loader append.
[/quote]

I would look into the command line loader for IQ.

eganjp (BOB member since 2007-09-12)

system · September 8, 2016, 12:16pm

I created settings in template table options by choosing Use OverFlow to Yes, it created a file name in from of File Name option.

But after execution it is not capturing any duplicate records…are we required to do other settings or write code as well?

Please explain in details

Parijatam (BOB member since 2016-09-08)