working with Table Comparison

system · August 3, 2009, 11:44pm

Hi,
I have a type 1 oracle target, and my DF looks like this
Delta–>TC–>KeyGen–>MapOper–>Trgt.
Now the problem is, in my target table we have 35Ms of records and in the delta 5k records daily. I have created the DF and in the test run, the jobs starts and never ends.
I have tried all 3 options in TC, nothing worked, what would be the best option to select in TC to compare the target?
Or am I doing in the wrong way?

Thanks In Advance

Praveen Reddy (BOB member since 2006-12-26)

system · August 4, 2009, 12:00am

I don’t use TC for huge datasets, rather I would outer join the target table with source table, and fetch the tgt primary/surrogate key column. Based on the Tgt key I will determine whether a record to insert/update.

pseudo logic

Drag the tables (src n tgt)
Outher join those tables in Qry step and fetch required column from source and tgt surrogate key column from tgt table
Add a case to to route the records,
For insert
If tgt.surrogate_key is null then key_gen and insert into table
else
If tgt.surrogate_key is not null then update.

BTW, nothing is wrong in your design as far as my knowledge is concern. Check the exact bottle neck in which instance job is taking time…
Check whether you have any proper indexes or not?

gssharma (BOB member since 2006-10-30)

system · August 4, 2009, 12:00am

When I use table compare, I have my flow:

Delta–>TC–>MapOper–>Query–>KeyGen–>Trgt.

I am bouncing 5,000 records an hour to a table with over 5 million records, I make sure to base the compare on the RecID, and two other key columns (not set to pk but known distinct record columns).

A couple of my co-workers even take the MapOp to another level, splitting it into two:

Delta -> TC -> |-> MapOp1 (update) -> UpdateQry -> Target
|-> MapOp2 (insert) -> InsertQry -> KeyGen -> Target

M

skyygirl (BOB member since 2008-02-06)

system · August 4, 2009, 12:22am

Thank you guys…

Sudheer: yeah, I was thinking the same, as you mentioned the join and inserts separate and updates separate.

But, let me spend some more time on TC to see where its going wrong…

Praveen Reddy (BOB member since 2006-12-26)

system · August 4, 2009, 12:02pm

For all the people doing the map operations: Why? Is it to turn UPDATES into INSERTS (for a slowly-changing-dimension kind of scenario?) The original poster said it was a Type 1 SCD, which means no history preservation, just updates. A map operation should be unnecessary.

Adding 5000 rows to a 5M row table, in my experience, should take far less than an hour, even with a Table Compare and row-by-row select, it should be more like 5 minutes. If it’s slow, the most likely problem is insufficient indexes on the target to support the table compare lookup.

Outer joining the target in with the source can be a good approach, particularly if the source and target are in the same datastore, or a linked datastore, so that DI can push down the SQL. Otherwise, it’s less compelling (and not really much faster than a table compare).

dnewton (BOB member since 2004-01-30)

system · August 5, 2009, 1:14pm

I agree with the outer join method. You also have more control over how to map each column in your update and insert data set.

If you really want to use Table Comparison, try sorted input, or check your indexes are in place and (on Oracle) statistics are up to date.

George (BOB member since 2003-06-27)

system · August 10, 2009, 7:54pm

When you say check the indexes, I have one more question.

Let’s go through the example

I have a target table contains 5Ms records, created indexes on the compare columns and applying table comaprison(TC) on it. The TC detected 1Mlln inserts or updates. How does it work now? does it take less time to do table comparison and more time to do inserts and updates than usual, because of the indexes on the key columns?

someone explain?

Praveen Reddy (BOB member since 2006-12-26)

system · August 10, 2009, 8:02pm

Meaning: For whatever columns are in the lower-left pane of the Table Compare, these are the columns that will be used in a row-by-row select to retrieve against the target.

So for TARGET_TABLE_A, if you had COL1 and COL2 as the logical key for the lookup (in the lower left), DI will issue SQL for the table compare row-by-row:


 select * from TARGET_TABLE_A
 where COL1 = 'BLAH'
 and COL2 = 'BLAHBLAH'

To make this work fast in the database, you should have a composite index on:

TARGET_TABLE_A(COL1,COL2)

dnewton (BOB member since 2004-01-30)

system · August 10, 2009, 8:19pm

Till the step creating index was clear. My question is about the next step.

If we try to do inserts or updates on a table having indexes, that will take few times and the total Dataflow completion time gets increased.

correct?

Praveen Reddy (BOB member since 2006-12-26)

system · August 10, 2009, 8:23pm

As always, with indexes, the answer is “it depends”. Generally, given the choice of a full-table scan (which is what you’d get with no indexes), or the additional overhead of one more index…

Generally adding the one index to support the table compare, improves the performance far beyond the extra overhead for its maintenance by the db.

dnewton (BOB member since 2004-01-30)

system · August 10, 2009, 10:39pm

Try adding a script to disable the index before the load and re-enable afterwards, then see what works best. Like a lot of things, it’s trial and error, to some extent, and all part of performance testing.

George (BOB member since 2003-06-27)