In our dataflow we’re using an SQL-Function in one of the column mappings in a query transformation to call an Oracle sequence. (SQL(‘DATASTORE’,‘Select key_sequence.nextval from dual’)). We need for an unique identifier for records.
My feeling is, that is is very slow and is a performance bottleneck (processing-overhead when communicating with Oracle when executing an SQL-Function?).
Can you confirm this, or does someone have an idea for improvement?
-Improving the speed for SQL-functions
-Other ways for calling Oracle Sequences from DI (Problem: Sequences cannot imported into a datastore and oracle does it not allow to use sequences in views which could be used in DI).
sql() function performs a connect disconnect at every single call, not to mention it cannot be optimized internally somehow, e.g. no array processing etc. This function is meant to be used in scripts, where it does not matter that it connects as it will be called once anyway.
A far better method is calling a stored procedure to retrieve a sequence number, something like
create or replace procedure myseq(seqno out number) is
begin
select seq.nextval into seqno from dual;
end;
you import that function into your datastore and can use it everywhere now, script, query mapping, query function call, … as you like.[/code]
Note that DI provides a “key_generation” transform that will generate sequence numbers.
However, we recently learned that Key_Generation is really only appropriate for data warehousing, where only one dataflow would ever be updating a target table – never two or more jobs concurrently. Key_Generation only gets the max(id) value from the database once, and then increments in-memory. If two dataflows are touching the same target, you will get a primary key violation.
We are thinking about different directions.
First, we can have key ranges - a minimum number for the key_gen_transform. Pro: often needed for loading different partitions of one table, would support all databases at once.
Or supporting sequences. Pro: can load the same partition simultaniously, but database specific implementation
Maybe you can also try this: Use an environment variable to store the sequence number. You can easily write a user function that sets reads and increments the variable. It should be accessible with multiple processes and much faster than the database. You would only have to initiate it with the proper value but that would only take a single select…
This would work only if DI is the only process writing to your target table. If other applications are doing so too, you could still get a conflict. Or if you have multiple DI servers, etc.
In addition to all the good feedback about how to call sequences, it is worth highlighting that calling a database stored procedure as opposed to using a sql() call through a DI custom function allows for push-down operations, which may speed up straight forward inserts considerably.
Another possible solution would be to let the database handle key allocation by writing a trigger that allocates the next key from a sequence on insert - the dataflow can just set the key to null and the trigger will do the rest. This allows for cases where other processes can insert into the table at the same time.
These are very good thoughts. Especially writing a stored function in the source and add that to the pushdown is a nice idea. If, yes if, that can be accomplished. Often, the sequence will be created in the target and pushdown will read the source.
Trigger is always overhead. Little but you can feal it. I guess still better than calling a stored procedure in DI, though. But has its limitations as well e.g. if API bulkloader is used all triggers are disabled.
We’ve done a similar thing to this - although instead of an environment variable, we store the next key value to use in a table on the database.
So our DI process is,
write the ready-to-load records to a staging table (without the ‘correct’ surrogate key)
count the records in the staging table
call a stored procedure to increment the key value stored in the database table by the number of records we are about to add, and to return the next key value to use
use the key value returned, along with gen_row_num to generate the keys to be used within DI & load the target table
This means that we only have to access the database once to get our key values to use, but also ensures that we will get unique keys. If other non-DI processes need to add records to the table, they can use the same stored procedure to get their key values.
Good point about API bulkloader. Otherwise I can see triggers being one of the fastest ways of achieving an auto-incremental key whilst ensuring fool-proof avoidance of duplicate key issues. The sequence residing on the target however should not cause a problem - if like in most cases the source table sits on a different server and / or database instance, you can still achieve complete pushdown of a select from source + insert in target operation as long as you do not need any DI specific functionality (like DI functions, DI local or global variables or DI memory tables etc.). All you need to do is use ‘linked datastores’ (option under datastore). You can tell if linking your datastores have achived push-down of the operation through viewing the ‘optimised SQL’ - e.g. an insert will show up as ‘insert … select … etc’ whereas prior to using linked datastores you would simply see the ‘select …’ part of the operation, which means the result is fed via DI prior to the insert operation. In my personal experience, the more you can push down to the database the better.
I also tested key generation speed vs stored procedure vs sql() query. Result:
If push down cannot be achieved:
Fastest: DI key gen
ever so slightly slower: Stored proc
Very slow: (4 to 5 times) SQL() call
If push down CAN be achieved:
Fastest: Stored proc
ever so slightly slower: key gen
Very slow: SQL() call
We faced a similar issue using sql() transforms for generating next value of an oracle sequence. We replaced that with $globalvariable + gen_row_num() to generate unique numbers. However, surprisingly, there is no improvement in the performance of the dataflow.
It might be because we were using lookups against huge tables in the data flow. So, we replaced the lookups with joins of those tables with the source. But still no improvement. The only thing to note is that our source is a CSV file and not a table. So, there is no pushdown operation happening while selecting from the source.
You mention lookups against huge tables - I’d recommend staging your CSV unmodified in a table prior to performing any cleansing / validation. Post that point you should be able to design the required ETL from staging to target as push down operations, unless you rely on BODI functionality. I.e. if your lookups are DI lookups, design some pl/sql functions to do the same and revert to using Oracle sequences through any of the options specified earlier. Hope it helps.