Things you should never do - a collection and discussion

Can we collect common design mistakes in this thread here? I would then document those in the Wiki, just looking for ideas.
Please post your opinions about such a thread here and if you disagree with what others said.

I have noticed that users make typical mistakes which are hard to change later. Changing a lookup from no_cache to pre_load_cache (or vice versa depending on the case) is an example of an optimization which can be identified and changed quickly. But other design decision would require lot of work.


Werner Daehn :de: (BOB member since 2004-12-17)

Approach: Process one row (one batch of rows) at the time.
Solution: Run the dataflow frequently, e.g. in a while loop.

Problem with that: The overhead of starting and stopping a dataflow is quite large, the dataflow has to connect to all sources and targets, process activation, etc. That might be as quick as 10 seconds, but if you would process one row per dataflow the throughput is 0.1 rows/sec instead of the 50’000 rows/sec the dataflow itself can process.


Werner Daehn :de: (BOB member since 2004-12-17)

Approach: We want full control of all the intermediate steps of the ETL flow.
Solution: We load all the data into multiple staging layers, first we copy the source changes into the replication layer, from there data is harmonized and loaded into the quality layer, from there …

Problem with that: The amount of I/O is doubled by each layer. You are losing parallel processing.

What is faster:
Source -> Transform1 -> Loader_Stage1/Reader_Stage1 -> Transform2 -> Loader_Stage2/Reader_Stage2 -> Transform3 -> Loader_Stage3/Reader_Stage3 -> Transform4 -> Target
Source -> Transform1 -> Transform2 -> Transform3 -> Transform4 -> Target

If the source is 100MB data, the first version would have 800MB worth of I/O. The second version just 200MB I/O - one read of 100MB, one write of 100MB. Overhead I/O = Source_Volume * 2 * number_of_layers.
The other problem is, in the second version all transformations work in parallel. While Transform4 is processing the row 1, Transform3 is busy with row 2 already, … reader is reading row 5 already. (Actually there are arrays and buffers inbetween). In the first version you have to wait for the first dataflow to process the entire data and only then the dataflow for transform2 can be started.


Werner Daehn :de: (BOB member since 2004-12-17)

Good topic and i’m sure there’s loads!

The first one to spring to mind, and it’s one I’ve often seen, is a team starts developing a solution and dive straight into the coding and haven’t setup any datastore configurations or system configurations and then have a tonne of rework to do when they port their code from dev to test.

AL


agulland :uk: (BOB member since 2004-03-17)

Approach: Only join one table at the time and filter the amount of data to be joined first.
Solution: A dataflow like

Source -> Query -> Join_Query1 -> Join_Query2 -> Join_Query3 -> Target
                         ^              ^              ^
                         |              |              |
                  Filter_Query1  Filter_Query2   Filter_Query3
                         ^              ^              ^
                         |              |              |
                  LookupTable1   LookupTable2    LookupTable3

Problem with that: Often these kind of flows cannot be executed efficiently. Depending on the details the optimizer might be able to rewrite all into one large SQL statement pushed down to the source database. But often not. In that case the data coming from Filter_QueryX will be cached in memory. Which still is okay. But what will happen in a delta load, e.g. the Source has just one changed row outputted by the object called “Query”? Then we cache all potential lookup candidates, we may end up caching million of rows for each lookup just to do one single lookup.

Better to rewrite above into a

Source -> Query -> Join_Query1 -> Join_Query2 -> Join_Query3 -> Target
                         ^              ^              ^
                         |              |              |
                  LookupTable1   LookupTable2    LookupTable3

where the Join_QueryX contains the filter clause. The DataServices optimizer will still pushdown the filter condition into the database when reading the lookup tables, so no change in that regards. But now we have the option to say cache the lookup results (for initial load) or lookup the one row in the database via a single row select.

Even better - if applicable - would be not joining the data but using the lookup_ext() function as a new_function_call instead. Same result but more control about the cache setting.

And if all happens to be in the same database, join all data in the first query if possible. That should be a golden rule: Join first, transform later. Not possible always but at least it should be the goal.


Werner Daehn :de: (BOB member since 2004-12-17)

Actually, I am not a big fan of this approach. Mine is more difficult to setup though. Want it to get it into the product to make it easier.

The biggest downside of datastore configurations for dev-test-prod migration is that the dev system has the settings for prod. So if I want to screw up, I could load the prod system with my job. Granted, in dev the connection parameters for prod might be set to something invalid.


Werner Daehn :de: (BOB member since 2004-12-17)

Approach: DataServices is so limited when it comes to the sql syntax supported in Queries, I do not want to read all data from a table, I want to use variables to read just a subset
Solution: I use a SQL Transform everywhere

Problem with that: A SQL Transform has lots of limitations, no impact lineage information, no partitioned readers, hard to read & modify the SQL statements typed, no full pushdown in the sense of an insert…select…, etc.

And actually it is not needed. DataServices has an optimizer which will try to rearrange the objects in a way that makes sense from a performance perspective. So if a Source table is connected to a query and the query contains a where clause, an other join setting, a variable/parameter,… the optimizer will translate all of that into the database specific syntax and push it down to the database. So the result will be exactly the same thing as you would manually write. Sometimes even better.

And only for those few cases where the database has a capability but DataServices does not, or the SQL string needs to be dynamic (e.g. different table names), only then the workaround of using a SQL Transform makes some sense. Even then I would try other approaches like using database views instead.


Werner Daehn :de: (BOB member since 2004-12-17)

Approach: I’ve Added my Job to the Central Repository and I’ve made many updates so some of the original dataflows are no longer used.
Solution: I delete the dataflows from the Central Repository

Problem with that: I can’t successfully retrieve historical versions of the job…and the Central Repository can get corrupted.

Real Solution: Don’t delete objects from the Central Repository


Nemesis :australia: (BOB member since 2004-06-09)

Yes, it’s not perfect but its better than hard coding.

A separate config file (text, xml or whatever) would be better. If i’m building some java app then I would normally externalise connection strings and other config items in an external file - something similar in data services would be good!


agulland :uk: (BOB member since 2004-03-17)

[quote="wdaehn

Source -> Query -> Join_Query1 -> Join_Query2 -> Join_Query3 -> Target
                         ^              ^              ^
                         |              |              |
                  LookupTable1   LookupTable2    LookupTable3

where the Join_QueryX contains the filter clause. The DataServices optimizer will still pushdown the filter condition into the database when reading the lookup tables, so no change in that regards. [/quote]

As a general point I try to reduce the number of query transforms in my DFs to the minimum necessary. So in the above example I would try to combine all the Join_Queries into one if at all possible. This makes the DF far more maintainable: I’ve lost count of the number of times I’ve had to go through a DF where a new column from Source was required on Target, and you have to add that column to a long string of lookup joins where one or two joins would do the same job.


dastocks (BOB member since 2006-12-11)

Have you seen the ‘Propagate Column From’ option that became available from version 3.2? I think this would simplify the task of adding columns to multiple transforms. I prefer to keep query transforms relatively straight forward. So I tend to have multiple simple transforms rather than a single complicated one.


Nemesis :australia: (BOB member since 2004-06-09)

Approach:A Dataflow generates lot of rows and it needs to get the data into the target table faster.
Solution: Turn on Bulk Loader

Problem with that: Bulk Loader doesn’t handle updates

Real Solution:

  1. Don’t use Bulk Loader, increase the number of loaders
  2. Split the deletes/updates and inserts into separate branches of the Dataflow each with it’s own target table. The inserts use Bulk Loader, the deletes/updates don’t. Be sure to use transaction control and set the order in which the tables are loaded so that you don’t try to send deletes/updates at the same time the Bulk Loader is running (and likely has the table locked).
  3. ??? I’m sure there are a number of other solutions but the above two are the “simple” ones I could come up with.

eganjp :us: (BOB member since 2007-09-12)

Hi,

Well i had faced this problem and thought to share this with you guys.

Whenevev i try join output of two sql transforms in a query transform it goes for a cartesian product.

Regards,
Praveen


tikkanipraveen :india: (BOB member since 2008-09-23)

Converting date, datetime datatypes:

What format does a date datatype have, dd/mm/yyyy? dd.mm.yyyy? yyyy-dd-mm??
Answer: None. A date/datetime datatype has no format at all. Only if you use implicit conversion from date to char or vice versa, then you use the database session default format which can be different for each database, even each client connecting to the same database.

Example: print(sysdate());
If you click on validate all you will see you have an implicit conversion from datetime to varchar, as you didn’t specify a to_char() yourself, the DataServices default format will be used, which is equivalent to you writing
to_char(sysdate(), ‘YYYY.MM.DD HH24:MI:SS’)

Example: to_date(sysdate(), ‘dd-mm-yyyy’);
So you convert a datetime datatype (sysdate output) to a datetime with the to_date function??? Well, validate that as well and you will see the sysdate() gets converted to a string implicitly. So what you are actually writing is:
to_date(to_char(sysdate(), ‘YYYY.MM.DD HH24:MI:SS’), ‘dd-mm-yyyy’);
And as the format strings do not match, you will get an error.


Werner Daehn :de: (BOB member since 2004-12-17)

Approach: I want to use a date inside the sql() function
Solution: sql(‘DataStore’, 'DELETE FROM table WHERE date_col = [$start_date] ');

Problem with that: is actually multi-fold.
First is the parameter substitution. In all strings DataServices does see, it will check for either [] or {} combinations. Each text inside these brackets will be put into the expression parser and replaced by its value. So the string ‘1+1=[1+1]’ will result in the string ‘1+1=2’. And with {} extra quotes will be put around. Hence in above example, the string
DELETE FROM table WHERE date_col = 2011.12.31
will be sent to the database. And the database will raise an error saying that 2011.12.31 is a weird number it does not understand. So the least we have to do is using {} brackets. Then the string sent to the database will be
DELETE FROM table WHERE date_col = ‘2011.12.31’

If we are lucky, the database does recognize this string as a date, in most cases it will say things like “I expect a date in the format DD/MM/YYYY”. Either way, we do not want to rely on luck, we should define the format.

Note, the to_char is a DataServices function, it is used inside the parameter substitution bracket {} or []. The to_date() function is part of the string sent to the database, so it is a database function. If you use a different database, check the database syntax.

In Oracle for example the way to convert a string to date is by using the to_date() function of Oracle. Note: There is a to_date() function in DS as well, but that one is not used!

sql(‘DataStore’, 'DELETE FROM table WHERE date_col = to_date({$start_date}, ‘YYYY.MM.DD’) ');

Now the database will receive the string
DELETE FROM table WHERE date_col = to_date(‘2011.12.31’, ‘YYYY.MM.DD’)
and hopefully understand what we want.

The last bit is, the {$start_date} forces DataServices to convert a date into a string. As we haven’t specified any format, it will use the internal conversion rules, which use the format YYYY.MM.DD HH24:MI:SS always. And we get a conversion warning. So it would be safer to use the DataServices(!! we are inside the expression!!) to_char() function.

sql(‘DataStore’, 'DELETE FROM table WHERE date_col = to_date( { to_char($start_date, ‘YYYY.MM.DD’)} , ‘YYYY.MM.DD’) ');


Werner Daehn :de: (BOB member since 2004-12-17)

@agulland
Adding to your point, it is also important create a database alias. If the database owner name is different across the Dev/Integration/QA/Prod envs then it is very important to create an alias and then assign the owner names to the alias created in the Configuration settings of a data store.


mrunals (BOB member since 2006-12-27)

yes, very good point mrunals. And then don’t hard code alias name in SQL, rather do something like,

# Get databasename and schema name
$vDatabaseName = db_database_name('DS')
$vSchemaName = db_owner('DS','ADMIN');

# Execute SQL
sql('DS','select * from [$vDatabaseName].[vSchemaName].MY_TABLE ');

This isn’t always necessary and sometimes using two datastores (for each alias) would be a more manageable solution.

AL


agulland :uk: (BOB member since 2004-03-17)

There are cases where there are specifications that the bodi job has to trigger another job(cmd command on scripts) only on a particular host/ trigger mail if the job runs on a specific server.

In those cases, the conditions for host name is mentioned on bodi scripts.
I have witnessed in years older jobs that those host names were hard coded in scripts like below,

If host_name=‘ukshutts01’
(

)

The jobs were huge and when I drilled , it went on and on and on.
In future, if we need to have change in the host name , it would be so difficult .
Hence it is better to have the host names onto a global variable and mention them on the initialise script.

If change needs to made, it can just be made just oncce.

I am sharing this piece of info as I am currently facing the task of drilling jobs and modifying the new host names. Its really hectic.

Hence better to use global variables as and when they are required.


rajnia1 (BOB member since 2011-11-17)

yes good point on host names. i guess coding like this is because we want to send the mail when in production rather than dev, test? So suitable code would be,

if current_system_configuration() = "production" then
   send mail
else
   print message

or have other global variables that control email vs print message that can be set at runtime

AL


agulland :uk: (BOB member since 2004-03-17)

Agree to this a million times…

The problem always starts when the output of SQL transform is getting Inner/Outer joined into another Query.
The optimizer didn’t knew what was coming from SQL transform and hence it produces Cartesian products of no. of rows even if the final no. of rows expected are in hundreds.

10 Query transforms to achieve the output of one SQL transform should always be the preferred way of designing jobs :slight_smile: :slight_smile: :slight_smile:


ds41user :india: (BOB member since 2007-09-03)