Approach: We want full control of all the intermediate steps of the ETL flow.
Solution: We load all the data into multiple staging layers, first we copy the source changes into the replication layer, from there data is harmonized and loaded into the quality layer, from there …
Problem with that: The amount of I/O is doubled by each layer. You are losing parallel processing.
What is faster:
Source -> Transform1 -> Loader_Stage1/Reader_Stage1 -> Transform2 -> Loader_Stage2/Reader_Stage2 -> Transform3 -> Loader_Stage3/Reader_Stage3 -> Transform4 -> Target
Source -> Transform1 -> Transform2 -> Transform3 -> Transform4 -> Target
If the source is 100MB data, the first version would have 800MB worth of I/O. The second version just 200MB I/O - one read of 100MB, one write of 100MB. Overhead I/O = Source_Volume * 2 * number_of_layers.
The other problem is, in the second version all transformations work in parallel. While Transform4 is processing the row 1, Transform3 is busy with row 2 already, … reader is reading row 5 already. (Actually there are arrays and buffers inbetween). In the first version you have to wait for the first dataflow to process the entire data and only then the dataflow for transform2 can be started.
Werner Daehn (BOB member since 2004-12-17)