How long is your nightly ETL job?

I am wondering – even if this is anecdotal, not scientific – how long everyone’s data warehouse/mart loads are taking.

To do our staging->Warehouse->Marts load every 12 hours, it takes ~5 hours to run. We have 300+ dataflows across all of these sources and targets. Some dataflows are very quick (seconds), others take 5-30 minutes. Even on average, if they take 1 minute, we’re populating 300 tables, so that’s 5 hours alone. It’s hard to tune any of them to get from 1 minute to 30 seconds, for example.

I don’t know what the total Gb of data moved every night is, but it’s not that large. Most of our struggles come when the source systems don’t give good indications of what data has changed, so you have to scan them completely.

We’re running on 64-bit 16Gb RAM (both on DI and SQL boxes) with Win2003 w/ SQL 2005 and a SAN.

How about you?.. I just don’t know what’s reasonable to expect.


dnewton :us: (BOB member since 2004-01-30)

Doug,

Some thoughts about this: answer these questions ahead of time for me.

  1. are your source data flows performing any cross-system joins (heterogeneous joins)?
  2. what dictates a source table scan versus non-source table scan? Is it consistent or are there times when indexes can be used?
  3. How many others share the NETWORK between the servers? Is the network between servers VPN or does it contain “client” traffic?
  4. the disks: are they exclusive to the SQLServer component, or are they shared across multiple users?
  5. what does the configuration of the target look like?

And a few more, but those can wait.

Cheers,
Dan L


dlinstedt (BOB member since 2007-06-25)

(My intent with the post was just to get a general idea of what’s reasonable or not, in terms of performance. My only reference point is our own environment. I have no way of knowing if we’re really good, average, or really bad…)

#1 - generally no, we stage everything up locally to the ETL server before loading the warehouse/marts.

#2 - Not sure I understand the question, but the data staged up is indexed to support later ETL steps.

#3 - There are other networks on the switch – it’s all fiber backbone in the computer room, and the machines in question are all on the same subnet. Network utilization on the two servers (per Windows at least) is generally quite low, almost nonexistent (< 5%).

#4 - This is probably our biggest problem - the SAN is shared with transactional apps (other SQL servers). I don’t have any insight to know whether the LUNs or spindles or whatever they’re called are shared or dedicated to us.

#5 - The target… database server, disk, etc.?

My two main concerns are overall performance that is resilient as source data volumes increase (say, at quarter-end), and, predictability. Generally our job takes 5 hours. Inexplicably it took just over 4 hours a few days ago. Last night it took 6 hours. Volumes in all cases were about the same. I’m betting #4 is probably the issue (contention from other resources hitting the SAN).

FWIW, I think the disk throughput is somewhere around 125Mb/sec, and with DI, the best we can do is maybe 10K rows per second under optimal conditions (which generally don’t exist!).


dnewton :us: (BOB member since 2004-01-30)

Hi,

Just curious about your comment.

“Even on average, if they take 1 minute, we’re populating 300 tables, so that’s 5 hours alone”

From what I gather is that your jobs are all running back to back (in sequence)? Or are you simply using this as an example.

Also, what tuning(database and/or workflow design) strategies have you already explored?

Brandon.


juz_b (BOB member since 2007-09-24)

The dataflows are generally running sequentially. We’re just now playing with running them in parallel, but there’s not a linear improvement there, since the dataflows are often competing for the same underlying hardware resources. Perhaps a 30% improvement at best by running more dataflows in parallel, at least on our hardware.

Yes, we’ve done extensive tuning of the dataflows and the database layer.


dnewton :us: (BOB member since 2004-01-30)

Hi,

Are both you source and target on the SAN?

Was wondering if you have already attempted the following:

  1. Replacing the staging tables with flatfiles where appropriate.

  2. Eliminating un-necessary indexes.

  3. Merging some smaller transforms where possible.

  4. Updating statistics on the highly volatile table.

Just a thought…


juz_b (BOB member since 2007-09-24)

Thanks for the ideas.

Both the source and target are indeed on the (same) SAN. (Although these sources are not the actual transaction systems, they are replicas.)

  1. I know flat files are pretty popular in the Oracle warehousing world, but I’m not sure why. In our own informal tests, DI can read just as quickly from a well-tuned SQL database as it can from a file system. Also, our staging tables are persistent (meaning they contain full source data, not just “last night’s data”). They are indexed appropriately so we can detect changes and load incrementally from there, usually.

  2. While we’ve removed redundant indexes, we haven’t done a good job about dropping & re-adding indexes on either side of the dataflow, mostly because it’s a royal pain to do so. There’s more we can do, here.

  3. Do you mean transforms within a single dataflow, or… ?

  4. We have auto-update statistics turned on. We’ve discussed turning them off, which would then require steps to update them at the start and/or end of the ETL process.


dnewton :us: (BOB member since 2004-01-30)

Should we rename this thread into “We tune dnewton’s system”? :mrgreen:


Werner Daehn :de: (BOB member since 2004-12-17)

I’d much prefer the title “Give me free consulting, please” :lol:

Seriously, why hasn’t anyone piped up to say what their numbers are? :frowning: Everybody’s got their warehouse/mart loads using DI, just come clean and tell us how long it’s taking…


dnewton :us: (BOB member since 2004-01-30)

…and then this person should run my benchmark job on their hardware… :cuss:


Werner Daehn :de: (BOB member since 2004-12-17)

Happy to give the numbers…but thei aren’t going to mean much because of:

Datawarehouse size - I do an hourly full refresh of one DW that takes 40 minutes…but it has fewer than 1M records. I have another DI job which does an incremental (delta) load to a DW (in the region of 100,000 possible updates to a DW with 25M rows). This takes around 12 mins. I have yet another full DW rebuild that takes 76 mins (and runs ever few horus), etc. etc.

Complexity - Some of the loads are simple (grab latest data using datestamp on source system, do a quck table comparison and zap it’s there) others aren’t (complex lookups, integration of multiple data sources, etc).

Geography - some of the oevrnight jobs take data from a server in the same rack, others load data from servers located all over the World).

Database sources - Mix of SQL2000, 2005, Oracle, Access, text files, etc.

Hardware - I have one job server with 8Gb memory and one with only 2Gb…and it can make a difference

System contention/loading - Multiple DW loads taking place concurrently has an effect as performance of an individual DW load (ditto source system extracts - where I can’t control how much loading a source system server may already be under).

Plus, there’s stuff like table, record and filed size, index strategy, network topology, etc.
…the list goes on! :twisted:


gpn01 :uk: (BOB member since 2004-11-30)

I apologize if my responses appears to be giving unsolicited advise. I was merely trying to understand why dnewton think his system is not optimized. Of course I understand there is always room for improvement, since there are many ways to tune a system without upgrading existing hardware (since it’s the most cost prohibitive). But I agree that would have been more appropriate under a different title.

It should have been obvious to me that both dnewton and wdaehn are well regarded for their knowledge and contributions in this forum. Again, I apologize for my ignorance.


juz_b (BOB member since 2007-09-24)

No, no, please don’t apologize – I’m not offended in any way. If I weren’t looking for input and advice, I wouldn’t have posted. :smiley: Even if I’ve already tried some of the things you mention, perhaps this thread will be useful to others now or in the future.

Werner is just poking me for fun. :slight_smile:

Doug


dnewton :us: (BOB member since 2004-01-30)

I am not :mrgreen:

Okay, maybe a little bit.

Okay, you got me. I did.


Werner Daehn :de: (BOB member since 2004-12-17)