Rows per commit

Hi,

Can someone please tell me what the ideal value of ‘rows per commit’ should be ?

My staging table is around 50,000 - 60,000 and the target is around 14 million.

I tried changing the rows per commit from 1000 to 2500, and it made a difference, but only on the first run :?

Will changing the rows per commit from 1000(currently) to another value(besides 2500) help me out ?

Thanks in advance!


coolvibs123 (BOB member since 2011-02-18)

https://wiki.sdn.sap.com/wiki/display/BOBJ/The+impact+of+the+CommitSize+(Oracle)

https://wiki.sdn.sap.com/wiki/display/BOBJ/The+impact+of+the+CommitSize+(SQL+Server)


Werner Daehn :de: (BOB member since 2004-12-17)

Thanks for the reply Werner,

I had one more question, I have around 15 jobs running simultaneously on my server, so if i increase the Degree of parallelism of my DF to 2/4 will it make a difference ?

I can see the difference in the development environment. However, i am not sure the difference will reflect in the production environment as 15 jobs run at the same time. please correct me if i’m wrong.

Thanks!


coolvibs123 (BOB member since 2011-02-18)

Too many variables. It can be everything.
While the CPU scales nicely, meaning that it does not really matter if 50 or 500 parallel threads do all the work (assuming that all threads use all available CPU of course), it is a different topic for memory access. If you read the same memory addresses with 4 threads, the data will reside in the L1 or L2 cache of the local CPU. If too many processes are active, the OS will have to swap in and out program and data from the cache and hence access the memory. Still fact, but memory runs at 1333MHz, CPUs and their L1 cache on 2.5GHz. And what is more important, all CPUs have to wait until its their turn to access the one memory chip, the cache is local to each CPU.

Network is bad, Disk is even worse.
A disk has a sequential read performance of 100MByte/sec. Not that bad. So one process can read e.g. a file with 100MByte/sec. If you read two files in parallel, the disk head has to jump between the location of file 1 and 2 back and forth. This “jump” is the average seek time, that is 10ms.
So let’s assume you read one byte of file 1, then one byte of file2, then move back to file 1 and read another byte. So you would manage to do just 100 seeks per second, as each seek reads one byte, that would be a transfer rate of 100 byte/sec! Obviously the disk tries to cache more data including data you haven’t requested yet but is nearby anyhow, the filesystem driver does rearrange the read requests so that nearby data is read at once before moving to the other position, etc.

But all of that results to the statement: No idea, can be everything.

https://wiki.sdn.sap.com/wiki/display/BOBJ/Identify+a+Bottleneck+in+a+Dataflow


Werner Daehn :de: (BOB member since 2004-12-17)

Before blindly flipping switches and turning knobs to try and improve performance you should determine WHY your job is slow. Find where the bottlenecks are before you try to “fix” things. An infinite number of degrees of parallelism in your ETL isn’t going to help one bit if the source database server doesn’t produce a result set in a timely manner.


eganjp :us: (BOB member since 2007-09-12)

Jim/Werner,

Thanks for your response.

I have figured out where the bottleneck is, its in the order by clause and the Table Comparison (row by row) transform. I have mentioned the size of the tables in my first post on this topic.

Since the order by and TC are timely processes i tried changing the DOP and also tried the run as separate process option. Like i mentioned earlier, it does make a difference in the development environment (a huge difference). However, i doubt whether it’ll make a difference in PRODUCTION.

I also tried changing the rows per commit.
Any other ways by which i can improve the performance of this job ?

Thanks.


coolvibs123 (BOB member since 2011-02-18)

The row-by-row comparison can be very slow if you have a very large number of rows in the input schema and/or the comparison table is not correctly indexed.

In your case the number of rows involved puts you close to switching to using sorted input instead of row-by-row. Sorted input works well when both the input and comparison table are large. It also depends on the makeup of the data in the input schema. If the data in the input schema is all over the place then sorted input may not be a good fit. By all over the place I mean is it distributed across the comparison table. If the input schema data is skewed toward the end of the comparison table (or perhaps a narrowly defined section of the table) then that is a good fit as the cursor constructed by the table comparison transform will include a WHERE clause that will get the cursor started towards the end of the comparison table. This way the cursor isn’t scanning past millions of rows that don’t match the primary key column values of the input schema.

The first column in the primary key list in the table comparison is the only one used in the WHERE clause of the sorted input cursor. So choose the order of your primary key columns very carefully.


eganjp :us: (BOB member since 2007-09-12)

Order by and Table Comparison.

First question, why do you do the order by?

Table Comparison, you have 60’000 rows as input and 14 million rows in the target. So row-by-row is definitely a good idea, sorted input might be - let’s forget about that for now.

row-by-row does execute a select * from comparison_table where key=xyz. So what columns does the TC list as input primary keys, does the comparison table have a matching index?
Maybe it is not the TC but the table loader. Does the target table have a primary key column, really marked as primary key in DS and having an index?


Werner Daehn :de: (BOB member since 2004-12-17)

hi,
I come across the below warning when ever a long column is being used. Which makes the loading process slow. Is their any way to overcome this.

Warning: Rows per commit for table loader has been reset to 1 because table contains a LONG column.

Thanks in advance. :slight_smile:


arunkumarmaddy (BOB member since 2011-06-13)

Nope.

Whenever tables contain LONG Columns, DI will override the value of Rows per Commit in the Target Table options.

Are you loading that column actively?


ganeshxp :us: (BOB member since 2008-07-17)

We changed that long -> arraysize=1 about a year ago.


Werner Daehn :de: (BOB member since 2004-12-17)

We have done testing that shows that SQL server is more than capable of handling commit scops well beyond the 1000 row limit (relative to row width, of course). Using 800 byte wide as a baseline, we compared a DS job to a TSQL job, and found that with proper sizing of the transaciton log and temp db the TSQL job could easily handle 100,000+ row commits without blowing anything out - and in a fraction of the time compared to the DS job.

Control of commit scope is necessary with large conversions (you can’t simply push down an operation when you’re dealing with 40 million rows), Thus, we’re left with two options:

  1. Design a while loop so that the operations can be pushed down entirely to the DB, but in definied increments based on row counters
  2. Insert a map transform, which will then allow DS to force commit points according to the fetch and commit limits. Unfortunately, these seemed to be based on an upper limit of what DS can handle in terms of data : “pipe” size.

I don’t like #1, but it’s probably the better option. The lmits set by the system for fetch and commit sizes are simply too low for a heavy conversion - at the same time avoiding sending the entire operation at once (40 million rows) down to the DB, thus blowing out your txn logs or temp db (either that or size them to ridiculous proportions). Working on one of these right now in fact.

I have struggled mightily with this over the past few months - any additional insight from the experts is greatly appreciated. It would be nice to see if SAP has any future plans for making the engine more intelligent in terms of understanding DB constraints (calcing size of log and tempdb, for instance) and automatically sets commit points for jobs. Obviously, there are additional factors, but some baseline calculation for partitioning pushdown operations seems like it would be feasible.


CILBOBuser (BOB member since 2010-12-09)

If you turn on Bulk Loader it will prevent the operation from being fully pushed down to the database.


eganjp :us: (BOB member since 2007-09-12)