Job fails when using collected statistics

system · June 23, 2014, 10:32am

Hi All,

We recently noticed the statistics collected for our main job run were very out of date. This was causing an occasional issue with a particular dataflow which was using IN MEMORY based on old statistics and failing because this exceeded the limits for IN MEMORY.

So I initially ran the full job with “Collect Statistics for Optimization” ticked - which runs everything using PAGEABLE cache to check how much memory the dataflows need. After this the job was run again with “Collect Statistics for Optimization” unchecked and “Use Collected Statistics” checked.

This ran fine until a particular dataflow…

(12.2) 06-23-14 10:59:10 (3472:4832) DATAFLOW: Data flow <stg_REDACTED_df> is started.
(12.2) 06-23-14 10:59:10 (3472:4832) DATAFLOW: Cache statistics determined that data flow <stg_REDACTED_2_df> uses <9> caches with a total size of <1088847201> bytes. This is less than(or equal to) the virtual memory <1333788672> bytes available for caches. Statistics is switching the cache type to IN MEMORY.
(12.2) 06-23-14 10:59:10 (3472:4832) DATAFLOW: Data flow <stg_REDACTED_2_df> using IN MEMORY Cache.

It then bombs out after the al_engine.exe process eats it’s way up to ~2GB. (32-bit OS so this fails). NB: When collecting statistics and using PAGEABLE cache the al_engine.exe process gets to ~800MB.

This is with no changes to input data. So why is the statistics collection thinking IN MEMORY is best when it then goes over the limits and fails?

DS version is fairly old 12.2.1.2 - we have an upgrade to 4.2 planned now BI4 is in place.

DanDensley (BOB member since 2009-05-12)

system · June 23, 2014, 4:12pm

Don’t know why it’s deciding that IN MEMORY is best, but you can force the DF to run as pageable - right click on the DF and it’s available as an option in properties. Personally I prefer to force everything to be IN MEMORY unless I know there may be memory issues. It’s always possible to work around memory issues, even if it means splitting a dataflow and using an intermediate staging table.

dastocks (BOB member since 2006-12-11)

system · June 23, 2014, 4:18pm

@dastocks
Thanks for the reply. The problem with the option on the dataflow is that it is ignored if you run the job with “Use statistics for optimization” enabled. So we would need to go through the whole job (which is sizeable) and specify how each dataflow should be run - and then maintain that - rather than using the more automated process of collecting and using statistics.

When run collecting statistics it all runs with PAGEABLE CACHE and not all that slowly to be honest - we’d like to try and use these features to run optimally though.

DanDensley (BOB member since 2009-05-12)

system · June 25, 2014, 11:48pm

I had a client that was running 12.1 and saw some very strange behavior with statistics. We found that deleting the statistics for that Dataflow solved the problem.

eganjp (BOB member since 2007-09-12)

system · June 27, 2014, 10:31am

@eganjp
Thanks. What was the process you followed for deleting the statistics for the dataflow? Directly from AL_CACHE/AL_CACHE_STATS tables?

I have found that if I alter the dataflow in a way which prompts a save but doesn’t change it functionally then the next time it runs with Use Statistics then it uses PAGEABLE CACHE as it doesn’t have stats. If we collect stats again in a subsequent run then the problem reappears.

DanDensley (BOB member since 2009-05-12)

system · July 2, 2014, 3:32pm

That makes sense. Any time a new version of a Dataflow is created then that invalidates the old statistics. This was a production issue so tweaking the Dataflow using Designer was not an option. We did delete statistics from those two tables, but only for that specific Dataflow.

eganjp (BOB member since 2007-09-12)

system · July 4, 2014, 12:20pm

Sorry, my post was incorrect: The dataflow options are ‘Pageable’ or ‘In-Memory’ and ‘Pageable’ is the default. If you specify ‘Pageable’ the statistics are used to decide whether it should use pageable or in-memory caching, and heaven help you if the statistics are wrong. Specify ‘In-Memory’ and it will always use in-memory caching, which means it all dies if you run out of memory. If you do your testing properly you will find this out before the dataflow gets into production, and usually you can modify the dataflow to work using in-memory cache (I come from the days before DS offered the pageable cache, and everything had to run in-memory).

What I don’t like about pageable is a) it’s the default and b) you can’t be sure that’s what will be used at run-time. I therefore try to make sure all my production code has

a) in-memory specified for dataflows unless I know that the pageable cache will be required.
b) statistics collection / use collected statistics disabled at job level.

That way I know what’s going to happen.

nothing that can’t be fixed with a global search/replace on an ATL file …

I used to develop/maintain a near-real-time data warehouse that used Oracle CDC and a continuous looping DS job - the whole thing contained dataflows that would run several hundred times during the execution of the job. Wherever possible dataflows were written (and set) to run using in-memory cache, because letting the environment choose was a recipe for disaster. We had to take similar precautions with database statistics.

dastocks (BOB member since 2006-12-11)

system · July 4, 2014, 2:41pm

destocks, my only disagreement with your approach is that your Dataflows end up allocating more memory than they need to. Without statistics the Job Server will end up allocating the maximum amount of memory for a Dataflow. If you had statistics for a Dataflow that uses little or no memory then the Job Server would allocate significantly less memory.

eganjp (BOB member since 2007-09-12)