Collect statistics for fast job execution

Dear BO experts,

recently, we did migration of jobs, and I noticed a job in Tets enviroment is running in 40 mins while in PRO enviroment takes almost 3 hours.

I checked the logs, and the difference is in VPR we are using collected statistics, while in PRO not…

Test env. log:
Cache statistics determined that data flow <DF_XXXXX> uses 9 caches with a total size of 106378869 bytes, which is less
than (or equal to) 3739222016 bytes available for caches in virtual memory. Data flow will use IN MEMORY cache type.

PRO env. log:
Cache statistics for data flow <DF_XXXXX> are not available to be used for optimization. You must collect statistics
before optimization can be done.

Does anybody know how to activate this statistics?

Thank you!
:nopity:


Victor_BCN :es: (BOB member since 2009-09-15)

When you run the job you can to enable the “Collect statistics for optimization” checkbox. Then in the next run of the job you turn off that checkbox and turn on the “Use collected statistics” checkbox.

While doing the above will help (certainly on memory utilization) I doubt that it is your only problem. Be sure you have the DBA collect statistics within the source/target databases for your PRO environments. I’ve seen some nasty performance issues in new environments where the database statistics had not been gathered.


eganjp :us: (BOB member since 2007-09-12)

Hello,

it’s a daily scheduled job at 7 am, my idea is the following:

  1. I will inactivate the daily current schedule
  2. I will create a new schedule for this job, marking “Collect statistics” and unmarking “Use colelcted statistics”
  3. Tomorrow in the morning, this new schedule will run with the statistics
  4. I will delete this latest schedule
  5. I will activate the inital schedule with marked “Use collected statistics”

what you think?

Thanks!


Victor_BCN :es: (BOB member since 2009-09-15)

yes, I will check the DB statistics as well

Thanks!


Victor_BCN :es: (BOB member since 2009-09-15)

I think you should keep both schedules. You should periodically update the gathered statistics.

Any time a Dataflow is changed in a repository it invalidates the existing statistics. In this case you will have to regather the statistics.

Here is a blog post I did on this subject: http://www.prokarma.com/blog/2014/09/03/painlessly-changing-etl-jobs-gather-statistics


eganjp :us: (BOB member since 2007-09-12)

thanks for the link!

so, as I understand it, the Statistics are related to the Dataflow.

When I run a job which contains several Dataflows, and I have activated “Collect statistics”, it will collect statistics for all the Dataflows.

In the next run of job, I will unmark “Collect…” and will mark “Use collected statistics”

Is it like that?

Thanks, regards
:wave:


Victor_BCN :es: (BOB member since 2009-09-15)

Yes, you have it correct.


eganjp :us: (BOB member since 2007-09-12)

woooow it worked perfect!

the “collect statistics” schedule took 2 hours

but when the “use collected statistics” schedule ran next day, it took the expected 40 minutes.

Thanks a lot! cool to learn something new :slight_smile:

Ciao!
:wave:


Victor_BCN :es: (BOB member since 2009-09-15)