Thank you for your answer! Unfortunately re-importing objects didn’t help me to resolve the issue.
The error appear as soon as I tick one of the checkboxes to generate statistic in Job execution properties AND data flow is using in-memory caching somewhere inside (I tested it with Table Comparison and lookup_ext() function).
If one of these two conditions is not met the error does not appear. I am still testing and other parameters of the environment seem not to affect this behaviour. I will continue to test and experiment further, i.e.:
Testing the behaviour on repository which was created in 4.2, not upgraded to one.
Testing the behaviour for different type of source/target database.
Testing the behaviour when caching and target tables are different.
I have a suspicion that may be our DSConfig.txt file is not configured fully/properly.
I was trying to find any information on WHAT exactly is happening when we enable the statistics generation for job execution to help me in troubleshooting, but didn’t find any useful resources unfortunately.
Go into the DSConfig.txt file on the job server and change the Global_DOP from 2 to 1. In DS 3.2 the default was 1. With a number of my clients I have found that this little change between 3.x and 4.x can create a large number of problems. Switching back to 1 fixes a number of those problems.
Often, if Dataflows are not coded with DOP > 1 in mind then there can be issues. One example is where the initial Query transform performs an ORDER BY to support the Sorted Input of the Table Comparison. But if there are other Query transforms between the initial one and the Table Comparison the rows can get unsorted if DOP > 1.
Alternatively, just change the Dataflow’s properties to force DOP =1.
After enabling “Transform Tracing” I can see that during job execution Data Services performs the following operation as many times as many rows the cached comparison table has:
In my case target tables had 60k rows and this operation was performed 60k times.
The failed job log (options “generate statistics” are enabled) looks like this:
Log of the healthy run (disabled options “generate statistics”) looks like this:
The first part of the log (and I enabled memory tracing as well) looks exactly the same.
What are the volumes of these tables? I’ve had some nasty crashes with one development job myself until I realised one lookup actually had the caching options enabled - even though the lookup table was very large and our poor little development DS4.1 box ran out of RAM … and crashed.
(I could reproduce this problem every single time - it really was the caching option in that lookup, against that table - changing it no cache resolved the issue immediately.)
How did you upgrade your code from 3.2 to 4.2? Did you perform a repository upgrade (the right way) or did you export an ATL directly into your new DS 4.2 testing environment (the wrong way)?
The problem is reproduced on a very small tables (3 columns, few records).
As long as we do not tick those two checkboxes in Job Execution properties, the problem do not appear and everything working fine. By everything I mean “Caching”, “Bulk load”, all transforms etc.
Only two things are required to be enabled to reproduce the problem: “caching” (in any way, does not have to be TC transform) and “Generate statistics” … nothing else (bulk load, database type, target and source objects) affects the behaviour.
It would be really great to understand what exactly this “Generate statistics” does under the hood, i.e.:
where DS stores the generated statistics.
what objects and structures DS generates statistics on.
how exactly does DS use generated statistics.
which processes/scripts/jobs perform statistics generation and at which points of execution of the job and how.
The generated statistics are held in the repository table AL_CACHE and AL_CACHE_STATS. If some knucklehead locked that table down such that you can’t write to it then that could be a problem but perhaps not with the symptoms you’ve described.
When the Dataflow runs it stores metadata about the execution in AL_STATISTICS. This is different from gathering or using gathered statistics.
Is it possible that your job server is horribly starved for memory?
On a 64 bit O/S it’s pretty easy to exhaust 8 GB of memory. The job server is pretty good about allocating memory based on how much is available. But I would still configure more than just 8 GB. I run 32 GB on my desktop PC.
When you say that you recreated the problem in a new DS 4.2 repository - did you also create a new date flow to test with, or did you import one of the existing (DS 3.2) data flows into that new repository?
I think it would be good to isolate if this is a problem with migrating DS3.2 objects to DS 4.2 or if this is a bug in DS 4.2 that occurs with new DFs as well.
Look at the Display Optimized SQL in the Dataflow, both with and without the Merge transform. I suspect that the Merge is forcing a different approach to how the Dataflow is using the database.
I’ve double check and I have the same optimized sql wether i use the merge transform or not. Only difference is it actually runs with the merge transform (doing nothing but passing through the data) and it doesn’t run without it.
In fact it doesn’t even get into the DF, I just get the Insufficient Memory error.