How to check run time for each workflow in job ?

system · January 9, 2013, 12:50pm

Hello,

I have few jobs where I need to do performance optimization.

These jobs have huge no.of workflows in them (somewhere around 25-30 & each workflow have script & couple of Dataflows in them)
so how can i find each workflow is taking how much time to run within a job.

Once i shortlist the high time consuming workflows then i can focus on modifying them to do performance optimization.

Any suggestions or alternative way to do this ?

Dvn (BOB member since 2012-11-24)

system · January 9, 2013, 9:49pm

The performance monitor in the Data Services Management Console will provide you with detailed execution stats for each data flow.

It only shows data flows as these are the only components that actually “do” something in terms of data loading, transformations etc. Except for scripts in work flows of course but these should never do any massive data massaging that would take any amount of time.

However, you can also query the repository database (in particular the AL_HISTORY table), where all the metadata is contained.

The table is not really setup for easy querying - but here’s a query I am using against our repository databases to extract additional stats for monitoring purposes etc.

Please bear in mind that our repository databases are located on SQL Server 2008 - you may have to make tweaks to the syntax to make it work on other RDMS’.

As the AL_HISTORY table also contains the stats of many internal data flows and jobs etc (a lot of BODS processes, like profiling, are actually internal BODS jobs!), you probably want to filter by your job name to ensure you extract the relevant stats.

You can change the query to retrieve the status by object type - it’s currently to work flows but you can change this to data flows etc. I wrote the query to get the results per data flows, not work flows, but I changed the object type to suit your needs as described in your post.

It’s not perfect but it’s all I have at the moment


SELECT dfstats.OBJECT_KEY,    
       dfstats.KEY2,
       dfstats.DATAFLOW_NAME,       
       dfstats.START_TIME,
       dfstats.END_TIME,
       dfstats.EXECUTION_TIME,
       h.START_TIME as JOB_START_TIME,
       DATEDIFF(ss, H.START_TIME, dfstats.START_TIME) as JOB_LAPSETIME_SEC,
       dfstats.OBJECT_ID,
       dfstats.PARENT_ID

FROM
(select  df.OBJECT_KEY, 
        df.KEY2, 
        df.value as DATAFLOW_NAME,
        (select z.value from AL_STATISTICS z where z.OBJECT_KEY = df.OBJECT_KEY AND z.KEY2 = df.KEY2 and z.NAME = 'START_TIME') as START_TIME,
        (select z.value from AL_STATISTICS z where z.OBJECT_KEY = df.OBJECT_KEY AND z.KEY2 = df.KEY2 and z.NAME = 'END_TIME') as END_TIME,
        (select z.value from AL_STATISTICS z where z.OBJECT_KEY = df.OBJECT_KEY AND z.KEY2 = df.KEY2 and z.NAME = 'EXECUTION_TIME') as EXECUTION_TIME,
        (select z.value from AL_STATISTICS z where z.OBJECT_KEY = df.OBJECT_KEY AND z.KEY2 = df.KEY2 and z.NAME = 'OBJECT_ID') as OBJECT_ID,
        (select z.value from AL_STATISTICS z where z.OBJECT_KEY = df.OBJECT_KEY AND z.KEY2 = df.KEY2 and z.NAME = 'PARENT_ID') as PARENT_ID        
from AL_STATISTICS df inner join 
                            (select OBJECT_KEY, KEY2 
                            from AL_STATISTICS
                            WHERE OBJECT_KEY = (select MAX(OBJECT_KEY)-1 from AL_HISTORY WHERE SERVICE = 'YOUR_JOB_NAME')
                            AND NAME = 'OBJECT_TYPE'
                            AND VALUE = 'Workflow'
                            GROUP BY OBJECT_KEY, KEY2) obj
                            
                            on df.OBJECT_KEY = obj.OBJECT_KEY
                            and df.KEY2 = obj.KEY2
where df.[NAME] = 'OBJECT_NAME' ) dfstats inner join AL_HISTORY h
                                  on dfstats.OBJECT_KEY = h.OBJECT_KEY

ORDER BY START_TIME, KEY2

ErikR (BOB member since 2007-01-10)

system · January 12, 2013, 3:22pm

Hi Erik,

Thanks for this knowledge sharing.

I will surely try this out & get back to you.

Thanks again for this

Wish you HAPPY NEW YEAR 2013 !!

Dvn (BOB member since 2012-11-24)

system · January 17, 2013, 5:03pm

Here is the query I use to find slow running objects. The slowest objects are found at the top of the list. Note that this is for SQL Server. Let me know if you need it for Oracle.

SELECT AL_LANG.NAME,
       AL_HISTORY.START_TIME,
       AL_HISTORY.END_TIME,
       AL_HISTORY.EXECUTION_TIME,
       ObjName.VALUE as Object_Name,
       ObjType.VALUE as Object_Type,
       cast(ExecTime.VALUE as numeric(10,0)) as Object_Execution_Time
FROM AL_LANG
     JOIN AL_HISTORY 
        ON (AL_HISTORY.INST_EXEC_KEY = AL_LANG.OBJECT_KEY)
     JOIN AL_STATISTICS ExecTime
        ON (ExecTime.OBJECT_KEY = AL_HISTORY.OBJECT_KEY AND
            ExecTime.NAME       = 'EXECUTION_TIME')
     JOIN AL_STATISTICS ObjType
        ON (ObjType.OBJECT_KEY = ExecTime.OBJECT_KEY AND
            ObjType.KEY2       = ExecTime.KEY2       AND
            ObjType.NAME       = 'OBJECT_TYPE'       AND 
            ObjType.VALUE     IN ('Workflow', 'Dataflow'))
     JOIN AL_STATISTICS ObjName
        ON (ObjName.OBJECT_KEY = ExecTime.OBJECT_KEY AND
            ObjName.KEY2       = ExecTime.KEY2       AND
            ObjName.NAME       = 'OBJECT_NAME')
WHERE AL_LANG.NAME = '<enter job name here>'
  AND AL_LANG.OBJECT_TYPE = 0 -- To help use the index and narrow the scope
  AND AL_HISTORY.OBJECT_KEY = (SELECT MAX(SUBQ.OBJECT_KEY)
                               FROM AL_HISTORY SUBQ
                               WHERE SUBQ.INST_EXEC_KEY = AL_LANG.OBJECT_KEY
                                 and has_error = 0  and status = 'D' AND END_TIME IS NOT NULL)
  AND AL_LANG.VERSION = (select max(subq.version) 
                         from AL_LANG subq 
                         where subq.OBJECT_TYPE = AL_LANG.OBJECT_TYPE 
                           and subq.GUID        = AL_LANG.GUID)
ORDER BY cast(ExecTime.VALUE as numeric(10,0)) DESC;

eganjp (BOB member since 2007-09-12)

system · January 25, 2013, 3:22pm

Another approach I used when AL_STATISTICS wasn’t available was to write a DI job that extracted the start and end times of objects from the trace logs and stored them in a database table for further analysis. This was mainly used for a CDC job that ran continuously in a loop. One feature of the job was that it truncated AL_STATISTICS at the start of the loop because otherwise the table grew unacceptably large and performance would deteriorate as a result.

dastocks (BOB member since 2006-12-11)

system · January 25, 2013, 3:41pm

That’s an excellent idea. I’m working with a client that has an AL_STATISTICS table that has 21,895,278 rows. It’s too big and there doesn’t seem to be any automatic archival/trimming going on.

eganjp (BOB member since 2007-09-12)

system · January 27, 2013, 11:43pm

True, but it gets a bit more complicated when a single BODS (Production) environment is used by 20-odd different jobs? (all being completely different ETL/data integration/data interface jobs).

A single truncate of the AL_STATISTICS table would probably not be recommended I think?

ErikR (BOB member since 2007-01-10)

system · January 28, 2013, 11:48am

On the newest version of BODS AL_STATISTICS can grow up to 2 GB storage. I probably start laughing or cyring because SAP still says that a schema is up to 200 MB big or so…

The only way is to clean AL_STATISTICS…

Johannes Vink (BOB member since 2012-03-20)

system · January 28, 2013, 2:15pm

I’m starting to warm to the idea of a truncate of that table. I might copy off the last 50 job executions to a temp table, then truncate AL_STATISTICS and then copy back into it the rows from the temp table.

eganjp (BOB member since 2007-09-12)

system · January 28, 2013, 5:46pm

Look at the possiblity to join AL_HISTORY and AL_STATISTICS together

Johannes Vink (BOB member since 2012-03-20)

system · January 29, 2013, 2:51am

While checking our AL_STATISTICS tables, I noticed that the table was relatively small in our main ETL repository - it only holds 700,000 records opposed to the 21 million as listed above.

When I looked into the schedules, I noticed that they had been scheduled to use the collected statistics but the collection options were turned OFF. This explains why the table isn’t growing, which in a way is a good thing.

But what exact benefits are we missing out on by not having “Collect statistics for optimization” and “Collect statistics for monitoring” ticked?

ErikR (BOB member since 2007-01-10)

system · January 29, 2013, 3:53pm

You’re thinking of al_cache. Al_statistics is different. My understanding is that al_cache is associated with the use/collect statistics checkboxes.

eganjp (BOB member since 2007-09-12)

system · November 12, 2014, 6:00pm

I know this thread is quite old now, but someone may be interested. This is how I do it.

I have a Custom Function as follows:

# Function Name:	FN_InsertProcessLogEntry
# Created By:		Generali UKB (RJS)
#
# Date:				14/Aug/2013      
# -----------------------------------
# Parameters
#		Input
#			$P_I_ProcessName					varchar(50)
#			$P_I_StartFinish					varchar(1)
#			$P_I_JOB_ID							int
#			$P_I_RUN_ID							int
#			$P_I_SYS_ID							int
#
#		Output
#			Return								int
#
#		Variables
#			$LV_SQL								varchar(1024)		
# -----------------------------------
# Description
#
# Inserts a record into the PROCESS_EXECUTION_LOG table for the specific execution and step.
#          
# -----------------------------------          
#
# Version Notes:
# --------------
# 1.0 - September 2013 - RJS.
#
# 1.1 - November 2013 - RJS
#       Added $P_I_SYS_ID


# Remove any ' from the entry text
$P_I_ProcessName  = replace_substr($P_I_ProcessName ,'\'','');

if($P_I_StartFinish = 'S')
begin
	$LV_SQL = 'insert into DW_PROCESS_LOG (PROCESS_ACTION, PROCESS_JOB_ID, PROCESS_EXECUTION_ID, PROCESS_START,PROCESS_SYS_ID) VALUES ({$P_I_ProcessName},[$P_I_JOB_ID],[$P_I_RUN_ID],sysdatetime( ),[$P_I_SYS_ID] )';
end
else begin
	$LV_SQL = 'update DW_PROCESS_LOG set PROCESS_END = sysdatetime( ), PROCESS_DURATION = DATEDIFF(S,PROCESS_START,sysdatetime()) where PROCESS_ACTION = {$P_I_ProcessName} and PROCESS_JOB_ID =[$P_I_JOB_ID] and PROCESS_EXECUTION_ID = [$P_I_RUN_ID] and PROCESS_SYS_ID=[$P_I_SYS_ID]';
end

sql('DS_EIM_ADMIN', $LV_SQL );

return 1;

I then place scripts at the beginning and end of the parts of the process I want to monitor (can be a single DF, but then must be in its own WF - this won’t differentiate a load of parallel DFs) with a simple function call:

FN_InsertProcessLog_Entry('Clean','S',$G_Job_ID,$GV_RUN_KEY,$GV_PROCESS_SYS_ID);

FN_InsertProcessLog_Entry('Clean ACQ to SRC','S',$G_Job_ID,$GV_RUN_KEY,$GV_PROCESS_SYS_ID);

each is then followed by an associated call at its Finish point:

FN_InsertProcessLog_Entry('Clean','F',$G_Job_ID,$GV_RUN_KEY,$GV_PROCESS_SYS_ID);

FN_InsertProcessLog_Entry('Clean ACQ to SRC','F',$G_Job_ID,$GV_RUN_KEY,$GV_PROCESS_SYS_ID);

It’s easy and it does exactly what it promises.

scrimsrj (BOB member since 2013-03-20)

system · March 6, 2015, 3:29pm

Thx for the function.
Do you have some more info about the variables:

$GV_RUN_KEY
$GV_PROCESS_SYS_ID

Thx in advance

vandaelei (BOB member since 2015-03-06)

system · March 6, 2015, 3:35pm

They are just standard variables that identify the Job and the particular Run ID. They are normally set at the start of a job.

They allow the ‘F’ call to find the matching ‘S’ record to update it with the finishing time and duration and also make it easy to organise and report on the accumulated data.

scrimsrj (BOB member since 2013-03-20)

system · March 6, 2015, 4:13pm

Is there a standard DS function to fill these variables?

I’ve used $G_Job_ID = job_run_id( );
But I don’t find the other functions?

vandaelei (BOB member since 2015-03-06)

system · March 18, 2015, 9:35am

You question is not very clear: You defined a global variable for some purpose and want a function for populating it. It looks like $G_Job_ID may an identifier for a job. The job_run_id() function (from memory) returns a number that identifies the execution of a job from a given repository.

If you are working in a shared environment environment it is better to maintain an independent job control table and assign your own job execution id that is independent of the repository.

The quick answer to your question is probably no. Most projects requiring logging of job execution beyond the functionality provided out of the box define their own suite of tables and functions to do this.

dastocks (BOB member since 2006-12-11)