We have one job that is run weekly on weekends and two dataflows took 4.6 and 7.8 hours to process 94785525 and 16506039 records respectively but it takes less than 20 minutes normally.
Please can you advise what would be the cause. Which logs I should look at the detailed analysis.
In one EIMAdaptiveProcessingServer log, I see some error lines like:-
Fatal: A server with type pjs is not found in the CMS BODS 01.comp.local: 6400 and the cluster @ BODS.comp.local: 6400 of DS.RF CService. Such a server may be down or it may be invalidated by the administrator. (FWM 01014)
Org.apache.axis2.AxisFault: Server with type pjs service is not found in CMS BODS.comp.local: 6400 with service DS.RFCService and cluster in cluster @ BODS.comp.local: 6400. Such a server may be down or it may be invalidated by the administrator. (FWM 01014)
But these does not help why it took long time for job to complete.
I usually blame the DBA for running memory/cpu/network intensive operations on the weekend.
Seriously, check for that issue FIRST. It’s the easiest thing to troubleshoot. A database backup running at the same time as an ETL job can certainly cause performance problems.
The database logs(SQL Server 2012) are huge and will continue to explore :x
There is some log in event viewer I could see .
Database access error. Reason: [Microsoft] [ODBC SQL Server Driver] [SQL Server] transaction (process ID 64) deadlocked with other processes on locked resources, this transaction was subject to that deadlock. Please re-execute the transaction. (FWB 00090).
I can see where a database looks like it is running slow but what’s really happening is that it is waiting for locks to be resolved.
I’ve seen this in Oracle shops where a developer opens SQL Developer, runs a query and lets it sit there (causing a read lock). Then in another session someone tries to truncate the table. The truncate will just sit there waiting for the read lock to be released. That’s the simple description.
I got to know that this error happened for quite sometime earlier too and there was no database backup activity done on weekends when the job took long time to complete.
Check with the sysadmin group to see if an O/S backup was being done. Have you looked at the timing of the error? Does it always happen at about the same time?
Late reply, but root cause analysis is nearly impossible without incomplete data. Are you running logging (like Nagios) on the DB and DS servers? That would at least give you a chance.