Jobs not running and have exit code status 3

system · November 1, 2012, 11:39am

Hello,

I’ve recently run into some issues with my Data Services 12.2 installation, on a Windows 2008 R2 machine. I trigger all my Data Services jobs via bat files run from a separate scheduler (Orchestrator System Center, which used to be Opalis). For the most part this works very well, but recently I’ve started to have some jobs complete but with an exit code of 3. I know the job itself is configured correctly because when I try to rerun the job all I have to do is re-trigger it and the job completes with an exit code of 0 (which is what I expect). When I go and look in the Data Services Management Console and look for the job with an exit code of 3 it doesn’t appear. It’s as if when I get an exit code of 3 the job was never actually triggered. Again, when I rerun the job manually (but again from the bat file), and get the exit code of 0, the job appears in the management console with all the trace information.

Does anyone know what the exit code 3 means? Am I looking at some kind of race condition that is keeping my jobs from running? Any and all help would be greatly appreciated. Thank you so much.

-NifflerX

NifflerX (BOB member since 2009-08-09)

system · November 1, 2012, 4:41pm

So you refer to your Scheduler’s Exit Code - 3 ?

If that is a DS Exit Code, then where do you see it?

Please goto %LINK_DIR%\log\ and access the file - AL_RWJobLauncherLog.txt and see what command is registered at the point of time that job is triggered…

ganeshxp (BOB member since 2008-07-17)

system · November 1, 2012, 4:51pm

Hi,

What our scheduler does is call out to a bat file that triggers the Data Services job and then returns the exit code that the bat file returns, that’s where I’m getting the exit code of 3 from.

From the log file the log entry for the job that failed looks different from the normal jobs that succeed:

Failed:

11_01_2012 01:50:24	CRWJobLauncherApp::InitInstance called.
11_01_2012 01:50:24	Launching Job and waiting for completion.  INET ADDR <inet:AICHIXFER:3500>, GUID <359c8fae_4676_4816_b8d8_045eaafc8d6f>. (BODI-1250132)

Success:

11_01_2012 01:51:30	CRWJobLauncherApp::InitInstance called.
11_01_2012 01:51:30	Launching Job and waiting for completion.  INET ADDR <inet:AICHIXFER:3500>, GUID <f2fdeddf_ced7_40fe_9f2a_604acb770d0d>. (BODI-1250132)
11_01_2012 01:52:00	*** RWJL_EXIT called.
11_01_2012 01:52:00	*** Job completed successfully. Exit code = <0>, GUID = <f2fdeddf_ced7_40fe_9f2a_604acb770d0d>. (BODI-1250136)

In the failed job it looks like call never makes it so nothing is returned. This is consistent with what I see in the Administrator where it doesn’t appear as though the job is run. But I can’t explain why that would be, because if I rerun my scheduler the job completes without an issue.

Any ideas what would be preventing the Data Services job from running only at that given point? I do have other Data SErvices jobs being called around the same time as this job, which is why I thought there might be a conflict but I’ve never seen that behavior before so I’m not so that is even possible. Thank you again for all your help.

-NifflerX

NifflerX (BOB member since 2009-08-09)

system · November 1, 2012, 7:11pm

This has been seen before but I don’t think there was a solution. The Job Launcher executable doesn’t seem to “hook up” right and doesn’t run the job.

When you manually run the job are you doing it from the O/S where the Job Server resides, from the 3rd party scheduler or from the DS Management Console?

eganjp (BOB member since 2007-09-12)

system · November 1, 2012, 7:16pm

Hi,

I’m rerunning the job from the 3rd party job scheduler, which resides on the same server as Data Services installation.

That’s too bad that there isn’t a known solution. What I’ve been doing as a work around is for jobs that fail each night, is to put in a 2nd attempt, but wait 10 minutes or so after the first failure. This 2nd attempt often works, but for reasons I don’t really understand. It also feels like a hack without actually solving the problem.

Are there any other log files I could take a look at, or is there an open bug report with SAP on this issue that I could add my name to? Thanks again for all the help.

-NifflerX

NifflerX (BOB member since 2009-08-09)

system · November 1, 2012, 7:26pm

Had the same problem with BODS 4.0 SP2. Scheduled via the windows scheduler, random jobs failing with exit code 3.

Scheduled via the management console (and underwater that is using the windows scheduler), never had a problem.

Extremely annoying.

Never found the cause.

Johannes Vink (BOB member since 2012-03-20)

system · November 1, 2012, 7:28pm

This can’t go in as a bug easily

I don’t know what that exit code is…

How many simultaneous jobs get subitted at the same time?

ganeshxp (BOB member since 2008-07-17)

system · November 1, 2012, 7:32pm

I’m not positive but it wouldn’t be more than 5 at any one time, and usually it’s closer to 2 or 3. I’ve tried spacing out the calls so that they aren’t called at the same time with some limited success.

Bummer that others have seen this error when using 3rd party schedulers. I can’t use the management console scheduler because the timing is based on things outside of the realm of Data Services, like when we get files from an outside source or a database update that takes a varied amount of time.

Do you think it could be the multiple jobs at, or near, the same time? I can try spacing jobs out even further, if people think that might be the cause. Thanks again.

-NifflerX

NifflerX (BOB member since 2009-08-09)

system · November 1, 2012, 7:36pm

No no I have had jobs submitted from Autosys, Maestro like tonnes at same time…

I just even wonder the network connectivity…But we can see that different piece of log message when the job hasn’t run. Can you file a Case with SAP and understand what that Exit Code is? and what the message from launcher is?

Let me go dig that log if I have similar message. Now I am using CPS and last week, I had an issue like Job completed successfully but the CPS told it failed…

Can you submit a ticket with SAP ?

ganeshxp (BOB member since 2008-07-17)

system · November 1, 2012, 7:44pm

I’ve opened a ticket with SAP, and will update the thread if they come back with anything. In the meantime if anyone has any other ideas I’m happy to give them a shot. Thanks again for all the help so far.

-NifflerX

NifflerX (BOB member since 2009-08-09)

system · November 1, 2012, 8:32pm

5 shouldn’t be a problem. I routinely run that many. But it depends on how the jobs are designed. A job that makes heavy use the CPU/Memory isn’t going to have the same impact as a job that gets it all done on the database. That said, the question that has to be asked, “Is this a Windows or Data Services problem?” It’s a rhetorical question, I’m not looking to you for an answer, but if you have any tools that monitor CPU and memory it might help to look at them to see if there was something hogging the O/S at the time.

eganjp (BOB member since 2007-09-12)

system · November 1, 2012, 8:35pm

I’ll try to set some monitoring tools to see if anything looks abnormal, but all of these jobs are pretty small and very quick. They are usually done in under 10-15 seconds. The jobs are mostly basic ETL jobs taking files from outside sources, loading them to our database, scrubing the data and loading it to our production systems. If I come up with anything I’ll post back. Thanks again.

-NifflerX

NifflerX (BOB member since 2009-08-09)

system · November 1, 2012, 8:42pm

If that’s the case then it could be an issue on the database server where your repository resides. Too many connections? Locks? Probably not but it’s something else to check.

eganjp (BOB member since 2007-09-12)

system · November 1, 2012, 9:05pm

Thanks again for the all the ideas. I’ll take a look at that as well as there are a fair number of database jobs going on when these Data Service jobs are failing. I’d still expect to see something in the log files, but if the Data Services job can’t connect to the database server at all that might explain it. Thanks again.

-NifflerX

NifflerX (BOB member since 2009-08-09)

system · November 1, 2012, 9:18pm

It’s get even more odd when your scheduler somehow runs the same job twice. We have no idea how that is happening either.

eganjp (BOB member since 2007-09-12)

system · November 1, 2012, 10:55pm

NifflerX:

Hi,

Failed:

11_01_2012 01:50:24	CRWJobLauncherApp::InitInstance called.
11_01_2012 01:50:24	Launching Job and waiting for completion.  INET ADDR <inet:AICHIXFER:3500>, GUID <359c8fae_4676_4816_b8d8_045eaafc8d6f>. (BODI-1250132)

Success:

11_01_2012 01:51:30	CRWJobLauncherApp::InitInstance called.
11_01_2012 01:51:30	Launching Job and waiting for completion.  INET ADDR <inet:AICHIXFER:3500>, GUID <f2fdeddf_ced7_40fe_9f2a_604acb770d0d>. (BODI-1250132)
11_01_2012 01:52:00	*** RWJL_EXIT called.
11_01_2012 01:52:00	*** Job completed successfully. Exit code = <0>, GUID = <f2fdeddf_ced7_40fe_9f2a_604acb770d0d>. (BODI-1250136)

-NifflerX

Just reviewing this piece of code with that of mine. Interesting that you get the job’s GUID in there…Wondering if that changed in 4.0 or what? my message in the Launcher log is completely different…

And the way the logging happens is not exactly one by one. I mean that in one stretch you can see ‘n’ number of Job initiation and then their completion at different points! So it is just a accident that you see in that way for your Failure one’s and succeeded one’s

Trace down if you see the end instance of the same GUID.

Now this brings in another confusion. (Before the confusion, did you verified the GUID of that failure what you posted is the one job that didn’t get initiated? )
If so, if there is a Initiate entry in Launcher, then your Scheduler tool has done its job of pinging the DS and asking it to Run. So something is odd. Explain this to SAP Support…

I will go see the file in PROD when we run massive amount of jobs…

ganeshxp (BOB member since 2008-07-17)

system · November 2, 2012, 12:08am

I do. Job name and job guid the same in 2 repo’s on the same job server. A dev and a QA repo for example where you deployed your code to.

Then suddenly you run a job in dev… and it starts to run in dev and prod.

Renaming the job, compacting repo (pre BODS 4.) or re-importing the atl solved it.

Took me months to figure out.

Please don’t tell me that there are other unknown causes

NifflerX: could you please check your Windows event logs? I had crashed listed there on the job launcher. But not really a clue why that happened.

Interesting enough my BODS 4.0 job launcher log shows this by the way (scheduled via BODS scheduling):

11_01_2012 05:07:02	CRWJobLauncherApp::InitInstance called.
11_01_2012 05:07:04	
11_01_2012 05:19:34	*** RWJL_EXIT called.

Intersting enough I have also these errors:

11_01_2012 19:56:00	CRWJobLauncherApp::InitInstance called.
Enterprise authentication could not log you on. Please make sure your logon information is correct. (FWB 00008)
11_01_2012 19:56:02	*** RWJL_EXIT called.

Oops. I need to check some schedules I guess :x

Johannes Vink (BOB member since 2012-03-20)

system · November 2, 2012, 12:20am

That is how the Job Launcher shows the Log in 4.0

Johannes, are you using 4.0 version too?

Enterprise authentication failure occurs when the Login in the CMS Connection in DS Admin Console is wrong!

ganeshxp (BOB member since 2008-07-17)

system · November 2, 2012, 6:40am

BODS 4.0 indeed. And not all schedules fail, had some developers who think that the password key file is complicated. Don’t know why. And is now giving problems…

Johannes Vink (BOB member since 2012-03-20)

system · November 2, 2012, 1:47pm

Hi all,

So the Windows logs yielded two interesting pieces of information. The first is the following error which appeared in the SYSTEM log

Application popup: Microsoft Visual C++ Runtime Library : Runtime Error!

Program: ...S\BUSINESSOBJECTS DATA SERVICES\bin\AL_RWJobLauncher.exe


This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.

The second is that the job was running under a user account that I didn’t expect. We recently migrated from one domain to another. Because of this we have an account in each domain that the scheduled jobs run under. This failed job this morning was running under the account in the old domain. So while that old domain account did have permissions to run the job, it’s possible there was some issue going across the trust between domain controllers. I’ve updated the job to run under the new domain user and will check it out next time it runs.

As far as looking at the GUIDs, I was actually using the timestamps to determine which logs go with which job. The 3rd party scheduler logs at what time the call to Data Services occurs at, and then I used that time to look through the logs. I’m not entirely sure why the GUIDs don’t stay the same between runs, but perhaps each run gets its own GUID?

From this morning here are the logs of the failure, and then after the success when trigger by me.

Failure:

11_02_2012 01:51:57	CRWJobLauncherApp::InitInstance called.
11_02_2012 01:51:57	Launching Job and waiting for completion.  INET ADDR <inet:AICHIXFER:3500>, GUID <1f4b41fb_b3c2_49b8_9ae4_bcb4332faca8>. (BODI-1250132)

Success:


11_02_2012 08:22:07	CRWJobLauncherApp::InitInstance called.
11_02_2012 08:22:07	Launching Job and waiting for completion.  INET ADDR <inet:AICHIXFER:3500>, GUID <f2fdeddf_ced7_40fe_9f2a_604acb770d0d>. (BODI-1250132)
11_02_2012 08:22:37	*** RWJL_EXIT called.
11_02_2012 08:22:37	*** Job completed successfully. Exit code = <0>, GUID = <f2fdeddf_ced7_40fe_9f2a_604acb770d0d>. (BODI-1250136)

I haven’t heard anything back from SAP, but will update this thread when I do, also will update if changing the user to the new domain has any appreciable effect. Thanks again for all the help.

-NifflerX

NifflerX (BOB member since 2009-08-09)