Jobserver Crashing During SQL Server Backup

loopi · May 24, 2021, 8:22pm

SAP BusinessObjects BI Platform 4.2 Support Pack 7 Version: 14.2.7.3069
Just changed DBMS from Oracle to SQL Server 2017 version 14.0.3356.20 running on Windows Server 2016 Datacenter . This problem is new since the DBMS change. We use Rubrik as a backup solution. During the time backups run our job server fails about 10 minutes into the backup which normally completes in 30 minutes. As a result, no scheduled instances fire and all reports being delivered by jobserver fail. Rubrik backup is performing I/O reads constantly over 100 MB/sec during the BU. Normal read / write activity 150 KB/sec. Any suggestions as to why the heavy disk activity kills my job server?

Thank you!

JohnBClark · May 24, 2021, 8:46pm

Have you tried tracing your job server to see what errors it records?
You might want to trace your CMS server as well if you suspect the DBMS as it is what handles all the database communication.

Our company has used Rubrik in the past (not sure if they still are or not) and we have always been on SQL Server for our DBMS. We’ve been on VMs since BI4.1, about 5 years, so it’s been a while since we were in a possible use case for Rubrik. I wouldn’t think a change in the DBMS would have any impact on this. It just doesn’t sound right.

Your post brings a couple of questions to mind:

Have you been using Rubrik for a while or is it new?
Was the backup that was running of your Business Objects servers or the database server?
Are you running a single stack of Business Objects or a clustered environment?

loopi · May 25, 2021, 1:25pm

Thank you for your response. I am not a full time sys admin so I do this work in my “spare time” so I may not be versed on all the lingo. cms log had this error occurring about the time the job sever begins to struggle.

|5001d11e-ef92-db94-b946-269c76b69dad|2021 05 24 11:11:44:701|-0600|Error| |>>|A| |cms_SERVERNAME.CMS| 4736|6736|| |1|0|1|0|BIPSDK.InfoStore:query|SERVERNAME:5308:538.158218:76|BIPSDK.InfoStore:query|SERVERNAME:5308:538.158218:76|cms_SERVERNAME.CMS.queryEx3|localhost:4736:6736.1897865:1|CkVyn.TSIEkrodpCVjIrNMU3c08c|||||||||||Scopes were closed out-of-order. TraceLog workflow data may have been corrupted.

Was using Rubrik prior to DBMS change with no known issues.
Rubrik is on this the SQL server with backup being written to another VM
Non clustered environment.

Thank you again for your time

JohnBClark · May 25, 2021, 8:03pm

It looks like there may be some type connection issue to the database but I’m not sure. If you have support from SAP, you can try searching their knowledge base for the phrase

or you can open a case with them.

loopi · May 26, 2021, 12:42pm

Performed a detailed test yesterday and have now confirmed job server failure is occurring only during our Rubrik backup when disk activity is averaging nearly 100 MB / seconds of reads for extended periods of time. We have one DB that takes close to a half hour to complte. The rest complete in less that 4 minutes. Very strange. The job server fails every time after about 12 minutes of the heavy disk activity caused be the backup!.

JohnBClark · May 26, 2021, 2:18pm

Is your database on the same server with your Business Objects install? If so, that would make more sense if Rubrik were the issue.

loopi · May 26, 2021, 7:32pm

Database and BO are on two different servers (VM’s). SQL database only application running on the DB machine and BO only application running on its VM. Both machines Server 2017 and all data drives are SSD on both VM’s. Plenty of available disk space so paging should not be an issue. SQL machine has 8 cores and 32GB Ram. Our database would be considered quite small as databases go!

JohnBClark · May 26, 2021, 7:44pm

I don’t see how running a backup on the database server would increase the I/O to the BO server. Something odd is going on. Have you brought this to the attention of who ever manages the Rubrik backups? It could be something with a setting there. I wonder if Rubrik is somehow locking files and BO can’t update the database.

loopi · May 26, 2021, 8:27pm

BO conected SQL server. During peak Rubrik activity, 100MB / Sec, BO making calls to database to run live report. Threads must be gettring stacked up and report requests become overwhelming and sooner that later the job server fails.