CMS Stopped after loss of connectivity with db

First off, I have to say this isn’t exactly a new topic, as it’s been covered here before:

I first saw this issue with XI (release one) on UNIX and am now seeing it pretty frequently on XIR2 on Windows boxes. When the CMS loses connectivity with the repository database (Oracle 10g in our case), it tries to restart itself a few times and then just stops and never restarts. The startup type for the Central Management Server service is Automatic, and our Recovery options are such that the service should be restarted after each failure (and subsequent failure). Our reset fail count setting is 0 days and our service is set to be restarted every 0 minutes.

I’ve opened a case with BO because I think it’s unacceptable that the service doesn’t restart properly. Am I being unreasonable here? If left with no other choice, we can work around it, but I’d prefer that BO addresses the root cause of the issue before we try a Band-Aid approach to resolve the issue.

I’m interested in seeing:

  • how common this issue is
  • if it seems to be Oracle-specific
  • what people are doing to fix it.

So, if people could answer:

  1. Is this problem happening to you?
  2. What database type (Oracle, MySQL, SQL Server, etc.) and version?
  3. What are you doing (if anything) to respond to this issue (tweaking recovery settings, create a batch script, 3rd party monitoring tools, waiting for BO to fix it, etc.)

Thanks,

DJ


DJ06482 :us: (BOB member since 2002-11-22)

Yes, i have the same problem. It happens once when DBA stopped the db for exceptionnal maintenance week end. On monday, the DB were back but the CMS was stopped.
Using Sybase as CMS DB (12.5).
No solution right now. I have to discuss this with the boss :smiley:


jerryf :fr: (BOB member since 2006-02-13)

Not dircetly relevant, but interesting background: I inherited a Crystal Enterprise 8.5 server last year that displays this very behavior. I can tell you it’s pretty annoying to be jumping on every Sunday to see if things are running, and it’s more annoying to have the box admins go deaf when asked about allowing me to put my own minitoring in place. I’m pushing for a move to R2 this summer, and while it’s not what I’m putting in the powerpoints it’s been a big motivater for me. Sad to hear it won’t improve years after we saw it.


Cris E :us: (BOB member since 2005-03-18)

I have this same problem…

We have Oracle 10g. using BOE XI - R1 SP1

usually happens when the cms database goes down for weekly cold backup.

I have test reports that run every morning and send me an email as confirmation that the server is up.

I could attach a service alert have me paged when this service (cms.exe) is down, but so far I am resisting this idea.

Thanks
Ram


ramkrish (BOB member since 2004-08-04)

Also get the same problem:

SQLServer 2000
Windows 2003
BO XI R2

Was hoping this would be fixed in XI R2 but it wasn’t. Again hoping it’s fixed in SP1 but doesn’t sound like it is.

Will also log a call with BO as think this isn’t exactly fit with their view that XI has automatic failover and no single point of failure


gbnz :new_zealand: (BOB member since 2005-07-26)

These test reports won’t run if the CMS is down, correct? So the absence of an email notification in the AM is your clue that something is wrong?

DJ


DJ06482 :us: (BOB member since 2002-11-22)

I just got off the phone with BO Tech Support. They basically said that because the failure was on the database side, it’s not their problem. After they said that I pointed out that the CMS is set to automatically restart after every failure (via the Windows Service -> Recovery Settings). However, it seems that when the CMS finally stops running, it’s not logged as a failure, which means the “Restart the Service upon Subsequent Failures” recovery setting isn’t effective.

The suggested scheduling shutting down the CMS before the database backup every week, and scheduling it to start up when that process was complete. Although that solution may solve some headaches, I see two scenarios when it wouldn’t give us any benefit:

  1. Backup takes longer than the window provided (we can add some buffer time in, but if the backup takes significantly longer than normal, we’ll run into our original problem)
  2. Non-scheduled connectivity issue (database server goes down unexpectedly for whatever reason, network goes down, etc.)

I’ve tested out the HealthMonitor utility that was suggested earlier, and it works perfectly (sends me an email when the CMS service goes down). However, I know that the Back Office group responsible for our servers would never allow us to install it (it’s freeware), and more importantly, that tool can’t bring the CMS back up on its own. Manual intervention (or a script during non-business hours) is required to start the CMS after it has stopped.

My concern with this issue is that I only find out if the CMS has stopped if I check for the CMS service every morning when I come in, or when I notice that something fails (like a login). Unfortunately, by that time, the damage is already done. For instance, if we have jobs scheduled to run before business hours in the morning, they won’t run. The whole point of the scheduler is to run things during off-peak hours, but if the CMS is down, we can’t do that.

I suggested two things:

  1. Create an enhancement request for the CMS service to keep trying to restart itself after it loses database connectivity.
  2. Email the PSO and see what solutions they’ve put in place for clients (scripts).

They’ve been touting BI as “Mission Critical” for a few years now. For me, it’s completely unacceptable that the CMS can’t recover properly from a database backup w/o using outside utilites or creating and scheduling scripts.

DJ


DJ06482 :us: (BOB member since 2002-11-22)

At one of my deployments, we ended up buying a third party tool to check and restart the CMS and other clustered services when a failure occurs. At another customer, I used the NET START and NET STOP in a scheduled batch file that did that thing. There are various tools and ways floating around. Anyway, XI does not act well in this arena. If you need 100% availability, you’re going to have to get creative till bobj fixes this. Same thing goes for life-cycle support. My two pet nags. :reallymad:

Ang.


angelsd1 :us: (BOB member since 2005-10-21)

If you don’t mind my asking, what third-party tool did you use?

I’m currently doing some reasearch around .bat file options, and am especially interested in using “eventtriggers.exe”. I’m thinking that when I see that specific “CMS has stopped” event, I can have it start the CMS server.

Having to work-around something as basic as handling a database outage is disappointing for an Enterprise-level software product.

DJ


DJ06482 :us: (BOB member since 2002-11-22)

Business Objects has put this on the list for enhancement requests, the ER number is:

ADAPT00595957

DJ


DJ06482 :us: (BOB member since 2002-11-22)

I’v read all these problems and I can’t understand what seem to be the problem.
I’m running XI_R2_SP1 with standard database Mysql 4.xxx. Never seen any connectivity problems. If you say that connectivity is the problem than maybe you should talk with your network administrator!
For backup my database I use MySql Administrator.


tannx :estonia: (BOB member since 2006-02-20)

tannx - So if you have XIR2 up and running, and you then shut down the MySQL database (the one holding the repository) for a good half-hour, the CMS doesn’t shut down?

BTW, the same problem exists with Crystal Enterprise 10’s CMS against MS SQL Server. So it’s a long-standing CMS problem. I disagree with BO’s assertion that it’s a problem not with their product but with the database connectivity layer.


dnewton :us: (BOB member since 2004-01-30)

Same problem here; XI R2 with Oracle 10g. We moved all repositories to a dedicated repository database server which is never taken down (hot-backup). We’ll also put some kind of monitoring on all CMS services and restart them if necessary (third party tool - activexperts network monitor).


JornGH (BOB member since 2003-02-21)

Whats the point stopping database and keep CMS running.


tannx :estonia: (BOB member since 2006-02-20)

Because sometimes it is inevitable - particularly if the repository database and XIR2 are on different machines (which they really should be). Systems go down, it’s a fact of life – the software should be resilient where possible.


dnewton :us: (BOB member since 2004-01-30)

We never intend for our DBs to be down. In our case, we have cold-backups that run every so often on the DB servers. However, there can be other causes for a loss of connectivity with a database, network-related issues topping that list.

My opinion is that the CMS should be able to recover from these types of issues. I hate to bring it up, but I never had a problem with this in BO 2.7 or 6.x :reallymad: IMHO, this behavior is a glaring flaw for a “world-class” enterprise product.

DJ


DJ06482 :us: (BOB member since 2002-11-22)

Have you entered a case with BO support, and asked them to enter a bug for it? If so, post the ADAPT number here and we can all individually ask to be associated with it too, to give it more weight.


dnewton :us: (BOB member since 2004-01-30)

I’ve entered a case, the ADAPT number is buried in the middle of the first page in this thread:

ADAPT00595957

dnewton, you’re absolutely right, the more people that request the enhancement, the more likely that it happens.

Thanks,

DJ


DJ06482 :us: (BOB member since 2002-11-22)

Hi Same problems

XiR2 SP1
Windows 2003 (Clustered with 6 machines, 2 web 4 app)
oracle 9i

Any time the databae connection is lost the CMS goes down and often does not recover.

if the time on the boxes differs by >5min the CMS goes down

Nice to know im not alone.
Rob.


madrob101 :uk: (BOB member since 2005-11-10)

You have to set the time on your servers to be same other wise CMSes will not start, this is not a bug :wink:


Sheshachala5 :india: (BOB member since 2004-01-09)