First off, I have to say this isn’t exactly a new topic, as it’s been covered here before:
I first saw this issue with XI (release one) on UNIX and am now seeing it pretty frequently on XIR2 on Windows boxes. When the CMS loses connectivity with the repository database (Oracle 10g in our case), it tries to restart itself a few times and then just stops and never restarts. The startup type for the Central Management Server service is Automatic, and our Recovery options are such that the service should be restarted after each failure (and subsequent failure). Our reset fail count setting is 0 days and our service is set to be restarted every 0 minutes.
I’ve opened a case with BO because I think it’s unacceptable that the service doesn’t restart properly. Am I being unreasonable here? If left with no other choice, we can work around it, but I’d prefer that BO addresses the root cause of the issue before we try a Band-Aid approach to resolve the issue.
I’m interested in seeing:
how common this issue is
if it seems to be Oracle-specific
what people are doing to fix it.
So, if people could answer:
Is this problem happening to you?
What database type (Oracle, MySQL, SQL Server, etc.) and version?
What are you doing (if anything) to respond to this issue (tweaking recovery settings, create a batch script, 3rd party monitoring tools, waiting for BO to fix it, etc.)
Yes, i have the same problem. It happens once when DBA stopped the db for exceptionnal maintenance week end. On monday, the DB were back but the CMS was stopped.
Using Sybase as CMS DB (12.5).
No solution right now. I have to discuss this with the boss
Not dircetly relevant, but interesting background: I inherited a Crystal Enterprise 8.5 server last year that displays this very behavior. I can tell you it’s pretty annoying to be jumping on every Sunday to see if things are running, and it’s more annoying to have the box admins go deaf when asked about allowing me to put my own minitoring in place. I’m pushing for a move to R2 this summer, and while it’s not what I’m putting in the powerpoints it’s been a big motivater for me. Sad to hear it won’t improve years after we saw it.
I just got off the phone with BO Tech Support. They basically said that because the failure was on the database side, it’s not their problem. After they said that I pointed out that the CMS is set to automatically restart after every failure (via the Windows Service -> Recovery Settings). However, it seems that when the CMS finally stops running, it’s not logged as a failure, which means the “Restart the Service upon Subsequent Failures” recovery setting isn’t effective.
The suggested scheduling shutting down the CMS before the database backup every week, and scheduling it to start up when that process was complete. Although that solution may solve some headaches, I see two scenarios when it wouldn’t give us any benefit:
Backup takes longer than the window provided (we can add some buffer time in, but if the backup takes significantly longer than normal, we’ll run into our original problem)
Non-scheduled connectivity issue (database server goes down unexpectedly for whatever reason, network goes down, etc.)
I’ve tested out the HealthMonitor utility that was suggested earlier, and it works perfectly (sends me an email when the CMS service goes down). However, I know that the Back Office group responsible for our servers would never allow us to install it (it’s freeware), and more importantly, that tool can’t bring the CMS back up on its own. Manual intervention (or a script during non-business hours) is required to start the CMS after it has stopped.
My concern with this issue is that I only find out if the CMS has stopped if I check for the CMS service every morning when I come in, or when I notice that something fails (like a login). Unfortunately, by that time, the damage is already done. For instance, if we have jobs scheduled to run before business hours in the morning, they won’t run. The whole point of the scheduler is to run things during off-peak hours, but if the CMS is down, we can’t do that.
I suggested two things:
Create an enhancement request for the CMS service to keep trying to restart itself after it loses database connectivity.
Email the PSO and see what solutions they’ve put in place for clients (scripts).
They’ve been touting BI as “Mission Critical” for a few years now. For me, it’s completely unacceptable that the CMS can’t recover properly from a database backup w/o using outside utilites or creating and scheduling scripts.
At one of my deployments, we ended up buying a third party tool to check and restart the CMS and other clustered services when a failure occurs. At another customer, I used the NET START and NET STOP in a scheduled batch file that did that thing. There are various tools and ways floating around. Anyway, XI does not act well in this arena. If you need 100% availability, you’re going to have to get creative till bobj fixes this. Same thing goes for life-cycle support. My two pet nags.
If you don’t mind my asking, what third-party tool did you use?
I’m currently doing some reasearch around .bat file options, and am especially interested in using “eventtriggers.exe”. I’m thinking that when I see that specific “CMS has stopped” event, I can have it start the CMS server.
Having to work-around something as basic as handling a database outage is disappointing for an Enterprise-level software product.
I’v read all these problems and I can’t understand what seem to be the problem.
I’m running XI_R2_SP1 with standard database Mysql 4.xxx. Never seen any connectivity problems. If you say that connectivity is the problem than maybe you should talk with your network administrator!
For backup my database I use MySql Administrator.
tannx - So if you have XIR2 up and running, and you then shut down the MySQL database (the one holding the repository) for a good half-hour, the CMS doesn’t shut down?
BTW, the same problem exists with Crystal Enterprise 10’s CMS against MS SQL Server. So it’s a long-standing CMS problem. I disagree with BO’s assertion that it’s a problem not with their product but with the database connectivity layer.
Same problem here; XI R2 with Oracle 10g. We moved all repositories to a dedicated repository database server which is never taken down (hot-backup). We’ll also put some kind of monitoring on all CMS services and restart them if necessary (third party tool - activexperts network monitor).
Because sometimes it is inevitable - particularly if the repository database and XIR2 are on different machines (which they really should be). Systems go down, it’s a fact of life – the software should be resilient where possible.
We never intend for our DBs to be down. In our case, we have cold-backups that run every so often on the DB servers. However, there can be other causes for a loss of connectivity with a database, network-related issues topping that list.
My opinion is that the CMS should be able to recover from these types of issues. I hate to bring it up, but I never had a problem with this in BO 2.7 or 6.x IMHO, this behavior is a glaring flaw for a “world-class” enterprise product.
Have you entered a case with BO support, and asked them to enter a bug for it? If so, post the ADAPT number here and we can all individually ask to be associated with it too, to give it more weight.