Can't get Primary Node services to start completely

Hello,

I’m having an urgent issue where I cannot get the Primary Node in a 5 node cluster to start completely. (W2K3)

I have ensured when trying to start it that all WebI services on the 4 Secondary Nodes are completely off (including killing WIProcessManager and itnode_daemon).

I have also tried toggling the Cluster Service between Auto and Manual, and starting it manually.

After a reboot, whether I first start the Cluster Service, and then use the “Start Server” BO Start Menu Option, or it Auto starts, or I use the winotify tray icon, the behaviour is as follows:

  1. Icon starts flashing
  2. Cluster Service is started (as shown in W2K3 Services Manager)
  3. Icon does not stop flashing
  4. Attempting a login through InfoView results in a “Service Not Available”
  5. Attempting a login through Administration Console returns a “The Business Objects system is not started. (ADC 00071)” error.

I have verified that Supervisor on this box can connect fine via the bomain.key located in the high-level LocData folder. The same BOMain.key (named “bomain.key”) is located in the nodes\host\cluster\locatada folder. I have also checked bomain.key with the wmainkey utility. (Note, this is a 5.1 repository that has recently been successfully migrated to 6.5.1 for use with this cluster.)

Finally, analysis with wasfadm shows the following:
With Other Nodes Disabled


Status  Node  N_ON  N_OFF  Module  M_ON  M_OFF  PROC
Booting 1     1     0      14      14    0      8

With Other Nodes Enabled (Auto-Starting)


Status  Node  N_ON  N_OFF  Module  M_ON  M_OFF  PROC
Booting 5     5     0      xx      xx    0      x

Investigation through Windows Task Manager shows the following ORB-related processes starting successfully:
itconfig_rep
itlocator
itnaming
itnode_daemon

The following WebI processes are also running:
WIAdminServer
WILoginServer
winotify
WIProcessManager
wiqt
wiqt
wiService
WISessionManager
WISiteLog
WIStorageManager

I have verified simple settings in the Configuration Tool (Primary Node, Port Range, etc), and yesterday, I blew away the BO Install, deleted all related folders and registry keys, and re-installed it, with the same result.

The environment for all 5 boxes is:
-Windows Server 2003
-DB2 Connect 8.2

I have SEEN the Secondary Nodes come up, although I’m not sure if it is repeatable. I think they came up when the Primary Node was off, and their services started. I have not seen the Primary Node complete the WebI “Booting” process.

Any help is greatly appreciated as I’m under an “end of the weekend” deadline to have this running!

Thanks,

Dan Lelovic

[Edited, removed the word “urgent” from topic title. Please, respect that all members of BOB are volunteers. Thank you, Andreas.]


dl_toronto :canada: (BOB member since 2002-08-28)

I’ve experienced the same and the quickest way around this is to delete the application/web virtual directories from myinfoview/admin/etc.
Uninstall IIS / Re-install IIS and then recreate the virualdirectories - takes 40 minutes.
i.e.
Use the remove program feature to unclick IIS - run - click to install - run and then recreate webi.


MikeD :south_africa: (BOB member since 2002-06-18)

I will give that a shot tomorrow and post back with feedback.

Thanks!

Dan Lelovic


dl_toronto :canada: (BOB member since 2002-08-28)

If you leave the ORB alone you won’t have to back anything up and
I think your secondary nodes should still be able to attatch back ok.
One thing to consider is that if you have auto start and one shut down did not complete cleanly before a reboot, you continue to have issues starting BO again.
Don’t trust the icon and always wait 5-10 minutes after BO has shut down before rebooting. I turned off auto start on my IIS servers and rather start BO manually once I’m sure the boxes are up & running.


MikeD :south_africa: (BOB member since 2002-06-18)

The notify icons are notoriously bad indicators of the cluster status. Your best resource is the WIProcessManager_boot_jtrace.log file in the node logs directory. Open it with some non-locking text editor like Textpad and watch it as you start up WebI. You will get a pair of lines for every process, like this:

WIStorageManager booting.
WIStorageManager OK. 

When you see a line at the end that says

Node xyz is operational.

, then everything has started up fine. If you don’t get that line a few minutes after the initial start-up, look through lines you did get. Chances are there is a process with a “… booting.” entry without an “… OK.” entry. This is a good indicator of where your problem lies. You may be able to continue the startup if you kill that process from task manager. Usually a new process will be started immediately. Then check the log file again and see if new rows start coming in. I would also turn tracing on for the process that is causing the problem and root out the problem, because you don’t want to babysit WebI like this every time it restarts.


Ke6n Swindlehurst :us: (BOB member since 2003-09-16)

Quick update while I hack away at this…

It appears that the hangup is related to the booting of the WIAdminServer process. You can see in the trace file below that it is hanging, and then when I execute the “Server Stop” menu choice, it finally comes back, but during the shutdown process.


...
2005/08/01 21:16:07.953|<=|||3396|12| |||||||||||||||service WISessionManager OK
2005/08/01 21:16:07.953|<=|||3396|12| |||||||||||||||service wiqt booting
2005/08/01 21:16:07.968|<=|||3396|12| |||||||||||||||service WIAdminServer booting
2005/08/01 21:16:08.906|<=|||3396|1| |||||||||||||||service wiqt OK
2005/08/01 21:28:34.140|<=|||3396|0| |||||||||||||||service WIAdminServer OK
2005/08/01 21:28:34.140|<=|||3396|1| |||||||||||||||service wiqt OK
2005/08/01 21:28:35.140|<=|||3396|0| |||||||||||||||service WISessionManager OK
2005/08/01 21:28:35.140|<=|||3396|0| |||||||||||||||service WIStorageManager OK
...

That sounds pretty consistent with what MikeD is saying. (based perhaps on the Admin application’s tie-in to IIS).

I have not yet tried the IIS/virtual directory stuff as I am a little fuzzy on it (specifically, am I using the Config Tool to remove the configuration for Infoview and the Admin web applications? After that, am I uninstalling IIS from the OS??)

Will post back later – thanks again,

Dan.


dl_toronto :canada: (BOB member since 2002-08-28)

Did you enable tracing for WIAdminServer? You might a get a helpful error code there. I would try killing the first instance of WIAdminServer that appears in TaskManager after a minute of waiting. Sometimes it just gets stuck.

I’ve never had to reinstall IIS for WebI, but yes those steps sound accurate. You get to IIS from the Control Panel: Add/Remove Programs -> Windows Components. That part’s pretty easy. Got server CD’s lying around, just in case?


Ke6n Swindlehurst :us: (BOB member since 2003-09-16)

Thanks again for your assistance. If it comes down to blowing IIS away, I think I will have to defer to a group here. Reason being, things are pretty locked down, and although I have permissions (system-wise) to do this, business-wise, it could be a bad move! :frowning:

I will enable a trace for WIAdminServer and see what I’m getting.

Thanks!

Dan Lelovic


dl_toronto :canada: (BOB member since 2002-08-28)

Small update…

Left for dinner, came back 2 hours later. Primary Node showed that it was fully booted (winotify icon solid, and WIProcessManager logfile indicated node was operational). Looks like there was nearly a 2 hour break in the log entries, and then the primary node’s logfile indicated “WIAdminServer OK” followed by everything else, then the operational message.

Looked at the secondary nodes, and they all had flashing winotify icons… Tried to login on Primary node, and did not have any luck. (Page not found, etc).

I guess when the primary comes up, it instructs all secondaries with the same cluster name and CORBA port range to come up as well? I had them all set to manual… but they were still trying to come up.

Dan.


dl_toronto :canada: (BOB member since 2002-08-28)

My suggestion was not as technically sound as the rest, but actually resolved our issue. Tracing and locating the issue IS the recommended approach, but my exp with this issue points to a bad shutdown that just messes with the next start up - and i was not patient enough for tracing etc.
I found it quicker to just quit this never ending loop by resetting back to a base point.
The IIS uninstall/re-install is pretty standard and then the webi rebuilds are back on to a clean slate.
Waiting long periods was tried before this process and sometimes worked, but re-appeared after webi just bombing and not starting again.
I then lost patience and did the IIS thing and never had any more problems - along with ensuring that the webi shut down has a decent period before the box reboot!

I recall a techsupport post on this but can’t recall their suggestion.

The secondary nodes should not auto start unless someone has defined some dependancy process to do this?


MikeD :south_africa: (BOB member since 2002-06-18)

Good info, Mike.

As soon as I am able, I will take the IIS uninstall/reinstall approach as an attempt. (this is not a production system, so I am not that worried – yet).

On the technical side, here is the trace info.

WIProcessManager Log:


2005/08/02 02:50:04.873|<=|||540|1| |||||||||||||||service WISiteLog booting
2005/08/02 02:50:04.873|<=|||540|1| |||||||||||||||service ConnectionServer booting
2005/08/02 02:50:05.122|<=|||540|15| |||||||||||||||service WISiteLog OK
2005/08/02 02:50:05.247|<=|||540|1| |||||||||||||||service ConnectionServer OK
2005/08/02 02:50:05.247|<=|||540|1| |||||||||||||||service WILoginServer booting
2005/08/02 02:51:06.110|<=|||540|12| |||||||||||||||service WILoginServer OK
2005/08/02 02:51:06.110|<=|||540|12| |||||||||||||||service WIStorageManager booting
2005/08/02 02:51:06.110|<=|||540|12| |||||||||||||||service WISessionManager booting
2005/08/02 02:51:06.282|<=|||540|6| |||||||||||||||service WIStorageManager OK
2005/08/02 02:51:06.547|<=|||540|4| |||||||||||||||service WISessionManager OK
2005/08/02 02:51:06.547|<=|||540|4| |||||||||||||||service wiqt booting
2005/08/02 02:51:06.547|<=|||540|4| |||||||||||||||service WIAdminServer booting
2005/08/02 02:51:07.279|<=|||540|1| |||||||||||||||service wiqt OK

While it’s waiting on WIAdminServer to boot, here is what that module trace is waiting on:

WIAdminServer Trace:


2005/08/02 02:51:06.796|==| | | 1696|2280| |||||||||||||||ActorMgr::ActorMgtJob::activate
2005/08/02 02:51:06.796|==| | | 1696|2280| |||||||||||||||ActorMgr::ActorMgtJob::refresh
2005/08/02 02:51:06.796|==| | | 1696|2280| |||||||||||||||ActorMgr::ActorMgtJob::reset
2005/08/02 02:51:06.796|<<| | | 1696|2280| |||||||||||||||actormgt reset job factory to null
2005/08/02 02:51:06.796|==| | | 1696|2280| |||||||||||||||Adminsrv::ResMgtJob::init()
2005/08/02 02:51:06.796|<=| | | 1696|2280| |||||||||||||||Adminsrv::ResMgtJob::reset()
2005/08/02 02:51:06.796|==| | | 1696|2280| |||||||||||||||Adminsrv::ResMgtJob::loadGenPars()

^^^ And this is where it waits. I’ll know tomorrow what comes after. I am wondering about the possibility that this LoadGenPars() command hits the repository for something, but that perhaps this freshly migrated repository (from a much-used production 5.1 repository) has some sort of errors or something occurring. I guess my next steps to resolve this will be:

  1. IIS Uninstall/Re-install (when available)
  2. Log case with technical details with BO
  3. Perform middleware trace on Primary Node to see if DB holdups are involved (repository issues?)

Dan.


dl_toronto :canada: (BOB member since 2002-08-28)

Dan,
This does not match my experience. I think you should shut down every webi process in the cluster, then reboot all the secondary nodes - keep the services set to manual. Then reboot the primary and start WebI, either auto or manual. I get the feeling something is still running on those secondary nodes.

Otherwise, I believe WIAdminServer only provides two functions for the whole deployment: the Admin Console and Supervisor Over the Web. If you want to verify the rest of your setup, InfoView will run without it. You can test this by editing \config\localnode.xml: find the line referring to WiAdminServer and change the enabled property to “false”. When WebI starts up, it won’t bother with WIAdminServer at all.
-Kevin


Ke6n Swindlehurst :us: (BOB member since 2003-09-16)

Again, great info Kevin. Thanks!

I believe the nodes were autostarting because although I had set the Service to Manual, I had not unchecked the “Automatically start…” option in the Cluster/Service preferences in the Config tool. Fixing that changed that behaviour.

Put 2 + 2 together about that Adminsrv::ResMgtJob::loadGenPars() command, and realized that it is directly related to the HUGE amount of data hanging around in OBJ_M_GENPAR. I guess the WIAdminServer process reads this table for some reason, and this was taking nearly 2 full hours! :blue: By disabled the loading of WIAdminServer (Thanks Kevin 8) ), I was able to get the node to load within a couple of minutes.

I realized role OBJ_M_GENPAR was playing in part by creating a blank 6.5.1 repository, and pointing the cluster to it. I was able to get the cluster started fine on this repository (including WIAdminServer), but still had “500 - Internal Server Error” messages when trying to connect to Infoview. I resolved this as well by:
-Removing Infoview and Admin reference in Config Tool
-Removing all related branches in IIS Manager
-Using config tool to recreate Infoview and Admin web applications
-Tested, still had failure
-Went back to IIS Manager, and ensured script executing was enabled (it was not, thus the ASP scripts could not execute)
-Re-tested, and voila.

So now I’m running a HUGE scan/repair/compact operation against my migrated repository, but am confident that the solution will be 100% soon enough. 8)

Mike/Kevin: Thanks for your help… I’ll make a final posting later confirming the resolution.

Regards,

Dan Lelovic


dl_toronto :canada: (BOB member since 2002-08-28)

Just wanted to confirm for future thread searchers… the resolutions above fixed the issues.

Thanks,

Dan Lelovic


dl_toronto :canada: (BOB member since 2002-08-28)

That reminds me, there are some Officially suggested indexes for OBJ_M_GENPAR you might consider, if size is a problem down the road:

  1. M_GENPAR_N_USERID
  2. M_GENPAR_N_USERID
    M_GENPAR_N_TYPE

They didn’t come standard on 6.1b -can’t remember about 6.5.1.
-Kevin


Ke6n Swindlehurst :us: (BOB member since 2003-09-16)

Interesting you should mention those, Kevin. I wonder why old BCA-related rows in GENPAR were not deleted by the “Purge” function in the old 5.1.6 BCA Console, or through 5.1.6’s Scan and Repair process, both of which my client performed very frequently. You would think when the related DS_PENDING_JOB rows were removed, that the related rows in OBJ_M_GENPAR would have also been removed. Is that a known bug in 5.1.x that the 6.5.1 scan and repair now corrects?

Thanks,

Dan Lelovic


dl_toronto :canada: (BOB member since 2002-08-28)

Dan,
There are a lot of things I would have expected from software this expensive. On the other hand, unmet expectations help get me a paycheck every two weeks…

That wouldn’t surprise me, but I’ve never been curious enough to look it up.
-K


Ke6n Swindlehurst :us: (BOB member since 2003-09-16)

Hi All,

I have found the solution and it’s working fine for me( BO 6.5 with windows 2003 server ) .
Please locate admisapi.dll file under install directory/nodes/cluster /cluster name/IIS/1/wiadmin/bin and add to web service extension and click allow.

This works fine if you are using IIS v6.0.

Once this is dine ,please restart your server and login to WEbi admin console and everything should work fine.

Thanks ,


Dhruva (BOB member since 2006-11-14)