Unexpected Logout in Infoview

system · September 22, 2008, 5:42pm

We have a BusinessObjects XIR2 running SP3 Installation.

Currently it is configured with 2 apache web servers running the mod_jk connector and set up with a worker.properties file for active/passive failover.

We also have 2 application servers that are configured in an BO cluster

The problem that we are having is that when one web server is down, users are being kicked back to the logon page at random points in their sessions. It sometimes is 5 minutes and sometimes it can be 50 minutes. However, if both web servers are running this issue cannot be duplicated. This leads me to beleive that there is an issue with the session. I have verified that the session timeout is functioning properly and BO doesn’t think that this is affecting this particular issue.

Are outside web requests come ine on port 443 to a Cisco CSS and then are encrypted/decrypted on a Cisco SCA then they are forwarded to the apache servers on port 80. The CSS provides the load balancing via Source IP.

aheeter (BOB member since 2008-04-30)

system · September 23, 2008, 4:20am

I’d need more information to confirm my suspicions and explain how to resolve this, but I think you’re victim of loadbalancing that is not complete.

How have you set up the CMS and XI for Loadbalancing?

I’ve used Cisco to redirect between passive and active XI clusters on Apache/Tomcat instances and it’s worked well - you just ave to ensure that the entire server setup is aligned with the BO installation(s).

MikeD (BOB member since 2002-06-18)

system · September 23, 2008, 1:29pm

Currently I’m using the mod_jk Jakarta connector from the apache web servers. Below is the Configuration for the worker.properties file.

#The Advanced router LB
worker.list=router

#Define a worker using ajp13
worker.boe1.port=8009
worker.boe1.host=Server1
worker.boe1.type=ajp13
worker.boe1.lbfactor=1
worker.boe1.redirect=boe2

#Define another worker using ajp13
worker.boe2.port=8009
worker.boe2.host=Server2
worker.boe2.type=ajp13
worker.boe2.lbfactor=1
worker.boe2.activation=disabled

#Define th LB worker
worker.router.type=lb
worker.router.balance_workers=boe1,boe2

#Define keepalive
worker.router.socket_keepalive=1

Here is the http.conf entry
JkWorkersFile conf/workers.properties
JkLogFile logs/mod_jk.log
JkLogLevel info
JkMount /* router

aheeter (BOB member since 2008-04-30)

system · September 23, 2008, 1:50pm

I was referring to the BO installation - how many server groups and CMS’s do you have?

Note: If you have one CMS and have not delineated your servers per group, the CMS will attempt loadbalancing regardless of your web and app server setup.

I used one CMS instance and 2 server groups and let the loadbalancing get handled by the CMS by stopping or starting a particular server group via a perl script - that was in turn triggered by a bad failed restart of a server or by the Cisco re-director.

MikeD (BOB member since 2002-06-18)

system · September 23, 2008, 2:30pm

We have 2 Apache Web Servers and 2 CMS Servers that are clustered. There are no server groups defined

aheeter (BOB member since 2008-04-30)

system · September 24, 2008, 6:40am

Your clustered CMS is the probable cause as to issuing requests to the application servers that you thought you were managing via the web server routing.

MikeD (BOB member since 2002-06-18)

system · September 24, 2008, 12:32pm

Hey Mike,
Thanks for you guidance in this but is it possible for you to elaborate a bit in you previous statement about the probable cause beign the clustered CMS.

I have had trace logging running on the CMS servers and they are not picking up any errors.

aheeter (BOB member since 2008-04-30)

system · September 25, 2008, 5:20am

I suspect that the additional servers added to your cluster are not being recognised in the CMS / CMC as being non functional.
I.e. regardless of where your web server tries to route all the requests to, the internal BO loadbalancing is managed via the CMS.

The fact that you have sessions getting stranded, or users being logged off in odd time frames is due to the fact that the CMS is routing these requests to the server that you are trying to avoid.
I.e. it doesn’t matter that your Cisco / Web setup has determined that the one server is not to be used, as the CMS doesn’t seem to know that it cannot balance the load across to this server.
So the requests that fail, are the ones being issued by the CMS to the server in it’s attempt to spread the load.

Somehow the registered BO services / servers on your alternate app server are still listed as being in operatable state in the CMS.

Or

Due to the alternate server not being shut down correctly - via the server DISABLE the STOP process - you have sessions that are stranded at various points of their processing.
Remember that you don’t know where a user is in the processing chain - and it could be that the thread/event is hopping from Report server to a Page server on different servers - and depending on the processing time, it could be a while before it tries to move back or forward to the next processing step - and then fail when the server it tries to contact to continue this session is not available.
Thus the different times in your failures - they relate to different activties in the step by step process journey.

MikeD (BOB member since 2002-06-18)

system · September 25, 2008, 11:46am

Hey Mike,
In doing some investigating I was able to determine that the web.xml file located under the \businessobjects\enterprise115\desktoplaunch directory has a setting that lists the “preferred CMS Server” so to speak. Both of our application servers were pointed to the same box. So for instance on the APP01 server, it was pointed to APP01 CMS and the APP02 boxx was also ponited to the APP01 CMS.

I spoke with BO and went over our configuration and was able to instead list the Cluster name in that parameter and then list the individual CMS servers in the cluster. From what I can tell this appers to be doing a much better job distributing the load. Not to mention that in looking through the previous confing I’m not even sure that we were allowing the CMS cluster to perform the load balancing in any respect.

We are testing the new configuration today and I should have some results by mid-day as to whether this was part of the problem.

Also I was able to find a settign in that same web.xml (session .timeout.opendoc.redirecturl) that allows you to redirect an invalid session to a given page. I made a change to this for testing purposes to redirect to an error page so that we will be able to rule out whether or not the session is actually being dropped.

aheeter (BOB member since 2008-04-30)

system · September 26, 2008, 2:24pm

Hey Mike,
you were right in your assumption that the problem was on the CMS and how it was configured. Thanks for the direction. We are still testing but we have not seen the error since I made the configuration changes. Below is a description of what was performed.

I added the following lines to the Web.xml file located in the
\businessobjects\enterprise115\desktoplauncWEB-INF directory

for our setup we have a cluster (@BOEPROD) and two CMS Server fronted by two apache servers on windows using the jakarta connector.

Under the original web.xml the section was as follows

cms.default [b]SERVERNAME:6400[/b]

Both of our CMS servers had the same SERVERNAME In that particular file and I believe the that that is what was causing the issue.

I change the file to instead point to the cluster name and then listed the members of the cluster. This is alowing BO to do all of the data flow between the servers as you had suggested.

cms.default @BOEPROD cms.clusters @BOEPROD cms.clusters.BOEPROD [b]CMS_SERVER1:6400, CMS_SERVER2:6400[/b]

This seems to have corrected the issue. Thanks for your help

aheeter (BOB member since 2008-04-30)

system · October 8, 2008, 5:41pm

Hey Mike,
My previous suspisions that the problem was fixed were incorrect. However in doing the full gamut of testing I have been able to determine that the issue is being caused by the worker.properties/tomcat configuration.

Just to review, we have two apache front end servers that pass traffic to the two cms servers. If I set up the worker.properties file so that there is only one worker in a non load balanced configuration I’m not able to reproduce the error.

On Web01 on the worker.properties is the following

worker.list=boe1
worker.boe1.port=8009
worker.boe1.host=CMS01
worker.boe1.type=ajp13
worker.boe1.socket_keepalive=true

On Web02

worker.list=boe1
worker.boe1.port=8009
worker.boe1.host=CMS02
worker.boe1.type=ajp13
worker.boe1.socket_keepalive=true

***Now here is the configuration that is producing the kickout.

On WEB01

#The Advanced router LB
worker.list=router

#Define a worker using ajp13
worker.boe1.port=8009
worker.boe1.host=CMS01
worker.boe1.type=ajp13
worker.boe1.lbfactor=1

#prefered Failover node for boe1
worker.boe1.redirect=boe2

#Define another worker using ajp13
worker.boe2.port=8009
worker.boe2.host=CMS02
worker.boe2.type=ajp13
worker.boe2.lbfactor=1
worker.boe2.socket_keepalive=1
worker.boe2.socket_timeout=60

#disable boe2 except for failover
worker.boe2.activation=disabled

#Define th LB worker
worker.router.type=lb
worker.router.balance_workers=boe1,boe2

#Define keepalive
#worker.router.socket_keepalive=1

On WEB02

#The Advanced router LB
worker.list=router

#Define a worker using ajp13
worker.boe1.port=8009
worker.boe1.host=CMS02
worker.boe1.type=ajp13
worker.boe1.lbfactor=1

#prefered Failover node for boe1
worker.boe1.redirect=boe2

#Define another worker using ajp13
worker.boe2.port=8009
worker.boe2.host=CMS01
worker.boe2.type=ajp13
worker.boe2.lbfactor=1
worker.boe2.socket_keepalive=1
worker.boe2.socket_timeout=60

#disable boe2 except for failover
worker.boe2.activation=disabled

#Define th LB worker
worker.router.type=lb
worker.router.balance_workers=boe1,boe2

#Define keepalive
#worker.router.socket_keepalive=1

When this configuration is in place we are being kicked out of the application at random times as described previously in the post. Any ideas as to why this is happening?

aheeter (BOB member since 2008-04-30)

system · February 3, 2009, 3:25pm

We determined the cause of this particular issue.

When running adhoc reporting tool the application acts independently of the infoview application. What this means is that if you are working in the adhoc tool for longer then the scheduled timeout for the infoview product, when you click back into infoview (favorites, public folders, etc.) the app kicks you out to the logon as the infoview app has timeout.

Our workaround was to increase the infoview timeout to 60 minutes and set the adhoc timeout to 20 minutes.

aheeter (BOB member since 2008-04-30)