Hi,
Note, this is not a question, our problem is already solved. This post is meant to help others with possibly the same problems.
This week we had a problem on our BO XI servers (working on windows in a cluster). One of the symptoms was the (almost infamous) “Transport error: Communication failure” message(s). First we searched the internet for possible solutions, and we found a lot of them:
- Firewall issues
- Nic issues (especially with 2 nics in one server, like we have)
- Routing issues
- Port-setting issues
- typo’s
After checking, double checking and triple checking, everything seemed ok.
Symptoms:
1: when trying to login in CMC using :6400 we got the “Transport error: Communication failure” error
2: when trying to connect to @:6400 we got the following error: “Unable to connect to cluster @:6400 to retrieve CMS member list. Locally cached member list not present. Logon cannot continue.”
3: in some situations (don’t know exactly) we got an error saying that an (old) bo server wasn’t reachable (was correct, server was ‘gone’ / turned off). What we didn’t understand is: why he didn’t just ignore the (dead) server…
4: in some situations (don’t know exactly) we got an error saying that an existing server was reachable, but had no cms running (was not right, server was indeed running, but the cms was also running and telnet to port 6400 was possible)
This is how we got to our problem:
We have a cluster of bo-servers. We started with one server (not really a cluster) and we added 2 more bo-servers (for now I will call them bo1, bo2 and bo3).
After a while, we removed bo3 (formatted / reinstalled) (we needed it for another job).
Then, after some more, we had a spare server and turned that one in a bo-server (bo4).
The servers have the following hardware:
bo1 - octacore
bo2 - dualcore
bo3 (at this point assigned to another task) - octacore
bo4 - dualcore
Then we discovered that the 2 dualcores were degrading the performance of the bo1 (octacore) server (because of how the cluster-load balancing worked).
Also, bo3 (octacore) was available again. So we turned off bo2 and bo4, formatted and reinstalled windows again on bo3. Got it’s old hostname and bo-name back and was reassigned to the cluster.
Almost immediately after this, we got reports from users claiming there BO-reports contained no data. After some troubleshooting we found out that the (renewed) bo3 was responsible for the reports with no data. After checking all settings, we couldn’t solve the problem.
The only possible problem we could think of was that bo3 (and the name of bo3) pre-existed in the cluster.
So again, we formatted and reinstalled the server, naming it (DNS and BO-name) bo5 and assigned it to the cluster.
Now, for no apparent reason - both servers failed to work with the symptoms described earlier in this post.
After contacting BO-support, it turned out our cluster information was the problem (like corrupted). With help from BO-support we ‘cleaned’ our cluster info (removed old / unnecessary entries) using the ‘-serverconsole’ option and voila, our cluster came back to life.
So, as you see, this posting is not a question (out problem is already solved) but because this board has helped us in the past and gave us a lot of suggestions, I wanted to do something back by sharing this information.
Michiel.Korthuis (BOB member since 2009-09-18)