The search process on BOB has been carefully monitored and tuned to avoid overloading our server. There are trade-offs, of course. Each time a post is entered or updated the words in that post are indexed. Common words like “the” and “when” and “over” are stopped, meaning they are not included in the index and cannot be found by searching. Short words of 3 letters or less are also stopped. Unfortunately this means that we can’t search on version numbers, because the dot “.” is treated as a word delimeter. So the version 6.5.1 is treated as three words: 6, 5, and 1, and since all three words are less than three letters they are not included in the search index.
Our index does a fantastic job of finding specific words, but it cannot find words in phrases since each word is indexed individually.
We have looked at several different alternatives for searching, and have implemented the google search because of the easy installation. Google indexes our board quite well, but not 100%. The google search allows you to find material that you would not otherwise be able to find with our standard search. But google can’t tell the difference between a post discussing version 6.5 and a person participating in the topic with a signature that mentions version 6.5.
The updated version of phpBB (version 3) has the same configuration for search, so an upgrade doesn’t solve our issue.
There are two additional options that have been considered in the past. First, there are 3rd party open source search engines that we can add to our site, one of which is called Sphinx Search. They have implemented this at phpbb.com with some success and some issues, so it’s not a slam-dunk as far as a fix. Second, we can investigate a standard MySQL database feature called “full text” indexing. We are going to create this index this weekend and run some load tests during the week next week. If testing seems promising, we will create an input form and get a limited number of people to test it out for functionality. Depending on those results we may go forward with this alternative search. A “full text” index will allow searching on phrases and should include version numbers as well.
Full text searches come with their own built-in set of stopwords, listed here:
I assume but have not verified yet that we can add more words to this list in order to tune the index.
There is an announcement in the top box on our page that lets folks know that BOB will be offline next Saturday (week from today) to rebuild our search index. That is for the standard search we have now. We’re going to drop some of the stopwords, meaning they will be indexed in the search once again instead of being dropped from your search terms. We still will not allow product names or extremely common words to be search terms. If you want to find a universe solution, limit your search to the Semantic Layer forum. If you have a Web Intelligence question, then limit your search to the proper sub-forum from Building Reports. The word “universe” appears in almost 13% of the total posts on the board. The word “report” appears in 28% of the posts on BOB. You’re going to have to come up with better search terms than that to find something useful on the board.
If anyone has questions or concerns about this process, please feel free to post them here. Any thoughts or input are welcome as well.
If you have been doing searches and noticed that some of your words are “stopped” and would like to submit them for inclusion in our index, do please make those suggestions in this topic as well.
Dave Rathbun (BOB member since 2002-06-06)