New Search Options Under Consideration

The search process on BOB has been carefully monitored and tuned to avoid overloading our server. There are trade-offs, of course. Each time a post is entered or updated the words in that post are indexed. Common words like “the” and “when” and “over” are stopped, meaning they are not included in the index and cannot be found by searching. Short words of 3 letters or less are also stopped. Unfortunately this means that we can’t search on version numbers, because the dot “.” is treated as a word delimeter. So the version 6.5.1 is treated as three words: 6, 5, and 1, and since all three words are less than three letters they are not included in the search index.

Our index does a fantastic job of finding specific words, but it cannot find words in phrases since each word is indexed individually.

We have looked at several different alternatives for searching, and have implemented the google search because of the easy installation. Google indexes our board quite well, but not 100%. The google search allows you to find material that you would not otherwise be able to find with our standard search. But google can’t tell the difference between a post discussing version 6.5 and a person participating in the topic with a signature that mentions version 6.5.

The updated version of phpBB (version 3) has the same configuration for search, so an upgrade doesn’t solve our issue.

There are two additional options that have been considered in the past. First, there are 3rd party open source search engines that we can add to our site, one of which is called Sphinx Search. They have implemented this at phpbb.com with some success and some issues, so it’s not a slam-dunk as far as a fix. Second, we can investigate a standard MySQL database feature called “full text” indexing. We are going to create this index this weekend and run some load tests during the week next week. If testing seems promising, we will create an input form and get a limited number of people to test it out for functionality. Depending on those results we may go forward with this alternative search. A “full text” index will allow searching on phrases and should include version numbers as well.

Full text searches come with their own built-in set of stopwords, listed here:

I assume but have not verified yet that we can add more words to this list in order to tune the index.

There is an announcement in the top box on our page that lets folks know that BOB will be offline next Saturday (week from today) to rebuild our search index. That is for the standard search we have now. We’re going to drop some of the stopwords, meaning they will be indexed in the search once again instead of being dropped from your search terms. We still will not allow product names or extremely common words to be search terms. If you want to find a universe solution, limit your search to the Semantic Layer forum. If you have a Web Intelligence question, then limit your search to the proper sub-forum from Building Reports. The word “universe” appears in almost 13% of the total posts on the board. The word “report” appears in 28% of the posts on BOB. You’re going to have to come up with better search terms than that to find something useful on the board. :slight_smile:

If anyone has questions or concerns about this process, please feel free to post them here. Any thoughts or input are welcome as well.

If you have been doing searches and noticed that some of your words are “stopped” and would like to submit them for inclusion in our index, do please make those suggestions in this topic as well.


Dave Rathbun :us: (BOB member since 2002-06-06)

Further research:

A full text index can be made on a combination of fields. We can make one that indexes the post text, the post subject, or a combination of the two.

When creating the index it locks the posts table for writing. We’re experimenting with the indexing process today, so if you are reading the board it should be fine, but if you’re posting there may be times when your post process times out due to the table lock.

MySQL full text has an even longer “short” word length than phpBB does. By default, any word four letters or less is ignored. This can be altered, but at the expense of having a much larger index and therefore slower searches.

By default the MySQL full text results are sorted by a relevance engine. That means that if the algorithm determines an older post is more relevant than a newer post, it will be listed first. This is a good thing. :slight_smile:


Dave Rathbun :us: (BOB member since 2002-06-06)

If you can alter the default StopWords list for MySQL, I would suggest removing “everyone” from the list, as I may want to narrow my search results to topics discussing the Everyone group in the CMC forum.


MichaelWelter :vatican_city: (BOB member since 2002-08-08)

From what I have read so far, stopwords are not used during an exact phrase search. So if you include “everyone group” in quotes in you search it should still work.


Dave Rathbun :us: (BOB member since 2002-06-06)

Oh, that would be awesome.


MichaelWelter :vatican_city: (BOB member since 2002-08-08)