Recently I’ve been playing around with MySQL full-text searching, including integrating it as Clagnut’s search engine.
Full-text searching was introduced in MySQL version 3.23.23 and is a doddle to use. Assuming you wish to search a table already populated with text, you first need to create a
fulltext index for the table. Let’s assume your table has this simple structure:
id | int | PRI | auto_increment | title | char(255) | | | body | text | | |
You can create a
fulltext index on the
body field at the MySQL command line with:
ALTER TABLE some_tbl ADD FULLTEXT (body)
However if you wish to search the
body fields together, you will need to create a joint index for the two fields (creating indexes for each field separately won’t work). The order in which you specify the fields of the index determines their weighting in the results. Here we say that the title is weighted more heavily than the body:
ALTER TABLE some_tbl ADD FULLTEXT (title,body)
Now your index is created (it will continue to be updated every time you perform an insert or update), your table is ready to be searched. All you need is a simple
SELECT statement like this in which we search the table for ‘search engine’:
SELECT * FROM some_tbl WHERE MATCH (title,body) AGAINST ('search engine')
This returns matching rows in order of relevance (highest first). See the MySQL full-text documentation for more details, including Boolean searches.
While MySQL searching is quick and easy to implement it does have a few drawbacks. Its definition of a word is any sequence of characters consisting of letters, digits, ‘’’, and ‘_’ over three letters long. This is generally fine, except when encountering possessives. A search for
Richard will not match a record containing the word
The three letter limit is also a problem. Presumably designed to weed out common words such as
and, its means that abbreviations such as CSS, W3C and WAI are not indexed. So a search for
CSS specifications will find all documents containing
specifications but no documents containing
I’ve left the possessives problem alone – I guess I could get my code to search additionally for a possessive of every word in the search term, but this seems like overkill.
In attempting to address the 3-letter problem I’ve ended up writing some extra functionality to supplement clagnut’s search results. In addition to returning matching posts, every search also lists related categories based the posts returned and keywords associated with categories. So, by making
WAI a keyword for the accessibility category, a search for
WAI will not yield any post results, but it will yield a category result.
I’ve also implemented full-text searching to come up with related posts. I just use a simple full-text search as described earlier, where the search term is the entire post that I’m trying to relate posts to. I do a bit of pre-processing to append all the comments to the post and strip out the HTML while retaining any URLs. The results are mostly about right – go to a blog post and look for ‘Possibly related posts’. A few other folks, including Simon Willison and Adam Kalsey have also use this technique.