Clagnut

Menu

Full-text searching

§ Blogging · PHP/MySQL

Recently I’ve been playing around with MySQL full-text searching, including integrating it as Clagnut’s search engine.

Easy implementation

Full-text searching was introduced in MySQL version 3.23.23 and is a doddle to use. Assuming you wish to search a table already populated with text, you first need to create a fulltext index for the table. Let’s assume your table has this simple structure:

id    | int       | PRI | auto_increment |
title | char(255) |     |                |
body  | text      |     |                |

You can create a fulltext index on the body field at the MySQL command line with:

ALTER TABLE some_tbl ADD FULLTEXT (body)

However if you wish to search the title and body fields together, you will need to create a joint index for the two fields (creating indexes for each field separately won’t work). The order in which you specify the fields of the index determines their weighting in the results. Here we say that the title is weighted more heavily than the body:

ALTER TABLE some_tbl ADD FULLTEXT (title,body)

Now your index is created (it will continue to be updated every time you perform an insert or update), your table is ready to be searched. All you need is a simple SELECT statement like this in which we search the table for ‘search engine’:

SELECT * FROM some_tbl WHERE MATCH (title,body)
 AGAINST ('search engine')

This returns matching rows in order of relevance (highest first). See the MySQL full-text documentation for more details, including Boolean searches.

Some limitations

While MySQL searching is quick and easy to implement it does have a few drawbacks. Its definition of a word is any sequence of characters consisting of letters, digits, ‘’’, and ‘_’ over three letters long. This is generally fine, except when encountering possessives. A search for Richard will not match a record containing the word Richard’s.

The three letter limit is also a problem. Presumably designed to weed out common words such as the and and, its means that abbreviations such as CSS, W3C and WAI are not indexed. So a search for CSS specifications will find all documents containing specifications but no documents containing CSS.

I’ve left the possessives problem alone – I guess I could get my code to search additionally for a possessive of every word in the search term, but this seems like overkill.

In attempting to address the 3-letter problem I’ve ended up writing some extra functionality to supplement clagnut’s search results. In addition to returning matching posts, every search also lists related categories based the posts returned and keywords associated with categories. So, by making WAI a keyword for the accessibility category, a search for WAI will not yield any post results, but it will yield a category result.

Related posts

I’ve also implemented full-text searching to come up with related posts. I just use a simple full-text search as described earlier, where the search term is the entire post that I’m trying to relate posts to. I do a bit of pre-processing to append all the comments to the post and strip out the HTML while retaining any URLs. The results are mostly about right – go to a blog post and look for ‘Possibly related posts’. A few other folks, including Simon Willison and Adam Kalsey have also use this technique.