¶ Recently I’ve been playing around with MySQL full-text searching, including integrating it as Clagnut’s search engine.
Easy implementation
Full-text searching was introduced in MySQL version 3.23.23 and is a doddle to use. Assuming you wish to search a table already populated with text, you first need to create a fulltext index for the table. Let’s assume your table has this simple structure:
id | int | PRI | auto_increment |
title | char(255) | | |
body | text | | |
You can create a fulltext index on the body field at the MySQL command line with:
ALTER TABLE some_tbl ADD FULLTEXT (body)
However if you wish to search the title and body fields together, you will need to create a joint index for the two fields (creating indexes for each field separately won’t work). The order in which you specify the fields of the index determines their weighting in the results. Here we say that the title is weighted more heavily than the body:
ALTER TABLE some_tbl ADD FULLTEXT (title,body)
Now your index is created (it will continue to be updated every time you perform an insert or update), your table is ready to be searched. All you need is a simple SELECT statement like this in which we search the table for ‘search engine’:
SELECT * FROM some_tbl WHERE MATCH (title,body)
AGAINST ('search engine')
This returns matching rows in order of relevance (highest first). See the MySQL full-text documentation for more details, including Boolean searches.
Some limitations
While MySQL searching is quick and easy to implement it does have a few drawbacks. Its definition of a word is any sequence of characters consisting of letters, digits, ‘’’, and ‘_’ over three letters long. This is generally fine, except when encountering possessives. A search for Richard
will not match a record containing the word Richard’s
.
The three letter limit is also a problem. Presumably designed to weed out common words such as the
and and
, its means that abbreviations such as CSS, W3C and WAI are not indexed. So a search for CSS specifications
will find all documents containing specifications
but no documents containing CSS
.
I’ve left the possessives problem alone – I guess I could get my code to search additionally for a possessive of every word in the search term, but this seems like overkill.
In attempting to address the 3-letter problem I’ve ended up writing some extra functionality to supplement clagnut’s search results. In addition to returning matching posts, every search also lists related categories based the posts returned and keywords associated with categories. So, by making WAI
a keyword for the accessibility category, a search for WAI
will not yield any post results, but it will yield a category result.
Related posts
I’ve also implemented full-text searching to come up with related posts. I just use a simple full-text search as described earlier, where the search term is the entire post that I’m trying to relate posts to. I do a bit of pre-processing to append all the comments to the post and strip out the HTML while retaining any URLs. The results are mostly about right – go to a blog post and look for ‘Possibly related posts’. A few other folks, including Simon Willison and Adam Kalsey have also use this technique.




Comments
1
The minimum length of words to be indexed is defined by the MySQL variable
ft_min_word_len. See section 4.6.8.4 SHOW VARIABLES. (This variable is only available from MySQL version 4.0.) The default value is four characters. Change it to the value you prefer, and rebuild yourFULLTEXTindexes. For example, if you want three-character words to be searchable, you can set this variable by putting the following lines in an option file:[mysqld]ft_min_word_len=3
Then restart the server and rebuild your
FULLTEXTindexes.This is from http://www.mysql.com/doc/en/Fulltext_Fine-tuning.html
2
Tim Bray has released a series of essays all about full-text searching… looks good and probably worth a read. It begins with explaining the concepts, touches on Internationalization issues, and touches on some more advanced topics like Result Ranking. The way it is laid out, you can easily cherry pick the sections relevant to you.
http://www.tbray.org/ongoing/When/200x/2003/07/30/OnSearchTOC
Add your comment
Comments are now closed on this post. If you have more to say please contact me directly.