Published in Brighton, UK

Clagnut

Full-text searching

Recently I’ve been playing around with MySQL full-text searching, including integrating it as Clagnut’s search engine.

Easy implementation

Full-text searching was introduced in MySQL version 3.23.23 and is a doddle to use. Assuming you wish to search a table already populated with text, you first need to create a fulltext index for the table. Let’s assume your table has this simple structure:

id    | int       | PRI | auto_increment |
title | char(255) |     |                |
body  | text      |     |                |

You can create a fulltext index on the body field at the MySQL command line with:

ALTER TABLE some_tbl ADD FULLTEXT (body)

However if you wish to search the title and body fields together, you will need to create a joint index for the two fields (creating indexes for each field separately won’t work). The order in which you specify the fields of the index determines their weighting in the results. Here we say that the title is weighted more heavily than the body:

ALTER TABLE some_tbl ADD FULLTEXT (title,body)

Now your index is created (it will continue to be updated every time you perform an insert or update), your table is ready to be searched. All you need is a simple SELECT statement like this in which we search the table for ‘search engine’:

SELECT * FROM some_tbl WHERE MATCH (title,body)
 AGAINST ('search engine')

This returns matching rows in order of relevance (highest first). See the MySQL full-text documentation for more details, including Boolean searches.

Some limitations

While MySQL searching is quick and easy to implement it does have a few drawbacks. Its definition of a word is any sequence of characters consisting of letters, digits, ‘’’, and ‘_’ over three letters long. This is generally fine, except when encountering possessives. A search for Richard will not match a record containing the word Richard’s.

The three letter limit is also a problem. Presumably designed to weed out common words such as the and and, its means that abbreviations such as CSS, W3C and WAI are not indexed. So a search for CSS specifications will find all documents containing specifications but no documents containing CSS.

I’ve left the possessives problem alone – I guess I could get my code to search additionally for a possessive of every word in the search term, but this seems like overkill.

In attempting to address the 3-letter problem I’ve ended up writing some extra functionality to supplement clagnut’s search results. In addition to returning matching posts, every search also lists related categories based the posts returned and keywords associated with categories. So, by making WAI a keyword for the accessibility category, a search for WAI will not yield any post results, but it will yield a category result.

Related posts

I’ve also implemented full-text searching to come up with related posts. I just use a simple full-text search as described earlier, where the search term is the entire post that I’m trying to relate posts to. I do a bit of pre-processing to append all the comments to the post and strip out the HTML while retaining any URLs. The results are mostly about right – go to a blog post and look for ‘Possibly related posts’. A few other folks, including Simon Willison and Adam Kalsey have also use this technique.

27 November 2003

§ Blogging · PHP/MySQL

2 comments

Next

Previous

Related posts

Keywords

Machine tags

Comments

  1. 1

    The minimum length of words to be indexed is defined by the MySQL variable ft_min_word_len . See section 4.6.8.4 SHOW VARIABLES. (This variable is only available from MySQL version 4.0.) The default value is four characters. Change it to the value you prefer, and rebuild your FULLTEXT indexes. For example, if you want three-character words to be searchable, you can set this variable by putting the following lines in an option file:

    [mysqld]
    ft_min_word_len=3

    Then restart the server and rebuild your FULLTEXT indexes.

    This is from http://www.mysql.com/doc/en/Fulltext_Fine-tuning.html

    Andrew
    3 Dec 2003
    09:59 GMT
  2. 2

    Tim Bray has released a series of essays all about full-text searching… looks good and probably worth a read. It begins with explaining the concepts, touches on Internationalization issues, and touches on some more advanced topics like Result Ranking. The way it is laid out, you can easily cherry pick the sections relevant to you.

    http://www.tbray.org/ongoing/When/200x/2003/07/30/OnSearchTOC

    Tim
    19 Dec 2003
    11:20 GMT

Add your comment

Comments are now closed on this post. If you have more to say please contact me directly.

Outside interest

Top Referrers

mobile comment