Using Fulltext Indexes in MySQL - Part 2, Boolean searches

Wednesday Feb 19th 2003 by Ian Gilfillan
Share:

Part 1 of this article looked at the fulltext index, and how to search on it using an ordinary MATCH() AGAINST(). Even more powerful, (although only available on the newer MySQL version 4), is the ability to do a boolean search. Part 2 of this article examines the possibilities.

Part 1 of this article looked at the fulltext index, and how to search on it using an ordinary MATCH() AGAINST(). Even more powerful, (although only available on the newer MySQL version 4), is the ability to do a boolean search. Part 2 of this article examines the possibilities.

You'll use the same table you used in Part 1. The full list of records is:

mysql> SELECT * FROM fulltext_sample;
+---------------------------+
| copy                      |
+---------------------------+
| It appears good from here |
| The here and the past     |
| Why are we here           |
| An all-out alert          |
| All you need is love      |
| A good alert              |
+---------------------------+
To perform a boolean search, the IN BOOLEAN MODE modifier is used. The following query demonstrates the new syntax.

mysql> SELECT * FROM fulltext_sample 
WHERE MATCH (copy) AGAINST ('love' IN BOOLEAN MODE);
+----------------------+
| copy                 |
+----------------------+
| All you need is love |
+----------------------+

An important difference between ordinary MATCH() AGAINST() queries and IN BOOLEAN MODE searches is that the latter does not apply the 50% limit (so the word can appear in more than half the rows).

There are a number of special operators you'll need to know in order to make the most of the boolean search. If no symbol is specified, the appearance of this word causes the relevance of the row to be higher, similar to an ordinary MATCH() AGAINST().

Boolean Search Operators

+The word is mandatory in all rows returned.
-The word cannot appear in any row returned.
<The word that follows has a lower relevance than other words, although rows containing it will still match
>The word that follows has a higher relevance than other words.
()Used to group words into subexpressions.
~The word following contributes negatively to the relevance of the row (which is different to the '-' operator, which specifically excludes the word, or the '<' operator, which still causes the word to contribute positively to the relevance of the row.
*The wildcard, indicating zero or more characters. It can only appear at the end of a word.
"Anything enclosed in the double quotes is taken as a whole (so you can match phrases, for example).

Examples

Let's see some of this action. The first example returns all rows containing the word 'here', but not the word 'past'.
mysql> SELECT copy FROM fulltext_sample WHERE MATCH(copy) 
AGAINST ('+here -past' IN BOOLEAN MODE);
+---------------------------+
| copy                      |
+---------------------------+
| It appears good from here |
| Why are we here           |
+---------------------------+
Compare the difference between the next two examples:
mysql> SELECT copy FROM fulltext_sample WHERE MATCH(copy) 
AGAINST ('here past' IN BOOLEAN MODE);
+---------------------------+
| copy                      |
+---------------------------+
| It appears good from here |
| The here and the past     |
| Why are we here           |
+---------------------------+

mysql> SELECT copy FROM fulltext_sample WHERE MATCH(copy) 
AGAINST ('here +past'IN BOOLEAN MODE);
+-----------------------+
| copy                  |
+-----------------------+
| The here and the past |
+-----------------------+

The first works in a similar way to an ordinary MATCH() AGAINST(), albeit without the 50% threshold. The word 'past' is mandatory in the second search (by default a word is optional), so the other two records are not returned.

The next example demonstrates a common mistake.

mysql> SELECT copy FROM fulltext_sample 
WHERE MATCH(copy) AGAINST ('+are here' IN BOOLEAN MODE);
+---------------------------+
| copy                      |
+---------------------------+
| It appears good from here |
| The here and the past     |
| Why are we here           |
+---------------------------+

The results may appear surprising compared the previous example, but since 'are' contains three or less letters, it is excluded for purposes of the search and is not mandatory.

The next two examples demonstrate a powerful enhancement for searching purposes:

mysql> SELECT copy FROM fulltext_sample 
WHERE MATCH(copy) AGAINST ('aler' IN BOOLEAN MODE);
Empty set (0.01 sec)

mysql> SELECT copy FROM fulltext_sample 
WHERE MATCH(copy) AGAINST ('aler*' IN BOOLEAN MODE);
+------------------+
| copy             |
+------------------+
| An all-out alert |
| A good alert     |
+------------------+

By default only whole words are matched, unless the '*' operator is used.

The '<' and '>' symbols are less commonly used, but they allow a great degree of control for relevance. In the following examples, we return the relevance indicator to demonstrate the difference between the queries.

mysql> SELECT copy, MATCH(copy) 
AGAINST ('appears good alert' IN BOOLEAN MODE) 
AS m FROM fulltext_sample WHERE MATCH(copy) 
AGAINST ('appears good alert' IN BOOLEAN MODE);
+---------------------------+------+
| copy                      | m    |
+---------------------------+------+
| It appears good from here |    2 |
| An all-out alert          |    1 |
| A good alert              |    2 |
+---------------------------+------+

mysql> SELECT copy, MATCH(copy) 
AGAINST ('appears <good alert' IN BOOLEAN MODE) 
AS m FROM fulltext_sample WHERE MATCH(copy) 
AGAINST ('appears <good alert' IN BOOLEAN MODE);
+---------------------------+------------------+
| copy                      | m                |
+---------------------------+------------------+
| It appears good from here |  1.6666667461395 |
| An all-out alert          |                1 |
| A good alert              |  1.6666667461395 |
+---------------------------+------------------+

The '<' operator decreases the relevance of the word 'good', in this case by approximately 0.33.

mysql> SELECT copy, MATCH(copy) 
AGAINST ('appears good >alert' IN BOOLEAN MODE) 
AS m FROM fulltext_sample WHERE MATCH(copy) 
AGAINST ('appears good >alert' IN BOOLEAN MODE);
+---------------------------+------+
| copy                      | m    |
+---------------------------+------+
| It appears good from here |    2 |
| An all-out alert          |  1.5 |
| A good alert              |  2.5 |
+---------------------------+------+

The '>' operator increases the relevance of the word 'alert', in this case by 0.5.

The parentheses group words into a subexpression. In the following example, the '+' symbol applies to the group of words, so that at least one of 'appears' or 'past' must appear.

mysql> SELECT copy, MATCH(copy) 
AGAINST ('+(appears past)' IN BOOLEAN MODE) 
AS m FROM fulltext_sample WHERE MATCH(copy) 
AGAINST ('(appears past)' IN BOOLEAN MODE);
+---------------------------+------+
| copy                      | m    |
+---------------------------+------+
| It appears good from here |    1 |
| The here and the past     |    1 |
+---------------------------+------+

You can also apply the operators to words in the subexpression, as follows:

mysql> SELECT copy, MATCH(copy) 
AGAINST ('+(>appears <past)' IN BOOLEAN MODE) 
AS m FROM fulltext_sample WHERE MATCH(copy) 
AGAINST ('+(>appears <past)' IN BOOLEAN MODE);
+---------------------------+------------------+
| copy                      | m                |
+---------------------------+------------------+
| It appears good from here |              1.5 |
| The here and the past     | 0.66666668653488 |
+---------------------------+------------------+
The '~' operator contributes negatively to the relevance, but does not bar the word from appearing, as the '-' operator does. It also does not do the same as the '<' operator, which is to make a reduced, yet still positive, difference to the relevance. These subtle difference allows for powerfully tuned searches, but can cause confusion. Look at the difference between the following. First, the word 'here' is compulsory, and the word 'past' optional:
mysql> SELECT copy, MATCH(copy) 
AGAINST ('+here past' IN BOOLEAN MODE) 
AS m FROM fulltext_sample WHERE MATCH(copy) 
AGAINST ('+here past' IN BOOLEAN MODE);
+---------------------------+-----------------+
| copy                      | m               |
+---------------------------+-----------------+
| It appears good from here |               1 |
| The here and the past     | 1.3333333730698 |
| Why are we here           |               1 |
+---------------------------+-----------------+
Next, a reduced relevance for the word 'past'. The same results are returned, but the record with both words ('The here and the past') has less weighting, though still more than the other records.
mysql> SELECT copy, MATCH(copy) 
AGAINST ('+here <past' IN BOOLEAN MODE) 
AS m FROM fulltext_sample WHERE MATCH(copy) 
AGAINST ('+here <past' IN BOOLEAN MODE);
+---------------------------+-----------------+
| copy                      | m               |
+---------------------------+-----------------+
| It appears good from here |               1 |
| The here and the past     | 1.2222222089767 |
| Why are we here           |               1 |
+---------------------------+-----------------+
Take note of the relevance in the next example:
mysql> SELECT copy, MATCH(copy) 
AGAINST ('+here ~past' IN BOOLEAN MODE) 
AS m FROM fulltext_sample WHERE MATCH(copy) 
AGAINST ('+here ~past' IN BOOLEAN MODE);
+---------------------------+------------------+
| copy                      | m                |
+---------------------------+------------------+
| It appears good from here |                1 |
| The here and the past     | 0.83333331346512 |
| Why are we here           |                1 |
+---------------------------+------------------+
The important difference is that the relevance is negatively affected, and is now lower than the other two records. This means if you asked for only '~past', you'd get nothing back, as the relevance would be lower than 0 (and therefore not returned)
mysql> SELECT copy, MATCH(copy) 
AGAINST ('~past' IN BOOLEAN MODE) 
AS m FROM fulltext_sample WHERE MATCH(copy) 
AGAINST ('~past' IN BOOLEAN MODE);
Empty set (0.00 sec)
Finally, '-' simply removes the second record from the result set, as we've seen before.
mysql> SELECT copy, MATCH(copy) 
AGAINST ('+here -past' IN BOOLEAN MODE) 
AS m FROM fulltext_sample WHERE MATCH(copy) 
AGAINST ('+here -past' IN BOOLEAN MODE);
+---------------------------+------+
| copy                      | m    |
+---------------------------+------+
| It appears good from here |    1 |
| Why are we here           |    1 |
+---------------------------+------+
The final operator, the double quotes ("), is useful to group phrases together. So, if you were interested in returning 'good from here' but none of those words if they did not appear in the phrase, you'd use:
mysql> SELECT copy, MATCH(copy) 
AGAINST ('"good from here"' IN BOOLEAN MODE) 
AS m FROM fulltext_sample WHERE MATCH(copy) 
AGAINST ('"good from here"' IN BOOLEAN MODE);
+---------------------------+------+
| copy                      | m    |
+---------------------------+------+
| It appears good from here |    1 |
+---------------------------+------+

For many applications, boolean searches are one of MySQL 4's most useful features. With careful use, you can save yourself substantial of development time, and add many useful enhancements. Good luck!

» See All Articles by Columnist Ian Gilfillan

Share:
Home
Mobile Site | Full Site
Copyright 2017 © QuinStreet Inc. All Rights Reserved