Data searches that leverage existing indexes

Monday Dec 28th 2009 by Marcin Policht

Recent installments of our SQL Server 2005 Express Edition series have been discussing its implementation of Full Text Indexing. This article focuses on data searches, which leverage existing indexes, taking into account such features as noise words and thesaurus files.

In the recent installments of our series dedicated to the most prominent features of SQL Server 2005 Express Edition, we have been discussing its implementation of Full Text Indexing. In the course of our discussion, we have covered its basic characteristics and stepped through its implementation, describing tasks required to create a file-system based catalog, as well as to define and populate full-text indexes. In this article, we will focus on data searches, which leverage existing indexes, taking into account such features as noise words and thesaurus files.

As a reminder, our sample index was created on Production.Document table in the AdventureWorks database. It covers DocumentSummary and Document columns (respectively of nvarchar(max) and varbinary(max) data types), with the FileExtension column defining the data format of the latter and PK_Document_DocumentID serving as the underlying unique, single column, non-null key. We also explicitly set the LANGUAGE parameter to 1033 (US English) and enabled AUTO updates with CHANGE_TRACKING. As the result, the full index population should be automatically taking place (which you can verify by examining a value of its crawl_end_date in sys.fulltext_indexes system view). With these constructs in place, let's examine the types of full text index-based queries we can execute against both columns.

In order to facilitate functionality not available in standard character-based queries, Full-Text Indexing technology introduces a set of new syntax structures, which consist of FREETEXT, CONTAINS, FREETEXTTABLE and CONTAINSTABLE. The first two are predicates returning a single boolean value (TRUE or FALSE), indicating whether a row includes a term being searched for and are typically used as part of a SELECT statement to provide a listing of such rows. The second pair represents functions that rely on the same search criteria, but yield an outcome consisting of an identifier of each returned row along with its rating (which designates a degree to which criteria have been met).To illustrate these concepts, let's run a series of equivalent searches against the same dataset using each of these constructs. To make our results easier to analyze, we will target the DocumentSummary column, whose content you can display by executing the SELECT DocumentSummary FROM Product.Document statement.

FREETEXT is intended for queries of full-text indexed character-based, varbinary(max), image, and xml based columns (which makes it suitable for searches of both Document and DocumentSummary). Its syntax is relatively straightforward and consists of three parameters. The first one designates a target (one or more columns, identified either via their names or through a wildcard notation). The second specifies a search string (of nvarchar data type), and the third, optional one allows you to identify a specific language, whose rules should be applied when carrying out the query (if this parameter is not explicitly stated, the language assigned to the full-text indexed column is used instead). This approach automatically takes advantage of logic incorporated into full-text indexing, such as the use of inflections (yielding approximate, rather than exact, matches - unless a search string is enclosed in double quotes) as well as thesaurus-based substitutions and replacements. To demonstrate this behavior, we could add an <expansion> element to the tsENU.xml thesaurus file that includes <sub> elements for bike and bicycle, effectively making it possible to carry out a query for each DocumentDescription entry that contains either one of them by executing the following statement:

SELECT DocumentSummary
	FROM Production.Document
	WHERE FREETEXT (DocumentSummary, N'bike', LANGUAGE 1033)

Another common query involves searching for entries, which contain any of the words in the phrase you specify (for example, with 'repair components' as the search string, you should be able to locate all rows with 'repair' or 'component' (including their inflections).

The FREETEXTTABLE function is syntactically and functionally similar to FREETEXT (it also is subject to the same search criteria), however instead of evaluating to a boolean TRUE or FALSE, it returns a table consisting of two columns. The first one identifies the KEY (a unique, non-null, single column that a full-text index references) for each row where an approximate match to a search string has been found, and the other stores an integer RANK indicating degree of similarity. The function expects several parameters, in particular, the name of a table where indexed columns reside, column names you want to query, and a search string. It also gives you an option to designate a desired language (affecting inflections and thesaurus-based expansions and replacements) and set the maximum number of returned rows. For example, the following query will yield all rows in the DocumentSummary column, which contain the word bike (including its inflections and synonyms defined in the thesaurus file), along with their respective rank and key values. (If you want to limit the result set to an arbitrary number of matches only and automatically order them according to their ranking, specify that number as the last parameter of the FREETEXTTABLE function):

SELECT f.*, d.DocumentSummary
	FROM FREETEXTTABLE(Production.Document, DocumentSummary, N'bike', LANGUAGE 1033) f
	INNER JOIN Production.Document d	
		ON f.[KEY] = d.DocumentID

The process used to calculate ranking for a query result set is relatively involved. This results, at least to some extent, from a hierarchical index structure, where statistics are evaluated on each level when forming higher-level indexes though merging of its subordinates. Such process is not deterministic, which implies that search results might vary as indexes change, but in general, rank calculations take into consideration (besides the hit count) such criteria as language used for tokenization, total number of rows indexed, column sizes, or the largest offset of the indexed term in a row. (For more details regarding this subject, refer to Understanding Ranking article in the SQL Server 2005 Books Online). In its simplistic form, ranking can be demonstrated by invoking the FREETEXTTABLE function targeting words that appear in indexed rows with different frequency. For example, we will get distinctly different results when querying the DocumentSummary column for the presence of seats, reflectors, and pedals terms (since their respective hit counts are 1, 2, and 3) by running the following:

SELECT f.*, d.DocumentSummary
	FROM FREETEXTTABLE(Production.Document, DocumentSummary, 
		N'seats pedals reflectors', LANGUAGE 1033) f
	INNER JOIN Production.Document d	
		ON f.[KEY] = d.DocumentID

In a similar manner, we could target the Document column of varbinary(max) data type, since it has been included in our full-text index. This would involve simply providing its name along with desired search criteria as part of the FREETEXTTABLE function. For example, we could apply the equivalent query by specifying FREETEXTTABLE(Production.Document, Document, N'seats pedals reflectors', LANGUAGE 1033) f as part of the FROM clause (of course, the resulting ranking would be different, reflecting relevance-based ordering of seats, pedals, and reflectors terms in Word documents stored in the Document column).

FREETEXT and FREETEXTTABLE are relatively easy to use, but they also lack flexibility, which might be needed in more demanding scenarios. This gap is filled by the CONTAINS predicate and the corresponding CONTAINSTABLE function. In the next article of this series, we will examine their characteristics and provide examples demonstrating their use.

» See All Articles by Columnist Marcin Policht

Mobile Site | Full Site