Full Text Index Management in SQL Server 2005 Express Edition

Monday Dec 14th 2009 by Marcin Policht

Recent installments of our series dedicated to the most important features of SQL Server 2005 Express Edition have focused on its Full Text Indexing capabilities. This article covers other full text indexing management methods.

In the recent installment of our series dedicated to the most important features of SQL Server 2005 Express Edition, we have been focusing on its Full Text Indexing capabilities. So far, we have covered its basic characteristics as well as described the steps involved in implementing file-based catalog and the indexes hosted by it (the latter was accomplished by leveraging CREATE FULLTEXT INDEX and ALTER FULLTEXT INDEX statements). In this article, we will cover other full text index management methods, leaving discussion about a unique syntax of index-based searches until our next encounter.

SQL Server 2005 Express Edition provides an alternative approach to creating or modifying full text indexes from the one we have presented earlier, which involves the use of sp_fulltext_table stored procedure. (You should, however, keep in mind that this stored procedure is being deprecated in favor of CREATE, ALTER, and DROP FULLTEXT INDEX constructs), whose behavior is controlled by its @action parameter (the other three identify, respectively, names of the indexed table, full-text catalog hosting it, and a unique, non-null column on which the full-text index is based). Depending on its value, you can use it to:

  • create metadata for a new full-text index (with @action='Create'). Note that in this case, all parameters are mandatory. It is important to note that no index-based searches can be conducted until you add a column or columns to be indexed and populate the index. The first of these tasks is typically accomplished by invoking sp_fulltext_column stored procedure, which takes as its arguments the names of a target table (@tabname), a column to be indexed (@colname), the desired action to be carried out ( @action set to 'add'), language code (@language), and a column identifying the document type when referencing image or varbinary(max) data types. (For example, in our case the resulting syntax would take the form sp_fulltext_column @tabname='Production.Document', @colname='Document', @action='add', @language='1033', @type_colname='FileExtension'). In order to determine the list of data formats (and their version) supported by your SQL Server installation, examine the output of SELECT * FROM sys.fulltext_document_types query. Carrying out the second task can be done by taking advantage of other action types of the sp_fulltext_table stored procedure.
  • drop an existing full-text index (with @action='Drop'). The only required parameter is the table name (since there is a limit of a single full-text index per table).
  • activate or deactivate an existing full text index (respectively, with @action='Activate' and @action='Deactivate'). The current state can be viewed by examining the is_enabled column in the sys.fulltext_indexes system view. Activation is typically not necessary, since it takes place automatically when the first column is added to an index, however, it might be needed if that index has been previously deactivated. Deactivation allows you to prevent an index from being used without the need for dropping it.
  • start or stop change tracking (respectively, with @action='start_change_tracking' and @action='stop_change_tracking'). The former sets the change tracking state to MANUAL and keeps tab of all future changes to both indexed and key columns. The latter, deletes the record of these changes and resets the change tracking state to OFF. Change tracking is beneficial when indexed data is moderately dynamic.
  • update an index (with @action='update_index'). The update is triggered immediately following invocation of the stored procedure and takes advantage of the change tracking mechanism (which must be enabled to successfully execute this statement).
  • start or stop background update (respectively, with @action='start_background_updateindex' and @action='stop_background_updateindex'). This initiates or terminates change tracking and background index population. The first of these options sets the change tracking to AUTO, while the latter, reverts it to MANUAL.
  • start or stop full or incremental population of an index (respectively with @action='start_full', @action='start_incremental', and @action='stopl'), without regard for change tracking mechanism. Note that the ability to initiate incremental population is dependent on the presence of a timestamp column in the indexed table. Full population (with change tracking disabled) is typically a preferred mechanism if changes to indexed columns are frequent and widespread. On the other hand, an incremental approach (without change tracking) is recommended when the volume of modifications is low.

As we have mentioned earlier, the algorithm that is used to populate full-text indexes takes into consideration additional unique factors, not included in the standard SQL Server indexing methodology. Among them, the one that most profoundly impacts the efficiency of searches involves the use of noise words. This term designates common, primarily supportive words (such as pronouns, prepositions, conjunctions, or interjections), which tend to occur so frequently and without particular distinguishing significance that their indexing would be essentially meaningless, contributing only to bloated catalog size and decreased performance. Noise words are language specific and are stored in a file named noiseXXX.txt, where XXX designates a three letter language code (for example, noiseENU.txt is intended for US English entries), residing in Program Files\Microsoft SQL Server\MSSQL.x\MSSQL\FTData folder, where x designates an integer assigned to a local SQL Server 2005 Express Edition instance. (This location can be identified by checking the value of the NoiseFile registry entry of the REG_SZ datatype under HKLM\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL.x\MSSearch\Language\eng key).

Sometimes it might make sense to alter the default content of the noise words file (for example, in order to avoid indexing terms customary to your business, which tend to appear more frequently across your data). In order to accomplish this, simply add new (or, if appropriate, remove any present) words by editing the text file. Keep in mind, however, that such changes have to be followed by full repopulation of existing indexes. This can be accomplished on a per index-basis (by leveraging the methods described above) or en masse for all indexes contained within a given catalog by invoking the ALTER FULLTEXT CATALOG statement with REBUILD parameter. In addition, note that the same set of noise words is shared across all catalogs associated with the same instance of SQL Server 2005 Express Edition (this restriction has been eliminated in SQL Server 2008 through introduction of stop lists).

Another factor that affects full-text search behavior is the use of thesaurus. This feature, relying on an XML formatted document named tsXXX.xml (where XXX designates a three letter language code) along with its tsGLOBAL.xml counterpart (both residing in the same Program Files\Microsoft SQL Server\MSSQL.x\MSSQL\FTData folder as the noise word files) gives you the ability to define synonyms, which in turn facilitate more intelligent queries, giving you the ability to locate phrases with an equivalent meaning but different wording. In order to customize its content, you need to first familiarize yourself with its syntax. By default, the majority of it is commented out. Before removing the comments, decide whether thesaurus-based searches should be accent sensitive or not (this involves setting the value of <diacritics_sensitive> element to either 1 or 0). Next, consider that synonyms are specified by employing either <expansion> or <replacement> sets. The first of them associates similar terms (enclosed within <sub> elements) with each other, such that searches yield content that includes any of them. For example, by implementing the sample tsENU.xml file, which contains <sub> elements within <expansion> for Internet Explorer, IE, and IE5, a search for any one of them would return every row that contains at least one of them (as long as the query syntax indicates that thesaurus should be used). <replacement> on the other hand, uses a value within the <sub> element to substitute every pattern specified within the <pat> elements (effectively, only rows containing the former are included in the result set). For example, if you refer to the same sample file tsENU.xml, you will find there a single <sub> element with Windows 2000 assigned to it, along with two <pat> elements, set to NT5 and W2K. This implies that any query for either of these two values would actually return rows containing their substitute. Note that, as with the noise word modifications, any changes to thesaurus files do not take effect until you fully rebuild all existing indexes. In addition, keep in mind that such changes should be saved in Unicode when using a text editor.

In our next article of this series, we will conclude our discussion of Full-Text Index technology implemented in SQL Server 2005 Express Edition by presenting syntax of search queries and providing a few examples demonstrating their power and flexibility.

» See All Articles by Columnist Marcin Policht

Mobile Site | Full Site