SQL Server 2005 Integration Services - Part 36 - Term Extraction and Term Lookup

Friday Nov 10th 2006 by Marcin Policht

Part 36 of this series covers two SSIS Data Flow transformations known as Term Extraction and Term Lookup.

As we have demonstrated in the recent installments of our series, in addition to handling standard tasks associated with data processing (available also in legacy Data Transformation Services), SQL Server 2005 Integration Services offers some new, unconventional capabilities. Fuzzy Lookup and Grouping, covered in my two latest articles, help dealing with potentially erroneous data (by identifying duplicates and performing approximate matches). Equally innovative functionality is delivered by two other SSIS Data Flow transformations known as Term Extraction and Term Lookup, which we will be the topic of today's discussion.

Development of computing technologies (accelerated by the advent of the Internet, which ushered in an era of worldwide communication) provided access to a variety of data sources on an unprecedented scale. This, however, also led to information overload, making its management increasingly complex and time consuming. One way to address this issue involves the use of a discipline of computer science called data mining, whose primary goal is to formulate methodologies for extracting relevant data from massive data sets based on identifying its patterns (in essence, turning unorganized data into meaningful information). While the majority of techniques necessary for implementing this approach in SQL Server environment are available within SQL Server 2005 Analysis Services, SSIS offers text mining capabilities in the form of Term Extraction and Term Lookup Data Flow transformations, which can be used to handle such diverse data sources as RSS feeds or Web Services.

The purpose of Term Extraction is to generate a dictionary of terms (that will be subsequently used to categorize multiple data sets), based on samples that are considered to be representatives of each category. This is accomplished by identifying repeating nouns or noun phrases within text read via input column of the transformation. Entire input content is broken into individual words (through the process called tokenization that takes into consideration relevant punctuation marks and special characters), which are then normalized (which results in articles and pronouns being discarded), tagged, and stemmed (yielding individual, singular, non-capitalized generic nouns, properly capitalized proper nouns, numbers, and adjectives that are part of noun phrases, but ignoring other parts of speech). Note that the analyzed text must be written in English and has to be delivered to transformation as DT_WSTR or DT_NTEXT data type (depending on circumstances, this might require introducing the Data Conversion component prior to passing text for term extraction). The process of extraction can be customized (using entries on the Advanced tab of the Term Extraction Transformation Editor) by specifying one of three types of terms to search for (nouns, noun phrases - which includes in the scope of analysis adjectives and proper nouns, as well as numbers - or both nouns and noun phrases) and whether to consider or disregard their case (with the default of context-insensitive term extraction). It is also possible to set the number of times each term needs to be repeated in order to be added to the dictionary (the default is 2) and the maximum length of each term (with the default of 12 tokens), which is applicable only when noun phrases are part of the search. There are two rating systems that can be used within the transformation (controlled by options in the Score type section on the Advanced tab of Term Extraction Transformation Editor). The first one (Frequency) is simply the count of occurrences of a term in input text. The second one is a value known as TFIDF (an acronym for "term frequency–inverse document frequency"), calculated according to the formula (Frequency of the term) * log((Number of input rows)/(Number of rows containing the term)), which helps determine uniqueness of a specific term across all input rows. A higher TFIDF value indicates that the term appears only in few of them and therefore, it is more relevant for proper categorization (term that is part of every input row does not help with selection process). Another customizable option is the ability to specify a database table containing generic words that are supposed to be ignored (effectively excluding them from the dictionary). Note that you are responsible for populating this table based on expected content of your data sources.

To better understand factors involved in implementing Term Extraction, let's consider a simple example that will use one of our earlier articles on SQL Server 2005 Part 1 - Security (Authentication) as the basis for forming a term dictionary. Start by copying the content of the article to a text file and store it in an arbitrary location. When saving it in Notepad, set the Encoding option to Unicode (this will eliminate the need for adding Data Conversion transformation to alter its data type to DT_NTEXT, which, as we have mentioned earlier, is required by the Term Extraction component). Next, launch the Business Intelligence Development Studio and initiate a new Integration Services project. Within its Designer interface, create a Data Flow task. Switch to the Data Flow tab and drag Flat File Source from the Toolbox onto it. Select Edit... from the context sensitive menu of Flat File Source and click on the New... command button within its Editor window to create a new Connection Manager. Assign a descriptive name to it in the Flat File Connection Manager Editor dialog box, click on the Browse... command button, and point to the newly created text file. Ensure that English is selected in the Locale entry and mark on the Unicode checkbox next to it. Switch to the Advanced section and ensure that the DataType property of our input column is set to "Unicode text stream", with the default column delimiter {CR}{LF} (which will generate a separate row for each paragraph from the source article). Confirm all choices and return to the Data Flow area. Now it is time to add the Term Extraction transformation by dragging its icon from the Toolbox. After you connected its input with the output of Flat File Source, right-click on it and select the Edit... option from the context sensitive menu. This will display the Term Extraction Transformation Editor. On the first tab, labeled Term Extraction, simply confirm that our input column is selected and take a note of output columns names (Term and Score). For the time being, do not modify entries on the Exclusion tab, since this requires having a table with terms that are considered generic (which we will determine in the next step). Review the Advanced tab and verify that it contains the customization options presented in the previous paragraph (leave them unchanged with their defaults).

To capture the outcome, create a Terms table in an arbitrary database with two columns - Term of nvarchar(128) type and Score of type float. Add an OLE DB Destination and Flat File Destination (connected, respectively, to standard and error outputs of our Term Extraction transformation) to your Data Flow, with the first one based on OLE DB Connection Manager pointing to the database hosting Terms table and the second based on Flat File Connection Manager (note that since the error output includes a copy of an entry in Input column that caused a failure, its Connection Manager needs to have the Unicode option turned on). Once you execute your package, the Terms table will be populated with rows containing each extracted term and frequency of its occurrence in our source file. As you can imagine, some of the extracted words are fairly generic and not related in any particular way to the subject of the article, such as, for example, value, default, number, name, character, combination, need, part, note, etc. (note that choice of these terms depends largely on your specific business scenario). To eliminate them from the scope of analysis, copy them to another database table and point to it on the Exclusion tab of the Term Extraction Transformation Editor. Just as with Fuzzy Lookup and Fuzzy Grouping transformations, you might want to consider testing Term Extraction with different data sources that contain text relevant to your data mining criteria, reviewing results after each run, in order to come up with an optimum dictionary of terms and set of exclusions (those might need to be periodically updated so emerging trends are taken into account).

Once you have established your final list of terms, you are ready to test the Term Lookup process. As its input, you can use another one of our articles, dealing with SQL Server 2005 Security authorization. Start by creating a new Integration Services project with a Data Flow task (or simply add to the existing project the second Data Flow task with a precedence constraint enforcing dependency on successful completion of the first one). Add a Flat File Source and its Connection Manager (with the Unicode option turned on) pointing to a Unicode-encoded text file containing the second article. Next, drag the Term Lookup transformation icon from the Toolbox onto the Data Flow area, connect output of the Flat File Source to its input, and select the Edit... option from its context sensitive menu. On the Reference Table tab, point to the Terms table populated by the Term Extraction transformation. Review the content of the Term Lookup tab and verify that the input column is compared against the Term column of our Reference Table (Terms). Finally, note that the Advanced tab gives you an option to perform case-sensitive term lookup (leave it in its default, turned off state). If you decide to store the results in text files, ensure that both standard and error Flat File Destination Connection Managers are configured for Unicode and English locale. Execution of this Data Flow task will generate output divided into three columns, with the first one listing a dictionary term, the second its frequency for an individual input row, and the third one the input row itself (note that each input row might appear in the output multiple times, depending on how many dictionary terms appear in it). At this point, any additional actions depend primarily on the specific goals of your data mining procedure and might involve the use of other SSIS transformations and components or applying procedures available in SQL Server 2005 Analysis Services.

» See All Articles by Columnist Marcin Policht

Mobile Site | Full Site
Copyright 2017 © QuinStreet Inc. All Rights Reserved