Continuing the topic of data cleansing through applying approximate match algorithms, which we started in the most recent article of our series dedicated to SQL Server 2005 Integration Services, we will now present another Data Flow task in this category called Fuzzy Grouping. Unlike previously discussed fuzzy lookups, grouping does not rely on a reference table for comparison, but instead matches input rows against each other.
However, before we focus on the new transformation, we need first to clean up after our latest exercise. As you might recall, in order to expedite subsequent executions of a sample package we created, we suggested storing and maintaining a match index that serves as a basis for fuzzy lookups. This is accomplished by setting up triggers on a reference table (which keeps the table and the index based on that table consistent, replicating every change from the first one to the second). While getting rid of the match index is straightforward and involves simply deleting the appropriate table (hosting the index) in the same database where the reference table resides, it should be preceded by the removal of match-specific triggers. The recommended way of handling this is by running the sp_FuzzyLookupTableMaintenanceUnInstall stored procedure, with the match index name as its sole parameter.
With the HumanResources.Employee table back to its original state, let's turn our attention to fuzzy grouping. As the name indicates, its primary purpose is grouping of records according to their common characteristics, which, in turn, can be used in order to eliminate duplicates. This process starts by breaking down string values of arbitrarily chosen columns for each input row into individual substrings (by applying the same, delimiter-based parsing that is used for fuzzy lookups). These substrings are then indexed and used for comparisons across entire data sets, allowing for identification of rows that likely (according to match types and similarity thresholds) represent the same data (and therefore are prime candidates for de-duplication). Such rows are grouped (by having assigned a common numeric identifier) and, for each group, the algorithm designates one of its rows (known as canonical row) as the representative, containing data that is considered as 100% accurate (selection and grouping leverage fuzzy lookup methodology using temporary tables to generate a match index and perform necessary comparisons). To analyze this process in more detail and become familiar with its implementation, we will review its impact on the following sample data set:
adventure-works\alex0;ANayberg;production;M adventure-dorks\alex0;ANayberg;production;M adventure-works\flex0;ANayberg;production;M adventure-works\kevin6;KBrown;marketing;M adventure-corks\kevin6;KBrown;marketing;M adventuwe-wowks\wobewto0;RTamburello;production;M adventure-works\roberto0;RPamburello;production;M adventure-works\roberto0;RTamburello;production;M adventure-works\ruth0;REllerbrock;production;F adventure-works\tooth0;REllerbrock;production;F adventure-works\rush0;REllerbrock;production;F
Save this listing into a text file, which we will use as the data source for our package. Launch Business Intelligence Development Studio and initiate a new Integration Services project. From its designer interface, create a Data Flow task. Switch to the Data Flow tab and drag onto it a Flat File Source Toolbox icon. Using its context sensitive menu, launch Flat File Source Editor. In the Connection Manager section, click on the New... command button to display the Flat File Connection Manager window. Type in a descriptive name in the top text box and point to the file you saved earlier after clicking on the Browse... command button. Switch to the Columns section to verify that the Preview window accurately reflects the file content. Using entries in the Advanced section, assign descriptive names to each column and narrow down their width to reflect the size of data they will contain (set OutputColumnWidth of Gender to 1). Once you have completed the configuration, click on OK to close the Editor window and return to the Designer interface.
Next, locate the Fuzzy Grouping icon in the toolbox, drop it into Data Flow task area, and connect the output of the Flat File Source with its input. Select Edit... option from its context sensitive menu to display the Editor window. Using New... command button on the first tab, create OLE DB Connection Manager pointing to a SQL Server 2005 database where the temporary object created by the fuzzy grouping procedure will be located (you will need to provide server and database names as well as appropriate authentication method in the Configure OLE DB Connection Manager dialog box). Note that with larger input data sets, size occupied by these objects will be significant, so it is recommended that you use a non-production system for this purpose. Once this is completed, switch to the Columns tab. From here, designate which input columns will be taken into account when performing comparisons (determined by the state of the first column of checkboxes in the Available Input Columns listing) and generating output of the transformation (based on the state of checkboxes in the column labeled Pass Through of the same listing). In our case, we will exclude Description from the fuzzy grouping algorithm (by clearing the checkbox in the first column) but will keep it in our result set (by keeping the Pass Through column checkbox turned on). In the lower portion of the Columns tab (underneath the Available Input Columns listing), you can specify settings for each input column participating in the fuzzy grouping operation that are more granular, including the following:
- Output Alias - name assigned to an output column corresponding to the given input column,
- Group Output Alias - name of the output column containing the canonical value for each row (identified by fuzzy grouping as the correct value for a group that this row is part of),
- Match Type - method used to compare values in the current column across all rows. This can be either Exact (which ensures that only identical values are viewed as duplicates and which you should apply if you are certain about their accuracy) or Fuzzy (which uses described earlier approximation algorithm to deliver grouping results).
- Minimum Similarity - number between 0 and 1 (applicable when Fuzzy Match Type is used) that determines how closely (in terms of fuzzy matching calculations) values within the given column for all rows need to be to each other in order to be considered part of the same group (typically, the higher this value is, the larger number of small groups you will end up with).
- Similarity Output Alias - name of the output column containing the similarity rating calculated by comparing value in the given column for each row against value of the same column in canonical row (within the same group). There is one such Alias for each column that is used for fuzzy grouping.
- Numerals - setting that designates relevance of leading or trailing digits in the string of characters contained within the given column (it can be assigned to Neither, Leading, Trailing, or LeadingAndTrailing, depending on which ones are regarded as significant).
- Comparison Flags - collection of options that affect details of the string matching process for a given column, including such criteria as case sensitivity (Ignore case), differences between Japanese character sets (Ignore kana type), use of diacritics, such as tilde (Ignore nonspacing characters) or symbols, such as punctuation (Ignore symbols), presence of single- and double-byte characters (Ignore character width), and sorting order for punctuation characters (Sort punctuation as symbols).
In our case, we will leave the defaults for all of the existing entries (UserName, LoginName, and Gender input columns) including their preassigned labels, Fuzzy match type, as well as Numerals and Similarity settings. The initial run will give you an indication how well the fuzzy grouping process performs with existing data. Depending on the results (and your ability to identify incorrect data values), you can adjust relevant parameters affecting number and size of groups as well as canonical value for each.
Switch to the Advanced tab of the Editor window, where you can set the component
level similarity threshold (with a slider), designate token delimiters
(including space, carriage return, tab, line feed, and
as well as customize names of three additional output columns:
- _key_in - a consecutive integer assigned to each row processed by the Fuzzy Grouping transformation,
- _key_out - an integer identifying _key_in value of canonical row for the group to which the current row has been assigned. For example, in our sample data, the first three rows, once processed and directed to the transformation output, will have _key_in values of 1, 2, and 3, respectively. As expected, the first row would be designated as canonical; hence, its _key_in value (i.e. 1) would appear in _key_out column for the second and third (as well as the first) output rows.
- _score - a number between 0 and 1 representing similarity between the current and canonical rows (with 1 designating exact match).
As in our previous example, we will capture results into a text file. To accomplish this, drag the Flat File Destination from the Toolbox onto Data Flow task area and connect output of the Fuzzy Grouping transformation to its input. Next, select Edit from its context sensitive menu to display the Editor window. Create a new Flat File connection manager pointing to a delimited text file and accept its settings preassigned based on the format of Fuzzy Grouping output columns (including _key_in, _key_out, and _score).
During package execution, you should expect an initial delay introduced when input data is buffered and temporary objects necessary for fuzzy matching are created. For larger data sets this will likely have significant implications on memory and space utilization so, as mentioned before, you should consider offloading processing (along with the input data) to a non-production SQL Server or at least setting upper limit on the memory used by Fuzzy Grouping with its MaxMemoryUsage custom property (on Component Properties tab of its Advanced Editor window).