SQL Server 2008 and 2008 R2 Integration Services - Sort Transformation

Wednesday Jan 25th 2012 by Marcin Policht

SQL Server 2008 R2 Integration Services offer a wide range of pre-built components, which deliver generic functionality commonly required when performing ETL tasks. While most of them are fairly straightforward to deploy in their basic form, there are caveats that should be taken into account as part of their implementation. Read on to learn more.

SQL Server 2008 R2 Integration Services offer a wide range of pre-built components, which deliver generic functionality commonly required when performing extraction, transformation, and loading tasks. While most of them are fairly straightforward to deploy in their basic form, typically there are refinements or caveats that should be taken into account as part of their implementation. This principle becomes quite evident when considering use cases of Sort transformation, which is the subject of this article.

Not surprisingly, the primary purpose of this Data Flow component is the ordering of input rows according to standard .NET Framework-based sorting techniques. Its scope typically involves multiple columns, with each assigned a unique non-zero integer representing its precedence (the one with the smallest absolute value greater than zero is applied first) and direction (positive or negative values designate, respectively, ascending or descending order), leaving 0 as the identifier of excluded columns. The actual sorting algorithm can be fine-tuned in order to accommodate various comparison options applicable to string data type (as described in SQL Server 2008 R2 Books Online). As a side benefit, the transformation also gives you the option to remove duplicate rows - although you should keep in mind that matches are determined exclusively based on the values of columns included in the sorting process (which might lead to removal of rows that are not identical).

The primary disadvantage of the Sort transformation is its negative impact on performance, resulting from its fully blocking, asynchronous nature. Effectively, introducing it in a data flow requires that all of the input rows to be sorted are loaded into memory before any of them can be passed to its output. (This becomes especially problematic when dealing with large data sets containing millions of records). Due to this shortcoming, you might want to consider avoiding Sort transformation by leveraging alternative methods that provide equivalent functionality.

In particular, when extracting data stored in SQL Server databases, you can take advantage of its native sorting functionality, which tends to be significantly more efficient, thanks to its auxiliary optimization mechanisms (such as indexing). For example, you can employ OLE DB Source with the SQL Command containing an ORDER BY clause that delivers the desired input records in an already sorted sequence (the same principle applies to ADO.Net source adapter).

Note, however, that if you decide to use this approach, it might be required in some cases (e.g. when combining multiple data sources using Merge transform) to indicate that the resulting data is already sorted. In the case of OLE DB Source, this is accomplished by setting the IsSorted property to True via the Common Properties section of OLE DB Source Output node on the Input and Output Properties tab of its Advanced Editor interface. In addition, you need to identify the columns included in the ORDER BY clause by assigning appropriate numbers to their SortKeyPosition property (accessible via the Common Properties section of the corresponding entries in the Output Columns folder), where positive and negative values represent ascending and descending order respectively.

If your intention is to sort columns of GUID (or uniqueidentifier in the SQL Server nomenclature) data type, you should keep in mind that the outcome is likely to differ depending on the method you choose. In particular, Sort transformation carries out its operations in a fairly straightforward manner (digits 0-9 precede letters A-F), while the ORDER BY clause relies on a rather convoluted approach that carries out comparison within boundaries defined by byte groups (sections delimited by dashes in the most common GUID notation) in a somewhat unexpected (discussed in more details on one of msdn blogs) sequence (which reflects two distinct approaches to sorting GUIDs implemented in the .NET Framework and SQL Server platform).

Ultimately, there are numerous scenarios where it might be justified to employ Sort transformation (e.g. due to a limited number of records) or where workarounds are not applicable (for example, when dealing with data originating from a Flat File Source). We will step through a sample process illustrating one of these cases. For the sake of simplicity, let's assume that we have an inventory of a few paintings we collected over the years, including information about their creators, year of completion, and estimated price (in mln USD) in the semicolon delimited format stored in a text file (including the column titles):

Portrait of Adele Bloch-Bauer I;Gustav;Klimt;1907;$150
Portrait of Dr. Gachet;Vinny;van Gogh;1890;$144
Irises;Vinny;van Gogh;1889;$105
Eight Elvises;Andy;Warhol;1963;$104

To implement our sample package, launch Business Intelligence Development Studio and create a new project based on the Integration Services template. Switch to the Data Flow tab and activate the hyperlink displayed in the center of the main window stating 'No Data Flow task have been added to this package'. Click here to add a new Data Flow task. With the Data Flow task in place and active, drag the Flat File Source component from the Toolbox onto the Designer interface. Display its Editor window by selecting the Edit... item from its context-sensitive menu. Click on the New... command button next to the Flat file connection manager listbox. In the resulting Editor window, type in the Connection manager name and point to the file containing our sample data. Enable the Column names in the first data row checkbox. Switch to the Columns section and ensure that Semicolon {;} is selected as the Column delimiter. Verify that the appropriate content appears in the Preview rows listbox and click twice on OK to confirm your choices.

Next, drag the Sort icon from the Toolbox onto the Designer window. Position it directly underneath the Flat File Source component, and extend the Data Flow Path so the two of them become connected. Display the Editor window of the newly created transformation. Its really straightforward interface allows you to designate which columns you want to sort by, along with their Output Alias, Sort Type (ascending or descending), and sort order (represented by a unique integer assigned to each). Note that you also have an option to Remove rows with duplicate sort values.

Advanced Editor (available from the corresponding entry in the context sensitive menu of the Sort transformation) offers more flexibility in regard to customizing its configuration. For starters, it allows you to limit the number of concurrent threads used during runtime (by changing the value of the MaximumThreads entry appearing on the Component Properties from its default of -1, which does not impose any restrictions). In addition, by switching to the Input and Output Properties tab, you gain access to Sort Output columns, where you can alter the sorting behavior by setting ComparisonFlags (Ignore case, Ignore Kana type, Ignore nonspacing characters, Ignore character width, Ignore symbols, and Sort punctuation as symbols), or alter such properties as CodePage, Datatype, or Length (all of them are relevant when comparing string data).

Finally, to complete the setup of our sample package, add OLE DB Destination to the Designer window, and extend the output of Sort transformation such that both of them are connected. Use the context sensitive menu to display the OLE DB Destination Editor window and create a new OLE DB connection manager pointing to a target database (we will use tempdb for the purpose of our demonstration). In the Data access mode listbox select the Table or view entry. Click on the New... command button to create a new table, adjust the default T-SQL statement appearing in the Create Table dialog box to match your requirements, and complete your changes by clicking on two subsequent OK command buttons. Switch to the Mappings section, verify that input and destination columns are mapped as expected, and close the Editor window.

We will use a Data viewer in order to facilitate a quick preview of sorting results. To accomplish this, right click on Data Flow Path between the Sort transformation and OLE DB Destination and select Data Viewers from its context sensitive menu. In the resulting Data Flow Path Editor click on the Add... command button in the Data Viewers section and select Grid from the list of available types. Once you return to the Designer window, an icon containing a pair of eyeglasses should appear right next to the arrow representing the output of our Sort transformation. Executing the package will display the Data Viewer window with sample data sorted in the specified order.

See all articles by Marcin Policht

Mobile Site | Full Site
Copyright 2017 © QuinStreet Inc. All Rights Reserved