SQL Server 2005 Integration Services – Lookup Transformation – Part 33

Continuing the overview of Data Flow components in SQL Server 2005
Integration Services, which we initiated in the previous
installment
of this series, we will focus now on the Lookup transformation.
Its basic purpose is to combine data from two sources, by matching one or more
of their fields (commonly used when populating fact tables in Data
Warehousing), although there are also other, numerous scenarios in which such
functionality comes in handy, such as, producing output containing rows for
which such match does not exist. Obviously, the majority of these actions can be
fairly easily accomplished with standard T-SQL statements that contain JOIN
clause, however, there are situations where this might not be preferred or even
possible (e.g. incorporating lookup into a data flow involving disparate
sources or referencing values that are available only at runtime).

The Lookup transformation populates its output by equi-joining rows being
fed through its single input with reference (lookup) dataset, formed using a
table, view, or (potentially parameterized) T-SQL statement (leveraging OLE DB
connection manager). The join is based on the comparison between values of
designated fields in each input row against lookup keys in the reference
dataset. Depending on whether a matching row is found, this operation generates
standard output by combining both data sources or triggering an error condition
(details of this behavior are dictated by a number of configuration settings
that we will be discussing shortly).

Following this short introduction, let’s analyze an example demonstrating the
functionality of this transformation and examine its characteristics more
closely. We will emulate a situation where an inventory of Windows user
accounts (including names of computers that belong to their owners and their
logon names for some third-party applications) needs to be correlated with
Human Resources records stored in the AdventureWorks database, in order to
determine the first and last name for each. The sample will have the following
format (with individual fields separated by semicolons):


adventure-workskevin0;usny-wks001;KBrown
adventure-worksroberto0;usny-wks002;RTamburello
adventure-worksmarcin1;usny-wks666;MPolicht
adventure-worksruth0;usny-wks004;REllerbrock

The Windows accounts data is stored in the LoginID column of the HumanResources.Employee
table, which, once joined with the Person.Contact table (via ContactID column)
can be used to determine each person’s first and last names (by examining the content
of the FirstName and and LastName columns). While the first two and the last
rows in our sample contain valid references to AdventureWorks data, the third
one will produce a mismatch.

Start by storing the sample data in a text file (we will call it
AWLogins.txt). Next, launch Business Intelligence Development Studio and
initiate a new project of Integration Services type. In Designer interface of
its package, create a Data Flow task. Using the context sensitive menu within the
Connection Managers area, bring up the Flat File Connection Manager Editor
window. Name it according to its purpose (AWLogins) and point to the newly
created text file using the Browse… button next to the File name text box.
Confirm that its content is displayed properly in the Columns section (with the
semicolon listed as a delimiter). Switch to the Advanced section and provide
meaningful names for each column (such as Login, Computer, and AppID). In
addition, we will need to ensure that the Integration Services data type for
the first column will map properly to the SQL Server data type for the LoginID
column, by changing its default string [DT_STR] to Unicode string [DT_WSTR]
(for more information regarding this topic, refer to the SQL Server
2005 Books Online
). Verify the outcome by examining the content of the
Preview section and close the Editor window. Once you are back to the Designer
interface, drag a Flat File Source icon from the Toolbox onto the Data Flow area,
configure it by setting its Flat File Connection Manager entry to the newly
created AWLogins and click on the OK button to make the change effective.

Now it is time to add the Lookup transformation icon from the Toolbox,
connect its input to the output of the Flat File Source, and select Edit from
its context sensitive menu to bring up its Editor window. On the Reference
Table tab, you need to provide a connection manager (create a new one that
references SQL Server 2005 AdventureWorks database) and specify a table, view,
or SQL query that will define the content of the lookup recordset. (As mentioned
in one
of our earlier articles
dealing with the topic of SSIS performance, it is
recommended to construct SQL queries that narrow down range of target columns).
In our case, enter the following statement in the "Use results of an SQL
query" text box (to verify its correctness, click on Parse Query command
button or check the outcome of its execution with Preview…):


SELECT E.ContactID, E.LoginID, C.FirstName, C.LastName
FROM HumanResources.Employee AS E
INNER JOIN Person.Contact AS C
ON E.ContactID = C.ContactID

Switch to the Columns tab where you will see a graphical representation of
two data sets in the form of two rectangles labeled Available Input Columns and
Available Lookup Columns. Drag the entry for Login from the first one and drop
it on the LoginID entry of the second (without the earlier adjustment of data
type to Unicode string [DT_WSTR], this operation would generate an error).
Underneath, you can define which lookup columns you want to include in the
output. Select FirstName and LastName (as indicated earlier) with <add as
new column> as the Lookup Operation for both.

Click on the "Configure Error Output…" command button to display
the identically labeled dialog box. From here, you can define the behavior of
the transformation for a variety of error conditions. In our case, you should
see three entries here. The first one (with Lookup Output in the Input or Output
column) defines the outcome of the mismatch in the lookup operation and can be
configured to Ignore failure (copying mismatched rows to standard output with NULLs
instead of lookup values), Redirect row (copying mismatched rows to error
output with error code instead of lookup values), or Fail component. The
remaining two entries for LastName and FirstName column exist due to the
potential for Truncation errors (which can be addressed by applying the same
three actions – Ignore failure, Redirect row, or Fail component). We will test
the Redirect row behavior for the Lookup operation, so ensure that this action
is selected in the Error column for the first row. Close the Configure Error
Output dialog box and switch to the Advanced tab, containing the settings
affecting memory utilization.

By default, Lookup transformation stores the referenced recordset in memory,
in order to optimize lookup operations. This is reflected by the value of its CacheType
property, which is set to Full (you can verify this in its Properties window or
the Component Properties tab of its Advanced Editor window) as well as by the cleared
"Enable memory restrictions" check box on the Advanced tab of its
Editor window. This way, the entire lookup data is preloaded in memory, which, in
most cases, speeds up processing, but might also negatively impact other
applications (especially for large data sets). If this happens to be the case,
you can also set the CacheType property to Partial (causing each row from the
lookup recordset to be loaded into memory the first time it is referenced and
remain there until cache size reaches the limit specified on the Advanced tab
of the Editor window) or None (which eliminates caching altogether and is
equivalent to turning on "Enable memory restriction" checkbox on the
Advanced tab of the Editor window and keeping "Enable caching"
cleared). None should be used with small reference recordsets, while Partial is
geared towards scenarios where a large number of duplicate lookup key values in
input data is expected.

It is important to realize that, depending on your choice of caching method,
you might experience different outcomes when comparing the same sets of data.
This surprising behavior results from the fact that with Full caching, all of
reference data is loaded into memory and compared within the SSIS environment,
which takes into account case and accent sensitivity. On the other hand, with None
or Partial settings, comparison is handled by SQL Server, with collation and sensitivity
settings defined typically on the database level (and commonly less strict).
You can easily test this by altering one of the Login entries in our example
(e.g. changing "adventure-worksruth0" to
"adventure-worksRuth0") and switching between Full and None/Partial
caching methods. To address this issue, convert input data into desired case
with Character Map transformation and apply T-SQL UPPER or LOWER functions when
generating reference recordsets.

In order to capture both standard and error outputs, set up two Flat File
Destinations. After you connect each to appropriate outputs of the Lookup
transformation, configure their Flat File Connection Managers, pointing to two
distinct files (e.g. AWLoginsOut.txt and AWLoginsErr.txt). Once you execute the
package, the first one should contain the following entries:


adventure-workskevin0,usny-wks001,KBrown,Kevin,Brown
adventure-worksroberto0,usny-wks002,RTamburello,Roberto,Tamburello
adventure-worksruth0,usny-wks004,REllerbrock,Ruth,Ellerbrock

while the second one will consist of a single line with
input row values followed by error code indicating the cause of failure. If you
decide to apply Ignore failure setting (in the Configure Error Output dialog
box of Lookup Transformation Editor), the standard output (and AWLoginsOut.txt
file) would contain four rows, with the mismatched one ending with two NULLs
(instead of FirstName and LastName values).

»


See All Articles by Columnist
Marcin Policht

Marcin Policht
Marcin Policht
Being a long time reader, I'm proud to join the crowd of technology gurus gathered here. I have a fair share of Microsoft exams behind me, a couple of acronyms after my name - MCSE, MCSD, and MCT, decent familiarity with majority of MS BackOffice products (SMS, SQL, Exchange, IIS), programming and scripting languages (VB, C++, VBScript with wsh) and several years of practical experience with Windows environment administration and engineering. My focus these days is on SQL Server and Windows 2000, and I'll attempt to share the most interesting experiences with these products.

Get the Free Newsletter!

Subscribe to Cloud Insider for top news, trends & analysis

Latest Articles