The last article of this series discussed how to get DB2 data into Microsoft Excel 2007 worksheets, some nice formatting capabilities in Excel 2007 and how to apply them to your DB2 data. This article examines some additional advanced features, and analysis capabilities that are part of Excel 2007.
In my last article, I showed you how to get DB2 data into
your Microsoft Excel 2007 (Excel 2007) worksheets. In addition, I showed you some
nice formatting capabilities in Excel 2007 and how to apply them to your DB2
data. In this article, I want to show you some additional advanced features, and
some analysis capabilities, that are part of Excel 2007 as well. You should
note that these features arent exclusive to DB2; youve got the data in Excel
2007, if you read the first
article in this series, but I think this is a good way to round out the
capabilities you can apply to your DB2 data with this wonderful tool.
Ready, set, go....
When you import data from the DB2STAFFTABLE connection object,
your Excel 2007 worksheet should look similar to this:
Excel 2007 has a useful feature called data bars.
Data bars use a gradient fill for each cell; this fill correlates the length of
the shading to the value in the cell relative to all the values within the
column. For example, a longer bar represents a higher value. You can access
this feature from the Home panel of the ribbon by selecting Conditional
Formatting. For example, to show each employees salary as it relates to
other employees, select the entire Salary column, and then click Home>Conditional Formatting>Data Bars>Green Data Bar:
You can see the benefits of adding data bar formatting to
your data. Very quickly, we can tell in the previous figure that a salary of
$98,505.40 is amongst the highest salaries paid to any employee in the company,
while $23,369.80 is among the lowest:
Note: Press Ctrl+Z
to undo the formatting if you want to work with each feature in this
article from the base data; alternatively, you can simply apply features on
You can click More Rules if you want to customize the
gradient formatting. The New Formatting Rule window opens.
Another new feature for data formatting in Excel 2007 is color
scales. Color scales display a two- or three-color gradient shading in a
selected range of data. The shade of color represents a hierarchal tier for the
value in the cell.
For example, to apply the default Red>Yellow>Green
hierarchy (where Red is the lowest stratification of values and Green is the
highest) select the entire Salary column, then click Home>Conditional Formatting>Color Scales>Green Yellow Red Color Scale:
Again, you can click More Rules if you want to customize
the gradient formatting:
Another new feature for data formatting in Excel 2007 is icon
sets. Icon sets use icons to represent data stratifications in the same
manner that color scales do.
These are the built-in icons that are part of Excel 2007 (and
more are available on the Web):
From the previous list of icons you can see that not all
icons are suitable for all types of data. In addition, you can see that different
icon sets contain different numbers of tiers for the data.
For example, to apply cellular signal strength icons to all
the salaries, select the entire Salary column, and then click Home>Conditional Formatting>Icon Sets>4 Ratings:
The highlighted icon set in the previous figure is called 4
Ratings (you can see the name of any icon set by hovering over it). It uses
4 icons to break all the data values in the salary column into quartiles and
uses an icon to represent them accordingly.
Just as you can customize the rules for all the other
formatting options shown so far in this article, you can select More Rules to
customize this data formatter too:
In the previous figures, you can see that there is quite a
range of values for the SALARY column and weve been able to easily categorize
or identify them using various graphical, icon, or color formatting. What
happens if you are working with a very large amount of data, however, and you
want to exclude certain ranges? One option would be to generate a query such
that the data returned to Excel 2007 was not returned to the worksheet. Excel
2007 offers an alternative if you dont have the luxury of receiving data that
matches the exact ranges you want to investigate. (You can also use the data
validation feature to force entered data to be chosen from a drop-down list of
values you want to specify.)
For example, lets assume you received data but didnt want
to analyze salary data that was less than $25,000 or more than $90,000 since
these ranges are considered outliner data and under the discretion of
the Human Resources department.
The first step in data validation is to set the rules to exclude
outlying data. To configure a set of data validation rules in Excel 2007,
perform the following steps:
the Salary column and click Data>Data
Data Validation window opens. Use this window to configure the outlying data
rules, or value-check entered values, and so on. For this example, set Allow
to Decimal, Minimum to 25,000, and Maximum to 90,000:
Note: If the values
you want to use as a lower or upper boundary are part of the data set, you can
use the cell selector () to
select the value directly from the worksheet.
The Input Message and Error
Alert tabs are used for parameter entry validation; however, this
capability is outside the scope of this article.
After you define validation rules
for your data, nothing happens to your data until you specifically call the
data validation function that will interact with your rules and validate the
the data you want validated by clicking Data>Data
Data. When you select this option, you can see that Excel 2007 applies the
data validation rules set up in Step 2 to the data in the worksheet, as shown
Notice in the previous figure that
there are two red circles (you can configure this color) around the outlying
data as defined by our rules.
the data validation rules by selecting Data>Data Validation>Clear
Removing duplicate data
Sometimes your data may contain duplicate values and you
dont want these to be used as part of your data analysis. You can address this
programmatically at the database tier using SQL (shown below); however, doing
so requires using a database query from within Excel or working with the
database administrator (DBA) to can a query as a view or something similar.
In previous versions of Excel, removing duplicate data once
the data was inserted into your Excel worksheet was a very manual task. In
fact, depending on how the data was delivered to staff workstations, a DBA had
to sometimes write custom scripts to clean up the data.
In Excel 2007, things get a lot easier: If you want to
remove duplicate values for the DEPT column in the STAFF table, perform the
the column from which you want to remove duplicates (in this case, DEPT):
Tip: You dont have
to select the column to remove duplicate values from since you can specify the
column directly in Step 2; however, I recommend that clients use this approach
because it keeps them visually focused on the task at hand and what data they
are going to remove. After all, this data is likely to be used in some sort of
analysis process or decision so you want to minimize the chance of data errors
and this is a good way to do that.
the Design ribbon (its under the Table Tools menu that appears
when the table object has focus in Excel) and click Remove Duplicates:
When the Remove Duplicates window
opens, select only the columns from which you want to remove the duplicate
values. In this simple example, I just want to ensure that only distinct values
are returned for the STAFF table as a function of the department. In other
words, there can only be one row of data for each department; therefore, in the
previous figure, I selected DEPT.
You can also see in the previous
figure that I left the My Data has headers check box select which is
the default. This is the reason why my tables column headings are shown in the
Columns section of the Remove Duplicates window. If you dont select
this check box, youll have to work with your data using generic headings, as
shown below, which greatly increases the chances of error; for this reason, I
recommend always using this option:
OK. Any rows that have repeating DEPT values are removed from the data
set and a pop-up window lets you know just how many rows were removed and how
many remain. For our example, the data set now looks like this:
If you select more than one column
in the Remove Duplicates window, Excel 2007 will treat them as a composite duplication
key. Quite simply, this means that its the combination of the selected columns
that determine if the value is unique or not. For example, a composite
duplication key of the NAME and DEPT columns wouldnt remove any rows in our
example since there are no two employees with the same name in the same
However, if you selected DEPT and JOB
as the composite duplication key, only the first employee in the data set that had
the same job in the same department would remain in the data set, as shown
One thing I dont like with this
feature is that you cant select multiple columns that arent adjacent in the
data set. For example, if I highlighted the DEPT and the YEARS columns and
wanted to use both as a composite duplication key, I would receive the
following error message:
The workaround is simple: reorder
the columns. The best way to reorder columns in Excel so that you can create a
composite duplication key is to copy and paste one of the columns that will be
used for the composite duplication key and move it to its new location, and then
delete the original column. Alternatively, you could write an SQL statement in
a database query and make that your data servers connection object, or base
the connection on a view that you create that projects (reorders) the columns
in the required order.
Wrapping it up...
In this article, I showed you some fancy features that are
new in Excel 2007. Whats more, you can use these features to seamlessly work with
your DB2 data. In my next article, I'll show you how you can use a wizard to build a query that retrieves a subset of
data from your connection object. This may be an easier approach to remove duplicate data our outliner values.
Again, it all depends on how the data is delivered to you, so I want to give you a number of tools that you can use to solve issues as they arise.
See All Articles by Columnist Paul C. Zikopoulos
About the Author
Paul C. Zikopoulos, BA, MBA, is an
award-winning writer and speaker with the IBM Database Competitive Technology
team. He has more than ten years of experience with DB2 and has written more
than one hundred articles and several books about it. Paul has co-authored the
books: DB2 9 New Features, Information on Demand, IBM DB2 Version 9: New
Features, DB2 Version 8: The Official Guide, DB2: The Complete Reference, DB2
Fundamentals Certification for Dummies, DB2 for Dummies, and A DBA's Guide to
Databases on Linux. Paul is a DB2 Certified Advanced Technical Expert (DRDA and
Cluster/EEE) and a DB2 Certified Solutions Expert (Business Intelligence and
Database Administration). In his spare time, he enjoys all sorts of sporting
activities, running with his dog Chachi, and trying to figure out the world
according to Chloë his new daughter. You can reach him at: mailto:firstname.lastname@example.org.
IBM and DB2 are registered trademarks of International
Business Machines Corporation in the United States, other countries, or both.
Microsoft is a trademark of Microsoft Corporation in the United States, other countries, or both.
Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both.
Other company, product, or service names may be trademarks
or service marks of others.
Copyright International Business Machines Corporation, 2007.
All rights reserved.
The opinions, solutions, and advice in this article are from
the authors experiences and are not intended to represent official
communication from IBM or an endorsement of any products listed within. Neither
the author nor IBM is liable for any of the contents in this article. The
accuracy of the information in this article is based on the authors knowledge
at the time of writing.