DB2 9 and Microsoft Excel 2007 Part 2: Working with the DB2 Data...

Tuesday Jul 24th 2007 by Paul Zikopoulos

The last article of this series discussed how to get DB2 data into Microsoft Excel 2007 worksheets, some nice formatting capabilities in Excel 2007 and how to apply them to your DB2 data. This article examines some additional advanced features, and analysis capabilities that are part of Excel 2007.

In my last article, I showed you how to get DB2 data into your Microsoft Excel 2007 (Excel 2007) worksheets. In addition, I showed you some nice formatting capabilities in Excel 2007 and how to apply them to your DB2 data. In this article, I want to show you some additional advanced features, and some analysis capabilities, that are part of Excel 2007 as well. You should note that these features aren’t exclusive to DB2; you’ve got the data in Excel 2007, if you read the first article in this series, but I think this is a good way to “round out” the capabilities you can apply to your DB2 data with this wonderful tool.

Note: This article assumes that you leverage the DB2STAFFTABLE connection object you created as part of the”DB2 9 and Microsoft Excel 2007 Part 1: Getting the Data...” article, and that you have imported the data from this object into your worksheet as shown below:

Click for larger image

Ready, set, go....

When you import data from the DB2STAFFTABLE connection object, your Excel 2007 worksheet should look similar to this:

Data bars

Excel 2007 has a useful feature called data bars. Data bars use a gradient fill for each cell; this fill correlates the length of the shading to the value in the cell relative to all the values within the column. For example, a longer bar represents a higher value. You can access this feature from the Home panel of the ribbon by selecting Conditional Formatting. For example, to show each employee’s salary as it relates to other employees, select the entire Salary column, and then click Home>Conditional Formatting>Data Bars>Green Data Bar:

You can see the benefits of adding data bar formatting to your data. Very quickly, we can tell in the previous figure that a salary of $98,505.40 is amongst the highest salaries paid to any employee in the company, while $23,369.80 is among the lowest:

Note: Press Ctrl+Z to undo the formatting if you want to work with each feature in this article from the base data; alternatively, you can simply apply features on other features.

You can click More Rules if you want to customize the gradient formatting. The New Formatting Rule window opens.


Color scales

Another new feature for data formatting in Excel 2007 is color scales. Color scales display a two- or three-color gradient shading in a selected range of data. The shade of color represents a hierarchal tier for the value in the cell.

For example, to apply the default Red>Yellow>Green hierarchy (where Red is the lowest stratification of values and Green is the highest) select the entire Salary column, then click Home>Conditional Formatting>Color Scales>Green – Yellow – Red Color Scale:

Again, you can click More Rules if you want to customize the gradient formatting:

Icon sets

Another new feature for data formatting in Excel 2007 is icon sets. Icon sets use icons to represent data stratifications in the same manner that color scales do.

These are the built-in icons that are part of Excel 2007 (and more are available on the Web):

From the previous list of icons you can see that not all icons are suitable for all types of data. In addition, you can see that different icon sets contain different numbers of tiers for the data.

For example, to apply “cellular signal strength” icons to all the salaries, select the entire Salary column, and then click Home>Conditional Formatting>Icon Sets>4 Ratings:

The highlighted icon set in the previous figure is called 4 Ratings (you can see the name of any icon set by hovering over it). It uses 4 icons to break all the data values in the salary column into quartiles and uses an icon to represent them accordingly.

Just as you can customize the rules for all the other formatting options shown so far in this article, you can select More Rules to customize this data formatter too:

Data validation

In the previous figures, you can see that there is quite a range of values for the SALARY column and we’ve been able to easily categorize or identify them using various graphical, icon, or color formatting. What happens if you are working with a very large amount of data, however, and you want to exclude certain ranges? One option would be to generate a query such that the data returned to Excel 2007 was not returned to the worksheet. Excel 2007 offers an alternative if you don’t have the luxury of receiving data that matches the exact ranges you want to investigate. (You can also use the data validation feature to force entered data to be chosen from a drop-down list of values you want to specify.)

For example, let’s assume you received data but didn’t want to analyze salary data that was less than $25,000 or more than $90,000 since these ranges are considered outliner data and under the discretion of the Human Resources department.

The first step in data validation is to set the rules to exclude outlying data. To configure a set of data validation rules in Excel 2007, perform the following steps:

1.  Select the Salary column and click Data>Data Validation:

Click for larger image

2.  The Data Validation window opens. Use this window to configure the outlying data rules, or value-check entered values, and so on. For this example, set Allow to Decimal, Minimum to 25,000, and Maximum to 90,000:

Note: If the values you want to use as a lower or upper boundary are part of the data set, you can use the cell selector () to select the value directly from the worksheet.

The Input Message and Error Alert tabs are used for parameter entry validation; however, this capability is outside the scope of this article.

After you define validation rules for your data, nothing happens to your data until you specifically call the data validation function that will interact with your rules and validate the data.

3.  Highlight the data you want validated by clicking Data>Data Validation>Circle Invalid Data. When you select this option, you can see that Excel 2007 applies the data validation rules set up in Step 2 to the data in the worksheet, as shown below:

Notice in the previous figure that there are two red circles (you can configure this color) around the outlying data as defined by our rules.

4.  Clear the data validation rules by selecting Data>Data Validation>Clear Validation Circles.

Removing duplicate data

Sometimes your data may contain duplicate values and you don’t want these to be used as part of your data analysis. You can address this programmatically at the database tier using SQL (shown below); however, doing so requires using a database query from within Excel or working with the database administrator (DBA) to can a query as a view or something similar.

In previous versions of Excel, removing duplicate data once the data was inserted into your Excel worksheet was a very manual task. In fact, depending on how the data was delivered to staff workstations, a DBA had to sometimes write custom scripts to clean up the data.

In Excel 2007, things get a lot easier: If you want to remove duplicate values for the DEPT column in the STAFF table, perform the following steps:

1.  Select the column from which you want to remove duplicates (in this case, DEPT):

Tip: You don’t have to select the column to remove duplicate values from since you can specify the column directly in Step 2; however, I recommend that clients use this approach because it keeps them visually focused on the task at hand and what data they are going to remove. After all, this data is likely to be used in some sort of analysis process or decision so you want to minimize the chance of data errors and this is a good way to do that.

2.  Select the Design ribbon (it’s under the Table Tools menu that appears when the table object has focus in Excel) and click Remove Duplicates:

When the Remove Duplicates window opens, select only the columns from which you want to remove the duplicate values. In this simple example, I just want to ensure that only distinct values are returned for the STAFF table as a function of the department. In other words, there can only be one row of data for each department; therefore, in the previous figure, I selected DEPT.

You can also see in the previous figure that I left the My Data has headers check box select which is the default. This is the reason why my table’s column headings are shown in the Columns section of the Remove Duplicates window. If you don’t select this check box, you’ll have to work with your data using generic headings, as shown below, which greatly increases the chances of error; for this reason, I recommend always using this option:

3.  Click OK. Any rows that have repeating DEPT values are removed from the data set and a pop-up window lets you know just how many rows were removed and how many remain. For our example, the data set now looks like this:

If you select more than one column in the Remove Duplicates window, Excel 2007 will treat them as a composite duplication key. Quite simply, this means that it’s the combination of the selected columns that determine if the value is unique or not. For example, a composite duplication key of the NAME and DEPT columns wouldn’t remove any rows in our example since there are no two employees with the same name in the same department:

However, if you selected DEPT and JOB as the composite duplication key, only the first employee in the data set that had the same job in the same department would remain in the data set, as shown below:

Click for larger image

One thing I don’t like with this feature is that you can’t select multiple columns that aren’t adjacent in the data set. For example, if I highlighted the DEPT and the YEARS columns and wanted to use both as a composite duplication key, I would receive the following error message:

The workaround is simple: reorder the columns. The best way to reorder columns in Excel so that you can create a composite duplication key is to copy and paste one of the columns that will be used for the composite duplication key and move it to its new location, and then delete the original column. Alternatively, you could write an SQL statement in a database query and make that your data server’s connection object, or base the connection on a view that you create that projects (reorders) the columns in the required order.

Wrapping it up...

In this article, I showed you some fancy features that are new in Excel 2007. What’s more, you can use these features to seamlessly work with your DB2 data. In my next article, I'll show you how you can use a wizard to build a query that retrieves a subset of data from your connection object. This may be an easier approach to remove duplicate data our outliner values. Again, it all depends on how the data is delivered to you, so I want to give you a number of tools that you can use to solve issues as they arise.

» See All Articles by Columnist Paul C. Zikopoulos

About the Author

Paul C. Zikopoulos, BA, MBA, is an award-winning writer and speaker with the IBM Database Competitive Technology team. He has more than ten years of experience with DB2 and has written more than one hundred articles and several books about it. Paul has co-authored the books: DB2 9 New Features, Information on Demand, IBM DB2 Version 9: New Features, DB2 Version 8: The Official Guide, DB2: The Complete Reference, DB2 Fundamentals Certification for Dummies, DB2 for Dummies, and A DBA's Guide to Databases on Linux. Paul is a DB2 Certified Advanced Technical Expert (DRDA and Cluster/EEE) and a DB2 Certified Solutions Expert (Business Intelligence and Database Administration). In his spare time, he enjoys all sorts of sporting activities, running with his dog Chachi, and trying to figure out the world according to Chloë – his new daughter. You can reach him at: mailto:paulz_ibm@msn.com.


IBM and DB2 are registered trademarks of International Business Machines Corporation in the United States, other countries, or both.

Microsoft is a trademark of Microsoft Corporation in the United States, other countries, or both.

Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both.

Other company, product, or service names may be trademarks or service marks of others.

Copyright International Business Machines Corporation, 2007. All rights reserved.


The opinions, solutions, and advice in this article are from the author’s experiences and are not intended to represent official communication from IBM or an endorsement of any products listed within. Neither the author nor IBM is liable for any of the contents in this article. The accuracy of the information in this article is based on the author’s knowledge at the time of writing.

Mobile Site | Full Site