Normalizing an Existing Database

Tuesday Mar 6th 2001 by Bruce Szabo
Share:

A followup to the earlier article 'Modifying an Existing Program,' this article examines the process involved in converting an existing database into a normalized representation.

Introduction

Last week a Swynk author posted an article on database normalization. To make the database analysis for this project more interesting David Faour has been asked to comment on the normalization of the database. Without getting into a threaded discussion it should provide some additional insight. This should allow you to get an opinion on the strengths and weaknesses of the design from an outside perspective. David's recent article on database normalization is a good introduction to this article.

The Normalization Process

This is the second article in the series on the process of redesigning a program. As mentioned in the previous article (Modifying an Existing Program (Overview)) the starting point in normalizing the current database was to analyze the existing program and database. Because the program is old and has been updated many times there is no programmer available to discuss design issues. The new database design will be based on user input, the existing program, and the existing database. The original user interface was built using visual FoxPro and consisted of 17 tabs. The tabs allowed the information to be organized based on common characteristics but gave a cluttered appearence to the user interface. In addition, in many cases information categories had to use more than one tab to fit all the relevant information. The interface updated a FoxPro database. The database is not first normal with the many repeating address fields in the multiple tables. In addition the tables do not contain data relavant to one area. In some cases the tables contain data from many categories.

The primary table in the new database is the firm table (tblFrim). This table contains the unique information for each firm. Although the address of the firm would normally be in the same table in this situation it became clear that each firm could potentially be linked to a number of different addresses. The addresses a firm could be linked to was based on the use of the address or role. The role table became the basis for organizing the addresses, contacts, phone numbers, email addresses and web addresses. A table was created for each of these categories (tblContact, tblAddresses,tblEmail,tblWeb,tblNumbers) and the tables were linked to the firm table (tblFirm) and the role table (tblRoleLookup). A diagram of the entire database depicts these relationships they are on the left side of the figure.

The next type of information were options and policies. There were three options tables; order, copyright and returns (tblOrder,tblCopyRight,tblReturns). The three tables allowed organized fields by category. Each of these three option tables had a one to one relationship with the firm table(tblfirm). A policy table exists for discounts, returns and copyrights (tblDiscountPolicies, tblReturnPolicies, tblCopyRightPolicies). Each table has a many to one relationship with the firm table (tblFirm). This many to one relationship dictated a separate table for this type of information, instead of adding the information to the appropriate options table. The added benefit is if a different policy needs to be added it will not affect table structure nor program structure.

Of the remaining tables tblOrderLanguage, tblOrderCode, tblOrderClassifiction are join tables that relate information from orders to tblLanguageType, tblCodeType, and tblClassificationType, respectively. The join tables allow the many to many relationships between the orders table the other three tables. The thlOrderISBNImprint has a many to one relationship with tblOrder and is used to hold ISBN information for each firm.

The last set of tables is going to be used by the program. The tblChangeTracking and tblModification tables hold information about when, where and who made modifications to the data. This can be used to resolve any questions about when data was updated and who updated the data.

Although the data is normalized it is not in third normal form. A third normal database is the ideal in some situations denormalization can be used to help optomize queries on the database. The following are some of the areas where the data is not normalized and the reasons for denormalization. The true test to whether denormalization makes sense is in the use and maintenance of not only the database but also the program the database supports.

Denormalization

In the design process it became clear the database could be further normalized. Some areas for normalization were the addresses, policies, and options. Also, denormalization could have occured in other areas. The following sections will cover areas where these two processes could have occurred.

Further Normalization

Under the current database design it is possible to have the same address, contact, email, web and phone information for each role in a given firm. This would cause a lot of duplicate data. Knowing in most cases, however that not all this information is repeated for each role lead to the conclusion the database could be denormalized. For example, There may be a returns department address but there is no contact for the returns department. There could, however, be a customer service agent or even multiple agents. For the agents it is possible only a toll free phone number will be kept. In this case a role is created for agent1 or multiple agents and their name is kept in the contact table while the toll free phone number is stored in the number table. In this case address information could be added to the database but it may not be required by the process the database supports.

The policies could also have all been stored in one table. Each policy could have had a descriptive lookup type. It seemed as though it would be easier to have three policy tables for programming purposes rather than one table containing all the policies. When reporting information the data will be able to be kept together by adding or removing tables from the query. In the event all policies are needed a union query will allow the tables to be combined.

The options tables could potentially have been combined to form a more generic classification similar to how policies may have been further normalized. It probably made less sense to further normalize the options as the data seems to fall nicely into the current tables.

Further Denormalization

The address, contact, email, web and phone information could have been further denormalized. It would have been possble to combine these tables into one table. One of the major problems with this denormalization would have been phone numbers. If another phone number type was added it would have required adding a field to the table. Combining the other four tables may have presented less severe problems but would have resulted in a number of blank fields for each record.

The policies could have been added to the respective options table but this again would have meant modifing table structure to accomodate new policies. Trying to avoid modifing table structure while accomodating future data lead to the current table design.

Conclusions

One finds when normalizing and denormalizing there are pros and cons to both approaches. While programming it becomes evident why many databases are denormalized. In some cases the original programming for a denormalized database is easier. Developing programs that make use of normalized data, however, leads to code that is reusable in many more situations. This fact alone is enough to fight through the normalization and denormalization battle.

For another perspective, check out David's comments on the normalization of this database, found on the next page of this article.

Another Point of View

One of the Swynk author's (David Faour) volunteered to comment on the process I had undertaken. I sent David a copy of the first two pages of this article and the database layouts. Below are some of his comments. One can read some of Davids' other articles here. I have taken Davids' comments from an email he sent to me. With only a few grammatical corrections here are his comments.

David's Comments

It looks pretty good in general. However, I do have some comments on tblAddresses, tblContact, and tblEmail.

In tblAddresses, I am not sure what you are trying to capture in Address1, Address2, Address3. Are you doing it to split into three input boxes on a form? Also street address can be further decomposed into street number, street name, street type(ave,blvd,st,rd), and Apt/Suite number. This depends on what your lowest atomic level is.

City, state, zip, and country should technically be lookups(foreign key) to a tblCity, tblState, tblZip, and tblCountry. Suppose you have free text typed into the DB, and somebody accidentally uses FK for the state Florida instead of FL, just a simply typo because k & l are together on the keyboard. when you do a delete from where state=FL, the FK row is still there. In addition, a select from where state=FL will miss the row; the same thing applies to cities, countries, and zip

Also, i'm not sure that firmID and lookup ID are appropriate in this table... it seems that address should be a table with its own addressID and there might be a cross reference table of firmAddresses (firmID,addressID) because a company might have many locations...

In the tblContact, the title should also not be a free text, and should be a lookup to tblTitle(id, description) this is because someone who's title is "Dr." may have their title mistakenly input as "Fr." and you would have the same situation as above with states.

Lastly, with tblEmail if you wish, you could decompose email address into 3 parts name, server name, servertype(.com/.net/.org/.gov/etc). Server name would be AOL, MSN, etc and name of course would be yourname@. Now, if you go for this setup, one might say that the servertype should also be a lookup to the valid servertypes and that a servertype table should be created.

Keep in mind that these comments are directed towards a very strict, formal interpretation of 3NF, and of course you will make decisions to denormalize based upon business requirements

Share:
Home
Mobile Site | Full Site
Copyright 2017 © QuinStreet Inc. All Rights Reserved