A followup to the earlier article 'Modifying an Existing Program,' this article examines the process involved in converting an existing database into a normalized representation.
Last week a Swynk
author posted an article on database normalization. To make the database analysis for this project more
interesting David Faour
has been asked to comment on the normalization of the database. Without getting into a threaded discussion
it should provide some additional insight. This should allow you to get an opinion on the
strengths and weaknesses of the design from an outside perspective. David's
on database normalization is a good introduction to this article.
The Normalization Process
This is the second article in the series on the process of redesigning a program.
As mentioned in the previous article (Modifying an Existing Program (Overview)) the
starting point in normalizing the current
database was to analyze the existing program and database. Because the program is old and has
been updated many times
there is no programmer available to discuss design issues. The new database design will be based
on user input, the existing program, and the existing database. The
original user interface
was built using visual FoxPro and consisted of 17 tabs. The tabs allowed the information to be organized based
on common characteristics but gave a cluttered appearence to the user interface.
In addition, in many cases information categories had
to use more than one tab to fit all the relevant information. The interface updated a
FoxPro database. The database is not first normal with
the many repeating address fields in the multiple tables. In addition the
tables do not contain data relavant to
one area. In some cases the tables contain data from many categories.
The primary table in the new database is the firm table (tblFrim). This table contains the
unique information for each firm. Although the address of the firm would normally be in the same table in this
situation it became clear that each firm could potentially be linked to a number of different addresses. The
addresses a firm could be linked to was based on the use of the address or role. The role table became the basis
for organizing the addresses, contacts, phone numbers, email addresses and web addresses. A table was created for
each of these categories (tblContact, tblAddresses,tblEmail,tblWeb,tblNumbers) and the tables were linked
to the firm table (tblFirm) and the role table (tblRoleLookup).
A diagram of the entire database depicts these relationships they are on the left
side of the figure.
The next type of information were options and policies. There were three options tables; order, copyright and returns
The three tables allowed organized fields by category. Each of these three option tables
had a one to one relationship with the firm table(tblfirm). A policy table exists for discounts, returns and copyrights
(tblDiscountPolicies, tblReturnPolicies, tblCopyRightPolicies). Each table has a many to one relationship with the firm table (tblFirm).
This many to one relationship dictated a separate table for this type of information, instead of adding the information to
the appropriate options table. The added benefit is if a
different policy needs to be added it will not affect table structure nor program structure.
Of the remaining tables tblOrderLanguage, tblOrderCode, tblOrderClassifiction are join tables that relate information
from orders to tblLanguageType, tblCodeType, and tblClassificationType, respectively. The join tables allow the many
to many relationships between the orders table the other three tables. The thlOrderISBNImprint has a many to one
relationship with tblOrder and is used to hold ISBN information for each firm.
The last set of tables is going to be used by the program. The tblChangeTracking and tblModification tables hold
information about when, where and who made modifications to the data. This can be used to resolve any questions
about when data was updated and who updated the data.
Although the data is normalized it is not in third normal form. A third normal database is the ideal in
some situations denormalization can be used to help optomize queries on the database. The following are some of
the areas where the data is not normalized and the reasons for denormalization. The true test to whether
denormalization makes sense is in the use and maintenance of not only the database but also the program the
In the design process it became clear the database could be further normalized. Some areas for normalization were
the addresses, policies, and options. Also, denormalization could have occured in other areas. The following sections
will cover areas where these two processes could have occurred.
Under the current database design it is possible to have the same address, contact, email, web and phone information
for each role in a given firm. This would cause a lot of duplicate data. Knowing in most cases, however that not
all this information is repeated for each role lead to the conclusion the database could be denormalized. For example,
There may be a returns department address but there is no contact for the returns department. There could, however,
be a customer service agent or even multiple agents. For the agents it is possible only a toll free phone number will
be kept. In this case a role is created for agent1 or multiple agents and their name is kept in the contact
table while the toll free phone number is stored in the number table. In this case address information could
be added to the database but it may not be required by the process the database supports.
The policies could also have all been stored in one table. Each policy could have had a descriptive lookup type. It
seemed as though it would be easier to have three policy tables for programming purposes rather than one table containing
all the policies. When reporting information the data will be able to be kept together by adding or removing tables
from the query. In the event all policies are needed a union query will allow the tables to be combined.
The options tables could potentially have been combined to form a more generic classification similar to how policies
may have been further normalized. It probably made less sense to further normalize the options as the
data seems to fall nicely into the current tables.
The address, contact, email, web and phone information could have been further denormalized. It would have been possble to
combine these tables into one table. One of the major problems with this denormalization would have been phone numbers.
If another phone number type was added it would have required adding a field to the table. Combining the other four
tables may have presented less severe problems but would have resulted in a number of blank fields for each record.
The policies could have been added to the respective options table but this again would have meant modifing table
structure to accomodate new policies. Trying to avoid modifing table structure while accomodating future data lead
to the current table design.
One finds when normalizing and denormalizing there are pros and cons to both approaches. While programming it becomes
evident why many databases are denormalized. In some cases the original programming for a denormalized database is
easier. Developing programs that make use of normalized data, however, leads to code that is reusable in many more
situations. This fact alone is enough to fight through the normalization and denormalization battle.
For another perspective, check out David's comments on the normalization of this database, found on the next page of this article.
Another Point of View
One of the Swynk author's (David Faour) volunteered to comment on the process I had undertaken.
I sent David a copy of the
first two pages of this article and the database layouts. Below are some of his comments.
One can read some of Davids'
other articles here. I have taken Davids' comments
from an email he sent to me. With only a few grammatical corrections here are his comments.
It looks pretty good in general. However, I do have some comments on
tblAddresses, tblContact, and tblEmail.
In tblAddresses, I am not sure what you are trying to capture in Address1,
Address2, Address3. Are you doing it to split into three input boxes on a
form? Also street address can be further decomposed into street number,
street type(ave,blvd,st,rd), and Apt/Suite number. This depends on what your
lowest atomic level is.
City, state, zip, and country should technically be lookups(foreign key) to
a tblCity, tblState, tblZip, and tblCountry.
Suppose you have free text typed into the DB, and somebody accidentally
uses FK for the state Florida instead of FL, just a simply typo because k &
l are together on the keyboard. when you do a delete from where state=FL,
the FK row is still there. In addition, a select from where state=FL will miss the
row; the same thing applies to cities, countries, and zip
Also, i'm not sure that firmID and lookup ID are appropriate in this
table... it seems that address should be a table with its own addressID and
there might be a cross reference table of firmAddresses (firmID,addressID)
because a company might have many locations...
In the tblContact, the title should also not be a free text, and should be a
lookup to tblTitle(id, description)
this is because someone who's title is "Dr." may have their title mistakenly
input as "Fr." and you would have the same situation as above with states.
Lastly, with tblEmail if you wish, you could decompose email address into 3
parts name, server name, servertype(.com/.net/.org/.gov/etc).
Server name would be AOL, MSN, etc and name of course would be yourname@.
Now, if you go for this setup, one might say that the servertype should also
be a lookup to the valid servertypes and that a servertype table should be
Keep in mind that these comments are directed towards a very strict, formal
interpretation of 3NF, and of course you will make decisions to
denormalize based upon business requirements