A Case for (Occasional) Denormalization of MySQL Data

Denormalized (sung to the tune of Dehumanized by Symphony X)

It’s denormalized
Don’t you realize
It’s denormalized
And optimized
You forbade me
You got angry
Denormalized!

Normalization is the process of applying a series of rules to ensure that your database achieves optimal structure. Each normal form is a progression of these rules as successive normal forms achieve a better database design than the previous ones did.  The de facto standard, called Third Normal Form, or 3NF, is considered to be so crucial to proper database design, that normalization rules or often treated like dogma, rather than just guidelines.  The result of such slavish adherence to rules leads to over-engineering; the evidence of which include inflexible design schemas and cumbersome data retrieval.  Therefore, I would like to suggest that there are some instances where denormalization may be called for.

I understand that pointing out any shortcomings of normalization is akin to insulting one’s religion, but I hope that you’ll hear me out before submitting those angry emails and/or comments.  There are many paradigms that don’t work one hundred percent of the time.  Take Extreme Programming (XP) (is he really going there!??). It has proven itself to be an excellent methodology in small agile projects where bureaucracy was not too pervasive.  In other environments, such as the government, the layers of red tape, and the sheer numbers of shareholders have made XP difficult if not near impossible to pull off. The lesson to take away from this is that the context in which a system is used has a sizeable impact on its success.

Two Cases for Denormalization!

There are at least a couple of very good reasons to denormalize one’s data.  The two most accepted reasons are:

  • To improve query performance
  • To help maintain historical information

The rest of this article will go on to describe how to achieve the above goals using denormalization.

Improving Query Performance

I reported on one scenario where denormalization would help in my Clearing a Path through the 3NF Join Jungle article.  In it, I raised the following point about normalization, which bears repeating here:

Optimizing a database for storage automatically makes it unoptimized for retrieval.

Denormalization is also referred to as “controlled redundancy”.  I like that term because it heeds the notion that DBAs who do this are well aware of what they are doing, in that they know what the normalization rules are, they know why the rules exist, they know what the consequences of breaking each rule is, and they can tell you exactly why they want to break a particular normalization rule.

Any time that you have a select query that is run on a frequent basis that involves data that is spread amongst many tables you should consider that data as a candidate for denormalization.  A good place to start is in reporting.  Reporting jobs are the ones that can take a heck of a long time to run and hog system resources in the process. 

Within those queries, there are many linked data entities that can be amalgamated in order to speed up retrieval.  For instance, people and their associated addresses are commonly stored in separate tables because a person can have more than one address and the latter also tend to change over time.  However, storing the two in the same table can markedly improve query execution time. 

If you’re not keen on denormalizing your base tables, consider creating a separate one just for reporting purposes.  In my department, we made so many changes to the tables that we wound up creating a whole new database dedicated solely to report generation!  And I’ll tell you what, it was pretty fast!  As a bonus, the main data input database reaped the benefits of no longer being hampered by the long-running report queries.

Derivable Data

Although there is a pervasive rule of database storage that thou shalt never store calculable values it may be worth risking going to database purgatory so that your select queries can be spared from having to perform processor intensive calculations every time that a query is run.  Just be careful that anytime the underlying values that make up the calculated value change, you also have to update the stored derived data or inconsistent and/or stale data will result.

Maintaining Historical Information

There are many reasons that history needs to be preserved. One of the most common is when a woman changes her name in marriage.  Without a history, there would be no way to know that the two names refer to the same person.

Usually, tracking the history of on entity means a lot of datetime fields, as the following example shows.  It lists the records for a woman who is a big believer in marriage and in adopting the surname of her husband.  The Social Security Number is enough to identify an individual because it is unique to each person, but that doesn’t tell us anything about her (or his –men have also been known to change their names) naming history:

Customer Table

RecID

Given

Surname

DateCreated

DateEnd

Description

Social Security Number

99

Nancy

Burger

01-Jan-04

 

Prior to Wedding

987-65-4320

100

Nancy

Burger

01-Jan-04

01-Jun-04

Ends Maiden Name

987-65-4320

101

Nancy

Lepzowitz

01-Jun-04

 

Adopts New Surname

987-65-4320

102

Nancy

Lepzowitz

01-Jun-04

12-Dec-05

Ends Marriage

987-65-4320

103

Nancy

Morgan

12-Dec-05

 

Remarries and adopts new Surname

987-65-4320

104

Nancy

Morgan

12-Dec-05

06-Jun-06

Ends Marriage

987-65-4320

105

Nancy

Summers

06-Jun-06

 

Remarries and adopts new Surname

987-65-4320

Now if we had an orders table, we would typically include a unique foreign key to the customer table, such as the SSN in this case.  However, that tells us nothing about the customer’s name unless we do some kind of cross-checking against the order date.  The solution is to include the customer’s current name in the orders table so that we can refer to the correct name.

Conclusion

Remember that denormalization is not the opposite of normalization. Just as getting to 3NF is an iterative process on the road to a complete design, so is denormalization. The best approach is to work through the analysis and design to get to 1NF, 2NF, and 3NF first. Then, depending on the application and several other factors, denormalization is an optional additional step in the design process.

See all articles by Rob Gravelle

Robert Gravelle
Robert Gravelle
Rob Gravelle resides in Ottawa, Canada, and has been an IT guru for over 20 years. In that time, Rob has built systems for intelligence-related organizations such as Canada Border Services and various commercial businesses. In his spare time, Rob has become an accomplished music artist with several CDs and digital releases to his credit.

Get the Free Newsletter!

Subscribe to Cloud Insider for top news, trends & analysis

Latest Articles