Microsoft Access 2010 has a collection of wizards to lead you step-by-step through each process involved in developing and using a production-grade database application. ' Exploring Relational Database Theory and Practice ' is extracted from ' Microsoft Access 2010 In Depth', published by Que.
Moving from Spreadsheets to Databases
Word processing and spreadsheet
applications were the engines that drove the fledgling personal computer
market. In the early PC days, WordPerfect and Lotus 1-2-3 dominated the
productivity software business. Today, most office workers use Microsoft Word
and Excel on a daily basis. Its probably a safe bet that more data is stored
in Excel spreadsheets than in all the worlds databases. Its an equally good
wager that most new Access users have at least intermediate-level spreadsheet
skills, and many qualify as Excel power users.
Excel 2010s Data ribbon offers
elementary database features, such as sorting, filtering, validation, and data
entry forms. You can quickly import and export data in a variety of formats,
including those of database management applications, such as Access. Excels
limitations become apparent as your needs for entering, manipulating, and
reporting data grow beyond the spreadsheets basic row-column metaphor.
Basically, spreadsheets are list managers; its easy to generate a simple name
and address list with Excel. If your needs expand to contact management and
integrating the contact data with other information generated by your
organization, a spreadsheet isnt the optimal approach.
The first problem arises when
your contacts list needs additional rows for multiple persons from a single
company. You must copy or retype all the company information, which generates
redundant data. If the company moves, you must search and replace every entry
for your contacts at the firm with the new address. If you want to record a
history of dealings with a particular individual, you add pairs of date and
text columns for each important contact with the person. Eventually, you find
yourself spending more time navigating the spreadsheets rows and columns than
using the data they contain.
Contact lists are only one
example of problems that arise when attempting to make spreadsheets do the work
of databases. Tracking medical or biological research data, managing consulting
time and billings, organizing concert tours, booking artist engagements, and
myriad other complex processes are far better suited to database than
Moving to a relational database
management system (RDBMS), such as Access, solves data redundancy and
navigation problems and greatly simplifies updating existing information. After
you understand the basic rules of relational database design, Access makes
creating highly efficient databases quick and easy. Access 2010 has a
collection of wizards to lead you step-by-step through each process involved in
developing and using a production-grade database application. Unfortunately, no
Relational Wizard exists to design the underlying database structure for you,
but youll find a wealth of pre-built database templates in the Backstage
pages New tab. (Click the ribbons File tab to open the new Backstage page.)
If your goal is learning
relational database fundamentals, start with Access 2010. Access is by far the
first choice of universities, colleges, trade schools, and computer-training
firms for courses ranging from introductory data management to advanced
client/server database programming. The reason for Accesss popularity as a
training platform is its unique combination of initial ease of use and support
for advanced database application development techniques
Reliving Database History
Databases form the foundation of
world commerce and knowledge distribution. Without databases, there would be no
World Wide Web, automatic teller machines, credit/debit cards, or online
airline reservation systems. Newsgathering organizations, research
institutions, universities, and libraries would be unable to categorize and
selectively disseminate their vast store of current and historical information.
Its difficult to imagine today a world without a network of enormous
databases, many of which probably contain a substantial amount of your personal
data that you dont want to be easily available to others.
Jim Grays article,
Data Management: Past, Present, and Future, which is available as a Microsoft
Word document at http://research.microsoft.com/~gray/DB_History.doc, offers a
more detailed history of data processing systems. Dr. Gray was a senior
researcher and the manager of Microsofts Bay Area Research Center (BARC) until
early 2007, when he became lost at sea while sailing off the California coast
The Early History of Databases
The forerunner of todays
databases consisted of stacks of machine-readable punched cards, which Herman
Hollerith used to record the 1890 U.S. census. Hollerith formed the
Computing-Tabulating-Recording Company, which later became International
Business Machines. From 1900 to the mid-1950s, punched cards were the primary
form of business data storage and retrieval, and IBM was the primary supplier
of equipment to combine and sort (collate) punched cards, and print reports
based on punched-card data.
The development of large
computer-maintained databasesoriginally called databanksis a postWorld War II phenomenon. Mainframes
replaced punched cards with high-capacity magnetic tape drives to store large
amounts of data. The first databases were built on the hierarchical and network
models, which were well suited to the mainframe computers of the 1950s.
Hierarchical databases use parent-child relationships to define data
structures, whose diagrams resemble business organization charts or an inverted
tree with its root at the top of the hierarchy. Network databases allow
relaxation of the rules of hierarchical data structures by defining additional
relationships between data items. Hierarchical and network databases ordinarily
are self-contained and arent easy to link with other external databases over a
Early databases used batch
processing for data entry and retrieval. Keypunch operators typed data from
documents, such as incoming orders. At night, other operators collated the
days batch of punched cards, updated the information stored on magnetic tape,
and produced reports. Many smaller merchants continue to use batch processing
of customers credit-card purchases, despite the availability of terminals that
permit almost instantaneous processing of credit- and debit-card transactions.
Hierarchical databases remain
alive and well in the twenty-first century. For example, data storage for Windows
2000s Active Directory and Microsoft Exchange Server is derived from the
hierarchical version of Accesss original relational Jet databases. The name
Jet comes from the original Access database engine called Joint
The Internets Domain Name System
(DNS) is a collection of hierarchical databases for translating character-based
Internet domain names into numerical Internet Protocol (IP) addresses. The DNS
database is called a distributed database, because its
data is held by a global network of thousands of computers.
The Relational Database Model
Dr. E. F. Codd, an employee of IBM Corporation,
published A Relational Model of Data for Large Shared Databanks in a journal
of the Association for Computing Machinery (ACM) in June 1970. A partial copy
of the paper is available at http://www.acm.org/classics/nov95/. Dr. Codds
specialty was a branch of mathematics called set theory, which includes the
concept of relations. He
defined a relation as a named set of tuples (records or rows) that have attributes (fields or columns). One of the attributes must
contain a unique value to identify each tuple. The common term for relation is
a table whose presentation to the user is similar to that of
Relational databases solve a
serious problem associated with earlier database types. Hierarchical and
network databases define sets of data and explicit links between each data set
as parent-child and owner-member, respectively. To extract information from
these databases, programmers had to know the structure of the entire database.
Complex programs in COBOL or other mainframe computer languages are needed to
navigate through the hierarchy or network and extract information into a format
understandable by users.
Dr. Codds objective was to
simplify the process of extracting formatted information and make adding or
altering data easier by eliminating complex navigational programming. During
the 1970s, Dr. Codd and others developed a comparatively simple language,
Structured Query Language (SQL), for creating, manipulating, and retrieving
relational data. With a few hours of training, ordinary database users could
write SQL statements to define simple information needs and bypass the delays
inherent in the database programming process. SQL, which was first standardized
in 1985, now is the lingua franca of
database programming, and all commercial database products support SQL.
Client/Server and Desktop RDBMSs
In the early database era, the
most common presentation of data took the form of lengthy reports processed by
centralized, high-speed impact printers on fan-folded paper. The next step was
to present data to the user on green-screen video terminals, often having small
printers attached, which were connected to mainframe databases. As use of
personal computers gained momentum, terminal emulator cards enabled PCs to
substitute for mainframe terminals. Mainframe-scale relational databases, such
as IBMs DB2, began to supplement and later replace hierarchical and network
databases, but terminals continued to be the primary means of data entry and
The most widely used SQL standard,
SQL-92, was published by the American National Standards Institute (ANSI) in
1992. Few, if any, commercial relational database management systems (RDBMSs)
today fully conform to the entire SQL-92 standard. The later SQL-99 (also
called SQL3) and SQL-200n specifications add new features that arent germane
to Access databases.
RDBMS competitors have erected an
SQL Tower of Babel by adding nonstandard extensions to the language. For
example, Microsofts Transact-SQL (T-SQL) for SQL Server, which is the subject
of Chapter 27, Moving from Access Queries to Transact-SQL, has many proprietary
keywords and features. Oracle Corporations Oracle:SQL and PL/SQL dialects also
have proprietary SQL extensions.
Oracle, Ingres, Informix,
Sybase, and other software firms developed relational databases for lower-cost
minicomputers, most of which ran various flavors of the UNIX operating system.
Terminals continued to be the primary data entry and display systems for
multiuser UNIX databases.
The next step was the advent of
early PC-based flat-file managers and relational database management systems.
Early flat-file database managers, typified by Jim Buttons PCFile for DOS
(1981) and Claris FileMaker for Macintosh (1988) and Windows (1992), used a
single table to store data and offered few advantages over storing data in a
spreadsheet. The early desktop RDBMSssuch as dBASE, Clipper, FoxBase, and
Paradoxran under DOS and didnt support SQL. These products later became
available in multiuser versions, adopted SQL features, and eventually migrated
to Windows. Access 1.0, which Microsoft introduced in November 1992, rapidly
eclipsed its DOS and Windows competitors by virtue of Accesss combination of
graphical SQL support, versatility, and overall ease of use.
PC-based desktop RDBMSs are
classified as shared-file systems because they store their data in conventional
files that multiple users can share on a network. One of Accesss initial
attractions for users and developers was its capability to store all
application objectsforms, reports, and programming codeand tables for a
database application in a single file, which used the earlier .mdb extension..
FoxPro, dBASE, Clipper, and Paradox require a multitude of individual files to
store application and data objects. Today, almost every multiuser Access application
is divided (split) into a front-end .accdb file, which contains application
objects and links to a back-end database .accdb file that holds the data. Each
user has a copy of the front-end .accdb file and shares connections to a single
back-end .accdb file on a peer Windows workstation or server.
Prior to Access 2000, Jet was
Accesss standard database engine, so the terms Access
database and Jet database were interchangeable.
Microsoft considered SQL Server to be its strategic
RDBMS for Access 2000 and 2003. Strategic means that
SQL Server gets continuing development funds and Jet doesnt. Jet 4.0, which
was included with Access 2003 and is a part of the Windows XP and later
operating systems, is the final version and is headed toward retirement.
Microsofts Access team decided to
enhance Jet 4.0 with the new features described in Chapter 1, Access 2010 for
Access 2007 Users: Whats New, change the file extension from .mdb to .accdb,
and drop all references to Jet. To reflect this change, this edition uses the
terms Access database and SQL Server
database. Unless otherwise noted, SQL
Server refers to all SQL Server 2005 editions except the Compact and
Client/server RDBMSs have an
architecture similar to Accesss front-end/back-end shared-file multiuser configuration.
What differentiates client/server from shared-file architecture is that the
RDBMS on the server handles most of the data-processing activity. The client
front end provides a graphical user interface (GUI) for data entry, display,
and reporting. Only SQL statements and the specific data requested by the user
pass over the network. Client/server databases traditionally run on network operating
systems, such as Windows and UNIX, and are much more robust than shared-file
databases, especially for applications in which many users make simultaneous
additions, changes, and deletions to the database. All commercial data-driven
Web applications use client/server databases.
This book uses the terms field and record when referring to
tables, and columns and rows
when discussing data derived from tables, such as the views and query result
sets described later in this chapter.
Since version 1.0, Access has
had the capability to connect to client/server databases by linking their
tables to an Access database. Linking lets you treat client/server tables
almost as if they were native Access tables. Linking uses Microsofts widely
accepted Open Database Connectivity (ODBC) standard, and Access 2010 includes
an ODBC driver for SQL Server and Oracle databases. You can purchase licenses
for ODBC drivers that support other UNIX or Windows RDBMSs, such as Sybase or Informix,
from the database supplier or third parties. Chapter 19, Linking Access Front
Ends to Access and Client/Server Databases, describes the process of linking
Access and Microsoft SQL Server 2008 databases. Although Chapter 19 uses SQL
Server for its examples, the linking procedure is the same foror at least
similar toother client/server RDBMSs.
Access data projects (ADP) and
the Microsoft SQL Server 2005 Express Edition combine to make Access 2010 a
versatile tool for designing and testing client/server databases and creating
advanced data entry and reporting applications. You can start with a
conventional Access database and later use Accesss Upsizing Wizard to convert
the .mdb file(s) to an .adp file that holds application objects and an SQL
Server 2005 back-end database. Access 2010s Upsizing Wizard has incorporated
many improvements to the Access 2000 and earlier wizard versions, but Access
2010s Wizard is the same as 2007s. Despite the upgraded wizardry, youre
likely to need to make changes to queries to accommodate differences between
Access and SQL Servers SQL dialects.
For an example of differences between
Access and SQL Server SQL syntax that affects the upsizing process, see
Displaying Data with Queries and Views, p. XXX (this chapter).
Defining the Structure of Relational Databases
Relational databases consist of
a collection of self-contained, related tables. Tables typically represent
classes of physical objects, such as customers, sales orders, invoices, checks,
products for sale, or employees. Each member object, such as an invoice, has
its own record in the invoices table. For invoices, the field that uniquely
identifies a record, called a primary key[field],
is a serial invoice number.
Figure 4.1 shows Accesss
Datasheet view of an Invoices table, which is based on the Northwind.mdb sample
databases Orders table. The InvoiceNo field is the primary key. Values in the
OrderID, CustomerID, EmployeeID, and ShipperID fields relate to primary key
values in Northwinds Orders, Customers, Employees, and Shippers tables. A
field that contains values equal to those of primary key values in other tables
is called a foreign key [field].
This simple Invoices table was
created from the Northwind Orders table and doesnt take advantage of Accesss
extended properties, such as the field captions, lookup fields, and
subdatasheets in the Datasheet view of the Orders table.
To learn more about primary keys in
Access tables, see Selecting a Primary Key, p. XXX (Chapter 5).
If you need information about a
particular invoice or set of invoices, open the Invoices table and search for
the invoice(s) by number (InvoiceNo) or another attribute, such as a customer
code (CustomerID), date (ShippedDate), or range of dates. Unlike earlier
database models, the user can access the Invoices table independently of its
related tables. No database navigation programming is needed. A simple, intuitive
SQL statement, SELECT * FROM Invoices, returns all the data in the table. The
asterisk (*) represents a request to display the contents of all fields of the
Removing Data Redundancy with Relationships
The Invoices table of Figure 4.1
is similar to a spreadsheet containing customer billing information. Whats
missing is the customer name and address information. A five-character customer
code (CustomerID) identifies each customer to whom the invoice is directed. The
CustomerID values in the Invoices table match CustomerID values in a modified
version of Northwinds Customers table (see Figure 4.2). Matching a foreign key
with a primary key value often is called a lookup operation. Using a key-based lookup operation eliminates the
need to repeatedly enter name, address, and other customer-specific data in the
Invoices table. In addition, if you change the customers address, the change
applies to all past and future invoices.
Foreign key values in the
Invoices table must match primary key values in the Customers table.
The Invoices table also
connects with other tables, which contain information on orders, sales department
employees, and the products ordered. Connections between fields of related
tables having common values are called relationships (not relations). Figure 4.3 shows Accesss
Relationships window displaying the relationships between the Invoices table
and the other tables of the Northwind sample database.
Accesss Relationships window
displays the relationships between the tables of the Northwind sample database,
plus the added Invoices table. Every relationship between these tables is
one-to-many. The many-to-many relationship between Products and Orders is an
Using derived key values, such as
alphabetic codes for Customer, is no longer in favor among database designers.
Most designers now use automatically generated numerical key valuescalled
Access AutoNumber or SQL Server identity fields. The
Northwind Orders and Products tables, among others, have primary keys that use
the AutoNumber data type. The Employees, Shippers, Products, and Suppliers
tables use AutoNumber keys to identify the persons or objects to which the
tables records refer. Objects that are inherently sequentially numbered, such
as checks, are ideal candidates for an AutoNumber key that corresponds to the
check number, as mentioned in Choosing Primary Key Codes later in this
Another method of generating
unique keys is by use of Globally Unique Identifiers (GUIDs), which also are
called Universally Unique Identifiers (UUIDs). GUIDs are 16-byte computed
binary numbers that are guaranteed to be unique locally and universally; no
other computer in the world will duplicate a GUID. SQL Servers
uniqueidentifier data type is a GUID. Because GUIDs cant represent a property
of an object, such as a check number, GUID keys are called surrogate
keys. You cant select a GUID data type in Accesss Table Design mode.
Relationships come in the
following three flavors:
- One-to-many relationships represent connections between a single
primary key value (the one side) and multiple instances of the same value in
the foreign key field (the many side). One-to-many relationships commonly are
identified by the number 1 and the infinity (∞) symbol, as in Figure 4.3.
All the direct relationships between the tables in Figure 4.3 are one-to-many.
One-to-manyalso called many-to-onerelationships are by far the most common.
- One-to-one relationships connect primary key values in two
tables. You might think that the relationship between the Orders and Invoices
tables could be one-to-one, but an order requires more than one invoice if one
or more items are backordered and then shipped later. One-to-one relationships
- Many-to-many relationships require three tables, one of which is
called a linking table.
The linking table must have two foreign keys, each of which has a many-to-one
relationship with a primary key in two related tables. In the example of Figure
4.3, the Order Details table is the linking table for the many-to-many
relationship between the Orders and Products tables. Many-to-many relationships
also are called indirect relationships.
There are many other indirect
relationships between the tables shown in Figure 4.3. For example, a
many-to-many relationship exists between the Suppliers and Orders tables. In
this case, Products and Order Details act as linking tables between the
Suppliers and Orders tables.
The Relationships window
displays the names of primary key fields in a boldface font. Notice in Figure
4.3 that the OrderID and ProductID field names are preceded by a key symbol.
The OrderID and ProductID fields compose a composite primary key, which uniquely identifies an order line item. You
cant repeat the same combination of OrderID and ProductID; this precaution
makes sense for products that have only one stock-keeping unit (SKU), such as
for Aniseed Syrup, which comes only in a carton of 12 550ml bottles.
Access 2010s multivalue field
feature automatically generates a hidden linking table under the covers.
Access 2007 introduced the multivalued field for compatibility with SharePoint
The Oakmont.accdb sample database file in the
\2010Samples\Oakmont folder of the downloadable code has a structure that
differs from that of Northwind.accdb, but the design principles of the two databases
are similar. OakmontSQL.mdf is an SQL Server 2008 database for use with ADP.
ADP uses a special set of toolscalled the project designer or da Vinci toolset in this bookfor designing and managing SQL Server
databases. The Oakmont files are course enrollment databases for a college.
Figure 4.4 shows the Database Diagram window for the OakmontSQL database. The
SQL Server Diagram window is similar to the Relationships window for Accesss
traditional Access databases. The key and infinity symbols at the ends of each
line represent the one and many sides, respectively, of the one-to-many
relationships between the tables. Access and SQL Server databases store information
on table relationships as an object within the database file.
The SQL Server Database Diagram
window for the OakmontSQL database shows one-to-many relationships between
primary key fields (identified by key symbols) and foreign key fields (infinity
This book uses the Access 2010
and SQL Server 2008 R2 versions of the Northwind and Oakmont sample databases
in almost all examples. The tables of the Oakmont database have many more
records than the Northwind tables. The large number of records in the Oakmont
database makes it better suited than Northwind for predicting the performance
of production Access and SQL Server database applications.
restriction prevents shared use of the Order Details table as an invoice line
items table. If you short-ship an order item on one invoice, you cant add
another record to the Order Details table when you ship the remaining quantity
of the item. Microsoft didnt add an Invoices table for Northwind Traders, probably
because of the complexity of dealing with backorders and drop-shipments.
Conforming to Table Design Rules
Designing tables for relational
databases follows a formalized procedure called normalization. Dr. Codd described the complete normalization process
in his 1972 paper Further Normalization of the Data Base Relational Model.
This paper isnt an easy read; its steeped in the language of set theory and
relational algebra. The sections that follow explain in common English the
application of the normalization process to Accesss Northwind database.
You normalize tables in a series
of steps called normal forms.
Applying the normalization process is necessary to move spreadsheet-style data
to relational tables. You also employ the normalization rules when designing a
new database or analyzing existing databases. In specific cases, however, you
might need to depart from strict adherence to normalization rules to retain a
history of data values that change over time or to improve performance of a
First Normal Form
First normal form
requires tables to be flat and have no repeating or potentially repeating
fields or groups of fields. A flat table is one in which every record has the same number of
fields. In addition, a single field cannot contain multiple data values.
Repeating fields must be moved to a related table. The first normal form is the
most important of the normalization steps. If all your tables dont meet the
rules of first normal form, you are in big trouble.
Northwinds Customers and
Suppliers tables violate the no repeating fields rule. If a customer or supplier
has more than one person involved in the ordering process, which is likely, the
table would need repeating pairs of fields with different names, such as
ContactName2 and ContactTitle2 or the like. To conform the Customers and
Suppliers tables to first normal form, you must create two new
tablesCustPers(sonel) and SuppPers(sonel), for exampleto hold contact
records. Including contact names in the Customers and Suppliers tables also
violates third normal form, which is the subject of the later Third Normal
The ContactName field also
violates the rule against multiple data values in a single field by combining
given and family names. This isnt a serious violation of first normal form,
but its a good database design practice always to identify persons by given
and family names in separate fields. When you create the new CustPers and
SuppPers tables, separate the ContactName field into two fields, such as LastName
and GivenName, which can include initials. You can then use a code similar to
that for CustomerID for the ContactID field. For this example, the ContactID
code is the first character of GivenName and the first four characters of
LastName. Alternatively, you could assign an AutoNumber value to ContactID.
Figure 4.5 shows the first 19 of
the 91 records of the CustPers table generated from the Customers table. The
CustomerID field is required for a many-to-one relationship with the Customers
table. Additional fields, such as Suffix, TitleOfCourtesy, Email(Address),
Phone, and Fax, make the individual contact records more useful for creating
mailing lists and integration with other applications, such as Microsoft
You extract data for records of
the CustPers table from the ContactName and ContactTitle fields of the
Customers table. Separating given and last names simplifies generating a
ContactID code to identify each record.
- For more
information on importing from Excel, see Importing and Linking Spreadsheet
Files, p. XXX (Chapter 8).
- To learn
how to use Access action queries, see Creating Action Queries to Append
Records to a Table, p. XXX (Chapter 13).
Figure 4.6 shows the
Relationships window with the CustPers and SuppPers tables added to the
Northwind database and their many-to-one relationships with the Customers and
Suppliers tables, respectively.
dont need to retype the data to populate the CustPers and SuppPers tables. You
can use Access to import the data from an Excel worksheet or text file, or use
Access action queries (append and update) to handle this chore.
The Relationships window
displays the many-to-one relationships between the Customers and CustPers
tables and the Suppliers and SuppPers tables.
Second Normal Form
Second normal form
requires that data in all non-key fields be fully dependent on the value of a
primary key. The objective of second normal form is to avoid data redundancy in
Only Northwinds Order Details
linking table (see Figure 4.7) has a composite primary key (OrderID +
ProductID). The UnitPrice field appears to violate the second normal form,
because UnitPrice is a field of the Products table. UnitPrice values added to
the Order Details table are dependent on the ProductID component of the
composite primary key and not the OrderID component, so UnitPrice data is not fully dependent on the primary key. On first glance, the UnitPrice
field appears to be redundant data. If you change the unit price of a product,
it would appear that you would need to alter the UnitPrice value in every Order
Details record for the product.
The Order Details linking table has a composite primary key
consisting of the OrderID and ProductID fields.The
Order Details table is an example of a situation in which you must retain what appears to be redundant information to
maintain the integrity of historical data. Prices of products vary over time,
so the price of a particular product is likely to change for orders placed on
different dates. If the price of a product changes between the order and
shipping (invoice) dates, the invoice reflects a different amount than the
order. Despite the Prices are subject to change without notice boilerplate,
customers become incensed if the invoice price is greater than the order price.
Eliminating the UnitPrice field
from the Order Details table and looking up its value from the current price in
the Products table also can cause accounting errors and distortion of
historical reports based on bookings and sales data. Removing the UnitPrice
data also violates the rules for the fifth normal form, explained later in this
Third Normal Form
Third normal form
requires that data in all non-key fields of the table be fully dependent on the
value of the primary key and describe only the object that the table
represents. In other words, make sure that the table doesnt include non-key
fields that relate to some other object or process and includes non-key fields
for descriptive data that isnt contained in another related table.
As mentioned in the First
Normal Form section, including contact information in the Customers and
Products table violates third normal form rules. Contacts are persons, not
customer or supplier organizations, and deserve their own related table that
has attributes related to individuals.
Other examples of a common third
normal form violation are the UnitsInStock and UnitsOnOrder fields of the
Products table (see Figure 4.8). These fields arent fully dependent on the
primary key value, nor do they describe the object; they describe how many of
the product you have now and how many you might have if the supplier decides to
ship your latest order. In a production order entry database, these values vary
over time and must be updated for each sale of the product, each purchase order
issued to the products supplier, and each receipt of the product. Purchases,
receipts, and invoices tables are the most common source of the data on which
the calculations are based.
The Products tables
UnitsInStock and UnitsOnOrder values must be calculated from data in tables
that record purchases, receipts, and shipments of products.
Including UnitsInStock and
UnitsOnOrder fields isnt a serious violation of the normalization rules, and its
not uncommon for product-based tables of order entry databases to include
calculated values. The problem with calculated inventory values is the need to
process a potentially large number of records in other tables to obtain an
accurate current value.
Fourth Normal Form
Fourth normal form
requires that tables not contain fields for two or more independent,
multivalued facts. Loosely translated, this rule requires splitting tables that
consist of lists of independent attributes. The Northwind and Oakmont databases
dont have an example of a fourth normal form violation, so the following is a
If youre designing an order entry
database, make sure to take into account committed inventory. Committed inventory
consists of products in stock or en route from suppliers for which you have
unfulfilled orders. If you decide to include inventory information in a
products table, add a UnitsCommitted field.
One of the objectives of Human
Resources departments is to match employee job skills with job openings. A
multinational organization is likely to require a combination of specific job
skills and language fluency for a particular assignment. A table of job skill
types and levels exists with entries such as JP3 for Java
ProgrammerIntermediate, as well as language/fluency with entries such as TE5
for TeluguVery Fluent. Therefore, the HR department constructs an EmplSkillLang
linking table with the following foreign key fields: EmployeeID, SkillID, and
The problem with the linking table
is that job skills and language fluency are independent facts about an
employee. The ability to speak French has nothing to do with an employees
ability to write Java code. Therefore, the HR department must split (decompose)
the three-field table into two two-field linking tables: EmplSkills and
Fifth Normal Form
Fifth normal form
involves further reducing redundancy by creating multiple two-field tables from
tables that have more than two foreign keys. The classic example is identifying
independent sales agents who sell multiple products or categories of products
for different companies. In this case, you have a table with AgentID,
CompanyID, and ProductID or CategoryID. You can reduce redundancyat the risk
of making the database design overly complexby creating three two-field
tables: AgentCompany, CompanyProduct (or CompanyCategory), and AgentProduct (or
AgentCategory). Database developers seldom attempt to normalize designs to
fifth normal form because doing so requires adding many additional small tables
to the database.
AutoNumber primary key values work
well for serially numbered documents if you dont allow records to be deleted.
Adding a true-false (Boolean) field named Deleted and setting the value to true
is one approach. This technique complicates queries against the tables, so you
might consider moving deleted records to another table. Doing this lets you
write a query to reconstruct all records for audit purposes.
Choosing Primary Key Codes
All Northwind and Oakmont tables use codes for
primary key values, as do almost all production databases. The critical
requirement is that the primary key value is unique to each record in the
table. Following are some tips, many with online resources, to aid in
establishing primary key codes:
- Many types of tablessuch as those for storing
information on sales orders, invoices, purchase orders, and checksare based on
documents that have consecutive serial numbers, which are obvious choices for
unique primary key values. In fact, most database designs begin with collecting
and analyzing the paper forms used by an organization. If the table itself or
programming code generates the consecutive number, make sure that every serial
number is present in the table, even if an order is canceled or voided.
Auditors are very suspicious
of invoice and purchase order registers that skip serial numbers.
- Packaged retail products sold in the United States
have a globally unique 10-digit or longer Uniform Product Code (UPC). The UPC
identifies both the supplier and the products SKU. The Uniform Code Council,
Inc. (http://www.uc-council.org/) assigns supplier and product ID values, which
are combined into linear bar codes for automated identification and data
capture (AIDC). The European Article Number (EAN) is coordinated with the UPC
to prevent duplication. The UPC/EAN code is a much better choice than
Microsofts serially assigned number for the ProductID field.
- Books have 10-digit and 13-digit International
Standard Book Number (ISBN) codes that are unique throughout the world and, in
North America, a UPC. ISBNs include a publisher prefix and book number,
assigned to U.S. publishers by the U.S. ISBN Agency
(http://www.bowker.com/standards/home/isbn/us/isbnus.html). ISBN Group Agencies
assign codes for other countries. Canada has separate agencies for English- and
French-language books. Either a UPC or ISBN field is suitable for the primary
key of a North American books database, but ISBN is preferred if the code is
for books only.
- The North American Industry Classification System
(NAICS, pronounced nakes) is
replacing the U.S. Standard Industrial Classification (SIC) for categorizing
organizations by their type of business. A six-digit primary key code for
18,000 classifications replaces the four-digit SIC code. Five of the six digits
represent codes for classifications common to the United States, Canada, and
Mexico. You can view a text file or purchase a CD-ROM of the NAICS codes and
their SIC counterparts at http://www.naics.com/.
- The U.S. Postal Service offers Address Information
Systems (AIS) files for verifying addresses and corresponding ZIP/ZIP+4 codes.
For more information on these files, go to http://www.usps.com and click the
Address Quality link.
- Social Security Numbers (SSNs) for U.S. residents are
a possible choice for a primary key of an Employees table, but their disclosure
compromises employees privacy. Large numbers of counterfeit Social Security
cards having identical numbers circulate in the United States, making SSN even less
attractive as a primary key field. The Oakmont database uses fictitious
nine-digit SSNs for EmployeeID and StudentID fields. Most organizations assign
each employee a sequential serial number. Sequential EmployeeID numbers can do
double duty as seniority-level indicators.
Specifying a primary key for
tables such as CustPers isnt easy. If you use the five-character code based on
first and last names for the primary key, you encounter the problem with
potential duplication of CustomerID codes discussed earlier. In this case,
however, common last namesJones, Smith, and Anderson, for examplequickly
result in duplicate values. Creating a composite primary key from CustomerID
and ContactID is a potential solution; doing this increases the number of new
contacts you can add for a company before inevitable duplicates occur. In most
cases, its easier to use an AutoNumber key for all ID values.
Figure 4.9 shows the final design of the modified
Northwind database with the added contact details tables. The tables of this
database are included on the accompanying CD-ROM as Nwind04.mdb in the
The final design of the expanded
Northwind database with customer and supplier contact details tables added.
The modified Northwind database
doesnt qualify as a full-fledged customer relationship management (CRM)
system, but the design is sufficiently flexible to serve as the model for a
sales and purchasing database for a small-sized wholesale or retail concern.
Maintaining Data Integrity and Accuracy
When you add, modify, or delete
table data, its important that the additions and changes you make to the data
dont conflict with the normalization rules that you used to create the
database. One of the most vexing problems facing users of large RDBMs is
unclean data. Over time, data entry errors and stray records accumulate to
the point where obtaining accurate historical information from the database
becomes difficult or impossible. Software vendors and database consultants have
created a major-scale data cleansing business to solve the problem. You can
avoid the time and expense of retroactive corrections to your data by taking
advantage of Access and SQL Server features that aid in preventing errors
during the data entry process.
You also must avoid changing the
primary keys of or deleting one of two tables in a one-to-one relationship.
integrity requires strict adherence to a single rule: Each foreign key value in a
related table must correspond with a primary key value in a base (primary)
table. This rule requires that the
following types of modifications to data be prevented:
- Adding a record on the many side of a one-to-many
relationship without the existence of a related record on the one side of the
relationship (for example, adding a record to the Orders table with a
CustomerID value of BOGUS when no such customer record exists in the Customers
- Deleting a record on the one side of a one-to-many
relationship without first deleting all corresponding records on the many side
of the relationship (for example, deleting Around the Horns Customers record
when the Orders table contains records with AROUT as the CustomerID value)
- Changing the value of a primary key field of a base
table on which records in a related base or linking table depend, such as
changing AROUT to ABOUT in the CustomerID field of the Customers table
Keypunch operators kept
their eyes on the source documents, which gave rise to the term heads-down data entry. The term continues in common use to
describe any data entry process in which the operator attention is fully
devoted to adding or editing database records as quickly as possible.
- Changing the value of a foreign key field in a linking
table to a value that doesnt exist in the primary key field of a base table
(for example, changing AROUT to ABOUT in the CustomerID field for OrderID
A record in a related table that
doesnt have a corresponding foreign key value in the primary key of a base
table is called an orphan record.
For example, if the CustomerID value of a record in the Orders table is ABCDE
and no ABCDE value exists in the CustomerID primary key field of the Customers
table, theres no way to determine which customer placed the order.
Access and SQL Server databases
offer the option of automatically enforcing referential integrity when adding
or updating data. Cascading updates and deletions are optional. If you specify
cascading updates, changing the value of a primary key of a table makes the
identical change to the foreign key value in related tables. Cascading
deletions delete all related records with a foreign key that corresponds to the
primary key of a record in a base table that you want to delete.
To learn more about enforcing
referential integrity in Access databases, see Establishing Relationships
Between Tables, p. XXX (Chapter 5) and Cascading Updates and Deletions, p.
XXX (Chapter 5).
Entity Integrity and Indexes
When you add new records to a
base table, entity integrity assures that each primary key value is unique.
Access and SQL Server ensure entity integrity by adding a no-duplicates index
to the field you specify for the primary key. If duplicate values exist when
you attempt to designate a field as the primary key, you receive an error
message. You receive a similar error message if you enter a duplicate primary
key value in the table.
For more information on Access
indexes, see Adding Indexes to Tables, p. XXX (Chapter 5).
Indexes also speed searches of
tables and improve performance when executing SQL statements that return data
from fields of base and related tables.
Data Validation Rules and Check Constraints
Data entry errors are another
major source of unclean data. In the days of punched-card data entry,
keypunch operators typed the data, and verifiers, who usually worked during the
succeeding shift, inserted the cards in a punched-card reader and repeated the
keystrokes from the same source document. This process detected typographical
errors, which the verifier corrected. Keypunch operators had no visual feedback
during data entry, so typos were inevitable; video display terminals didnt arrive
until the mainframe era.
Rekeying data leads to low
productivity, so most data entry applications support data validation rules
designed to detect attempts to enter illegal or unreasonable values in fields.
An example of a validation rule is preventing entry of a shipping date thats
earlier than the order date. The rule is expressed as an inequality: ShipDate
>= OrderDate, which returns False if the rule is violated. Similarly,
UnitPrice > 0 prevents accidentally giving away a line item of an order.
Access tables and fields have a
Validation Rule property that you set to the inequality expression. SQL Server
calls validation rules check constraints.
Both Access and SQL Server have a Validation Text property for which you
specify the text to appear in an error message box when the entry violates the
rule or constraint. Its a more common practice when working with client/server
databases to validate data in the front-end application before sending the
entry to the back-end server. Detecting the error on the server and returning
an error message requires a roundtrip from the client to the server. Server roundtrips generate quite a
bit of network traffic and reduce data entry efficiency. One of the objectives
of client/server front-end design is to minimize server round-tripping.
To learn more about Accesss
validation methods, see Validating Data Entry, p. XXX
A database transaction occurs
when multiple records in one or more tables must be added, deleted, or modified
to complete a data entry operation. Adding an order or invoice that has
multiple line items is an example of a transaction. If an order or invoice has
five line items, but a network or database problem prevents adding one or more
item records, the entire order or invoice is invalid. Maintaining referential
integrity prevents adding line item records without a corresponding order or
invoice record, but missing item records dont violate integrity rules.
As mentioned earlier in the
chapter, fields become columns
and records become rows in a
query. This terminology is an arbitrary convention of this book and not related
to relational database design theory. The reason for the change in terminology
is that a querys rows and columns need notand often do notrepresent data
values stored in the underlying tables. Queries can have columns whose values
are calculated from multiple fields and rows with aggregated data, such as
subtotals and totals.
Transaction processing (TP), also called online transaction
processing (OLTP), solves the missing
line item problem. Requiring TP for order entry, invoice processing, and similar
multirecord operations enforces an all-or-nothing rule. If every individual
update to the tables records occurs, the transaction succeeds (commits); if any update fails, changes made before the
failure occurs are reversed (rolled back). Transaction processing isnt limited to RDBMSs.
Early mainframe databases offered TP and transaction monitors. IBMs Customer
Information and Control System (CICS, pronounced kicks) was one of the first transaction processing and
monitoring systems, and it remains in widespread use today.
Access and SQL Server databases
offer built-in TP features. Access has a Use Transactions property that you set
to Yes to require TP for updates. SQL Server traditionally requires writing
T-SQL statementsBEGIN TRANS, COMMIT TRANS, and ROLLBACK TRANSto manage
transactions, but Access 2010s ADP forms have a new Batch Updates property
that lets you enforce transactions without writing complex T-SQL statements.
For a brief description of the batch
update feature introduced by Access 2007, see Changes to ADP Features, in
Online Appendix B.
Displaying Data with Queries and Views
So far, this chapter has
concentrated on designing relational databases and their tables, and adding or
altering data. SQL SELECT queries return data to Access, but you dont need to
write SQL statements to display data in forms or print reports from the data.
Access has built-in graphical tools to automatically write Access SQL for
Access databases and T-SQL for SQL Server databases. Accesss query tools use a
modern implementation of query-by-example (QBE),
an IBM trademark. QBE is a simple method of specifying the tables and columns
to view, how the data is sorted, and rows to include or exclude.
Linking related tables by their
primary and foreign keys is called joining the tables. Early QBE programs required defining
joins between tables; specifying table relationships automatically defines
joins when you add records from two or more related Access or SQL Server
Figure 4.10 is an example
of Accesss QBE implementation for Access databases, called Query Design View.
You add tables to the queryin this case, Northwinds Customers, Orders, and
Employees tables. As you add the tables, join lines indicate the relationships
between them. You drag the field names for the query columns from the table
lists in the upper pane to the Field row of the lower pane. You also can
specify the name of a calculated column (Salesperson) and the expression to
create the column values ([FirstName] & & [LastName]) in the Field
row. The brackets surrounding FirstName and LastName designate that the values
are field names.
Accesss Query Design view for
Access databases uses graphical QBE to create queries you can store in the
Selecting Ascending or Descending
in the Sort column orders the rows in left-to-right column priority. You can
restrict the display to a particular set of values by adding an expression in
the Criteria column.
Running the query returns the
resultset, part of which is shown by Figure 4.11. You can save the query for
later reuse as a named Access QueryDef(inition) object in the database.
These are the first 16 of the
408 rows of the query resultset returned by executing the query design of
SELECT Customers.CompanyName, Orders.OrderID, Orders.OrderDate,
[FirstName] & & [LastName] AS Salesperson
INNER JOIN (Customers
INNER JOIN Orders
ON Customers.CustomerID = Orders.CustomerID)
ON Employees.EmployeeID = Orders.EmployeeID
ORDER BY Customers.CompanyName, Orders.OrderID;
Its obvious that using QBE is
much simpler than writing SELECT queries to concatenate field values, join
tables, establish row selection criteria, and specify sort order. Accesss QBE
features are powerful; many developers use Access to generate the SQL
statements needed by Visual Basic, C++, and Java programs.
Access QBE automatically converts
the query design of Figure 4.10 into the following Access SQL statement:
The da Vinci QBE tool for
creating T-SQL views is similar to the Access Query Design view, but has an
additional pane to display the T-SQL statement as you generate it. You add
tables to the upper pane and drag field names to the Column cells of the middle
pane. An SQL Server view is the client/server equivalent of an Access QueryDef.
As with Access QueryDefs, you can execute a query on an SQL Server view.
T-SQL uses + rather than & to
concatenate strings, uses a single quote () as the
string delimiter, and requires a numerical instead of a string criterion for
the YEAR function. Heres the T-SQL version of the preceding
Access SQL statement after the SELECT
and WHERE clauses have been tweaked:
modifier is needed to permit an ORDER BY
clause in a view; prior to the addition of the TOP keyword
in SQL Server 7.0, creating sorted views wasnt possible. The da Vinci query
parser adds the TOP
100 PERCENT modifier if an ORDER BY clause is present. However, TOP 100 PERCENT ... ORDER BY doesnt sort SQL Server 2005 views. Replacing 100 PERCENT with a large integer (<= 2147483647) sorts the view.
prefix to table and field names is an abbreviation for database
owner, the default owner for all SQL Server databases you create as a
system administrator. Figure 4.12 shows the design of the T-SQL query generated
by pasting the preceding statement into the da Vinci query pane.
Despite their common ANSI SQL-92
heritage, SQL Server wont execute most Access SQL statements, and vice versa.
Copying the preceding Access SQL statement to the Clipboard and pasting it into
the SQL pane of the query designer for the NorthwindCS sample database doesnt
work. The da Vinci designer does its best to translate the Access SQL flavor
into T-SQL when you paste, but you receive errors when you try to run the
SELECT TOP (2147483647) dbo.Customers.CompanyName,
dbo.Employees.FirstName + +
dbo.Employees.LastName AS Salesperson
INNER JOIN dbo.Customers
INNER JOIN dbo.Orders
ON dbo.Customers.CustomerID = dbo.Orders.CustomerID
ON dbo.Employees.EmployeeID = dbo.Orders.EmployeeID
WHERE (YEAR(dbo.Orders.OrderDate) = 2006)
ORDER BY dbo.Customers.CompanyName, dbo.Orders.OrderID
For more information on the da Vinci
toolset, see Exploring SQL Server Views, in online Chapter 27.
- For detailed instructions on
installing SQL Server Express and NorthwindCS.adp, see Performing SQL Server
Express Setup, p. XXX (Chapter 1), and Exploring the NorthwindCS Sample
Project, in online Chapter 27.
The Datasheet view of the SQL
Server view generated by the preceding SQL statement is identical to the Access
querys Datasheet view shown in Figure 4.11.
Pasting an Access SQL statement into Accesss version of the
da Vinci query design tool and making a few minor changes to the T-SQL
statement results in an SQL Server view equivalent to the Access query of
Microsoft Access 2010 In Depth, Rough Cuts
By Roger Jennings