Of the various constraints possible on relational tables, referential constraints are perhaps the most common ... and most misused. Learn about the advantages and disadvantages of different methods to implement and enforce RI, and issues that must be addressed when implementing DBMS-enforced Referential Integrity.
Of the various constraints possible on relational tables, referential constraints are perhaps the most common ... and most misused.
Referential constraints occur in most business data, and are often documented in data models as one-to-many relationships between entities. For example, let's consider a retail application: one customer has one (or many) orders; one order contains one (or many) products; and so forth.
A database where all such data and their relationships are consistent is said to have referential integrity (RI). How is the data kept consistent? For example, how do we ensure that an order inserted into the ORDER table is for a valid, existing customer? Also, if we must delete a customer, what happens to the customer's outstanding orders?
To summarize the elements of an RI relationship using our retail example:
- The CUSTOMER table is the parent table, and contains rows whose unique identifier (the primary key) is the value in column customer_number.
- The ORDER table is the child table, and contains rows having values for customer_number (the foreign key) that are required to correspond to values in the CUSTOMER table.
- This referential constraint (parent table with primary key, child table with foreign key) also includes a delete rule. This specifies what logical action must be taken when a row from the parent table is deleted. The most common actions: Restrict (do not allow a deletion of a CUSTOMER if any corresponding ORDERs exist); and Cascade (when a CUSTOMER row is deleted, delete all corresponding ORDER rows).
When we implement the data model and its relationships as tables in a relational database (such as DB2 for z/OS) there are several ways to implement and enforce RI. The most common are:
- Application-enforced. Have applications check data values as they are entered into the database. When deleting parent table rows, the application must take the appropriate action with the "orphaned" child rows.
- Database management system (DBMS)-enforced. Have the DBMS automatically enforce valid data values and, based on the delete rule, automatically take the action specified.
Let's continue with our customer/order example.
If we implement application-enforced RI, then any process that inserts a row to the ORDER table must first interrogate the CUSTOMER table and ensure that the customer already exists. Any process that deletes a CUSTOMER must implement the RI delete rule and take action on the corresponding ORDER rows (either prevent the CUSTOMER delete from happening if matching ORDER rows exist, or delete all matching ORDER rows).
With DBMS-enforced RI, processes are free to insert rows to the ORDER table with any customer value they wish; however, the DBMS will prevent any insert having an invalid customer value by checking the parent (CUSTOMER) table itself. Further, the DBMS will automatically implement the delete rule.
Where do we enforce RI?
While these two methods seem to be equivalent, there are several advantages and disadvantages to each.
- Change control
- Application-enforced RI embeds the RI-checking and delete rule logic in applications. How and where do we document this? Changing a relationship now involves changing application code, which also means re-testing, re-certifying, and so forth.
- DBMS-enforced RI is stored in the database definition. Changing it is a relatively simple database change, but if RI is changed, will application logic respond properly to (potentially) new error messages?
- Application-enforced RI means potential multiple invocations of the DBMS software for insert and delete operations. An ORDER row insert must be preceded by a query against the CUSTOMER table. Each of these requires a call to the DBMS, with corresponding parameter setup, error handling, documentation, and testing.
- DBMS-enforced RI automates data validation and delete rule execution.
- Documentation clarity
- Application-enforced RI is embedded in application logic. This makes centralizing the documentation about business rules difficult.
- DBMS-enforced RI requires that all RI relationships and delete rules are included in the database definition. This nicely centralizes RI documentation that can be displayed easily with many modeling tools.
- Code generation
- Many code-generation software tools and query tools depend upon DBMS-enforced RI to determine how related tables should be JOINed. With application-enforced RI, all of these rules must be manually entered into the tool.
- Backup and recovery
- In a recovery situation (for example, a media failure) it is typical that the parent and child tables will be recovered together to the same point-in-time. How does the DBA determine this if we implemented application-enforced RI? The only way is to have complete, consistent, up-to-date documentation on all RI relationships. And, if these change, then application code changes must be accompanied by changes in the DBA's backup and recovery processes.
- With DBMS-enforced RI, the DBA can query the database to determine what RI relationships exist. With DB2 for z/OS, one can run the REPORT RECOVERY utility, which will list related tables for recovery.
Is DBMS-enforced RI a best practice?
While the above might indicate that DBMS-enforced RI is a best practice, there are additional issues that must be addressed when implementing it. These include:
- Data environment
- While it is common to implement RI in operational data, many data warehouse implementations do not use it at all. This is because data inserted to the data warehouse is presumed to be already validated.
- Some databases are created in order to maintain copies of all or most of another database. There are several versions of this: examples include data replication and publish/subscribe applications. These target databases may receive copies of data in haphazard fashion; as a consequence, at any particular moment they may not be consistent. For example, suppose we are replicating customer and order information from database A to database B. When all data is finally received then database B is consistent; however, there may be moments when orders have arrived a few instants prior to their corresponding customers.
- Code tables
- Code tables contain data typically used as abbreviations in other tables. One example is U.S. state code abbreviations. It would be possible to implement RI such that the code table is the parent and tables containing the code column are defined as child tables; however, this is rarely done. By their nature, code tables rarely experience row deletions. In addition, it is common for applications to use code tables for data validation. For example, an application inserting a customer address might use a U.S. state code table to populate a pull-down list for an online user. Since the codes used in this fashion are valid, there is no need to use RI to enforce data consistency.
- Existing data and application logic
- Some existing databases and applications may have already implemented application-enforced RI. In this case, implementing DBMS-enforced RI would not only be redundant but potentially wasteful of resources.
How do we mitigate the disadvantages of DBMS-enforced RI?
In the current version of DB2 for z/OS (Version 8) there is a parameter in RI definition that the DBA can use to ease the restrictions noted above. Here is a typical RI definition example:
... CREATE TABLE ORDER ... FOREIGN KEY Customer_Number REFERENCES CUSTOMER ON DELETE CASCADE
This syntax reads as follows (I've removed other things like the column definitions for clarity): "Table ORDER contains a foreign key column called Customer_Number. The parent table is CUSTOMER. When a customer is deleted from the CUSTOMER table, delete all corresponding rows from the ORDER table."
Now, consider the following slight change:
... CREATE TABLE ORDER ... FOREIGN KEY Customer_Number REFERENCES CUSTOMER ON DELETE CASCADE NOT ENFORCED
This last parameter, "NOT ENFORCED", does several things:
- The RI relationship is still defined in the database, providing a centralized place for RI rule documentation;
- The DB2 REPORT RECOVERY utility will generate recovery information for the tables as parent and child;
- It allows data query and code tools to detect the RI relationship, permitting complete data model and table JOIN documentation, and so forth;
- It will not cause issues in data warehouse or replication environments.
Referential integrity is a property of business data. It manifests itself as business rules defining relationships between entities in a logical model. When a data architect transforms this logical model into a physical database model, they must decide how the RI will be enforced. The most common choices are application-enforced and DBMS-enforced.
DBMS-enforced referential integrity is a best practice. In cases where implementing RI this way may cause problems, DBAs can use the NOT ENFORCED option to ease restrictions.