On a regular OLTP system, the integrity of the data is generally supported in two ways - firstly by logic embedded in the application, and secondly by constraints placed on the logical structure of the database itself. The two may be complementary, as a designer may rely on the latter to support the former, through interpretation of error messages raised by the database.
In a data warehousing environment it is common practice to rely more heavily on the application level to guarantee the data integrity - the application in this instance is the Extraction, Transformation & Loading" (ETL) process (or ELT process, depending on your choice of technologies).
Why the Difference?
In the minds of dedicated OLTP professionals, alarm bells are now ringing loudly - except maybe for those with a lifetime of experience on major enterprise-level software, who may never have heard of constraints of any type. So why the difference? Why can a data warehouse be operated successfully and safely without integrity being enforced at the database level?
The simple answer is that the ETL application is itself a reasonably simple application, and the data warehouse is a reasonably simple logical structure. The ETL process may apply complex business cleansing rules, but there are generally only few paths through which it will modify the database. A fact table might only be touched by a single sub-process inserting new rows of data. Contrast this to the number of different operations that might be performed on a table of an OLTP system - the database-side constraints are a necessary precaution against application error.
The enforcement of constraints is often easier in the ETL layer because it has special information about the data. For example if the logical primary key on a data warehouse fact table involves a date on which stored transactions took place, then the ETL process might be able to guarantee that new data will not violate such a key by only retrieving from the source data set information on transactions that occurred after the latest in the fact table. Alternatively, the ETL process might have to detect values of a column in a fact table and maintain the appropriate dimension table, so foreign keys would never find a violation of the relationship that they enforce.
Protection against invalid data and data relationships is not the only function of database-side constraints of course. The Oracle optimizer can leverage constraint information to produce more effective execution plans. As a simple example, knowledge that an indexed column is constrained to be NOT NULL enables Oracle to use the index to find all the unique values of that column.
So What's Wrong With Constraints?
Given that constraints provide valuable help to the optimizer, what is the problem with making use of them? We must first acknowledge that there are two or three elements to these constraints.
- information that they provide concerning the intentions of the database designer.
- use of this information by the database. It is used to reject modifications to data that would result in a condition that would violate the intent of the constraint, and it is used by the optimizer to infer what assumptions may be made about the data.
- supporting structures and the overhead required to maintain them that allow efficient enforcement of the constraints. Primarily these are any indexes associated with Primary, Foreign and Unique keys constraints.
So which of these elements are desirable and undesirable in the context of a data warehouse?
The rejection of undesirable modifications to the data is undesirable, for the reasons previously stated - the ETL process will often render such a process redundant and the overhead of duplicating the validation is an unnecessary burden on the database.
The supporting structures can be especially unwelcome for two reasons. Firstly, there is the storage requirement for them, which will probably be significant in comparison to that required for the fact table itself, especially if the table's data segment(s) is compressed. However, given the comments in the earlier article in which I advised that a data warehouse runs out of i/o capacity long before it runs out of storage capacity this is less significant than the second reason for preferring to not maintain these indexes, which is the burden that the maintenance places on CPU load and i/o capacity. This burden is imposed at a time when the system is already busy with the loading of new data, and if the data load process runs very frequently, or continuously, instead of during a comfortable off-hours load window, then the overhead is doubly unwelcome.
Finding a Middle Ground
So hopefully that is sufficient to make the case for a set of constraints that perform only the function of providing information to the optimizer. Oracle provides this functionality in the form of the RELY keyword, which allows a constraint to be created in a disabled and non-validating state - meaning that it does not check new or existing data for integrity - while telling the optimizer that it can infer that the data meets the declared constraint definitions. We can distinguish between this form of integrity and the regular enforced kind by referring to it as declarative integrity.
Limitations on Declarative Integrity
- Although is seems redundant to say so, it is important to note that declarative integrity provides no additional information that could not be provided by regular enforced integrity. There is a method for providing additional information on data relationships to the database through the logical objects known as Dimensions that will be covered in a future article - these are not the tables that we refer to as dimension tables, but they do give the optimizer a better understanding of the data within them.
- Note that if you try to modify a NOT NULL constraint to the RELY state, you will be welcomed with the ORA-25127: RELY not allowed in NOT NULL constraint error. This is less of a problem than it might seem for two reasons. Firstly, there is no index supporting the NOT NULL constraint, so there is less of an advantage to disabling it. Secondly, a data warehouse will more commonly use bitmap indexes for fact table columns, and in contrast to b-tree indexes, bitmap indexes store null values and the optimizer knows it. Thirdly, the overhead on validating that a column is not null is very much less than just about any other constraint type.
- The extent to which the optimizer uses declarative integrity information on primary and foreign key constraints is confined to query rewrite against materialized views. If you are not using this feature then you will find limited benefits in creating such constraints.
Do not be disheartened by the limitations listed above - given the choice of creating enforced integrity, declarative integrity, or no integrity, I would always go for the declarative. Even if the optimizer is not going to make much use of the information, it is there to provide metadata to ETL applications, front-end applications, and support staff.
It will also stop those uninitiated OLTP-persons from mocking you for not building integrity into your database - with declarative integrity, they will never know the difference.