Information technology must build, support, and tune databases that support business applications such as order entry, accounting, inventory or payroll. For these applications, typical database designs are derived from business rules and application requirements and are modified by enterprise data modeling standards such as normalization and referential integrity enforcement.
Data warehouses are different. They are not a system of record, they are designed for data availability, database recovery has a low priority, and the physical database design is typically non-relational. Consequently the database administrator (DBA) tends not to implement things such as referential integrity, triggers, and regular backups. These differences force the database administrator (DBA) to look at performance tuning the data warehouse in a different way.
DB Design for Availability
The data warehouse (DW) is designed for data availability; recovery considerations are secondary. Further, DW activity occurs in two distinct phases: extract, transform and load (ETL), and business intelligence (BI).
During the ETL phase, data from operational systems is:
- Extracted -- Unloaded and transmitted to the DW
- Transformed -- Surrogate keys are assigned to key fields such account number and customer number; invalid data is replaced by defaults (sometimes called data cleaning)
- Loaded -- Data is placed in tables of the DW
During the BI phase, users and applications query the DW. These queries typically aggregate or subset data by one or more dimensions such as time or geographic area.
The DW database design takes these phases into account using two main strategies:
- Table partitioning -- Tables are designed so that data is separated into different partitions by a dimension value. For example; daily transaction data is placed in a table partition designated by today's date.
- Fact and dimension tables -- Fact tables contain the data which is to be aggregated or summarized; dimension tables contain selection criteria. For example, an Account fact table would contain data balances, while the Location dimension table would contain geographic locations.
Tuning Techniques for the ETL Phase
During the ETL phase large volumes of data are loaded into the DW. The DBA should take this into account during database design.
Key clustering is one technique the DBA uses to ensure that inserted rows do not create hot spots in the data. One common way this happens is when new records are assigned a key based on a sequentially ascending value (such as an Identity column or a Sequence), and the table is clustered on this key. The result is that newly-inserted rows are added together in a section of the table .
The DBA must coordinate such key clustering techniques with index choice, because physical clustering of the table is dependent upon designating one of the indexes as the clustering index. Since adding a surrogate key to a table in order to cluster it necessarily means creating a clustering index for that key, this may affect the total number of indexes for the table.
In addition, the DBA will attempt to use bulk loading methods rather than those using SQL Insert statements. The most common bulk loading method is the DB2 Load utility. The DBA will ensure that the extracted data is pre-sorted in key clustering sequence before passing it to Load.
One potential issue in the DW is access by queries during the ETL process. If possible, the DBA will schedule DW table loading during a time when queries are minimal. If this is not feasible, then the DBA can use one of the following techniques:
- The Shadow Table -- The DBA creates a copy of a table. During ETL, the copy is loaded while the original is available for querying. After the load is finished, the DBA executes a process that switches the names of the tables. The copy now becomes the original, and queries access this version.
- Active / Inactive Partitions -- Tables are designed with multiple partitions. A separate control table contains one row for each of the original table's partitions. These rows contain activity switches that indicate which partitions of the main table are active. The ETL process turns an indicator off; it is now free to load the inactive partition. When done, the indicator is turned on. BI queries are designed to only access active data.
Tuning Techniques for the BI Phase
Users of the data warehouse typically use tools to construct queries rather than writing SQL statements themselves. These tools provide a visualization of warehouse tables along with descriptions of table relationships, usually using some form of entity-relationship diagram.
Data warehouse queries are sometimes described as ad hoc, or invented for a one-time purpose. Thus, DBAs consider tuning a data warehouse as a complex process that must balance the performance of potentially thousands of different queries.
In reality, the database design of the data warehouse results in highly-predictable access paths. Queries will almost always require one or more columns from a fact table, with aggregations requiring joins to one or more dimension tables. This means that the DBA can be quite proactive about tuning SQL queries against the DW using one or more of the following techniques:
- Standard fact table index -- Create an index on the fact table that includes all columns that may be joined to dimension tables. (In relational database designs these would all be foreign key columns.) This standard index will be required in order for the database management system (DBMS) to choose what is called a starjoin access path. This high-performance access path involves pre-joining dimension table rows into a large temporary table before joining to the fact table.
- Additional fact table indexes -- Since fact tables contain the largest part of the data returned by BI queries, these tables may require additional indexes in order to provide alternative access paths to the table. These indexes will usually take the form of one dimension column and one or more additional fact table columns that are accessed frequently.
- Gather rich table and column statistics -- The DBMS uses data distribution statistics in order to determine the lowest cost access path for a query. Most DBAs will have the system gather and store statistics about table cardinality (number of rows) and column statistics about the leading columns of indexes. Rich statistics means gathering additional information, including frequent column values, column value correlations (e.g., columns like ZipCode and City should correlate), and even histogram statistics.
- Regularly reorganize non-clustered indexes -- Each table having one or more indexes has one designated the clustering index. The key columns of this index define the preferred physical order of rows in the table. In a DW the ETL process usually bulk loads new rows into the table in this order, so row clustering is maintained. However, the keys of the remaining indexes (if any) are not in this order. The result: such indexes may become disorganized, leading to poor performance. The DBA should ensure that such indexes are regularly reorganized.
- Consider summary tables for pre-aggregation -- Summary tables (also called materialized query tables in DB2) are populated after fact table data load and before BI querying begins. These tables contain common aggregations of the fact table data such as summations over time, minimums, maximums and averages. This provides BI queries with ready-made results, allowing the DBMS to bypass complex and expensive calculations (perhaps several times in one day).
The nature of the data warehouse requires the DBA to take a different approach to performance tuning. The database design of fact tables and dimension tables leads to a small number of common access paths with the most common being joining a fact table with one or more dimension tables.
Knowing this, the DBA can use the tools I have described to provide ETL processes and BI queries with optimized access paths.