Best practices for transactional systems aren't always relevant in an enterprise data warehouse. Priorities are different, environment ditto, people's knowledge and expertise double ditto. What are the biggest differences? What are the biggest challenges for staff, DBAs, and managers?
The enterprise data warehouse (EDW) is typically a very large collection of historical transaction data. As Bill Inmon defined it, " [A data warehouse is] a subject-oriented, nonvolatile, integrated, time variant collection of data in support of management's decisions." Most warehouse databases are not structured in a normalized design; instead, the database design resembles more of a hub-and-spoke or star schema design (see Wikipedia for more definitions). In addition, designs must take into account very large database (VLDB) considerations such as partitioning schemes, special backup and recovery processes, and capacity planning.
Performance issues in the EDW are more often associated with its bulk data transfer processes. These include table loading, table extracts for consuming systems such as remote data marts, and analytical SQL executed by analysts.
There are significant challenges facing EDW staff in this environment. Here is a summary of what each group faces, and their biggest challenges.
The database administrator usually prioritizes work in this order:
- Data Availability
In an EDW this ordering changes. The data warehouse is usually not a mission-critical system, and is typically not required to be available immediately after a disaster. Indeed, the very size of the warehouse means that backup files may be extremely large, and recovery times prohibitively long.
The highest priority in the EDW is data availability. To this end, the DBA will design and implement processes that minimize data locking, reduce outages during data loads, and implement high-speed data transfer processes such as high performance table loads and unloads. The DBA will also monitor database size and growth, and provide major input into storage capacity planning.
Another significant difference to the DBA is performance tuning. Most of the EDW database designs will involve a typical mix of fact and dimension tables. Standard processes will include table bulk loads and unloads, and SQL access. Most of the SQL access across sets of related tables will follow a standard pattern.
Consequently, the DBA will usually spend far less time doing database design and SQL performance tuning, and instead turn their attention to DB2 system performance tuning. EDW system-level tuning will involve:
- Allocation and use of memory for buffer pools and work files
- Implementing processes to detect, prevent and correct timeouts and deadlocks
- Implementing high-speed data transfer processes, perhaps using third-party software tools
- Controlling the use of parallelism (methods which access multiple partitions of data in parallel)
One exception to the list will be the initial efforts to tune SQL statements that access the data warehouse using a "star schema" access path. This access path takes advantage of the relatively large sizes of fact tables and the correspondingly smaller sizes of dimensions. In DB2 for z/OS, this access path would include joining multiple dimension tables together prior to accessing the fact table. This is in contrast to more common access paths where the largest table is usually accessed first.
Data Warehouse Management
The manager of the EDW will spend more time on customer service to data consumers. This is in contrast to other infrastructure managers, who spend their time dealing with data processing or technical issues. The EDW manager looks 'outward' to the primary customers of the warehouse. The most important question they must answer is this: How can our customers' needs be met in a timely fashion?
The EDW manager's direct reports will have both business knowledge and technical knowledge. One of the major compromises to make is budgeting for staff training. EDW staff will require knowledge of database design, SQL coding and tuning, standard data movement processes such as FTP, and any third-party tools used for design, tuning, or reporting.
Reporting tools can be particularly vexing. They store (or have visibility to) portions of the warehouse database, and include methods for data aggregation, summation, drill-down, and other analytic functions. The tool generates SQL to access the warehouse based on the user's specifications. The result is a complicated tool with a complex interface. Training classes will be lengthy and expensive.
The EDW exists to store time-dependent data for analysis by lines of business. Accordingly, the EDW manager concentrates on maintaining standards, access control, and security of the data warehouse. Less time is spent on evaluating lines of code written or databases implemented; more time is spent on implementing data flow and storage in a way that meets the requirements of the line of business.
EDW Business Analyst
The business analysts interpret the requirements of the enterprise's lines of business and implement the appropriate data structures and data movement processes. Over time they become quite knowledgeable in DB2 database design, SQL, and performance tuning. This may be a boon to the database administrator, who can offload simple SQL analysis tasks to the analysts while concentrating on other, more important things. The business analyst will also acquire some skills in query and reporting tools, if only to test new and changed systems prior to implementation.
In addition to being familiar with the EDW database design, business analysts must become adept at implementing standard processes. These include:
- Bulk data transfer from operational systems into the EDW staging areas
- Data cleaning operations to remove or fix invalid fields and deal with missing data
- Surrogate key algorithms for keying important fields such as account numbers
- Loading data from staging areas into dimension and fact tables
- Coordination and implementation of data extracts send to external systems
Thus, the analysts become quite valuable members of the EDW team. They must respond quickly to business needs, balance requirements against implementation options using their technical knowledge, and build the requisite structures and processes.
The enterprise data warehouse differs from other major systems in its size, structure, and use. Thus, standards, processes and best practices must be changed or customized to meet the demands of the primary customers.
The greatest difference is in the quality of personnel required to manage the EDW environment.
Database administrators will concentrate less on tactical issues such as SQL tuning, turning their attention to strategic matters such as system performance tuning. The DBA's primary goal is no longer recovery or disaster planning, but rather assuring data availability.
Managers will deal less with internal technical details such as database design or SQL coding. Instead, they become more project- and customer-oriented, responding to requests from EDW data consumers for more access to more data for analysis. Employee evaluation will concentrate more on customer service than on process execution.
Business Analysts will require education and expertise in multiple technologies, some of them new and complex.
In a complex, changing environment the most important recommendation is to develop and implement best practices. Create readable process and standards documents that are available to all, and review them on a regular basis. This will increase the quality of communications, provide templates for implementing most processes, and make it easier to keep a complex environment manageable.