A little personal history...
Some six or seven years ago, I was working for a utility company, trying to extract high-level reports from their complex OLTP system. It turned in to a project of fiendish complexity, that essentially tried to answer the question, "Are we making a profit?" from a system holding all the individual income and expenses. Even for a person with a pretty extensive background in extracting information from transactional data it was a daunting task, with many failed experiments, dead-ends encountered, and systems-brought-to-their-knees by full table scans.
What was really needed, although we did not realize it until rather late in the game, was a separate data warehouse or decision support system. However, even with some books by experts on data warehousing in general, and with Oracle-specific technical documentation, making the mental leap from OLTP to this new paradigm was a daunting prospect.
This series of articles is intended to be the reference that we were looking for, to answer both the questions that we had and those that we should have been asking had we suspected that we needed the answers.
OK, at the time, the data warehousing tools at our disposal pretty much amounted to bitmap indexes -- which the DBAs wouldn't let me use -- and the star query -- now pretty much a thing of the past. However, if such features as table compression, materialized views with query rewrite, dimensions, bitmap join-indexes etc. had existed at the time, then I hope that this series would have been our savior.
Focusing on the problems...
I don't intend that these articles be a dumping ground for information that you can look up in the excellent Oracle 9iR2 Data Warehousing Guide, or in the Concepts Manual, or in the SQL Reference. You should be intimately familiar with the latter two if you are to make a success of your databasing efforts, warehousing or otherwise. Rather, they will be a guide that firstly lets you know that this information exists, and secondly that will extract and organize some key information from these and other sources, and refer to them often. It will also offer advice on what features I have found to be the most important, which ones I have found to be rather optional and how to recognize the time to just say "no" to shiny new stuff.
Each article will tend to focus on a particular set of decisions that we have to confront in our data warehousing efforts, and provide information to help you isolate the key issues, identify potential solutions, and defend your decisions to your skeptical co-workers, who peer out over the battlements of their transaction processing systems in no-way prepared for the different priorities of the data warehouse.
A high-level view...
As an introduction to the series I want to talk about a very few high-level differences between OLTP and Data Warehousing systems. For those who have walked the path before there will be few surprises, if any, but these notes will lay the groundwork for many of the comments and much of the advice that follows.
One of the characteristics most often associated with data warehouses is their larger size relative to OLTP systems. In some ways this is rather obvious, but let us just look at two reasons why this may be so.
A data warehouse built using data from an OLTP system will tend to support an historical view of the source system's transactions, hence there may be many "versions" of an inventory record, for example, telling us inventory levels for the past year at a daily, weekly, or monthly level. The data warehouse may therefore contain 365-times, or thirty-six times more low-level records than the OLTP system. This is not always the case though - if you are warehousing retail sales records then the reason that the data warehouse is larger may simply be that the OLTP system is purged of old data on a regular basis. Possibly this is only happening because there is a data warehouse to copy records to, so that they may be purged.
Whatever the reason for the size difference we must remember that having a lot of bytes on disks is not necessarily a problem in itself, and the raw size difference seems overstated to me. The methodologies for storing and accessing data are so different between the two database types that the raw size issue is misleading enough to be almost irrelevant.
This brings us neatly to the second issue, of how data is accessed.
An interesting metaphor describes OLTP databases as "flickering" systems. If you were able to take a virtual look at all the data, laid out in front of you, then the data reads and changes might appear as small, scattered, flickering bursts of light. Individual rows are inserted, individual columns are updated and there is almost a continuous activity level as tens, hundreds or thousands of simultaneous operations. The occasional report, bulk data extraction, or bulk data load might make your eyes water for a second but they are generally kept aside for periods of lower transactional activity.
In contrast, as you watch your data warehouse you see many distinct reporting activities lighting up large contiguous sections of some tables. Every now and then on a regular basis a completely new set of data appears -- maybe every fifteen minutes, maybe once a week, and if you sped up the process, you might notice the appearance of more complex harmonics. For example, every hour the new transactions get loaded, every 24 hours the customer records are updated, and every month the price list is modified.
This data load process can usefully be thought of as the heartbeat of the system-- everyone ignores it until it stops, and if it does stop then you can be sure the alarms will go off, the phone system will light up, and the IT director might even make a personal visit just to say "Hi" and look over your shoulder for a while.
One of the other observations that you might make from your virtual views of the two systems is that there are vast areas of the data warehouse that are never modified. They may be read frequently, though not as often as the most recent data, but once they have been created they sit unchanged. This is sometimes the case with the OLTP system, where permanent invoicing records are being generated, for example, but in general, a much higher proportion of the data are fair game for future changes.
What it all means...
In the data warehousing environment we are handed some distinct advantages over the OLTP system.
- We know when data will arrive, because we control all of the data load processes.
- We probably know how much data there will be.
- We might know just what data is going to be in demand at particular times.
- We know that once we have loaded our data, most of it will not change.
We also have some disadvantages.
- We may have to deal with vast amounts of new data as fast as possible.
- Reporting requirements may change constantly.
In next months installment we'll take a look at disks, datafiles, tablespaces and maybe discuss tables if we're lucky.