Since the building of data warehouses began, the perceived wisdom has been to maintain a separation between the data warehouse and the company's operational systems. Today, while this 'separatist' thinking still dominates among many data warehousing professionals, the approach is being questioned. Evolving wisdom is to consider alternatives to the implementation of a traditional warehouse solution, where operational systems are capable of living in harmony with the data warehouse.
by Duncan Pauly
volumes continue to place organisations' IT departments under greater strain
than ever as they look to harness data effectively while overcoming a minefield
of performance issues. The importance of the data warehouse in attaining the
maximum benefit from data has grown exponentially, with its promise to provide
an enterprise-wide view of business activities and increase the company's
profitability through intelligent data handling. Increased sales, more
effective and intelligent marketing, enhanced customer services, and
streamlined business processes - the data warehouse is regarded in high esteem
by organisations as being capable of paving the way towards the attainment of
these business benefits.
since the building of data warehouses began and the term was first coined by
industry heavyweight Bill Inmon, the perceived wisdom has been to maintain a
separation between the data warehouse and the company's operational systems.
Today, while this 'separatist' thinking still dominates among many data
warehousing professionals, the approach is being questioned. A school of
thought is emerging that challenges the division of the data warehouse from
operational systems, citing the fact that the justification for separation is
purely technical - as opposed to being a divide founded on the drivers of the
business. The evolving wisdom, particularly acknowledging the continuous march
of technological progress, is to consider alternatives to the implementation of
a traditional warehouse solution, where operational systems are capable of
living in harmony with the data warehouse.
Why accept a compromise?
data warehouses to provide a complete and immediate understanding of the
enterprise, offering the capability to react quickly to the marketing place and
to out-manoeuvre their competitors. Business drivers include, for example,
increased revenues through more effective marketing and cross selling to the
existing customer base, again based on better understanding of customer
activity and profiles. Similarly, by identifying inefficiencies and areas of
strength, cost reductions can be achieved, while revenues can be increased. The
benefits, in short, more than pay for the implementation of the data warehouse.
Then there are
the technical considerations that translate to business benefits for the
organisation. Today, technology is such that separate systems - with the
concomitant investment and ongoing operational costs - could be unnecessary.
The same is true of compromise approaches, or "halfway houses," as discussed
later, where an intermediate database is used to attempt to furnish intelligent
data rapidly. Not only is this approach very much a poor cousin to the data
warehouse, yielding inferior data results, but it also exposes the business to
greater costs in terms of setting the solution up, maintaining it and then
replacing it in the future.
Disharmony and differing demands
The need to cater
for disparate demands is why the data warehouse has traditionally been
implemented separately from operational systems: they each have different
profiles and make different demands on hardware and applications. Technically,
therefore, the IT department has faced a range of conflicts between
performance, operational and user requirements of the respective systems.
because an operational system is usually built for a transaction processing
workload, it needs to cater for multiple concurrent short-lived transactions, mixing
queries with updates. The data warehouse, in contrast, supports a smaller user
base and longer-lived queries. In detail, for instance, while the data
warehouse typically benefits from a disk configuration optimised for high
transfer rates, the operational system hardware needs to support a higher
volume of individual random disk operations. While it is often a good thing
that hardware and operating systems handle mixed workloads, using resources
effectively, performance issues arise not with the hardware but from the hosted
applications or the database. These are often in the form of contention with
resources, or excessive consumption of resources, such as rollback images.
The invisible and the visible
systems and data warehouse systems have different optimal schemas. The former
aims to achieve performance and the maintenance of constant transactional
integrity, so the schema is designed that way. Users won't be aware of the
schema - it will be 'hidden', whereas the schema in a data warehouse is likely
to be more visible to users. Why? Because for a data warehouse - in order to
fully exploit its potential - the schema needs to be intuitive to users,
offering the ability to undertake flexible queries without resorting to
multi-way joins. Complex joins are difficult for users and often incur a big
performance hit. In this instance, the usual is a de-normalised star schema
built from a central fact table surrounded by dimension tables.
Choosing and tuning the indexes
There are issues
with indexes too, as the systems require different ones. Query flexibility and
the ability to perform index scans efficiently, as well as good selectivity
from a combination of predicates in a where clause, means that a bit map index
is useful in a data warehouse. However, for an operational system, poor
concurrency of bit maps makes it unsuitable as contention and performance
degradation is highly likely to occur, resulting from multiple sessions
attempting to update and query the same bit map index.
As for hash clusters,
they should not be used for full table scans, as often occurs with a data
warehouse, because a hashed organised table is spread over more blocks than an
equivalent heap organised table. Hash organised tables are, in short, for
static data, while B-trees might be used more appropriately for both systems as
they provide good concurrency and flexible queries. However, since they can
become rapidly disk IO bound and can cause excessive database checkpoint
activity, they must be used with care on heavily updated tables of significant
justification cited for separate systems is scheduling differences. The
platform for the two systems is likely to be handled differently in terms of
administration and upgrading. The availability requirements of an operational
system may be strict, while those for the data warehouse are less so thus
giving different scheduling cycles.
Implementation and potential pitfalls
a data warehouse means facing several potential pitfalls. Aside from the
technical issues, there are many commercial and people-related issues that
typically arise with a major development project. From budget for the project
to getting resource and skills, there are hurdles at every corner, including
overcoming the politics associated with setting up a new area.
too, there are many potential pitfalls. Consider integration with operational
systems - a major issue to be addressed. Operational systems will load the data
warehouse, necessitating integration at several levels. This is from the
fundamental levels (such as network, hardware and software) to application
areas - data representation, data semantics and data schema. Integration
challenges must be faced up to with the application areas regardless of where
the data warehouse is, but issues faced in relation to the network, hardware
and software are actually highly likely to result from having chosen to
separate the data warehouse from operational systems.
Then there is the
issue of the playoff between the much desired timeliness and completeness. For
any data warehouse to be of benefit to the enterprise, it needs to satisfy both
wide-ranging queries aimed at understanding long-term trends, and queries of
up-to-the-minute data, which target the current short-term position of the
business. It is difficult, if not impossible, to resolve these conflicting
requirements within the constraints of the loading window.
This conflict has
been difficult to resolve, and in order to do so, many organisations have
turned to the typical solution of "the halfway house", as referenced earlier. This
is the implementation of an intermediate database that provides the business
with its volatile and intermediate summary business information but very much
on a restricted query basis. Needless to say, to introduce such a solution
introduces more complications to the technical solution - not to mention
greater implementation and operational costs. If an operational system and a
data warehousing system living in harmony on one server can avoid this one
issue alone, apart from all the rest, surely it is worthy of consideration?
Achieving the best query performance
optimal performance for queries, the schema used by a data warehouse will be
different to that used in operational systems, and to convert from an
operational to a warehouse schema means employing a combination of
de-normalisation and over-normalisation.
combines columns and or tables to avoid join operations at query time. This is
usually more intuitive for end users and significantly enhances performance.
Over-normalisation partitions tables horizontally and/or vertically to improve
performance. Partitioning opens up the possibility of parallel data loading and
parallel query execution and also provides an effective method for rolling in
and rolling out historical data. This latter aspect can be central to achieving
timely updates to the data warehouse.
is another possibility. Using this approach to optimise query performance
involves pre-computing sums and counts of data across various dimensions as the
data is loaded, meaning queries not having to do this work at query evaluation
time. This can substantially improve query performance as a query may only need
to fetch a handful of rows from an aggregation table - and the cost of
performing such calculations is moved from query time to load time. If load
windows are tight, aggregation can become an issue in its own right.
The march of progress: database technology
technology has moved on since the underlying architecture for data warehouses
was first established. While performance improvements have been achieved for
the separated warehouse, they have been based on unchanged fundamental
principles and assumptions, and have largely ignored the march of technological
progress in database technology. These database enhancements 'moved the goal
posts', to the degree that new technologies and database features make it quite
reasonable to unite operational and decision support functions - so that they
can at last live in harmony with one another.
We are not
suggesting that there is a nirvana where a complete methodology for
implementing disparate application on a single database exists. That needs time
to evolve. However, there are some avenues worth exploring should a full data
warehouse not be suitable or worthwhile or where the availability schedule for
the operational system is flexible.
plays a key role here. One database serving both operational and data warehouse
systems needs to use data replication to separate the two applications, with
the operational system using its conventional transaction optimised schema
which is then replicated and mapped to a data warehouse optimised schema on the
same database. Using continual incremental synchronisation data is replicated
onto the warehousing system. The big question here is, what indexes are now
suited to the warehouse schema? Big tables with dynamic data, such as a fact table,
rule out B-Tree indexes, as well as hashed clusters. Big tables result in both
becoming disk IO bound during index updates, which is likely to seriously impact
the synchronisation process. For small or medium cardinality columns (assuming
that multiple synchronisation processes are not updating the same index) bit
map indexes may be acceptable. But the avoidance of contention between the
synchronisation process and user queries must still be considered.
Third-party indexes: a step toward harmony?
Today, there are
alternatives. Third-party indexes have been developed to integrate with the
database and they can provide the generic functionality of B-trees without
suffering the same performance impact during index updates. In addition to
being fast to update and fast to query, they can be used for fact tables that
need to be continuously synchronised as queries are simultaneously performed.
In looking to bring the data warehouse into a harmonious relationship with the
operational system, these indexes provide a compelling alternative to
conventional B-tree indexes, hash indexes and bit maps. They are key to
attaining good performance with incremental updates.
technologies, such as Adaptive Addressing, provide a more efficient and
flexible alternative to conventional indexing techniques, reducing the
overheads and constraints traditionally associated with indexing. There are
also data aggregation techniques, which can further eliminate some of the
indexing requirements and provide fast and flexible access to data analysis.
So the approach
is worth considering, as by achieving effective incremental synchronisation,
data in the warehouse is far more up-to-date than would be possible with a
traditional warehouse, which often involves a bulk transfer of data through
file systems and staged rebuilding of indexes at infrequent intervals.
mean that data warehouse tables and indexes can reside on different disks from
the operational data, which avoids disk IO contention. And separate dedicated
rollback resources can be allocated and tuned to meet the different demands of
the small transactions for operational schema and the large transactions for
the warehouse schema. It is also worth considering separate buffer pools to
guarantee that warehouse queries do not flush out any blocks cached for the
The block size
should suit the operational system, which is fortuitous since the operational
system is likely to exist and changing the block size is not an easy option.
Setting an appropriate multi-block scan size will be necessary, however, as the
warehouse queries will require a larger transfer size for full table scans.
What are likely
to stay the same for a co-resident scheme are the decisions made about de-normalising
and over-normalising in a separate warehouse schema, which are taken to
optimise query performance and allow an efficient roll out of historical data.
Optimum implementation of the incremental synchronisation mechanism will demand
some additional changes, and attention must be paid to the synchronisation
process for transactional consistency, as queries are active while the
incremental update is in progress. To make sure that the decision support
queries are against a schema optimised for them, any decision support queries should,
of course, be confined to the warehouse schema and kept away from the
One for all?
Ultimately, it must
always be remembered that any organisation needs to objectively assess the
strengths and weaknesses of all approaches based on what the business actually
needs. But what must be borne in mind is that new database technology is
playing a more significant role than before, and that there are ever more
compelling arguments for achieving a state of harmony when implementing a
co-resident schema with incremental updates.
For one thing,
warehouse data is made more relevant and useful due to the timeliness of the
update. Additionally, there is likely to be a reduction of operating costs, as
deployment of one database negates the need for another hardware platform, of
course, as well as bulk data transfers between database instances and their
probable resulting complications.
organisations need to cut their own way through the dense undergrowth of data
today and the data warehouse issue. One system will most likely give more
equitable sharing of the total resources than separated hardware platforms, but
whichever route is taken more disks, more memory and more CPUs is probable.
point to remember is that the "halfway house" route may well represent a risky
option and is highly likely to result in significantly greater costs to the
business in the long term. It is doubtful whether organisations adopting this
approach will achieve the true business benefits of data warehousing - such as
the rapid access to the type of high quality, reliable and intelligent data
that can be used to drive the business forward.
Another key point
to keep high on the agenda is the impact of advancements. Today, both
traditional and accepted wisdoms are constantly under scrutiny - they are being
tested fiercely by evolving technologies and thinking. It is precisely for this
reason that co-hosting a data warehouse with an operational system must be a
serious consideration for any forward-thinking business.