The staging area tends to be one of the more overlooked components of a data warehouse architecture, and yet it is an integral part of the ETL component design. Learn why it is best to design the staging layer right the first time, enabling support of various ETL processes and related methodology, recoverability and scalability.
In any data warehousing initiative, there are several common components to the architecture. There are the data sources and targets, ETL framework, infrastructure, application layer and the data staging area.
The staging area, in my experience has to be one of the more overlooked and underestimated components of a data warehouse architecture. I think mostly this is due to a lack of understanding as to what exactly it is.
If a quick search is made through a number of websites, many definitions will include the fact the data staging area is simply a temporary workspace used to transform and enrich data before it flows into the operational data store (ODS) and the data warehouse.
This is a good fundamental definition of the data staging area. However, it is so much more. How much more do you ask? Well in reality, the data staging area is an information hub that facilitates the enriching stages that data goes through in order to populate an ODS and/or data warehouse. It is the essential ingredient in the development of an approach and/or methodology for creating a comprehensive data-centric solution for any data warehousing project.
If we really think about this, the data staging area is an integral part of the ETL component design and is the foundation for the ETL architecture.
The Design of the Information Hub
The data staging area has been labeled appropriately and with good reason. With any data warehousing effort, we all know that data will be transformed and consolidated from any number of disparate and heterogeneous sources.
However, the design of a robust and scalable information hub is framed and scoped out by functional and non-functional requirements. Examples of some of these requirements include items such as the following:
- The amount of raw source data to retain after it has been processed through the ETL data lifecycle
- The type of server(s) to house the staging area will be dedicated or shared with other applications and environments (dedicated servers are a proven way to go)
- The acceptable levels of data quality, related baselines and metrics as stated by the Data Governance Board
- Decisions on the data sources that will be federated in and the ones that will be a copy of the sources
- The management of metadata as data sources are brought into the landing zone of the staging area
- The level of security and roles defined for each of the areas with the staging environment
- The masking/scrambling of sensitive data within staging areas
- The identification of recoverable artifacts in the event of disasters, etc.
With these types of requirements, rules and decisions, a scalable and secured framework is firmly in place to facilitate the defined ETL methodology. These data sources go through a number of evolutionary stages in order to build a robust and comprehensive data warehouse and/or ODS. Moreover, as great data architects that we are, we know that these stages must include the following.
This process includes landing the data physically or logically in order to initiate the ETL processing lifecycle. The staging area here could include a series of sequential files, relational or federated data objects. However, the design of intake area or landing zone must enable the subsequent ETL processes, as well as provide direct links and/or integrating points to the metadata repository so that appropriate entries can be made for all data sources landing in the intake area.
Data profiling is the surveying of the source data landscape to gain an understanding of the condition of the data sources. In most profiling efforts, this means generating various reports with any number of metrics, statistics, and counts that reflect the quality of the source data coming in.
Data cleansing is an iterative set of processes that starts and ends with the business rules and standards around acceptable data quality levels from the Data Governance Board (e.g. 95% of the data meets the quality standards). ). This includes investigative jobs to provide additional detail in detecting data patterns and design alternatives for quality enforcement at the attribute, record and aggregate levels and data correction jobs to fill in missing or incomplete data and correct data values. There is also the analysis of reports based on the findings and results of the investigation and data correction jobs to determine if further refinements and/or modifications are to be made.
Data Standardization and Matching
Data standardization and matching is a set of processes that primarily consists of the design and execution, standardizing jobs to create uniformity around specific mandatory data elements. This includes the design and execution of matching and de-duplicating jobs to eliminate duplicate data and create a single version of the truth. It also includes the analysis of reports related to errors and/or exceptions and determines if further refinements or modifications are to be made (if required) and to assess the readiness for data delivery to the data warehouse and ODS.
Transforming data essentially means converting data to conform to a standard established by the Data Governance Board. Examples of data transformations include converting nulls to specific values, gender codes that are disparate to a common set of values or even merging multiple source fields to one data element.
Depending on business requirements, the loading phase can include a total data refresh of the target component or adding new data to the data component in a historical manner. Loading to a staged copy of the target component enables a series of validation exercises. This includes verification of referential integrity, data quality and transformation rules prior to the actual data population of the DW and/or ODS.
Design and Construction
The creation of a staging area will usually start with the typical activities of the design of any data environment. Tasks such as server configuration, alignment of file systems, creating the database instances and related database objects are common elements in the design of any infrastructure dedicate to a data environment.
However, there a number of unique tasks that need to be completed to align the staging area to the ETL methodology discussed in prior sections of this article.
For starters, the data architect and the DBA will need to create separate environments for each stage that the data goes through. This means separate database and file systems that are dedicated to the stage that the ETL lifecycle is in.
For example, a dedicated database instance and related file systems should be created for the data acquisition and profiling stages. The tasks included in these stages are the reading of every data element and record in order to generate detailed statistical information on the source data. This means that processes involved in the profiling effort will be using tremendous amounts of resources related to memory and CPU and should be segregated so that other workloads are not adversely impacted. The design of the database instance must take into consideration the fact that with the use of federated data, there may be implications at the database level that will cause ripple effects on the other data objects within the database instance. Also the file systems allocated to the containers that the database uses should be separate from the file systems used in the data acquisition process so that there are no I/O bottleneck issues.
Then there is the SECURITY component! This is live production data that has highly sensitive information. This data cannot be masked and/or scrambled as this defeats the whole purpose of the ETL process to stage data into the data warehouse or ODS. The raw data must be exposed in order for the ETL to be as effective in integrating, cleansing and standardizing all data from all sources. Therefore, having a robust security framework is an essential ingredient in this configuration. Typically, the data steward and an appointed business analyst should be among the chosen few that have access to some of the sensitive data elements. The ETL developer, DBA and system administrator does not need to see any of it. There is also the prevention of copying data. No one should be allowed to make copies of anything for any purpose. The information hub should be able to satisfy all requests for data access for analysis in a robustly secured environment.
The Information Hub Experience - Tales from the Data Layer
I was assigned to the first data warehouse project at a major healthcare company. It was our first time working with an ETL solution and all that comes with it. We successfully installed the toolset, created the protocols to pull in the data sources and target data warehouse components. However, it was an extremely painfully project. Why? Because whenever the ETL processes aborted or there were hardware failures, there were no clean ways to restart anything! The staging layer was the sum total of several file systems allocated for ETL usage and not much else was in place at the staging area level. In other words, we built a flimsy foundation for the ETL component and we paid dearly for it!
At another time, having grown from that experience, I worked at another client site as part of a team to design and construct a data warehouse environment complete with an ETL solution, etc. This time, I knew I would get it right! I created an information hub that had file systems and a database, tables and views. This database had federated objects and every kind of bell and whistle you could think of. Except that during the data profiling process of the federated objects, the process ran out of temporary space at the source application and aborted. The error message generated was that the database is corrupt and all is lost. Talk about the panic! I had that look in my eyes! Everything ground to a screeching halt while I completed the database recovery.
The lessons here are to design the staging layer to enable support of various ETL processing and related methodology, recoverability and scalability.
A well-designed staging area should enable the ETL approach, processes and services and the facilitation of the data management activities with business analysts, data stewards, (validation of business rules) profiling reports, quality reports and successfully stage the data required to populate the data warehouse and the operational data store. Failure to do that will lead to many sleepless nights, days spent in war rooms and putting the data warehouse project in jeopardy of not meeting milestones and deadlines. I have been on both sides and not being a big fan of the war rooms, I now know better. You should too!