With promises of incredibly fast queries, many IT shops implemented one or more big data applications in combination with high-performance hardware and software suites. Results have been generally good: many installations report incredible decreases in query elapsed times, sometimes by factors of 100 or more. Such improvements translated to faster results from analytical queries as well as a better understanding of corporate data.
However, few IT enterprises followed the appropriate data governance means and methods of ensuring data quality. What ensued was data of dubious quality being loaded into these applications, calling into question the results. In this article we review data quality methods and metrics for loading big data applications.
The Big Data Environment
Medium and large IT enterprises have implemented big data applications with some success. These applications usually store data in a propriety hardware appliance that is optimized for fast analytical queries. In addition, it is common to implement big data applications integrated with the existing data warehouse environment. The reason for this is simple: data warehouses contain the dimensional data (geography, time, organizational hierarchy, product classifications, etc.) that are used for subsetting and aggregation in big data analytical queries.
The data warehouse architecture is static and time-dependent, and supports a heavy query-only workload. With these similarities, it is logical to approach data quality in this dual environment in terms of standard data warehouse flows and service level agreements (SLAs) combined with current warehouse data quality measures.
Typical warehouse performance and data quality measures include:
- Speed and volume of data extract. Data enters the warehouse from operational systems either directly or via transferred extract files. This data is collected daily via a snapshot process at the end of regular business cycles. Metrics include total data volume, CPU and elapsed time used for each data transfer. Performance results are aggregated by transfer method such as FTP, SQL, or sequential file.
- Data transform resources used and errors. Data from operational systems is rarely 'clean'. Some field values may not be known at the time of extract and may be empty or contain special indicators such as all 9s. Other fields may contain invalid data, usually resulting from the formats of extract files not matching current or recently updated operational databases. Finally, some fields may exist in a form not easily understood by analytical users; code fields are a prime example, where single letters or digits may represent different departments, categories, of status information. All of these fields must be edited, corrected or transformed into a form that can be stored efficiently and queried effectively. This process is particularly important in the case of databases, where table column attributes such as DATE or DECIMAL require data to be valid and in a specific range. Metrics include total volume of data transformed, CPU and elapsed time used during the transformation, and the categories of transformation.
- Data staging and keying. In this step, the transformed data is staged into intermediate tables, and surrogate keys are calculated and assigned to key fields. Again, metrics include CPU and elapsed time used. Another important aspect of data quality in this process is that the staging tables form the basis for populating important data warehouse fact and dimension tables. Hence, the data must be correct, valid, and verified at this point prior to the next step.
- Data load. Here, the warehouse load process is usually straightforward. Interestingly, a big data application adds another layer to this process, as the appliance used for query acceleration must be loaded as well. It is the data loaded into the appliance in proprietary format along with the multiple parallel data storage and retrieval mechanisms that permits lightning-fast query execution. Metrics here emphasize elapsed times and SLAs, since this is the final step prior to business analytics and ad hoc querying.
- Data archive. Since warehouse data is time-dependent, over time the data ages and becomes less relevant. For example, it is rare that ten-year-old sales data would be used to estimate current customer product preferences. In addition, as the warehouse grows there is a potential for poorly-written queries to run longer and longer. The same is true for a big data application, despite extremely fast query times. Finally, the bigger the database the longer it takes to back it up or recover it in case of a disaster or outage. To counteract this, the database administrator will usually implement a regular purge process to remove old or stale data. This is usually done in compliance with risk standards, as there may be retention requirements for some data. Metrics include data volume archived and summaries of what current data is of what age.
Measuring Big Data Quality
With metrics in place you now have the ability to view resource usage over time. The next step is to construct and implement regular period monitoring. This is monitoring and gathering data about specific categories of objects and how those measurements change over time. The purpose is to allow you to analyze objects in the context of their dependence on resource constraints. After reviewing the reports you will have a much better idea of those objects that are most affected by resource availability. These are the objects you begin with in your review of data quality.
We now advance to quality metrics. You should review these shown below (and others) and prioritize them for your critical objects.
- Data volume match. Is there a difference between the number of records present in the operational and warehouse systems?
- Data consistency. Do data elements' meanings conflict? For example, an employee cannot have both 20 years of experience and an age of 23.
- Domain integrity. Do data element values match their domain? For example, a column titled Salary should probably not be less than zero, and InterestEarned should only occur for an account with a balance greater than zero.
- Business rule integrity. Do data elements values match business rules? For example, someone not in the Sales department should probably not have a positive value for CommissionEarned.
- Non-duplication. Common in a warehouse environment, entries for the same key on the same date should be unique. Similarly, input data extracted from original systems should not contain duplicates.
- Existence of orphan records. Entities should obey referential integrity constraints. For example, an Order should not exist unless there is a matching Customer for that order.
- Data availability. Data needs to be available for querying and analytics per the service level agreement.
For each of these rules you should implement a filter or preprocessing step to perform data validation before it enters your big data application. One method is to code these rules as SQL statements suitable for running against your staging tables. Alternatively, develop a table-based method of reviewing incoming extract files. Rules and data quality checks can then be added or removed from the table when required.
With basic data transformation and metrics in place, the final step is to distinguish between types of analytics users. Here are the basic categories.
Users running ad hoc queries. These users create their own queries against data tables using structured query language (SQL). They then use an online SQL execution tool to run the query and produce the results in raw data form, which they can then observe directly or download to a spreadsheet program for further analysis. These users know the data tables, have SQL expertise, and use simple tools to refine the results.
Sophisticated report analysts. These consumers typically use a sophisticated reporting tool that displays a pictorial data model of the data. They then manipulate the model by dragging and dropping tables and columns to a report window. The tool then creates the appropriate SQL statements based on the model and other parameters, executes the query, and displays the results. These users know the data, usually do not have SQL expertise, and require some advanced query and statistical reporting techniques.
Data mart consumers. These users have their own highly-specialized business data analytical software. They directly extract business data from the source and store it on a local server. They then use special-purpose software to analyze the data.
Any big data quality solution must take these communities into account, as each of them have differing needs. Report analysts will require accurate and complete data in order for analytics to deliver valid and actionable results, while ad hoc users may only require data that is “good enough” for their initial analyses.
Generally speaking, higher data quality leads to better results. In the big data environment, where database sizes are very large and data volume and velocity are high, it is tempting to allow poor quality data into the application. However, in order to permit users to provide results that add value to the enterprise the big data support staff must implement data quality standards, sufficient data transformation processes and good quality metrics. These will ensure that business analytics users can generate meaningful results and reports that provide value to the enterprise.