Big data applications are here to stay. Most large IT enterprises have implemented one or more large data stores, hybrid hardware/software solutions and high-speed business analytics packages. The promise of this technology is the ability to quickly and easily analyze large amounts of data and derive from that analysis changes to customer-facing systems. Management believes that the analysis and subsequent changes will drive up customer satisfaction, market share and profits, hopefully at a reasonable cost.
If these solutions give analysts their answers in seconds, rather than hours or days, why worry about performance tuning? The key is that analytics is only one part of the process. IT systems must still acquire data from source systems, transform it, and load it into the big data application. In addition, another factor will come into play: the evolution of big data implementations from specific use cases to mission-critical applications.
In today’s commercial information technology systems, larger and larger volumes of data are being gathered and stored. To be able to capture, store, and analyze this data, most enterprises turn to specialized hardware and software solutions. One common solution category is a special-purpose hardware data store (sometimes called an appliance) coupled with business information analytics software that is tailored to access that appliance. One example is the IBM DB2 Analytics Accelerator (IDAA).
These solutions are expensive. Large data stores require extensive disk and memory arrays. High-performance access requires lots of CPU power coupled with complex data access allowing multiple processes to access various sections of the data in parallel. All this technology is focused on one thing; high-speed query execution.
Yet querying large volumes of data is only one part of the picture. Other processes feed the big data application, and the database administrator (DBA) must concentrate their performance tuning efforts here. The most important areas are data transformation and bulk data load.
Acquire, Extract and Transform
One classical process of very large database applications and data warehouses is the initial acquisition of data from source systems and operational data stores. Typically these take the form of simple data copy operations, database extracts, and inputs from external sources. A big data application may store terabytes of new information daily. This means that IT staff may need to investigate methods of speeding up data flows. Some options include data compression, sending file change descriptions rather than full files (sometimes called deltas), and specialized software or hardware solutions.
In addition to simply transferring the data from the source system, most data warehouses perform data transforms. Here, the IT staff codes transformation logic to replace missing data with defaults, correct invalid data such as dates that are outside a specified range, verify status code fields, and so forth. Executing this logic requires CPU time, and as data volumes increase the supporting hardware may constrain the speed of this process. IT staff can address this issue by adding more processing power, parallelizing the data flows, or moving some of the transform logic into the source system extract process.
Bulk Data Load
Big data naturally implies large volumes of data. This data needs to be loaded into the database management system or the special appliance before users can execute analytics. Data load by nature is extremely I/O-intensive. Many vendors of special appliances now provide high-speed load functions based on special sorting, specialized hardware, or methods that continuously feed new data over time rather than bulk load. Each of these options can be tuned based on the vendor’s specifications.
Mission-Critical Big Data
Over time big data applications accumulate data and become more valuable. Users become familiar with the data, and with the results of their analytics. The tendency will be for users to issue more and more queries of increasing data scope and complexity. The sequence of events may be something like this:
- Your company implements a big data solution;
- Users use specialized software to create and run high-speed analytical queries;
- Some queries produce results that, when analyzed and acted upon, result in better customer satisfaction, better market share, or higher profits;
- The big data application grows in size;
- Excited analysts submit more queries; some are so useful that they are converted to regular reports;
- The number of valuable reports results in management designating the big data solution and analysis as mission-critical.
The result? As the number of concurrent queries greatly increases, overall response time increases. The big data appliance requires more resources in the form of disk storage, CPUs, and I/O channels. Cost goes up and speed goes down.
The alternative is performance tuning. This usually takes two forms: query performance tuning and resource constraint analysis.
Query Performance Tuning
Big data stored in DB2 tables may require the DBA to reduce or minimize the number of indexes on the data. While it is typical to add multiple indexes to a table to improve query performance, for very large tables the indexes will be large as well. Disk storage limitations may prevent the DBA from creating some indexes. In addition, a greater number of indexes will slow data insert performance as well as make any database recovery process run longer.
Big data housed in a proprietary hardware and software appliance must often be accessed simultaneously with data warehouse tables. (This is typically implemented using SQL join statements.) The DBA must coordinate loads of the big data appliance with data warehouse extract - transform - load (ETL) processes to ensure that all data is available during querying.
A best practice for tuning distributed access performance is by using resource constraint analysis. The DBA monitors resources such as the disk subsystem and CPU while gathering performance data. Even query and job elapsed times can be considered resources. When the DBA discovers that a resource is constrained they may then balance other resources to compensate.
There are several common resource constraints that occur in big data applications, with data contention and locking being the most common.
The most common locking problem is SQL statements that lock too much data. An SQL statement that reads a row usually locks multiple rows as read-only for the duration of the SQL statement. This behavior is controlled in multiple places, including the statement syntax, the database definition, and the use of commit statements by the application.
DBAs should review SQL statement locking behaviors to make sure that the minimal amount of data is locked. Know the lock size of objects, and how applications access the data. Long-running applications may lock data for long periods of time, reducing data availability. Consider row-level locking to minimize the footprint of the SQL, although this may lead to an increase in CPU time used to manage locks.
Application commit logic should also be reviewed. A commit releases locks and allows data access.
Additionally, the DBA should review application and utility scheduling. For example, verify that utilities such as Image Copy are not running concurrently with applications that do database updates.
A big data application does not exist in a vacuum. It requires a lot of input data, which must be acquired and transformed. The data must then be loaded into a combination of database tables and a big data appliance. Each of these steps is a point where the DBA can focus on performance tuning.
Three Success Factors for Handling Big Data and Data Warehouse Performance - Dave Beulke 2012