Many enterprises that have installed big data applications have done so using vendor-supplied hybrid hardware / software solutions. These solutions, sometimes called appliances, store huge amounts of data in a massively-parallel storage array and use sophisticated analytical software to return solutions to SQL queries. One of the main selling points of such a configuration is the incredibly fast execution time of queries, sometimes thousands of times faster than if they were executed directly against DB2 tables.
However, with multiple applications accessing enterprise data we now need mechanisms to guarantee that the data is easily available. Here, we discuss a few common techniques to ensure that a big data application does not cause bottlenecks in mission-critical processes.
Availability is not Performance
Performance deals with using resources to get work done quickly and efficiently; availability relates to whether or not all required data is accessible in a timely fashion given the current resource load and data locks and claims.
Consider a long-running query that requires a lot of DB2 memory for sorting. During execution of this query, that memory is not available to other queries. This is a performance consideration. On the other hand, consider a utility that loads a customer table. In most cases, the loaded data is not available to other applications until the load completes successfully.
Loading Data to the Appliance
Most early-generation big data solutions involve using the DB2 load utility to load DB2 tables, followed by execution of a process that loads the new data into the appliance. There is, therefore, a short delay between the time the data is available in DB2 and when it is loaded into the appliance.
In addition, as data is loaded into the DB2 table, portions of the current data may not be available. This can be due to one or more of the following:
Hot spots. A hot spot is a physical place in the database where a lot of data access is happening. The worst situation is when a lot of inserts, updates, deletes, and selects are happening to a set of rows in one location. A flood of inserted rows requires DB2 to find space to store the data, eating up embedded free space. Simultaneous rows updates and selects can lead to deadlocks and timeouts.
Usually, hot spots can be predicted (and perhaps avoided) during database design. Take the case of an account table where new rows are assigned new, ascending account numbers. If the table is clustered by account number, the physical end of the table now becomes a hot spot where all new rows will be inserted.
Another example might be a table keyed by date, where all transactions for a particular date range are stored (i.e., clustered) close together.
Some hot spots can be avoided, or at least mitigated, during database design. The DBA can embed free space for rows to be inserted, and can schedule frequent reorgs to re-distribute rows. It may be possible to physically cluster data in ways that spread new row inserts evenly across the table. The DBA will also coordinate with the data owners to ensure that old or unused data is regularly purged or archived.
Index processing. As the DB2 load utility processes rows, it first completes loading data into the table. Once finished, it begins creating index entries for the new rows. These new entries need to be inserted into one or more of the existing indexes, and this process may cause a lot of movement and changes to index internals. The result: indexes are not available during this phase of the load process.
Stale data purge. Another best practice for table maintenance is to purge old or stale data from tables. This is usually accomplished immediately prior to the load process in order that space freed up by the deleted rows can be reused. Considering that most business tables are related in some way (customer and order, patient and treatment, assembly and part, etc.), purging stale data from a single table may leave related data in other tables in the system. For example, a purge of customers over 10 years old from a customer table may render invalid the orders from those customers in other tables.
Data Resides in Multiple Places
IT staff now realize that most of their big data now resides in at least two places: DB2 tables and the appliance. Indeed, it may well exist in other places as well, including image copy datasets, datamarts containing extracts of the big data, and potentially DB2 log datasets as well. It is common to see a creative enterprise increase data availability by distributing data access. Here is one example.
- Direct large ad hoc analytical queries to the big data appliance;
- Direct smaller ad hoc queries to DB2 tables;
- Have applications requiring bulk data extracts retrieve their data from image copies.
These are not meant to be specific recommendations. I merely wish to point out that there are multiple sources for your data, and that you can increase data availability by making judicious use of the resources at your disposal.
DB2 Availability Issues
For those interested in using DB2 tables as an alternate access path to big data, here are a few best practices for data availability.
Indexes. Remember that:
- Data loading typically happens first to the DB2 table and then to the appliance;
- DB2 table loads take longer when indexes are present;
- Database designs contain indexes for many reasons, include uniqueness enforcement, partitioning, and query performance;
- The appliance has no use for DB2 indexes, loading instead directly from the table.
To maximize data availability, more indexes are preferred. One common tendency with DB2 database designs is to create an index for every column, or set of columns, likely to be searched. This must be balanced against the elongated load time.
Horizontal Partitioning. Horizontal partitioning (or partitioning by key range) is a common method of assigning data rows to separate physical media. Such separation can have many advantages.
Partitioning by date range is common. This allows application processes to access partitions independent of one another. A good example would be an archive or purge process that acts on the data in the oldest partition while new table rows are inserted in another. Another example is creating one partition per day, which allows activity on the current day's data to be physically separated from other processes.
Another byproduct of date range partitioning is that backup and reorganize processes can be done on separate partitions. Consider banking and order entry applications. New account transactions and new orders result in inserts and updates to current data. At the end of the day, the DBA team can execute backups and reorgs of today's data, thus providing good recoverability and good performance. Backups of partitions containing old data are not needed, and can be deferred.
Vertical Partitioning. In this method, the database administrator analyzes the column data in a table to determine if certain columns are updated or queried more or less frequently. The table data is then split into two tables, each with the original table's key. One table contains the frequently-used columns, the second table the remainder. This can greatly increase query performance, since SQL accessing frequently-used columns now accesses only one of the tables, which is now significantly smaller.
There are other criteria for splitting rows. These include:
- Security. Columns containing personal or sensitive data are segregated to a separate table having more strict access requirements.
- Performance. Columns containing numeric fields that are commonly summed or aggregated, such as account balances or on-hand quantities, can be segregated to a smaller table, making calculations faster.
- Recoverability. Some column data may be critical to the enterprise: think account numbers and balances displayed on an ATM. These could go to a separate table that would be backed up more often than other data such as addresses or telephone numbers.
Replication. One issue hampering data availability is when data is accessed by a large number of users from a large number of applications. If these applications' needs for data are diverse in volume and priority, contention may result. To reduce contention, the application designer can implement a data store separate from the operational data and have the required data copied at regular intervals. Many data warehouses are implemented in this fashion with nightly batch loads of the previous day's data. Applications may then query the data warehouse without contending with critical transactions.
In addition, since data warehouse data is static, the database designer can create additional indexes to support data availability and performance. Considerations of data recoverability and update performance mentioned earlier are not as important, since the data warehouse is not a mission-critical function.
A variation of the data warehouse is the datamart. This is a data store separated from the operational data similar to a data warehouse. The datamart is usually smaller, and contains data for only a single business subject area or for only a small number of applications. Many enterprises have a number of datamarts spread across several lines of business. Data is copied from the operational system at specific times determined by a "publish / subscribe" agreement. This means that datamarts subscribe to pre-determined subsets of data that are published by the main data store on a regular basis.
Data availability concerns are often overlooked in a big data environment. However, once performance benefits are realized, the IT staff is usually confronted with contention, locking, and data availability issues. Analyze your current big data load processes to ensure that there are no hot spots and that stale data is purged regularly. Consider directing some processes to the DB2 tables (or even to image copy data sets) rather than to the appliance if query speed is not critical. Finally, review DB2 database design to see if partitioning options or a different indexing scheme would allow the DB2 tables to be a viable data source during times of high usage.