Your big data application needs regular extracts from your production systems. While many best practices exist for big data extract, transform and load (ETL) processes, we sometimes forget that these data-intensive procedures can affect the operational environment’s performance.
Big Data Application Resource Usage
Today’s big data applications are scaling up and out. This involves adding more CPU power, more memory, and more system resources. IT staff are also upgrading the hybrid hardware and software appliances used for big data information storage and execution of business analytics.
As big data applications grow, they also evolve in their implementation of operational data extracts. What was once a simple and usually daily extract of flat files and databases has morphed into high-speed and high volume data movement that includes unstructured data such as large objects, XML data streams and images.
The result? Operational systems performance has become a constraint on big data extracts. We will present a few tips for monitoring and tuning operational systems in order to keep your big data application happy. Ultimately, the goal is to reduce the elapsed time of the data transfer process.
Before You Begin - Disaster Recovery Considerations
As big data applications matured they became more and more valuable. Today, it is common for many big data analytical reports to be executed regularly in order to provide customer support staff with valuable data for making business decisions. Some of these decisions involve mission-critical systems such as raw materials ordering and delivery, product ordering and shipping, and customer service. This led naturally to including the big data application in disaster recovery (DR) scenarios.
One common option is to duplicate the big data appliance in the DR environment. In the event of a major disaster, big data applications and their data will be available. This requires an expansion of data storage, network capacity, and a review of hardware and software license requirements. As you review your operational system for changes required to improve performance, don’t forget to include the DR environment in your calculations. You don’t want to be in a situation where operational systems perform acceptably in the standard production environment but are slow-running (or non-performing!) in case of a disaster.
Review of the Operational Environment
You should schedule a regular performance review of your operational environment that looks specifically at changes that are or will be required due to current and future big data applications. Some givens are the following:
- More records will appear in current database and file extracts as your production systems naturally grow due to an expanding customer base, more products and services offered, and larger transaction volumes;
- Specific extract files will increase in width as more fields are added. These include additional fields selected for extract, new fields added to production files and databases, and new production applications;
- The time window for running extracts will be constant or may decrease. This may happen because of expanded hours of operation, implementation of off-hours applications (ATMs, kiosks, web sites, etc.), or an increase in the number of extracts running concurrently.
Some of these constraints will be mitigated by increases in the performance of storage and network hardware. Disk storage arrays are now sold with greater memory caches and additional features (such as parallelism) that result in higher speed storage and retrieval. Your performance review should take these things into account.
Costs and benefits of new mainframe components typically involve hardware and software license charges. The IT organization will need to re-budget and perhaps even re-negotiate current licenses and lease agreements. As always, new hardware comes with its own requirements of power, footprint, and maintenance needs.
One aspect of performance review that is sometimes forgotten is best practices. These include documentation of enterprise data in a data model, as well as standard file and database documentation.
Operational Systems Performance Review
Your performance review will concentrate on data volumes and data movement, so-called speeds and feeds. Every process and job in your extract transform and load (ETL) applications needs to be timed and reviewed.
Begin by documenting all extract files and processes. They may be varied, and include some or all of the following:
- Jobs that are part of the operational system (e.g. jobs that produce extract files);
- Jobs that are part of the big data application (e.g. a file transfer protocol (FTP) that is initiated in the big data application to pull data from the source system);
- Jobs that process source data in a separate environment prior to transfer (for example a stand-alone system that pulls data from multiple operational systems, consolidates the information, and then passes the results to your big data application).
To reduce total elapsed time of these applications it may be necessary to execute more of them in parallel. Alternatively, one can convert what is currently a single large extract into two or more smaller extract jobs that are run simultaneously. Another possibility is to move the transform logic into the extract itself. For example, instead of one job that extracts from production file and a second job that cleans the job prior to loading into the big data appliance, combine the two jobs into a single job, thus only passing once through the data.
Big Data Load Performance
When reviewing the big data environment, consider alternatives for data storage. Some of these are:
- Database-only storage. Store operational data in a datamart or in the enterprise data warehouse. This environment is already a familiar one for most business analysts, and it should already be fairly well integrated into your big data application.
- Appliance-only storage. Store operational data only in the big data appliance. This saves database storage capacity, and provides high-speed data access. The only disadvantage is that the data can only be JOINed with other data in the appliance (that is, it cannot be JOINed to data in the data warehouse).
- High-speed appliance load. The latest version of the IBM DB2 Analytics Accelerator (IDAA) appliance has the ability to load data directly from an external system. This avoids the necessity of creating a mainframe file of loadable data and then executing a separate utility for the data load. Instead, data can be sourced from an external system and loaded directly into the appliance.
Another performance consideration in an operational environment is the set of business data users. These consist of regular reporting processes, operational data processing, user and customer transactions, business analytics, and system status and diagnostic tools. Make an effort in your analysis to identify categories of transactions that can be offloaded to non-production environments. These include:
- Staff running diagnostic queries or tests. Ensure that there is a test or development environment with production-like data where these users can operate. These might include business analytics users who are attempting to form structured query language (SQL) queries for current or future analyses, as well as technical users attempting to diagnose problems.
- Sophisticated or data-heavy reports. Determine if any long-running reports can be deferred to a time period that will minimize resource consumption. As a part of this effort, consider that many software license charges are based on average CPU usage over multi-hour periods. Coordinate with your hardware staff to ensure that rescheduling CPU-intensive reports does not adversely affect costs.
- Datamart loading. Many large enterprises designate one or more hardware environments as stand-alone datamarts. These marts are loaded regularly with production data extracts. Review the usage of these datamarts, and determine if they can be loaded at different times. Use this information to schedule the loads so that CPU and network resources are not overworked.
- Enterprise data warehouse (DW) integration. Most big data applications are integrated with one or more data warehouses. This is due to the DW containing many of the dimension and fact tables used by business analytics to explore big data. Since DW ETL processes commonly take up a large part of the nightly batch cycle, consider these jobs in your analysis as potential targets for performance tuning. Some typical ideas are adding indexes to speed data access, implementing high-performance load processing, and data and process parallelism.
The performance of mission-critical operational systems can make or break your company. Since the growth of big data applications usually far outstrips that of standard operational database growth, plan on reviewing your big data loading processes regularly. This review should include the effects your big data extract and load jobs have on resource usage system-wide. As the numbers and elapsed times of these jobs grow, the potential exists that resource consumption may affect operational systems performance. With this in mind, track resource usage (CPU, memory, disk storage, network data movement) over time and plan any capacity increases or performance tuning as needed.