Both application designers and database administrators sometimes take the simplistic view that regular backups of application data are sufficient for any recovery needs. The strategy of weekend backups can easily backfire! Backup methods that meet the application’s and enterprise’s needs start with a sound recovery strategy. Further, this strategy must be applied from the beginning, starting with database design.
Database Design for Recovery
The experienced database designer knows that different database designs may have remarkably different recovery times. In order to make a wise choice, the designer must know two things:
- The recovery time objective (RTO) -- During a recovery scenario, how long can the application data (or portions of the data) be unavailable?
- The recovery point objective (RPO) -- During a recovery scenario, to what point must data be recovered? To a specific date/time? To the end of the most recently completed transaction?
Some examples should clarify. Consider a data warehouse application with a database containing several hundred gigabytes of data. Assume that management defines the RTO as “all data from the most recent month must be available within 4 hours.” Depending upon the hardware and software available, it may be impossible to recover the entire database in a few hours. However, if the database is partitioned by month (that is, each partition contains data for a single month), then recovery need be completed only for a single partition in order to satisfy the RTO.
Now consider another data warehouse application that uses a database containing a very large table. This table has twenty indexes, each created to increase query performance for various analytics executed against the data. During recovery, most database management systems would recovery the base table first, and then re-build each of the indexes from the table data. Such rebuilds typically involve reading the entire table, saving all index key data to a file, sorting the file in key order, then creating the index. For our example table with twenty indexes this may take quite some time, even if some of the rebuilds can be run simultaneously.
One alternative in this case would be to take backup copies of the indexes. This could be done by copying the base files. IBM’s DB2 has a COPY INDEX utility that could be used. Recovery of the table and all twenty indexes can now be accomplished with multiple file-to-file copy jobs run in parallel.
Typical database design decisions that affect potential recovery times are as follows:
- Table partitioning. Already mentioned above, data can be physically partitioned into multiple pieces, with different pieces having recovery requirements.
- Critical data separation. For mission-critical applications such as order entry or account maintenance some critical data may be split into exclusive tables. These tables would then have a higher recovery priority. Fields such as account number, account balance and order status might fall into the ‘quick recovery’ category. Other fields such as address or notes fields, might be recovered later.
- Data volatility. Some applications have very high rates of data input or data update. Data warehouses typically have a daily mass insert or mass load phase where data for the previous day from operational systems is loaded. Databases that support order entry applications may have a high rate of update activity, as orders are entered, filled and shipped.
- Capacity and performance planning. As databases grow in size they require more and more resources to support. A data warehouse database may be backed up quickly during its first year of operation; however, as more and more data is added more backup space is required, backups (and recoveries) take longer. Some databases with high volatility may require regular database reorgs to maintain performance.
Application Design for Recovery
It isn’t only the database design that needs to take into account recovery needs. Applications that access the data must also be designed with recovery in mind. Consider a database that supports a customer information application. If part of the database needs to be recovered, can the application still access the remaining data?
The considerations here revolve around application SQL that may lock data. If a poorly written SQL statement locks an entire table, then portions of it cannot be recovered for the duration of the transaction containing that SQL statement.
Here are some common recovery-related issues to take into account during application design:
- Locking. As mentioned above, applications should take care to lock only the data required. Multiple options exist for addressing this issue, including SQL parameters such as WITH UR (indicating that data locks are not to be taken).
- Transaction size. Application designers accessing data with a critical recovery time should keep transactions as short as possible. Only the minimum, necessary and sufficient data should be accessed and updated. Long-running transactions make recovery of the underlying data impossible until they complete. Even worse, a long-running transaction that fails, requires the database management system to back out all changes made by the transaction, thus lengthening the transaction further.
- Balancing performance and recoverability. One common SQL performance technique database designers use is to add indexes. Indexes make additional data access paths available, sometimes with impressive results. Regrettably, the more objects one adds to the database, the longer it may take to recover it.
Infrastructure Design for Recovery
Once the database and application are designed and implemented, the infrastructure team must support them. Such support includes performance monitoring and tuning, regular backups, and database reorgs. Recovery must be considered here as well. Support staff must ensure that jobs such as backups are run at times that support the RTO and RPO. If a database has a very short RTO (“recover all tables within 1 hour”), it may be necessary to run backups more frequently than once a week.
In addition, recovery testing is a necessity. The support team should test the ability of their recovery processes to execute successfully and within the required time.
One interesting consideration is the type or category of recovery scenario. Here are some things to consider:
- The rogue application. Consider a poorly-tested application that is implemented in production and proceeds to ruin much of the data in a database. Recovery now includes not only database recovery to a point in time prior to the application execution, but also some kind of ‘replay’ of valid transactions (assuming the bad application is fixed quickly).
- Disaster recovery. Most IT enterprises have documented and tested disaster recovery plans. These plans include recovery of data, usually some kind of global recovery done by storage management accompanied by local recoveries by database administrators.
- Changing requirements. Some applications graduate from being merely useful to extremely important or even critical. If this involves a major change in the data RTO, designers may need to scramble to meet the new needs.
Data Backup Options for Recovery
The database administrator has several backup methods that can be used to support recovery requirements. Each involves a different combination of backup media such as tape or disk, required resources such as tape drives or CPUs, and of course differing recovery elapsed times. Some of these options include:
- Database unloads. The most basic backup method, the DBA uses SQL or a utility process to unload table rows to files on disk or tape. The process is easy to create, use, and document. Recovery times are usually quite long since all rows must be re-inserted into the tables and indexes completely rebuilt.
- Database full image copies. The most common method. The DBA uses a utility process called COPY to make backup files. These files are typically in an internal format that can be quickly reloaded back into place. Recovery time is quick, although indexes must be rebuilt after a table recovery.
- Hot standby. In this method, every table in the database has a duplicate table that exists in a separate disk array. Activity against the original table is also directed to the standby. In case of failure of the primary table, applications can quickly switch to the secondary table.
- Disk mirroring. A hardware solution implemented outside of the DBMS, one frequently-used by large corporations. Disk arrays are defined in pairs, and any disk activity such as a database update is directed by the disk hardware to both arrays. Should the first array fail, the hardware can quickly switch activity to the second array. Expensive, but recovery time is very quick.
DBAs should ensure they have all of the following:
- A regularly scheduled process for determining (and documenting) the recovery status of all production objects
- Regular measurements of required recovery times for objects belonging to critical applications
- Development of alternative methods of backup and recovery for special situations (such as image copy of indexes, data replication to recovery site, and DASD mirroring)
- Regular development, improvement, and review of data recoverability metrics
Mission-critical applications must be supported by a robust and rapid recovery process. Devising this process requires designers to begin at the beginning: database design. Recovery requirements are an integral part of any application or database solution. Database designers and administrators can ensure that recovery time objectives are met by utilizing the appropriate combination of backup and recovery processes discussed here.
IBM -- Application Recovery Tool for IMS and DB2 Databases A Data Recovery Guide
IBM -- DB2 Recovery Expert for z/OS User Scenarios
IBM -- System z Mean Time to Recovery Best Practices
IBM -- High Availability and Disaster Recovery Options for DB2 on Linux, UNIX, and Windows
IBM -- Backup and Recovery I/O Related Performance Consideration