Database backup copies may not be enough to ensure data recoverability. The DBA must "bake" recoverability into the database design process; additionally, for current systems, the DBA needs to know whether current backup processes support recovery procedures that will meet application recovery objectives.
The Laws of Database Administration
In order of their importance, the laws include recoverability, availability, security, and performance. The easiest way to explain the laws is in the negative sense; that is, what you shouldn't do. In this form, the laws can be stated as "Thou shalt not cause thy …
- Data to become unrecoverable
- Data to become unavailable
- Data to become unsecure
- System to perform poorly.
The first three laws deal with data management, the fourth with system management. These encapsulate the four most important responsibilities of the DBA.
The Order of Importance
Some might quibble with the order of the laws. For example, one manager told me: "In our shop, performance is of highest importance. We have service level agreements that we must meet in order to meet our customer's needs. Performance is more important than backup and recovery."
Performance is certainly an urgent concern, but recoverability remains the most important. To determine what's really important, consider this anecdote:
You are a DBA supporting IT systems for a provider of medical and surgical services. Your systems are used daily by doctors, nurses, and technicians to provide services to patients. In some cases, these services (e.g., diagnosis) may involve life-or-death decisions. Your supervisor asks you to give a presentation to upper management (including the CIO, vice president of IT, and major stockholders or owners) of new enhancements your department will make. Your presentation begins as follows:
"Ladies and gentlemen, the DBA team will be implementing some high-performance features in the near future. We have two implementation plans and we'd like your assistance in choosing between the two.
"Plan A will involve performance changes that will result in up to 95 percent of our online transactions finishing in their required service levels." You hear some grumbling from the audience, as they realize this means 5 percent of the online transactions will perform poorly. You continue.
"Plan B will involve performance changes resulting in 100 percent of our online transactions finishing in their required service levels." You hear sighs of relief from your audience, as they are clearly more comfortable with this plan.
"However," you continue, "if we have a major hardware outage, there's a good possibility that up to 5 percent of our data will be missing or invalid."
Your audience now sits in stunned silence. They realize there could be a power outage or other disaster that affects the IT systems. Should this occur, when the systems come back up, every user will know there's a 5 percent chance that test results are missing, diagnoses are incorrect, or that some patient's records will have completely disappeared.
Which plan will your audience adopt? Which was more important to them: recoverability or performance?
In the remainder of this article I will concentrate on the first law: data recoverability.
Law #1: Data Recoverability
Ensure data recoverability. If there's one thing to get right, this is it. While other things (such as performance or security) may seem more urgent, ensuring data recoverability is the database administrator's most important responsibility.
Consider a project to implement a new database to support a critical production application. If a disaster occurs, will the data be available in the agreed-upon Recovery Time Objective (RTO)? If not, in the case of medical and financial data, this may breach contracts with vendors or violate audit guidelines.
Data recoverability is another major consideration in some legislation. Here are some that affect financial institutions:
- Expedited Funds Availability (EFA) Act, 1989 requires federally chartered financial institutions to have a demonstrable business continuity plan to ensure prompt availability of funds.
- Federal Financial Institutions Examination Council (FFIEC) Handbook 2003-2004 (Chapter 10) specifies that directors and managers are accountable for organizationwide contingency planning and for "timely resumption of operations in the event of a disaster."
- Basel II, Basel Committee on Banking Supervision, Sound Practices for Management and Supervision, 2003 requires that banks establish business continuity and disaster recovery plans to ensure continuous operation and limit losses.
Most IT shops use regularly scheduled standard backup procedures (e.g., DB2 image copies), but few have actually tested the recovery time of these objects and analyzed whether their backup procedures are sufficient (or necessary) for their recovery requirements. One common mistake is forgetting that tablespace recovery using an image copy must also include rebuilds of indexes (which take additional time).
The simplest way for the DBA to proceed is to ensure that recovery requirements are documented during the requirements phase of systems design. For existing systems, document the recovery requirements either immediately, during the next phase of maintenance or enhancements, or during the next audit. These documents can then be used to drive one or more internal projects to measure application data recoverability, compare against recovery objectives, and implement improvements.
Best practices for recoverability start with recovery, not with backups. The DBA should not simply implement daily or weekly backups (image copies) as a standard; instead, list possible recovery methods and options, their costs, and their speeds. This can then be used during design or enhancement projects to ensure recovery requirements are met.
Some of these methods include:
- Full image copies
- Incremental image copies (with optional change accumulation to full copies)
- Image copies of indexes (especially large indexes)
- Image copies using the ShrLevel Change option ("fuzzy copies")
- Data replication
- Hot or cold standby copies
- Disk mirroring
DBAs should ensure they have all of the following:
- A regularly scheduled process for determining (and documenting) the recovery status of all production objects;
- Regular measurements of required recovery times for objects belonging to critical applications;
- Development of alternative methods of backup and recovery for special situations (such as image copy of indexes, data replication to recovery site, and DASD mirroring);
- Regular development, improvement, and review of data recoverability metrics.
Automation of the Recovery Process
In general, the DBA should automate reactive or simple reporting processes, freeing them for higher-level work. Your first reaction might be, "Wait! I'll automate myself out of a job!" Far from it. Implementing automation makes the DBA more valuable. IT management wants its knowledge workers doing tasks that add value. These might include detailed systems performance tuning, quality control, cost/benefit reviews of potential new applications and projects, and more. Management understands that a DBA spending time on trivial tasks represents a net loss of productivity.
The advantage of automation isn't merely speed; automating tasks helps move the DBA away from reactive tasks such as reporting and analysis toward more proactive functions.
Here's a typical list of processes many DBAs still manually perform:
- Executing an EXPLAIN process for SQL access path analysis
- Generating performance reports such as System Management Facility (SMF) accounting and statistics reports
- Verifying that new tables have columns with names and attributes that follow standard conventions and are compatible with the enterprise data model and data dictionary
- Verifying that access to production data is properly controlled through the correct authority GRANTs
- Monitoring application thread activity for deadlocks and timeouts
- Reviewing console logs and DB2 address space logs for error messages or potential issues.
Each of these tasks can be replaced by an automated reporting or a data gathering process of some kind. With such processes in place, DBAs now can schedule data gathering and report generation for later analysis, or guide requestors to the appropriate screens, reports or jobs. This removes the DBA from the "reactive rut" and generates time for proactive tasks such as projects, architecture, planning, systems tuning, and recovery planning.
Along with choosing specific tasks to automate, you'll probably need to learn one or more automation tools or languages. REXX is an example of a popular language for online or batch access to DB2 data. There are many examples and ideas for automated processes in articles, presentations, and white papers.
Autonomics in the Recovery Process
As our IT organizations have matured, we've become smarter about our problems. We began to collect problem logs, and analyzed them looking for trends and patterns. We began to recognize frequent problems and devised strategies for automatically dealing with them or preventing them.
We've now reached the next logical step in this progression: engineering processes and process control to make systems and applications self-aware and self-healing. This is called autonomics. Autonomics, ranging from simple scripts to complicated processes, can be applied to applications, systems, or support software. For many DBAs, the idea of a self-healing database inspires visions of the database redesigning itself.
What exactly would a self-healing database heal? One DB2 z/OS example that comes to mind is real-time statistics. DB2 dynamically generates these data distribution statistics. An example of their use is during reorg utility execution, where they can be queried and the results used to decide whether or not to execute the reorg.
These and other examples of DB2 autonomics make it possible to "program in" a manner of self-tuning (or at least self-management) into the DBA's support infrastructure.
The next logical step is implementing autonomics into the recovery process. The DBA can use real-time statistics to decide when to take image copies, or what kind of copies (full or incremental).
In implementing best practices for database recovery, the DBA needs to move away from reactive tasks, initiate quality measures, and offload basic, repetitive tasks. I hope that you can use this material as you work to become more productive. Remember to quantify and document your results, and then advertise your value.
Application Recovery Tool for IMS and DB2 Databases A Data Recovery Guide
IBM DB2 Recovery Expert for z/OS User Scenarios
IBM System z Mean Time to Recovery Best Practices
IBM High Availability and Disaster Recovery Options for DB2 on Linux, UNIX, and Windows
IBM Backup and Recovery IO/O Related Performance Considerations
Term - Recovery Time Objective