disaster happened to your database server, could you recover from it? This
article discusses the questions a DBA can ask to probe the realm of what we all
hate to think about - the possibility of the untimely demise of a production
database server. Learn how to become a fearless 'master of disaster'.
Saturdays ago, I performed a planned viability test of my Oracle 9iR2 hot
standby database. I terminated the transmission of archived redo logs from the
primary site, activated the standby database, and compared results between
primary and standby sites. As expected, row counts, dollar totals, and a few
other measures matched up perfectly. Satisfied, I kicked off the process to
copy RMAN backups from the primary site in preparation for restoring the
standby site to its standby role, and went home until Monday, leaving the newly
activated database running over the weekend.
arrived early on Monday to run a few more tests of our applications against the
standby site, I was surprised to discover the instance had crashed. After
investigation, I found out that all of the drives had failed on one of
the standby site's two disk drive arrays. Since that array held drives that
contained datafiles for the system rollback segments, the rollback segment tablespace
was corrupted almost immediately. Further investigation revealed that the disk
array had failed because the array had only one power supply, even though a
second redundant power supply module could have been installed.
this was a rather unexpected and reasonably unlikely failure, it could not have
come at a better time. It caused me to review our entire disaster recovery plan
for both the secondary and primary servers. I found out that none of the
production servers had been outfitted with redundant power supplies for the
disk arrays. And some further reevaluation of my disaster recovery scenarios
proved that the loss of one of the arrays would have caused the loss of UNDO
segments on the production database - because it turns out those datafiles
weren't mirrored properly either.
primary and standby sites are all repaired now, of course, and everything is
copasetic. However, my cautionary tale underlines how a robust disaster
recovery plan can be critical in preventing and surviving a potential disaster.
disaster recovery scenarios.
A good disaster recovery planner isn't afraid to "think
about the unthinkable." This entails developing the common disaster recovery
scenarios that could happen to your database and server.
Based on my
experiences over the past several years as an Oracle DBA, the most serious of
these is media failure. A typical example of preventable media failure
involves under-utilization of RAID-0+1 or RAID-1 redundancy for critical data
files, log groups, and control files. Moreover, as I described in my earlier
tale of woe, it is a good idea to remember those pesky and often-overlooked
UNDO or rollback segments - it may be impossible to restart the database when
those tablespaces are damaged or corrupted due to media failure.
of disaster recovery scenarios with serious implications involves the partial
or complete loss of the database server itself. This might include damage
to the software needed to run the Oracle instance - for example, the loss of
critical operating system files - as well as physical damage, such as a failed
power supply, memory, or CPU module.
disasters can be more difficult to predict, and can be even harder to test,
since realistically a "test to destruction" of the hardware might have to be
performed to simulate some of the failures. However, even with robust modern
service agreements available from major hardware suppliers, it could be hours
or even days before the damaged server is repaired and ready to take the load
of a production database again, so these scenarios should not be ignored.
uncovered potential single points of failure and have painted some grim
pictures as to what might happen if those failures occurred, it's time to turn
attention to the methods, practices, and hardware configurations that help
prevent a disaster.
If you are using Oracle's DataGuard facilities to create and maintain
either a logical or physical "hot standby" server site, then you've already got
this angle covered. However, if you do not have an alternate server to which
you could quickly restore your production database, the ability to recover from
a serious hardware disaster will be much more in doubt.
robust alternative to a standby site is a quality-assurance (QA) database
server. This server should ideally be a close match to the hardware for the
production site to allow evaluation of the next set of application or database
changes about to be released to production. On one occasion before getting our
hot standby server in working order, I was forced to transfer our production
database over to our QA site because we had noticed some "flaky" performance of
the production server. As it turned out, we had guessed right - the production
server's motherboard was facing an imminent failure, and failed shortly after
the transfer of responsibility. Though the QA server had only half the memory
and CPU power of the production site, having a QA server in my "back pocket"
saved the day.
I won't harp on the obvious: The ability to access recent,
consistent backup files and archived redo logs is the key to recovering from
and surviving a disaster. Of course, I am assuming your production database
is running in ARCHIVELOG mode for maximum flexibility for recovery. Moreover,
if the database is running in ARCHIVELOG mode, I am assuming that Recovery
Manager (RMAN) is being used for creating backups and recovery.
As in most
shops, we have designed our production backup scheme to run overnight during
off-peak hours. We have the luxury of a relatively small production database
(330GB) at about 20% utilization, so nightly Incremental Level 0 RMAN backups
only consume about 45-50 GB of disk space, and they are completed in
approximately 4 hours. However, this gives me extreme flexibility in rolling
forward from a potential disaster, including point-in-time incomplete recovery
to RMAN backups.
My Oracle University teacher repeated this in class over and over again: "RMAN
is the best way to back up a database - but it's not the only way." I thank him
every day for the reminder! Even though our RMAN backups are created nightly,
as a second line of defense against data loss I create a full set of exports
every night. If I should need to recover just one table, or a portion of the
table, it is a lot easier to recover it from an export than from a full tablespace
backup. In addition, if a disaster does arise, and my backups are damaged as
well, I have a chance of recovering at least some of the data from an export.
media storage of backup files.
While writing to disk media is probably the speediest and
easiest mechanism for backup files retention, in many cases the disk space
required is a luxury. Even though I do have the advantage of sufficient disk
space, however, I have worked out a scheme of alternate media backups (tape) as
a third line of defense against loss of the database server. In an absolute worst-case
scenario - complete loss of the physical hardware - I still have a guaranteed
method to recover a significant portion of my production database, albeit
limited by the most recent available set of archived redo logs on tape.
word about alternate media backups: Offsite storage is strongly recommended for
at least some of the backup tapes. We currently send a complete set of backups
off to a remote site once a week for vaulted archival with guaranteed
turnaround of one hour for any particular tape (for a small fee, of course).
having a hard time imagining why you'd ever need offsite storage for backups,
here's a classic Oracle "urban legend" I heard at a recent seminar. A panicked
DBA called Oracle for help because his production server had been destroyed
when a truck backed up through his company's loading dock, which was on the
other side of the server room. Part of the collapsed wall crashed down directly
on top of the production server, destroying it. The DBA had an alternate server
available, and had been backing up his database to tape.
the backup tapes were stored - you guessed it - on top of the production
the disaster recovery plan.
Once all the disaster recovery pieces I have discussed
previously are in place, I have found it is important to determine if the
disaster recovery plan will work by actually simulating at least the most
critical disaster scenarios.
experiences a few Saturdays ago, I reviewed all the media failure
possibilities, including the loss of one or more datafiles containing SYSTEM,
UNDO/rollback, index, and data segments. Then I constructed scenarios under
which they might fail, and my expected course of action. Finally, I constructed
methods to simulate the failure.
media failures of the various segment types, for example, I configured a RAID-0
drive on one of our development servers and then restored copies of a test
database so that the appropriate datafiles were installed on that drive. While
our QA manager simulated activity against that datafile by running application
code that accessed that datafile's tablespace, I simply pulled that drive out
of the disk array. I compared the expected results from the simulated failure
against my expectations, and then attempted to restore and recover the damaged datafile
using appropriate RMAN scripts.
I ran into
some unexpected challenges with my initial attempts at RMAN recovery scripts,
since some of the commands to rename and switch datafiles during restoration
are slightly different from those used when restoring from "hot" or "cold"
backups of datafiles and tablespaces. However, I have considered the lessons I
learned during the evaluations of these scenarios to be invaluable, since I now
have working examples of RMAN scripts for each specific scenario.
I am now fully confident that in the worst-case scenarios of a partial or
complete media failure of my production databases, I can easily restore and
recover the appropriate datafiles from an RMAN backup set - something I do not
ever want to have to do under the gun with one hand on the manual and one hand
on the keyboard!
is an Oracle DBA for a telecommunications company in Schaumburg, IL. He can be
contacted at firstname.lastname@example.org.