When we were in the process of implementing RMAN backups for our databases, we felt that there were not many articles or real stories on how someone survived a real disaster. This article walks through a real disaster we recently had in our company and how we survived it using RMAN backups. This article explains the disaster scenario, how we recovered our databases using RMAN backups and the lessons learned in this process.
Disaster Scenario:We have over 100 Oracle databases varying from version 8.1.7 to 9.2.0 spread out on 45 HPUX boxes. The oracle software and the database files on all these boxes were attached to a SAN. Our SAN was configured with a 7+1 raid. We lost eight physical drives in SAN that turned out to be 104 logical volumes. Our first step was to find out what boxes and databases were affected.
Understanding and Assessing the Disaster: The first symptom that we saw was that the load on the UNIX server had started to climb very rapidly. We saw loads around 40 when we did a top on the server. As DBAs we always relied on the alert log to pick up any block error when the application accessed the corrupt data block. However, for some reason, the alert logs were not reporting any error. We started getting calls that the application was getting block corruption errors and that's when we realized that some databases were affected. Therefore, the only way to check which databases were affected was to run a dbverify on all of the database files. The dbverify was a CPU intensive process and it ran for quite a while checking the database files. In addition to that, we also rebooted our whole SAN infrastructure to make sure we caught all the disk errors. After running the dbverify, we found that we needed recovery on six databases as we had lost one full file system on the box. There were quite a few boxes where we lost the oracle software and we decided to restore them from our backups.
RMAN Backups: We run weekly RMAN full backup, either hot or cold, for all of our databases. We are using Tivoli Storage Manager as our Enterprise Backup solution. The RMAN backups went directly to tape using Tivoli Data Protection for Oracle. All of our RMAN backups were identified by separate tag names. The oracle software was backed up as a regular TSM backup.
Recovery using RMAN backups: We had to go through different types of recovery to recover these six databases.
Control file and Data File Recovery: The first database had lost three index data files and all control files. We restored the control file from the latest backup. Then we restored the particular data files that reported as corrupt. The last step was to recover the database. The recover database checked for the archive logs in the disk and if they not available it automatically restored the archive logs that were needed. We had developed our own recovery scripts and made sure we had tested them with the different scenarios. So all we did was to run those scripts, sit back and monitor the log files.
Log Based Recovery: The second database had lost one data file. We thought it would be nice to try time-based recovery. However, for some reason, time based recovery would not pick up the backup that we wanted it to. Therefore, we ended up doing a log sequence based recovery. We got the last log sequence number from the alert log before the database became corrupt and specified it as part of the recovery script. It worked like a charm and we were able to recover the database.
Tag name Based Recovery: We had a third interesting scenario. The database needed the backup prior to our latest backup as the application needed that. We thought the easiest option was to do a time-based recovery and we gave the time immediately after the backup had completed. However, for some reason RMAN would restore only the latest backup. The other option was to recover until the sequence number that was created. This option did not work either as there were considerably less log switches to perform and the restore would pick up the backup prior to what we needed or backup after what we needed. The only option in this case was to restore using the tag name. This helped us to go to the exact backup, as we had a standard tag name for all our backups.
Full Database Recovery: The fourth scenario was a full database recovery where we had to run a full database restore of a 40GB database. The initial approach we took was to divide this into three separate restores, first to restore all the system datafiles, second to restore data datafiles and the third to restore the index datafiles. We started these three restores in parallel hoping that the restore will complete quickly. After these restores had been running for three hours, we realized that the restore goes through every backup piece to restore a particular data file. We had about seven backup pieces each having a size of 2GB. While one session examined the backup piece, the other session was in a media wait, as it required the same backup piece used by the other session. Finally, we ended up canceling the three sessions and just restored the full database. In this second attempt, we increased the number of channels for the restore from two to four. As a result of these changes, we were able to restore and recover the 40 GB database log in one hour by applying the archive. .
Previous Incarnation Recovery: The fifth recovery scenario was another challenge. After restoring and opening the database with resetlogs, we realized that we still had some corrupt datafiles in the database. The solution was to go back to a previous incarnation and restore the database. We followed the documentation by setting RMAN to the previous incarnation and started the restore. To our surprise, it worked without any issues. It went to the previous incarnation, restored the datafiles and recovered the database.
Cataloging Archive logs: Last but not least, we ran into a scenario where the RMAN catalog did not understand archive logs that were created after we had run the RMAN backup. We had to do manual cataloging of all the archive logs that were created after the backup. As the number of archive logs that were created was less, we were able to catalog the archives quickly and proceeded with the recovery of the database.
Lessons Learned:Run periodic re-sync of the target database with the catalog database, preferably twice a day so that all new archive logs created will be tracked in the catalog database. This applies to databases that are in archive log mode and that create a large number of archives. Time based recovery does not work as properly as it should. It always restores the latest control file backup instead of the control file that was backed up around the time of the restore. The next lesson was for the Tivoli Data Protection for Oracle. The version we were using, TDP 2.2.1 would not restore backup pieces that were 2GB in size. The solution was to upgrade to TDP 5.2. We were lucky to test it out one week prior to the disaster.
We wished we had:OEM can be used as a GUI tool to restore RMAN backups from disk. However, if the backups are going to tape as was in our case, there is no GUI tool available to restore the database. Everything needs to run through either scripts or using command line commands. The second item on the wish list is the dbverify program. We had to run dbverify for all six databases in the box to determine which datafiles had been affected. It is a very CPU intensive process and we wished it had a parallel option so that it could have run in parallel.
Conclusion: We always make sure we have regular backups for our databases. However, the real success of our backups depends on how effectively we were able to restore the databases from these backups. Develop recovery scripts and make sure to test and practice the different recovery scenarios. We even documented a restore cookbook for all the different recovery scenarios. After a new incarnation of a database is created, make sure to run a full backup of the database. Also, make sure to run a full export of the database once the database is recovered. It will go through all the database blocks and is a good way to check whether there are still corrupt datafiles.
It was a very tough decision for us to let go of our regular operating system backup and switch to RMAN. Now that we have survived a disaster, we are confident that we have made the right decision in switching to RMAN backup. Our hats off to RMAN backup and the way it worked as it is supposed to.