Don't Just Do Something, Stand There! Avoiding Junior DBA Mistakes

Thursday Apr 15th 2004 by Jim Czuprynski

When an Oracle database suddenly becomes unavailable, the immediate reaction is to do something -- right now! -- to solve the problem, which sometimes makes the problem worse. This article discusses some examples of when it's really best to stop and think before proceeding.

Synopsis. When an Oracle database suddenly becomes unavailable, the immediate reaction is to do something -- right now! -- to solve the problem, which sometimes makes the problem worse. This article discusses some examples of when it's really best to stop and think before proceeding.

Military veterans I have spoken with have often described their combat experiences as days and weeks of monotony and boredom punctuated by minutes and seconds of sheer terror. While I have hardly ever been in physical danger as an Oracle DBA, I have certainly experienced some white-knuckled terror when a database has suddenly become unavailable - usually at a most inopportune time for my clients.

As Oracle DBAs, our immediate reaction is to do something - right now! - to solve the problem. Moreover, it usually does not help when your supervisor is standing behind you asking what's wrong, and how long will it be until the database is available again. While doing something immediately is certainly a noble goal, in many cases it is just the opposite of the proper action. Here are a few examples:

The Slow Reboot

I was called out of a meeting when a fellow IT employee reported that many of our users were having difficulty accessing our company's primary order entry application. Like any good DBA, I made a beeline for the server room to check on the status of the database server. What I found sent my heart racing: the database server had rebooted itself for some reason from a Windows 2000 "blue screen of death" (BSOD) error, and in fact was still rebooting when I had arrived on the scene.

Since I knew I had configured the Oracle database service and instance to restart itself automatically in these circumstances, I headed back to my desk to start checking on the status of the myriad applications that depend on the database. However, after a long five minutes or so - more than enough time for the server to reboot - we were still getting reports of inaccessibility from our user community.

Back to the server room! I immediately checked the database alert log and found an error message indicating that one of the disks storing the database's datafiles could not be located. Now my heart is really pounding, and I am already thinking about which Recovery Manager backups would have to be applied to restore the datafile, whether all the archived redo logs I would need to recover were present on the server... and then I recalled a similar incident when I had been testing the same "cold reboot" disaster recovery scenarios earlier in the year.

Therefore, I just shut down the database instance, restarted the service and attempted to restart the database. Almost magically, everything was fine. Oracle found the datafile on the expected drive, the database restarted just fine, and SMON performed a perfect instance recovery. The database was back online in a few moments, and my heartbeat returned to normal.

It turned out that during the server reboot, the Windows 2000 operating system had just fallen a bit behind in recognizing the "missing" drive, and when it was time to mount the database, Windows told Oracle the drive "wasn't there." Of course, by the time I performed a shutdown and restart, Windows had finished making the disk drive available, and voila! everything was once again copasetic. Our team eventually found the root cause of the BSOD with some help from Microsoft. We traced it down to a Windows driver conflict between the OS and SNMP-based software that is supposed to monitor the server for hardware failures, including (ironically) disk drive failures.

But what if I hadn't resisted my urge to do something right away to fix the problem? I could have made the situation much worse by performing a time-consuming restoration and recovery in the midst of our company's busiest operational period, and possibly corrupted a datafile that was not even damaged.

The Unredundant Redundant Power Supply

Our system administration team found out that our hardware supplier had shipped all media storage cabinets without redundant power supplies - including the four cabinets for our primary production database server. They therefore decided to install redundant power supplies for all our media cabinets. When they received the hardware, they began testing the hot-swap capabilities of the power supplies, and reported that we should be able to simply shut down our databases, swap in the redundant power supplies, and power them up - we did not even need to power down the media cabinets themselves. We scheduled our installation for late on a Friday evening to limit disruption to our customers.

The Friday arrived, the database was shut down and we began swapping in the new power supplies. As we powered up the last array, we noticed the cooling fan for that array's original power supply was running noisily - never a good sign - so the system administrator pulled that power supply to investigate. What we found knocked us off our feet: Three of the four retention bolts for the fan had been sheared away during normal operation! Thankfully, we had ordered spare cooling fans, too, so the system administrator ran to the spares room to get one.

Throughout the process, I had been watching the arrays intently for any signs of problems. While we were waiting for the spare fan to arrive, the fourth array suddenly shut itself down even though the "redundant" power supply was still operating. (We later found out this was a safety feature built into the media cabinet to prevent potential data loss.)

Since this media cabinet contained the mirrored drives for its companion above it, the OS suddenly got very upset, and began to claim that several disk drives were now unavailable to the server - drives that contained half of the database's datafiles. When the second power supply was reinstalled, the drives went into immediate RAID recovery mode, and it appeared that they would be available again for use within four hours.

This was in my early days as an Oracle DBA, so I had never experienced a potential failure like this before. My immediate concern was that the database would be in an inconsistent state or completely shredded, and my immediate reaction was to activate our Oracle 9i DataGuard standby server and pick up the pieces the next morning. However, my system administrator stopped me just in time and explained that the drives would be just fine once they had completed synchronizing themselves with their mirrored "twins." Since we had no pressing need for the database until Saturday morning, I decided to wait out the media recovery period.

Once again, everything was fine. The Windows OS recognized the drives, I was able to restart the database, and still get home in time for a late dinner. However, if I had gone with my original plan to activate the DataGuard standby database, I would have wasted an enormous amount of effort for what was a trivial issue in retrospect.

Why Aren't The Shared Servers Sharing?

My final example deals with what at first appeared to be a catastrophic database failure. Like most American companies, our transaction processing volumes vary throughout the month. Our peak usually occurs during the two days before and after the end of the month when our Accounting department completes its billing cycles for most customers.

Late in the afternoon in the midst of a recent month-end billing cycle, our production database's alert log began to spew out messages relating to failed shared servers, and response time for both OLTP and decision support applications began to degrade seriously. My immediate reaction was to check the hardware in the server room. I found no unexpected problems - all storage media was online, CPU utilization was moderately elevated, and memory utilization was relatively normal.

However, that did not explain the slowdown, so I re-examined the alert and user dump logs. I found that in only five minutes, there had been two deadlocks, both caused by the same pair of users. When we checked these two users' workstations, we found a fatal error message on one of their terminals - one of the users had left it on his screen and simply gone on break without informing anyone of the problem. It turned out that the application he was using had a fatal flaw that had gone unnoticed until then: It was allowing him to post payments for invoices that were still being generated by his teammate.

The end result of the two deadlocks? Oracle brought online several dozen UNDO segments to handle the rollback of the massive transactions that had failed. Since we use automatic UNDO management, Oracle handled this without DBA intervention. However, it took some time to bring all those UNDO segments online, and Oracle diverted almost all of the server's processing power for approximately two minutes to handle this.

So, what caused the shared server failures? We ascertained that our Sales department had just finished a major sales contest, and several dozen salespeople were rabidly attempting to run reports on their month-end sales figures. Unfortunately, I found out later that most of these reports had been written against "live" database tables - the same tables that were affected by the aforementioned deadlock. In addition, the reporting application that the salespeople were running used shared server connections to run these reports.

A final piece of the puzzle fell into place the next day. It turned out a large number of reports had been migrated from our order entry application (which uses dedicated database connections) to their rightful place in our decision-support application (which uses shared server connections), and that migration had happened just before the monthly billing period had commenced. We found out that many of the migrated reports had not been fully tested in shared server mode and contained horrendous query plans that used large hash joins to return result sets. Since Oracle Shared Server manages UGA memory differently than for dedicated connections and usually allocates relatively small memory areas for sorts and hash joins, the reports were taking an enormous time to complete.

In the midst of all this turmoil, one of my colleagues kept insisting that we should just shut down the database server to clear up all these issues and pick up the pieces later. I was able to convince him that while yes, response time is slower, and yes, some of our internal clients are affected, what we really needed to do was to monitor, measure, and selectively investigate for the root cause. Moreover, had I not done so, I would have never gotten the data I needed to forestall the next failure. I am happy to report the end result: We corrected the application logic error that caused the deadlocks, moved some of the reporting application users to dedicated connections instead of shared, and sharpened our measuring tools.

Lessons Learned

I have woven these tales of DBA combat not to frighten or impress anyone, but merely to illustrate what I have learned during my moments of sheer terror. Here is a quick summation:

Know your hardware's capabilities as intimately as possible. In my second example, I would have saved myself an enormous amount of anxiety if I had understood exactly how my hardware worked. If your shop allows it, I strongly recommend getting an understanding of the hardware and its capabilities. Since media failures are the most vexing of any data recovery effort, I tend to concentrate on the media storage side of the equation, but understanding all aspects are just as important to the survivability, recoverability, and restartability of the database server.

Review and test your disaster recovery plan. As often as possible, review your disaster recovery plan, especially when you have either just experienced a combat moment or whenever there have been significant changes to your database - structural, size, and so forth. (You don't have a disaster recovery plan? No problem. See my previous article on Becoming A Master Of Disaster for what you need to know about disaster recovery plans.) And don't forget that a disaster recovery plan that is untested is just a checklist.

Use "failures" to sharpen your failure detection tools - and trust those tools. One of my close friends is a private pilot, and he has been in some tough situations while flying - bad weather, engine problems, and even getting lost late at night. When I asked him what the most important thing to remember during situations, he smiled and said: "Trust your instruments." No matter what tool set you use to monitor your databases - home-grown SQL queries, Oracle Enterprise Manager, or even those expensive third-party packages - be sure that you understand what those "instruments" are telling you, especially during a system slowdown or an impending failure. Above all, resist the urge to pull back on the stick, even if you hear a passenger screaming - the airplane may not be heading for the ground after all!

» See All Articles by Columnist Jim Czuprynski

Mobile Site | Full Site