We are finally getting back to where we should be: designing databases right the first time. There is, however, a small problem: What do we do with current implementations? How do we backfill quality into them?
In the context of database design what does quality mean?
- Defect prevention during design
- Defect detection and remediation during construction and testing
- Data verification as close to the source as possible
- Stability, reliability, data accessibility, systems scalability
The ideals of quality assurance for IT systems delivery have weakened to something more akin to "mediocrity prevention". Quality is sacrificed so that critical IT functions can be rolled out on time, and within budget. This means a shift in focus from development to support.
Many more errors will now be detected after system implementation, thus putting more pressure on database administration support staff. The cost of addressing these errors may not be charged back to systems development, leading to invalid cost estimates during database design.
The result: there will be a major shift in the workload of DBAs. They will no longer have time for software installation and customization, performance monitoring and tuning, data backup and recovery, and other tasks that contribute to quality. Instead, they will spend more and more time fixing errors. Errors that should have been prevented, detected, or corrected during database design.
How Does This Affect the DBA?
The lack of quality in recently delivered systems affects those in technical support positions the most. They are the ones that must deal with system problems in real-time. This drives up the cost of problem determination and problem resolution. It also manifests itself in product behaviors that may annoy ... or drive away customers.
The most common problems encountered by the DBA after an implementation are either performance-related or contention-related. That is, database processes (like processing SQL) run too slowly, or one process contends with another for access to data.
Typical fixes by the DBA in this reactive situation include changing or adding indexes, reorganizing tablespaces and indexes, and changing tablespace partitioning or clustering to mitigate contention. Many of these could have been engineered into the database design prior to implementation.
How Do We Introduce Quality?
There are several reactive methods for introducing quality into currently existing databases. Initial steps should concentrate on the following.
First, DBAs should coordinate and collect lists of frequent problems and questions that they encounter. Management can then categorize and analyze these problems, perhaps noting trends. For example, frequent complaints about lack of authority to execute certain functions may indicate an underlying security issue. Frequent errors that indicate database unavailability may indicate network, space allocation, or server issues.
Next, expand on this by ensuring you are using a good problem tracking system. This helps the support staff focus on tasks and priorities, while providing management with hard data on resource usage. How much time are the support staff spending fixing application issues when they could (or should) be focusing on systems performance, server maintenance, or network connectivity?
Next, use your tools to report the support costs to the development areas. This is not to be used (at least initially) as a chargeback mechanism; instead, you are making support costs more visible to management. IT management must be made aware that fixing application problems after implementation is more costly than fixing them in either the testing or design phases.
Reengineering Legacy Systems
Given a legacy system, how does one approach reengineering it to improve quality?
Systems engineering, including most software development methodologies, concentrate on error prevention. This is done by using quality metrics and processes during the early phases. In some cases, quality is such a high priority that the systems are designed in part to be self-correcting. Systems exhibiting such behavior are termed autonomic.
In our case, we already have a system implemented. Here, correcting errors is mostly reactive, and assumes that you can actually detect errors when they happen (or soon thereafter). Since we are unable to re-design the system, we must concentrate on increasing the quality of error detection and reporting.
In one sense, the advent of database management systems and communications systems such as DB2, CICS, and MQSeries has made error detection much easier. Before these systems arrived, the most typical error was one of total application failure, accompanied by a memory dump.
Most of these error situations were fatal (physical I/O errors, dataset full, division by zero, and so forth). However, now even so-called fatal errors are reported back to the application in the form of error codes. For example, a DB2 application encountering a physical I/O error receives a specific SQLCode and additional status information.
So, increasing the quality of error detection and reporting is relatively straightforward in these cases. Based on your previous compilations of frequently encountered errors and problems:
- Document the most common error codes
- Document the additional status information returned to the application
- Determine how (and to whom) such errors should be reported, and what information is most useful
- Use this as a basis for either updating or designing one or more standard error processing modules
- Embed these modules in your legacy applications.
Too simple, you say? Too costly to implement? Compare this cost against the time and resources spent doing problem determination and resolution. Emphasize to your management that this approach uses information you already possess, deals directly with the most common problems, and has the potential to detect future problems as they occur, thus speeding up problem resolution.
With better error detection and correction processes in place, the next step is the expansion of the standard error processing modules to include errors that you have not encountered yet. The complete list of error codes exists in the production documentation. A few hours spent researching possible errors codes will result in an expanded list of errors that applications can detect and report.
Too much work, you say? Consider the delays you currently experience: A user encounters a symptom, which must be reported to someone who determines the underlying problem, who must then contact someone to fix the problem. Depending upon how unspecific the symptom is, problem determination may take several hours. For example, what if the customer gets a message "application unavailable". What does that mean? Who should be told? What do you fix?
Contrast this with a standard error module intercepting a specific error code. For example, the standard error module receives a DB2 code indicating that a database is full. Based upon criteria in the module, it may send an e-mail or page to support staff, or even display a message on a support console. Errors are detected faster, they contain sufficient information to define the problem, and they are routed to someone responsible for a fix.
Now that's quality.
Postscript: The Future
IT shops are now returning to a process they left long ago: systems analysis and design using software development methodologies and tools that incorporate quality metrics and processes. Given the changes in tools, personnel, and priorities, how can you ensure that you are developing quality systems?
I recommend implementing best practices based on the Capability Maturity Model. This is a method for developing, implementing, and supporting systems based on a process of continuous improvement.
Carnegie Mellon Capability Maturity Model
IBM TechDocs library: Information on autonomics -- "A First Look at Solution Installation for Autonomic Computing", IBM document SG24-7099, available at the
IBM Quality management solutions
American Productivity and Quality Center
American Society for Quality