Oracle: Preventing Corruption Before it's Too Late - Part 2

Thursday Oct 30th 2003 by Marin Komadina
Share:

Marin Komadina discusses Oracle data-block corruption, describing several features of Oracle that may be used at an early stage to detect database corruption. Part 2 of this series explores Oracle soft and bug corruption.

Oracle Soft and Bug Corruption

One block is soft corrupted when a block format is different from the Oracle default block format. Oracle soft data corruption (logical, software) is usually detected while reading some data from the disk to the database buffer cache. In the buffer cache, Oracle kernel investigate block content, reading block information about type, incarnation, version, sequence number, checksum and data block address (DBA) depending of the database settings. The same way, whenever Oracle modifies a data block a health check is performed on the block to check it is fully consistent. Any errors found cause an internal error to be signalled.

Oracle by default will not dig deeply into block content; rather it just does a quick look in the block header. If the header does not conform to standard rules, and the block structure is not regular, then the block is considered corrupt. However, this does not always mean that the block on disk is truly corrupt. That fact needs to be confirmed.

Oracle error indicating soft corruption:

ORA-00600: internal error code, arguments: [3339], [RBA1], [RBA2], [], [], [], [], []

Where RBA1 is the block address reread from the block header and RBA2 is the real physical block address in the database. The Oracle database engine will make a block check for a every block read into the database buffer cache. If the block content is incorrect, an error message will be generated. This type of block Oracle will mark as soft corrupted, changing several bytes in the block header. Oracle will skip the soft corrupted blocks, regardless of readable information they contain. Let's look at several different situations:

ORA-00600: internal error code, arguments: [3339], [0], [15742], [], [], [], [], []

The above error occurs when the calculated DBA (real physical block location) and the block header read DBA do not mach. The reason for these differences can result from an operating system repair attempt after a system crash, or by faulty ASYNC I/O processing.

ORA-00600: internal error code, arguments: [3339], [12222222], [144665742], [], [], [], [], []

This error occurs when both addresses, read and calculated, contain some large numbers. Possible reasons are an incorrect entry in the block header (pointing to non-existent block) due to faulty memory modules, or the block is part of a large database file (greater than 2GB) and the block is written in the wrong place.

The message, "write blocks out of sequence" for files greater than 4.3 GB indicate this kind of corruption. Since Oracle supports only 2GB, the operating system has to translate the address and positioning blocks inside large files to the correct location.

ORA-00600: internal error code, arguments: [3339], [14237], [15742], [], [], [], [], []

In the above example, both block addresses are real; one from the block header and one calculated by the Oracle. The problem is that DBA from the block header has offset from the real, true address in the database. The reason for this could be that the operating system has had a failure, writing in the block header the address of the previous block that was last read into database memory.

ORA-00600: internal error code, arguments: [3339], [14237], [15742], [], [], [], [], []

In the above example, Oracle considered the block it was reading from disk to be soft corrupted since it had a different DBA address in the header than the one requested. The reason for this behaviour may be an extremely high stress load on the system causing the operating system to have a read failure, and thus retrieving the wrong block from the disk.

Errors in file /opt/oracle/admin/ALSY1/bdump/ckpt_5514_alsy1.trc:
ORA-01242: data file suffered media failure:

ORA-01122: database file 13 failed verification check 
ORA-01110: data file 13: '/u04/oradata/ALSY1/cwrepo01.dbf' 
ORA-01251: Unknown File Header Version read for file number 13 

If Oracle fails to verify block header content the operation will finish with an error message indicating an Unknown File Header Version. Again, this can be result of Oracle memory mishandling.

Bug Corruption

Due to the bugs in the Oracle code or perhaps because of imperfect behaviour between Oracle code and the underlying operating system, diverse block corruptions occur. Oracle is fixing bugs with every new version. Unfortunately, new ones appear. Here are just a few examples of Oracle bug corruption:

  • a corrupted database due the auto extended bug

    
    Corrupt block dba: 0x24024cd0 file=5. blocknum=150736.
    found during buffer read on disk type:0. ver.
                  dba:0x00000000 inc:0x00000000 seq:0x00000000 incseq:0x00000000
    Entire contents of block is zero - block never written
    Reread of block=24024cd0 file=9. blocknum=150736. found same corupted Data
    
  • Mishandled block information (in certain conditions), upon reading the database block, which indicates that a good block is corrupted. This is the bug in the Oracle version 8.1.x - 9.x, where the Oracle will raise the error:

    ORA-600 [kcoapl_blkchk][ABN][RFN][INT CODE] 

    Pointing to failure condition during block check. This only happens when block checking is enabled.

  • A problem with a faulty database trigger operation, causing data block corruptions

    
    ORA-01115: IO error reading block from file 6 (block # 14873)
    ORA-01110: data file 6: '/oracle/artist/artist01.dbf'
    ORA-27091: skgfqio: unable to queue I/O
    IBM AIX RISC System/6000 Error: 9: Bad file number
    
  • A bug in the operating system, making the system check, corrupting a good Oracle block and causing a database file to be offline.

    
    SQL> SELECT * FROM ARTIST_HISTORY; 
    ERROR:ORA-01115: IO error reading block from file 12 (block # 2342) 
    ORA-01110: data file 12:'/oracle/artist/artist01.dbf' 
    ORA-27091: skgfqio: unable to queue I/O 
    OSD-04006: ReadFile() failure, unable to read from file 
    O/S-Error: (OS 23) Data error (cyclic redundancy check) 
    
    

Database Recommendations

As stated earlier, corruption typically hits an Oracle database block in the memory or on the hard disk. Memory corruptions are handled with the Oracle kernel and operating system detection mechanism. This is not always a perfect solution; sometimes, good blocks are marked as corrupt. Disk corruptions are the most often seen type of corruption. Once written to the disk, the block is not checked again until it is requested. Oracle will follow block delivery until the block reaches operating system. After that, the operating system will take responsibility for block handling. However, the operating system might not make an "intelligent" block check, since it has no knowledge about the Oracle block structure.

Oracle provides several possibilities to defend against block corruptions:

a.)   Disk protection

DB_BLOCK_CHECKSUM=TRUE

Oracle's main method for detecting Oracle block corruption is with the block checksum. After the block modification, the header bytes checksum is calculated by the database writer. On the next reread of the same bock, the block is validated. In addition to regular data and rollback blocks, the database also manages the redo log blocks. The redo log blocks hold undo information, which is necessary for transaction recovery. The Archive process (ARCH), as a part of Oracle's background process structure, will calculate checksum for the redo log block and compare it with the existing checksum information from the redo block header before making an entry into archive log file.

The checksum information represents the logical block structure. If a block has a standard Oracle block format, then the block is good. The process of checksum validation does not read the actual block content, therefore, there is still the possibility that data inside the block is corrupt.

Setting this parameter prevents a block from being written that does not confirm to the standard Oracle block format. If the block corruption is due to a change of format, that kind of the corruption will be discovered as a media error. The disadvantage of setting this parameter is that some blocks can be retry-able with the good data portion, but due to the header corruption, Oracle will mark the block as soft corrupted. Once a checksum is generated for a block, the block always uses checksums, even if the parameter is later removed.

DB_BLOCK_CHECKING=TRUE 

Setting this instance parameter instructs Oracle to make a detailed logical block check, (not only header), before making any block changes. This block self-consistent checking prevents memory and disk corruptions. By default, the Oracle kernel always executes a detailed logical check for blocks that belong to the system tablespace. If corruption inside of the block content is detected, media error ORA-00600 [6599] or some variation of the media error, depending of the action completed on the block, will result. This condition will suspend database activity until the problem has been resolved. In addition, all data in the corrupted block is lost. There has been much discussion regarding how much detailed block checking will slow down the system. From my experience, the overhead was never more than 3%, even under a heavy load. Oracle says that we can expect from 1% to 10% overhead. Again, it all depends how much insert/delete activity you have on your Oracle database.

b.)   Memory protection

 _DB_BLOCK_CACHE_PROTECT=TRUE

Database block memory protection is enabled by leaving DB_BLOCK_CACHE_PROTECT on default settings (TRUE) and having the DB_BLOCK_CHECKSUM parameter enabled.

Oracle will use a special database algorithm to protect all modified blocks in the database memory. Any uncontrolled database writes inside the database memory area will be detected, protecting the disk from getting in memory corrupted blocks. After every successful disk read, the database block is checked in the database memory. The checksum information from the block header is compared with the memory checksum calculated on the fly. When they do not match, Oracle raises an error indicating a soft corrupted block, which will be reformatted and reused again.

Although in memory corruption detection is helpful, any in the memory corruption will crash the database instance. This parameter is hidden and by default always on.

c.)   Additional parameters

_CHECK_BLOCK_AFTER_CHECKSUM - perform block check after checksum, by default set ON, and activated only when DB_BLOCK_CHECKSUM=TRUE

_DB_ALWAYS_CHECK_SYSTEM_TS - always perform block check and checksum for the system tablespace, by default set to ON

Oracle Hardware Assisted Resilient Data (HARD)

Once the data block leaves Oracle's protected memory area, control is handed to the underlying operating system. Information stored in the Oracle data block will be checked again, but only on the next read. There is no guarantee that data was not changed before actually being written to the disk. Usually, this should not be a problem, since the operating system will use the underlying control mechanisms for data block checking. However, incomplete writes and corrupted blocks still, from time to time, occur on the disk device. For this reason, storage vendors began to provide Oracle validation and checking algorithms at the storage level. Can you imagine one EMC, or XP storage box, which, after writing to disk, reads the Oracle stored block and then does the Oracle checksum check? This whole initiative is called HARD, with the main target to physical protect data. Storage software products, certified according to the HARD standard are able to make:

  • validation of checksum and selected block fields for the datafile blocks
  • validation of checksum and selected block fields for the redo log blocks
  • validation of checksum and selected block fields for the controlfile blocks
  • validation of single block write

With this methodology, it is possible to follow the Oracle block all the way to the physical disk, allowing us to prevent corruptions before they happen.

Conclusion

Corruption cannot be eliminated, but we can make every effort to discover the problem early. Having advanced system configurations increases the chance of having a block corruption on one of interconnected system layers. In addition, there are many more possible block corruption situations, with even more variations of the same errors, than were presented in this article.

» See All Articles by Columnist Marin Komadina

Share:
Home
Mobile Site | Full Site
Copyright 2017 © QuinStreet Inc. All Rights Reserved