Fast-Start Failover in Oracle 11g Data Guard

Friday Nov 20th 2009 by Jim Czuprynski
Share:

Learn how to set up and control automatic failover using the Fast-Start Failover Observer, so that DBA intervention is no longer required during a disaster recovery scenario.

Synopsis. While Oracle 11g’s Data Guard definitely protects a database when the entire production site is lost via its failover capabilities, it’s still necessary for an Oracle DBA to intervene to complete the failover process. This article – the seventh in this ongoing series – shows how to set up and control automatic failover using the Fast-Start Failover Observer so that DBA intervention is no longer required during a disaster recovery scenario.

The prior article in this series demonstrated how to:

  • Manually initiate a failover operation when the primary database is no longer accessible
  • Reinstate a failed primary database by transforming it back into a physical standby database

This article will show how to:

  • Configure the primary database and one physical standby database for Fast-Start Failover (FSF)
  • Activate, monitor, and relocate the Fast-Start Failover Observer (FSFO)
  • Insure against the loss of a single FSFO via Enterprise Manager Grid Control

Fast-Start Failover: Concepts

Oracle introduced the Fast-Start Failover (FSF) feature set in Release 10gR2, but it’s been enhanced significantly in Oracle 11g to permit much finer-grained control over the conditions under which a FSF would be initiated. Simply put, FSF ensures that under the appropriate circumstances – some of which are mandatory, and some of which are optionally-configured - a failover to the chosen standby target database will occur without DBA intervention. The following occurrences will initiate FSF:

Table 7-1. Fast-Start Failover: Triggering Events

Event Class

Triggering Event

Connectivity Loss

Network connectivity is lost simultaneously between the primary database and:

  • the FSFO itself; and
  • the standby database designated as the FSF target

and the connectivity time lost exceeds the FSF threshold

Database Health Check Failure

A database health check detects any of the following optionally-configured failures:

  • Any datafile has gone offline due to a write error
  • A critical database object has dictionary corruption
  • A control file is permanently destroyed because a disk has failed
  • Log Writer (LGWR) cannot write to any member of a log group due to an I/O failure
  • Archiver (ARCn) cannot archive a redo log because the destination is full or unavailable

Instance Crash (Single Instance)

The primary database’s instance has crashed

Instances Crash (RAC)

All instances for a RAC primary database have crashed

Shutdown Abort on Primary

The primary database is shut down with the SHUTDOWN ABORT command

Fast-Start Failover detects one of these failover situations through the Fast-Start Failover Observer (FSFO). The FSFO leverages the Oracle Call Interface (OCI) architecture to decide when a failover is necessary, which physical standby database should be the target of the failover, and how long to wait until it declares a failover is absolutely necessary. The good news is that I have quite a bit of control over these three directives, all of which are completely configurable using the Data Guard Broker Manager utility (DGMGRL).

Fast-Start Failover: Installation. The FSFO application software is automatically loaded as part of the standard installation when an Oracle 11g database home is created. Either DGMGRL or Oracle EM Grid Control can be used to control the FSFO when a complete database home installation is present. Alternatively, FSFO may be installed by downloading the Oracle 11g Client installation software from otn.oracle.com and then installing just the Oracle Client Administrator on the desired server; however, it’s important to note that when it has been installed on a separate server, the FSFO can only be managed via the DGMGRL utility.

Fast-Start Failover: Basic Configuration. Since it’s certainly possible that more than one physical standby database could exist in a Data Guard configuration, the first thing that I’ll need to establish is which physical standby database should be paired with the primary database in case a fast-start failover is initiated. I’ll do that by setting a value for the FastStartFailoverTarget parameter via the DGMGRL utility. Note that I’ve chosen the primary database as the fast-start failover target for the selected physical standby database as well:

DGMGRL> EDIT DATABASE orcl_primary SET PROPERTY FastStartFailoverTarget = 'orcl_stdby1';

DGMGRL> EDIT DATABASE orcl_stdby1 SET PROPERTY FastStartFailoverTarget = 'orcl_primary';

Next, I’ll establish how long the Fast-Start Failover Observer should wait until it decides that the primary database is unreachable by setting a value of 180 seconds for the FastStartFailoverThreshold parameter:

EDIT CONFIGURATION SET PROPERTY FastStartFailoverThreshold = '180';

Now that the basic fast-start failover configuration is completed, I can confirm its status with the SHOW FAST_START FAILOVER command:

DGMGRL> show fast_start failover

Fast-Start Failover: DISABLED
  Threshold:           90 seconds
  Target:              (none)
  Observer:            orcl_stdby1
  Lag Limit:           30 seconds
  Shutdown Primary:    TRUE
  Auto-reinstate:      TRUE

Configurable Failover Conditions
  Health Conditions:
    Corrupted Controlfile          YES
    Corrupted Dictionary           YES
    Inaccessible Logfile            NO
    Stuck Archiver                  NO
    Datafile Offline               YES

  Oracle Error Conditions:
    (none)

DGMGRL> show database verbose orcl_primary;

Database
  Name:            orcl_primary
  Role:            PRIMARY
  Enabled:         YES
  Intended State:  TRANSPORT-ON
  Instance(s):
    orcl_primary

  Properties:
    DGConnectIdentifier             = 'orcl_primary'
    ObserverConnectIdentifier       = ''
    LogXptMode                      = 'ASYNC'
    DelayMins                       = '0'
    Binding                         = 'OPTIONAL'
    MaxFailure                      = '0'
    MaxConnections                  = '1'
    ReopenSecs                      = '300'
    NetTimeout                      = '30'
    RedoCompression                 = 'DISABLE'
    LogShipping                     = 'ON'
    PreferredApplyInstance          = ''
    ApplyInstanceTimeout            = '0'
    ApplyParallel                   = 'AUTO'
    StandbyFileManagement           = 'AUTO'
    ArchiveLagTarget                = '0'
    LogArchiveMaxProcesses          = '4'
    LogArchiveMinSucceedDest        = '1'
    DbFileNameConvert               = ''
    LogFileNameConvert              = ''
    FastStartFailoverTarget         = 'orcl_stdby1'
    StatusReport                    = '(monitor)'
    InconsistentProperties          = '(monitor)'
    InconsistentLogXptProps         = '(monitor)'
    SendQEntries                    = '(monitor)'
    LogXptStatus                    = '(monitor)'
    RecvQEntries                    = '(monitor)'
    HostName                        = '11gPrimary'
    SidName                         = 'orcl_primary'
    StandbyArchiveLocation          = '/u01/app/oracle/flash_recovery_area/ORCL/'
    AlternateLocation               = ''
    LogArchiveTrace                 = '0'
    LogArchiveFormat                = 'log_%s_%t_%r.arc'
    LatestLog                       = '(monitor)'
    TopWaitEvents                   = '(monitor)'

Current status for "orcl_primary":
SUCCESS

DGMGRL> show database verbose orcl_stdby1

Database
  Name:            orcl_stdby1
  Role:            PHYSICAL STANDBY
  Enabled:         YES
  Intended State:  APPLY-ON
  Instance(s):
    orcl_stdby1

  Properties:
    DGConnectIdentifier             = 'orcl_stdby1'
    ObserverConnectIdentifier       = ''
    LogXptMode                      = 'ASYNC'
    DelayMins                       = '0'
    Binding                         = 'OPTIONAL'
    MaxFailure                      = '0'
    MaxConnections                  = '1'
    ReopenSecs                      = '300'
    NetTimeout                      = '30'
    RedoCompression                 = 'DISABLE'
    LogShipping                     = 'ON'
    PreferredApplyInstance          = ''
    ApplyInstanceTimeout            = '0'
    ApplyParallel                   = 'AUTO'
    StandbyFileManagement           = 'AUTO'
    ArchiveLagTarget                = '0'
    LogArchiveMaxProcesses          = '4'
    LogArchiveMinSucceedDest        = '1'
    DbFileNameConvert               = ''
    LogFileNameConvert              = '/u01/app/oracle/oradata/orcl/, /u01/app/oracle/oradata/stdby/'
    FastStartFailoverTarget         = 'orcl_primary'
    StatusReport                    = '(monitor)'
    InconsistentProperties          = '(monitor)'
    InconsistentLogXptProps         = '(monitor)'
    SendQEntries                    = '(monitor)'
    LogXptStatus                    = '(monitor)'
    RecvQEntries                    = '(monitor)'
    HostName                        = '11gStdby'
    SidName                         = 'orcl_stdby1'
    StandbyArchiveLocation          = '/u01/app/oracle/flash_recovery_area/STDBY/'
    AlternateLocation               = ''
    LogArchiveTrace                 = '0'
    LogArchiveFormat                = 'log_%s_%t_%r.arc'
    LatestLog                       = '(monitor)'
    TopWaitEvents                   = '(monitor)'

Current status for "orcl_stdby1":
SUCCESS

Configuring Detection of Advanced Failover Conditions

Which physical standby database is the target for Fast-Start Failover and how long to wait until initiating Fast-Start Failover are obviously the most important reactions I want to control in a failover situation. However, Oracle 11g Data Guard Broker also offers the ability to adjust just how much lost redo data is permitted before declaring a failover and what it should do with the primary database after the failover operation. Here’s a complete list of the configurable thresholds, triggering events, and post-failover behaviors in a Fast-Start Failover situation:

Table 7-2. Fast-Start Failover: Configurable Reactions

FSFO Parameter

Default Value

Triggering Event

FastStartFailoverTarget

NULL

Indicates which database the FSFO will choose as its failover target should a Fast-Start Failover occur

FastStartFailoverThreshold

90

Determines how long the FSFO will wait until it triggers a Fast-Start Failover. Note that both the target physical standby and the FSFO must have lost network connectivity to the primary database for this time limit (in seconds) before FSF is triggered

FastStartFailoverLagLimit

30

When the primary database is operating in Maximum Performance mode, this sets a limit for the amount of lost data in seconds before the FSFO should trigger a Fast-Start Failover

FastStartFailoverPmyShutdown

TRUE

Determines whether the FSFO will automatically shut down the primary database after a Fast-Start Failover occurs

FastStartFailoverAutoReinstate

TRUE

Determines whether the FSFO will automatically attempt to reinstantiate the designated primary database as the physical standby as soon as the FSFO detects the primary site is once again available

Advanced Failover Options. In addition to these conditional triggering events and controls, FSFO also offers exquisitely detailed control over which failures of the primary database’s infrastructure would trigger a Fast-Start Failover:

Table 7-3. Fast-Start Failover: Advanced Failover Triggers

Trigger

Active By Default?

Explanation

Datafile Offline

Yes

Tells FSFO to perform a failover whenever a datafile is offline on the primary database

Corrupted Controlfile

Yes

Tells FSFO to perform a failover whenever a corrupted control file is detected on the primary database

Corrupted Dictionary

Yes

Requests FSFO to perform a failover whenever corruption is detected within the primary database’s data dictionary

Inaccessible Logfile

No

Forces FSFO to perform a failover whenever any one online redo log member cannot be accessed on the primary database

Stuck Archiver

No

Tells FSFO to perform a failover whenever archived redo logging cannot proceed on the primary database because there is no more space to write out archived redo logs on that server

These two sets of Fast-Start Failover conditions offer me extremely fine-grained control over when FSFO should trigger a failover automatically. And even if these controls are insufficient for my situation, there’s one other option: I can configure FSFO to trigger a Fast-Start Failover upon receipt of a specific Oracle 11g error message code. For example, if I wanted to force a failover whenever a critical lack of space in a Flashback Data Archive occurred, I could configure FSFO to initiate automatic failover whenever the ORA-55623 error is detected on the primary database with the following command:

DGMGRL> ENABLE FAST_START FAILOVER CONDITION 55623;

Activating the Fast-Start Failover Observer

Now that the configuration of FSFO is complete, all I need to do is enable the configuration via DGMGRL as shown below. Note that I’m also enabling logging of Data Guard Broker activity for the command-line utility so that I can track any unexpected issues related to the FSFO’s performance or configuration:

[oracle@11gStdby ~]$ dgmgrl -logfile 11gStdby1_observer.log
DGMGRL for Linux: Version 11.1.0.6.0 - Production

Copyright (c) 2000, 2005, Oracle. All rights reserved.

Welcome to DGMGRL, type "help" for information.
DGMGRL> connect sys/oracle
Connected.
DGMGRL> ENABLE FAST_START FAILOVER;
Enabled.

Finally, it’s time to start up FSFO. Once again, I’ll use DGMGRL to start the Fast-Start Failover Observer process:

DGMGRL> START OBSERVER;

Once the FSFO is started, I can confirm that it’s been activated properly with the SHOW CONFIGURATION and SHOW DATABASE commands:

DGMGRL> show configuration verbose

Configuration
  Name:                MAA_orcl
  Enabled:             YES
  Protection Mode:     MaxPerformance
  Databases:
    orcl_primary - Primary database
    orcl_stdby1  - Physical standby database
                 - Fast-Start Failover target

Fast-Start Failover: ENABLED
  Threshold:           180 seconds
  Target:              orcl_stdby1
  Observer:            11gStdby
  Lag Limit:           30 seconds
  Shutdown Primary:    TRUE
  Auto-reinstate:      TRUE

Current status for "MAA_orcl":
Warning: ORA-16608: one or more databases have warnings

DGMGRL> show database orcl_primary

Database
  Name:            orcl_primary
  Role:            PRIMARY
  Enabled:         YES
  Intended State:  TRANSPORT-ON
  Instance(s):
    orcl_primary

Current status for "orcl_primary":
SUCCESS

DGMGRL> show database orcl_stdby1

Database
  Name:            orcl_stdby1
  Role:            PHYSICAL STANDBY
  Enabled:         YES
  Intended State:  APPLY-ON
  Instance(s):
    orcl_stdby1

Current status for "orcl_stdby1":
SUCCESS

DGMGRL> show fast_start failover

Fast-Start Failover: ENABLED
  Threshold:           180 seconds
  Target:              orcl_stdby1
  Observer:            11gStdby
  Lag Limit:           30 seconds
  Shutdown Primary:    TRUE
  Auto-reinstate:      TRUE

Configurable Failover Conditions
  Health Conditions:
    Corrupted Controlfile          YES
    Corrupted Dictionary           YES
    Inaccessible Logfile            NO
    Stuck Archiver                  NO
    Datafile Offline               YES

  Oracle Error Conditions:
    (none)

Automatic Detection of Failover Conditions: An Example

Now that FSFO is fully configured and is ready to detect a failover situation, I’ll use the same technique I used in the prior article about Data Guard failover to simulate a failure of the primary database: I’ll simply issue the kill -9 <pid> command against its Server Monitor (SMON) background process. Once again, the death of the primary database is almost immediately recorded in its alert log:

. . .
Tue Aug 25 18:54:10 2009
Errors in file /u01/app/oracle/diag/rdbms/orcl_primary/orcl_primary/trace/orcl_primary_pmon_6166.trc:
ORA-00474: SMON process terminated with error
PMON (ospid: 6166): terminating the instance due to error 474
Instance terminated by PMON, pid = 6166
. . .

Just as before, the loss of connectivity to the primary database is reflected within the alert log of the corresponding physical standby databases by its Remote File Server (RFS) background process:

. . .
Tue Aug 25 18:54:49 2009
RFS[2]: Possible network disconnect with primary database
Tue Aug 25 18:54:49 2009
RFS[1]: Possible network disconnect with primary database
Tue Aug 25 18:55:49 2009
. . .

This time, however, there’s a dramatic difference! After approximately three minutes have elapsed, there’s a sudden flurry of activity at the physical standby site as the FSFO automatically detects the failure of the primary database. In Listing 7.1, I’ve captured the alert logs of both databases as well as the Data Guard Broker log entries to show all of the actions that Oracle 11g initiates during a Fast-Start Failover. After the automatic failover is complete, the Data Guard configuration fully reflects the successful actions of the FSFO:

DGMGRL> show configuration verbose

Configuration
  Name:                MAA_orcl
  Enabled:             YES
  Protection Mode:     MaxPerformance
  Databases:
    orcl_stdby1  - Primary database
    orcl_primary - Physical standby database (disabled)
                 - Fast-Start Failover target

Fast-Start Failover: ENABLED
  Threshold:           180 seconds
  Target:              orcl_primary
  Observer:            11gStdby
  Lag Limit:           30 seconds
  Shutdown Primary:    TRUE
  Auto-reinstate:      TRUE

Current status for "MAA_orcl":
Warning: ORA-16608: one or more databases have warnings

DGMGRL> show database verbose orcl_stdby1

Database
  Name:            orcl_stdby1
  OEM Name:        orcl_11gStdby1
  Role:            PRIMARY
  Enabled:         YES
  Intended State:  TRANSPORT-ON
  Instance(s):
    orcl_stdby1

  Properties:
    DGConnectIdentifier             = 'orcl_stdby1'
    ObserverConnectIdentifier       = ''
    LogXptMode                      = 'ASYNC'
    DelayMins                       = '0'
    Binding                         = 'OPTIONAL'
    MaxFailure                      = '0'
    MaxConnections                  = '1'
    ReopenSecs                      = '300'
    NetTimeout                      = '30'
    RedoCompression                 = 'DISABLE'
    LogShipping                     = 'ON'
    PreferredApplyInstance          = ''
    ApplyInstanceTimeout            = '0'
    ApplyParallel                   = 'AUTO'
    StandbyFileManagement           = 'AUTO'
    ArchiveLagTarget                = '0'
    LogArchiveMaxProcesses          = '4'
    LogArchiveMinSucceedDest        = '1'
    DbFileNameConvert               = ''
    LogFileNameConvert              = '/u01/app/oracle/oradata/orcl/, /u01/app/oracle/oradata/stdby/'
    FastStartFailoverTarget         = 'orcl_primary'
    StatusReport                    = '(monitor)'
    InconsistentProperties          = '(monitor)'
    InconsistentLogXptProps         = '(monitor)'
    SendQEntries                    = '(monitor)'
    LogXptStatus                    = '(monitor)'
    RecvQEntries                    = '(monitor)'
    HostName                        = '11gStdby'
    SidName                         = 'orcl_stdby1'
    StandbyArchiveLocation          = '/u01/app/oracle/flash_recovery_area/STDBY/'
    AlternateLocation               = ''
    LogArchiveTrace                 = '0'
    LogArchiveFormat                = 'log_%s_%t_%r.arc'
    LatestLog                       = '(monitor)'
    TopWaitEvents                   = '(monitor)'

Current status for "orcl_stdby1":
Warning: ORA-16829: fast-start failover configuration is lagging


DGMGRL> show database verbose orcl_primary

Database
  Name:            orcl_primary
  OEM Name:        orcl_11gPrimary
  Role:            PHYSICAL STANDBY
  Enabled:         NO
  Intended State:  APPLY-ON
  Instance(s):
    orcl_primary

  Properties:
    DGConnectIdentifier             = 'orcl_primary'
    ObserverConnectIdentifier       = ''
    LogXptMode                      = 'ASYNC'
    DelayMins                       = '0'
    Binding                         = 'OPTIONAL'
    MaxFailure                      = '0'
    MaxConnections                  = '1'
    ReopenSecs                      = '300'
    NetTimeout                      = '30'
    RedoCompression                 = 'DISABLE'
    LogShipping                     = 'ON'
    PreferredApplyInstance          = ''
    ApplyInstanceTimeout            = '0'
    ApplyParallel                   = 'AUTO'
    StandbyFileManagement           = 'AUTO'
    ArchiveLagTarget                = '0'
    LogArchiveMaxProcesses          = '4'
    LogArchiveMinSucceedDest        = '1'
    DbFileNameConvert               = ''
    LogFileNameConvert              = ''
    FastStartFailoverTarget         = 'orcl_stdby1'
    StatusReport                    = '(monitor)'
    InconsistentProperties          = '(monitor)'
    InconsistentLogXptProps         = '(monitor)'
    SendQEntries                    = '(monitor)'
    LogXptStatus                    = '(monitor)'
    RecvQEntries                    = '(monitor)'
    HostName                        = '11gPrimary'
    SidName                         = 'orcl_primary'
    StandbyArchiveLocation          = '/u01/app/oracle/flash_recovery_area/ORCL/'
    AlternateLocation               = ''
    LogArchiveTrace                 = '0'
    LogArchiveFormat                = 'log_%s_%t_%r.arc'
    LatestLog                       = '(monitor)'
    TopWaitEvents                   = '(monitor)'

Current status for "orcl_primary":
Error: ORA-16661: the standby database needs to be reinstated

Reinstating the Original Primary Database

My previous example of initiating Fast-Start Failover brings to light an interesting situation: What if the primary database was actually completely healthy at the time that FSFO acknowledged the conditions for Fast-Start Failover? Here’s where the brilliance of enabling Flashback Logging on both the primary and physical standby databases really shines through: With a single command, it’s a simple matter to reinstate the original primary database as a physical standby database.

To illustrate, I’ll issue the REINSTATE DATABASE command from a DGMGRL session connected to the new primary database, ORCL_STDBY1, and I’ll designate the original primary database, ORCL_PRIMARY, as the target of the reinstatement:

DGMGRL> reinstate database orcl_primary

Once again, there’s a flurry of activity on the original primary database as Data Guard Broker successfully attempts the reinstatement. I’ve captured the pertinent alert log entries from the ORCL_PRIMARY database in Listing 7.2, and DGMGRL reflects the appropriate Data Guard configuration once the reinstatement has completed:

DGMGRL> show configuration verbose

Configuration
  Name:                MAA_orcl
  Enabled:             YES
  Protection Mode:     MaxPerformance
  Databases:
    orcl_stdby1  - Primary database
    orcl_primary - Physical standby database
                 - Fast-Start Failover target

Fast-Start Failover: ENABLED
  Threshold:           180 seconds
  Target:              orcl_primary
  Observer:            11gStdby
  Lag Limit:           30 seconds
  Shutdown Primary:    TRUE
  Auto-reinstate:      TRUE

Current status for "MAA_orcl":
SUCCESS

Switching Back. Since the original primary database is now successfully restored as part of the Data Guard environment, I’ll request the original primary and physical standby databases to switch roles with the SWITCHOVER command:

DGMGRL> switchover to orcl_primary;
Performing switchover NOW, please wait...
New primary database "orcl_primary" is opening...
Operation requires shutdown of instance "orcl_stdby1" on database "orcl_stdby1"
Shutting down instance "orcl_stdby1"...
ORA-01109: database not open

Database dismounted.
ORACLE instance shut down.
Operation requires startup of instance "orcl_stdby1" on database "orcl_stdby1"
Starting instance "orcl_stdby1"...
ORACLE instance started.
Database mounted.
Switchover succeeded, new primary is "orcl_primary"

DGMGRL> show configuration verbose;

Configuration
  Name:                MAA_orcl
  Enabled:             YES
  Protection Mode:     MaxPerformance
  Databases:
    orcl_primary - Primary database
    orcl_stdby1  - Physical standby database
                 - Fast-Start Failover target

Fast-Start Failover: ENABLED
  Threshold:           180 seconds
  Target:              orcl_stdby1
  Observer:            11gStdby
  Lag Limit:           30 seconds
  Shutdown Primary:    TRUE
  Auto-reinstate:      TRUE

Current status for "MAA_orcl":
SUCCESS

Deactivating Fast-Start Failover

To deactivate Fast-Start Failover, all I need to do is issue the DISABLE FAST_START FAILOVER command from within a DGMGRL session:

DGMGRL> DISABLE FAST_START FAILOVER;
Disabled.

Note that this only disables the possibility of future Fast-Start Failovers until I re-enable the Fast-Start Failover configuration with the ENABLE FAST_START FAILOVER command; all of the Fast-Start Failover configuration details I’ve so carefully constructed are still intact.

Next Steps

The next article in this series will explore how to construct and maintain a Logical Standby database in Oracle 11g, focusing on their usefulness in data warehouse and data mart environments.

References and Additional Reading

While I’m hopeful that I’ve given you a thorough grounding in the technical aspects of the features I’ve discussed in this article, I’m also sure that there may be better documentation available since it’s been published. I therefore strongly suggest that you take a close look at the corresponding Oracle documentation on these features to obtain crystal-clear understanding before attempting to implement them in a production environment. Please note that I’ve drawn upon the following Oracle Database 11g documentation for the deeper technical details of this article:

B28279-02 Oracle Database 11g New Features Guide

B28294-03 Oracle Database 11g Data Guard Concepts and Administration

B28295-03 Oracle Database 11g Data Guard Broker

B28320-01 Oracle Database 11g Reference Guide

Also, this white paper about Fast-Start Failover Best Practices on Oracle Technology Network (OTN) helps clarify this feature set.

» See All Articles by Columnist Jim Czuprynski

Share:
Home
Mobile Site | Full Site
Copyright 2017 © QuinStreet Inc. All Rights Reserved