Oracle 10gR2 RAC Load Balancing Features

Wednesday Mar 21st 2007 by Jim Czuprynski
Share:

Learn how Oracle 10gR2 takes advantage of client-side connection load balancing, server-side load balancing, and the new Load Balancing Advisor (LBA) features to improve the distribution of workloads across the appropriate nodes in an Oracle 10gR2 (RAC) clustered database environment.

Synopsis. To insure an even workload distribution, a clustered database must employ methods to distribute incoming sessions evenly and effectively across the various components of its cluster. This article explores how Oracle 10gR2 takes advantage of simple client-side connection load balancing, server-side load balancing, and the new Load Balancing Advisor (LBA) features to significantly improve the distribution of workloads across the appropriate nodes in an Oracle 10gR2 Real Application Clusters (RAC) clustered database environment.

Oracle 10g Real Application Clusters (RAC) provides a robust, high-availability (HA) architecture that is scalable to an unprecedented size as business computing requirements increase. Since Oracle has placed the concept of services as the centerpiece of this architecture, it also means that application workloads can be distributed appropriately across the entire clustered database. This virtually eliminates the possibility that any one application will starve because of resource requirements during its peak processing periods. It also guarantees that there is always a bit of “Kentucky windage” in the overall estimates for the required peak capacity of any Oracle 10g database system.

Oracle 10g provides several different methods to support load balancing capabilities for this service-based architecture, including client-side connection load balancing, client-side connection failover, and server-side connection load balancing. Since these methods existed before Oracle 10g, I’ll briefly review them to explain their relative benefits and shortfalls.

Client-Side Load Balancing Methods

Client-Side Connection Load Balancing. This load balancing method has been available since Oracle 8i. When a user session attempts to connect to the database, the database’s listener will assign the session randomly to one of the listed multiple listening endpoints. Listing 1 shows an example of the TNSNAMES.ORA network configuration file entries for an alias named CSLB that uses the LOAD_BALANCING=ON directive to implement this load balancing feature.

While this load balancing method is certainly simplest to implement, it also has an obvious limitation: the listener has no idea if the session has been assigned to an endpoint whose corresponding database server is already overloaded. Moreover, since the listener is essentially picking the connection completely at random, there is no guarantee that the connection chosen will even be available at that time. This may force the session to wait for a considerable length of time – perhaps even minutes, a relative eternity in computing timeframes! – until the operating system indicates to the listener that the connection is unavailable, which causes the user session to fail eventually with an ORA-01034 ORACLE not available error.

Client-Side Connection Failover. Obviously, balancing the overall number of connections is a desirable goal, but what happens if the chosen connection point is unavailable because the database server’s listener is no longer active? To forestall this, Oracle 8i provided the capability to determine if the connection that has been chosen at random is still “alive” and, if not, continue to try the other listed connection points until a live connection is found. This simple method tends to limit the chance of a lost connection, but unfortunately it must rely on TCP/IP timeouts to determine if a connection is alive or dead, and this means that an application may wait several seconds (or even longer!) before it receives a notification that the connection has been terminated.

I’ve laid out the TNSNAMES.ORA entries to activate client-side connection failover in Listing 2. They are almost identical to Listing 1 with the notable exception of one more directive: FAILOVER=ON.

Server-Side Load Balancing

The two previous methods will adequately handle the distribution of user sessions across available resources while helping to guarantee that no session will wait an excessive time to find a currently active address on which to connect. Clearly, a better solution was needed, and Oracle 9i offered one: server-side load balancing. This method divides the connection load evenly between all available listeners by determining the total number of connections on each listener, and then distributing new user session connection requests to the least loaded listener(s) based on the total number of sessions already connected. While a bit more complex to implement because it requires configuration of multiple listeners, it most definitely helps to even out connections across all available listeners in a database system.

To implement server-side load balancing, at least two listeners must be configured. Also, the REMOTE_LISTENERS initialization parameter must be added to the database’s PFILE or SPFILE so that the database knows to search out the value provided in that parameter in the database server’s TNSNAMES.ORA configuration file. When server-side load balancing is activated, each listener that contributes a listening endpoint communicates with the other listener(s) via each database instance’s PMON process. Oracle then determines how many user connections each listener is servicing, and it will distribute any new connection requests so that the load is balanced evenly across all servers. The entries in TNSNAMES.ORA direct the listeners to share information about the relative load connectivity.

As shown in Listing 3, I’ve gathered these required changes to TNSNAMES.ORA. I’ve also created a new PL/SQL package, HR.PKG_LOAD_GENERATOR, that incorporates three different methods to generate user sessions in an attempt to “overload” an Oracle 10gR2 database listener. A simple shell script, RandomLoadGenerator.sh, calls a few SQL command files that in turn make calls to the package’s procedures and thus create a sample workload of approximately 40 connections against an Oracle 10gR2 database that contains the standard sample schemas.

Load Balancing In Oracle 10g Real Application Clusters Environments

These three methods are actually quite effective for distribution of incoming connections evenly across multiple listeners in any single-instance database configuration. An Oracle 10gR2 Real Application Clusters (RAC) clustered database, on the other hand, needs more robust load balancing capabilities because of the nature of that environment.

A RAC clustered database comprises at least two (and usually many more) nodes, each running a separate instance of the clustered database. In addition, a RAC database usually needs to supply a minimum amount of connections and resources to several applications, each with dramatically different resource needs depending on the current business processing cycle(s), so the application load that’s placed on each instance in the clustered database therefore can be dramatically different at different times of the day, week, month, and year. Finally, it’s likely that a RAC clustered database will need to guarantee a minimum cardinality (i.e. a specific number of nodes on which the application needs to run at all times) to one or more mission-critical applications.

RAC Services. Starting in Oracle 8i, an Oracle database could dynamically register a database directly with its corresponding listener(s) based on the settings for the SERVICE_NAMES initialization parameter through the database’s Process Monitor (PMON) background process. To completely support this feature, Oracle strongly suggested that the SERVICE_NAME parameter should be used instead of the original SID parameter in the TNSNAMES.ORA configuration file so that an incoming user session could immediately identify the database instance to which a session intended to connect.

Oracle 10g RAC leverages this service naming feature to distribute application connections efficiently across a RAC clustered database. For example, a clustered database may need to support three different applications, OLTP, DSS, and ADHOC. The OLTP application is the main order entry application for this enterprise computing environment, and therefore it needs a minimum cardinality of two cluster database instances at all times. The DSS application, on the other hand, supports extraction, transformation and load (ETL) operations for the enterprise’s data warehouse, and thus it requires a minimum cardinality of just one instance. Likewise, the ADHOC application supports OLAP and general user query execution against the data warehouse, but it too only requires a minimum cardinality of a single instance.

Oracle 10gR2 RAC: Server-Side Connect-Time Load Balancing

To demonstrate the implementation of load balancing features in a RAC environment, I’ll use a relatively straightforward testing platform: a simple two-node RAC clustered database, RACDB, with two instances, RACDB1 and RACDB2, configured on two nodes (RACLINUX1 and RACLINUX2, respectively). I’ve set up this configuration using two VMWare Virtual Machines, each running CentOS Linux Enterprise Server 3 Release 8 (kernel 2.4.21-40) as the guest configuration.

Listing 4 shows the SRVCTL commands I’ve issued to create and start three new application services, OLTP, DSS, and ADHOC, on the RACDB clustered database. Note that I’ve specified both the RACDB1 and RACDB2 instances as the preferred instances for all three applications. When I execute SRVCTL commands to create these services, Oracle 10g automatically adds these service name values to the SERVICE_NAMES parameter for each instance in the cluster database.

I’ve also set up three application aliases for these applications in the client TNSNAMES.ORA entries configuration files. Note that I’ve specified the SERVICE_NAME parameter as RACDB so that all nodes in the cluster can participate in distributing the load of these applications across the cluster. I’ve also used the two nodes’ virtual IP addresses (raclinux1-vip and raclinux2-vip) as the connection points for these services. This guarantees that if any one of the listeners or instances servicing these applications should fail, ONS will automatically relocate any new connection requests to a new listener alias on another surviving node.

To complete the configuration of server-side connection load balancing for this RAC clustered database, note that I’ve set the *.REMOTE_LISTENERS=RACDB_LISTENERS initialization parameter in the database’s shared SPFILE. I’ve also added a corresponding RACDB_LISTENERS entry to the TNSNAMES.ORA file in the Oracle Home of each node in the cluster. Each database’s PMON process will now automatically register the database with the database’s local listener as well as cross-register the database with the listeners on all other nodes in the cluster. In this mode, the nodes themselves decide which node is least busy, and then will connect the client to that node.

It’s also important to realize that in a RAC environment, the server-side load balancing methodology differs slightly from the methodology used in a single-instance environment because Oracle 10gR2 discriminates whether the incoming connection has been requested as either a dedicated or a shared server connection:

  • If a dedicated session is requested, then the listener will select the instance first on the basis of the node that is least loaded; if all nodes are equally loaded, it will then select the instance that has the least load.
  • For a shared server connection, however, the listener goes one step further. It will also check to see if all of the available instances are equally loaded; if this is true, the listener will place the connection on the least-loaded dispatcher on the selected instance.

Advanced Load Balancing: The Load Balancing Advisor (LBA)

In a RAC environment, it’s not unlikely that one node may become overwhelmed by application requests for resources. For example, let’s assume that a two-node clustered database currently has 100 user sessions connected across both nodes in the cluster, and that the database is using standard server-side load balancing. If there is a sudden “storm” of 200 additional connections, the listeners on the two nodes will simply distribute them evenly across both nodes, 100 to each node, resulting in 150 connections per node. However, it’s possible that node RACDB2 is actually much “busier” than the other node. As a result, node RACDB2 will most likely be completely overwhelmed by those new connections’ resource demands, while node RACDB1 remains relatively underutilized with plenty of additional resources.

The good news is that Oracle 10gR2 now provides an advanced method to overcome this imbalance: the Load Balancing Advisor (LBA). The LBA will calculate how much work each instance in the clustered database has been asked to do. The LBA will then make sure that any new incoming work is routed to the instance(s) that are least busy. This determination takes place on a service-by-service basis across all instances that are providing resources to each service.

The LBA considers the following factors when determining how to perform this “balancing act”:

  • Are there any differences in processing power between nodes?
  • Are there any sessions that are currently blocked from execution because of waits?
  • Have any failures occurred on a node that might block processing from continuing?
  • Are there any services that are competing for resources, and have those services been granted a different priority to complete their tasks at hand?

Not only does this help to balance out the total workload across all instances in the cluster, it also insures that one node won’t be overwhelmed by requests for service by its primary application(s).

As part of its Fast Application Notification (FAN) event transmission features, Oracle 10g RAC uses Oracle Notification Services (ONS) to communicate the status of cluster resources to all participating node applications in the cluster. In Oracle 10gR2, FAN added a new event type, SERVICE_METRIC, whose event payload contains information about the relative workload of each node in the RAC cluster, and it’s this information that the Load Balancing Advisory uses to determine how to route new connections throughout the clustered database’s instances.

Oracle 10gR2 supplies an upgraded DBMS_SERVICE.MODIFY_SERVICE procedure that modifies an existing RAC service so that it can utilize the Load Balance Advisory for service-by-service load balancing. This procedure provides two crucial parameters, GOAL and CLB_GOAL, that interact to determine what methods (if any) Oracle 10gR2 will utilize to perform load balancing. For example, if I set the GOAL parameter to either the GOAL_SERVICE_TIME or the GOAL_THROUGHPUT enumerated constants for a specific RAC service, Oracle 10gR2 will activate the Load Balancing Advisory for load balancing of that service’s incoming connections. Table 1 explains the difference between these two load balancing targets.

Table 1. Oracle 10gR2 Load Balancing Advisory (LBA) Service Goals

LBA Goal

Explanation

GOAL_NONE

Setting DBMS_SERVICE.GOAL to this value disables the LBA.

GOAL_SERVICE_TIME

The LBA calculates a weighted moving average of the total elapsed time for completed work plus the bandwidth that’s available to the service to calculate the service goodness. This goal is ideal for services whose workload may change dramatically over a short period of time, e.g. an application that services a “clicks and mortar” store that provides customer self-service through an internet-based shopping web site.

GOAL_THROUGHPUT

The LBA calculates a weighted moving average of throughput (i.e. the rate at which work is completed) in addition to the bandwidth available to the service to calculate the service goodness. This goal is best suited for long-duration tasks that are typically queued to run serially, e.g. scheduled jobs that handle large batches of transactions.

In addition, the CLB_GOAL parameter provides yet another load balancing method: It tells the LBA how to perform connection load balancing. See Table 2 for a list of its acceptable enumerated constant values and the impact of setting either.

Table 2. Oracle 10gR2 Connection Load Balancing (CLB) Goals

CLB Goal

Explanation

CLB_GOAL_SHORT

The Load Balancing Advisory will be used for connection load balancing only if it is enabled (i.e. set to other than GOAL_NONE). If the LBA has been disabled, connection load balancing will utilize abridged advice determined by CPU utilization.

CLB_GOAL_LONG

Connection load balancing will be determined by first tallying the total number of connections per instance, and then by counting the number of sessions per each service. Oracle recommends using this setting for services whose applications tend to connect for long periods of time (e.g. Oracle Forms). The Load Balancing Advisory can be used in conjunction with this setting as long as the connection pool has been sized to accommodate “gravitation “ within the pool without adding or subtracting connections. Oracle recommends this option as the most efficient design.

Depending on the settings for these two parameters, the LBA uses the metrics shown in Table 3 to make its determination of which instance should receive the next set of incoming connections:

Table 3. Oracle 10gR2 Load Balancing Advisory (LBA) Metrics

LBA Goal

LBA Metric Used

Explanation

LONG

Session Count By Instance

The LBA uses this metric to evenly distribute sessions across the cluster’s nodes. This works best for services that are distributed uniformly across the cluster’s instances, and the nodes have similar capacities.

LONG

Node Run Queue Length

If the service uses only a subset of the RAC instances, and the nodes have dissimilar capacities, then the LBA will attempt to place more sessions on the node with the least load at the time that the connection was created.

SHORT

Goodness By Service

This method incorporates a ranking for the quality of service that the service is experiencing at an instance level, including whether access has been restricted from an instance. It attempts to prevent a listener from routing all connections to the same instance in between updates to the goodness value because each listener will adjust its local goodness rating – computed as a delta value - as connections are distributed across the cluster. This delta represents the average of resource time connections are consuming when they use that service. Also, to prevent a “storm surge” of logins to the least busy instance, Oracle 10gR2 will compute a threshold delta if the goodness rating delta value is too low.

Proof of Concept: LBA in Oracle 10gR2 RAC

To enable my RAC database environment for LBA, I’ll first activate three different levels of load balancing using DBMS_SERVICE.MODIFY_SERVICE against the three new services I just created:

  • The ADHOC service has been set up with no LBA load balancing (GOAL=GOAL_NONE) and no connection load balancing, either (CLB_GOAL=CLB_GOAL_LONG).
  • The DSS service will use the LBA for load balancing with optimized service time as its goal (GOAL=GOAL_SERVICE_TIME and CLB_GOAL=CLB_GOAL_SHORT).
  • Finally, the OLTP service will use the LBA for load balancing with a goal of throughput (GOAL=GOAL_THROUGHPUT and CLB_GOAL=CLB_GOAL_SHORT).

Listing 5 shows the code I’ve used to activate LBA management for these services, as well as a query against the DBA_SERVICES view plus the resulting output that verifies the proper settings for GOAL and CLB_GOAL for these services.

Now that these three services are configured for LBA, I’ll use a combination of shell scripts to place a load on one of the two nodes for this RAC clustered database and then attempt to connect to both nodes in the cluster using the three RAC services. I’ll first execute the ConstantLoadGenerator.sh shell script to create 100 sessions that execute a SQL script that places a moderately heavy load on the database’s CPU, and then I’ll execute shell script RandomLoadGenerator.sh to simulate 40 additional user connections for one of the three services.

How can I determine what the impact will be of running these tests? Oracle 10gR2’s GV$SERVICEMETRIC global view shows each service’s “goodness” on each instance in the RAC clustered database, the service’s “predicted” goodness (or delta), and summary statistics on how much resources the service consumed during the time period. Listing 6 shows the query I’ve constructed against that view; here’s the result of that query before I started my evaluations:

                                        Current Service-Level Metrics
                                           (From GV$SERVICEMETRIC)
 
                                               Pred-               CPU     Elpsd      # 0f
                                               icted              Time      Time      User
                                               Good-               Per       Per     Calls    DBTime
Service                         Inst    Good    ness              Call      Call       Per       Per
Name     Start Time End Time      ID    ness    Incr   Flags     (mus)     (mus)    Second    Second
-------- ---------- ---------- ----- ------- ------- ------- --------- --------- --------- ---------
ADHOC 15:06:41 15:06:48 1 0 1 0 0 0 0 0 15:06:23 15:06:29 2 0 1 0 0 0 0 0 15:05:02 15:06:02 1 0 1 0 0 0 0 0 15:04:37 15:05:37 2 0 1 0 0 0 0 0 DSS 15:06:41 15:06:48 1 100 100 4 0 0 0 0 15:06:23 15:06:29 2 100 100 4 0 0 0 0 15:05:02 15:06:02 1 100 100 4 0 0 0 0 15:04:37 15:05:37 2 100 100 4 0 0 0 0 OLTP 15:06:41 15:06:48 1 0 100 4 0 0 0 0 15:06:23 15:06:29 2 0 100 4 0 0 0 0 15:05:02 15:06:02 1 0 100 4 0 0 0 0 15:04:37 15:05:37 2 0 100 4 0 0 0 0

Here’s the results of the first test against the ADHOC RAC service. While my tests executed, I noticed that Oracle 10gR2 continued to route connections to each node in the cluster regardless of the load on each instance, and the number of connections remained relatively constant on both nodes as new connections arrived. Here are the results of a query against GVSERVICE_METRIC for that test:

>>> Output after ADHOC unit testing:
                                        Current Service-Level Metrics
                                           (From GV$SERVICEMETRIC)
                                               Pred-               CPU     Elpsd      # 0f
                                               icted              Time      Time      User
                                               Good-               Per       Per     Calls    DBTime
Service                         Inst    Good    ness              Call      Call       Per       Per
Name     Start Time End Time      ID    ness    Incr   Flags     (mus)     (mus)    Second    Second
-------- ---------- ---------- ----- ------- ------- ------- --------- --------- --------- ---------
ADHOC 15:10:48 15:10:52 1 34 1 0 6481 69808 65 454 ADHOC 15:10:36 15:10:42 2 26 1 0 1106 3419 92 31 ADHOC 15:09:02 15:10:02 1 34 1 0 0 0 0 0 ADHOC 15:08:38 15:09:38 2 26 1 0 0 0 0 0
DSS 15:10:48 15:10:52 1 100 100 4 0 0 0 0 DSS 15:10:36 15:10:42 2 100 100 4 0 0 0 0 DSS 15:09:02 15:10:02 1 100 100 4 0 0 0 0 DSS 15:08:38 15:09:38 2 100 100 4 0 0 0 0 OLTP 15:10:48 15:10:52 1 0 100 4 0 0 0 0 OLTP 15:10:36 15:10:42 2 0 100 4 0 0 0 0 OLTP 15:09:02 15:10:02 1 0 100 4 0 0 0 0 OLTP 15:08:38 15:09:38 2 0 100 4 0 0 0 0

For the tests I ran against the DSS and OLTP RAC services, however, I noticed that Oracle 10gR2 definitely tended to route connections to the node that was least loaded, and this was reflected in the query results against the GVSERVICE_METRIC global view for these tests:

>>> Output after DSS unit testing:
                                        Current Service-Level Metrics
                                           (From GV$SERVICEMETRIC)
                                               Pred-               CPU     Elpsd      # 0f
                                               icted              Time      Time      User
                                               Good-               Per       Per     Calls    DBTime
Service                         Inst    Good    ness              Call      Call       Per       Per
Name     Start Time End Time      ID    ness    Incr   Flags     (mus)     (mus)    Second    Second
-------- ---------- ---------- ----- ------- ------- ------- --------- --------- --------- ---------
ADHOC    15:12:48   15:12:53       1       0       1       0         0         0         0         0
ADHOC    15:12:37   15:12:42       2       0       1       0         0         0         0         0
ADHOC    15:11:06   15:12:06       2       0       1       0         0         0         0         0
ADHOC    15:11:02   15:12:02       1       0       1       0     49585    322550         2        54
DSS 15:12:48 15:12:53 1 900 9947 0 26051 200992 14 273 DSS 15:12:37 15:12:42 2 2050 9901 0 4301 11800 56 66 DSS 15:11:06 15:12:06 2 2050 9901 0 885 916 2 0 DSS 15:11:02 15:12:02 1 900 9947 0 0 0 0 0
OLTP 15:12:48 15:12:53 1 0 100 4 0 0 0 0 OLTP 15:12:37 15:12:42 2 0 100 4 0 0 0 0 OLTP 15:11:06 15:12:06 2 0 100 4 0 0 0 0 OLTP 15:11:02 15:12:02 1 0 100 4 0 0 0 0
>>> Output after OLTP unit testing:
 
                                        Current Service-Level Metrics
                                           (From GV$SERVICEMETRIC)
 
                                               Pred-               CPU     Elpsd      # 0f
                                               icted              Time      Time      User
                                               Good-               Per       Per     Calls    DBTime
Service                         Inst    Good    ness              Call      Call       Per       Per
Name     Start Time End Time      ID    ness    Incr   Flags     (mus)     (mus)    Second    Second
-------- ---------- ---------- ----- ------- ------- ------- --------- --------- --------- ---------
ADHOC    15:14:48   15:14:53       1       0       1       0         0         0         0         0
ADHOC    15:14:36   15:14:42       2       0       1       0         0         0         0         0
ADHOC    15:13:06   15:14:06       2       0       1       0         0         0         0         0
ADHOC    15:13:02   15:14:02       1       0       1       0         0         0         0         0
DSS      15:14:48   15:14:53       1     100     100       4         0         0         0         0
DSS      15:14:36   15:14:42       2     100     100       4         0         0         0         0
DSS      15:13:06   15:14:06       2     100     100       4         0         0         0         0
DSS      15:13:02   15:14:02       1     100     100       4     59223    173148         0         3
OLTP 15:14:48 15:14:53 1 5200 645 0 7799 53529 46 246 OLTP 15:14:36 15:14:42 2 3800 524 0 5354 16001 55 88 OLTP 15:13:06 15:14:06 2 3800 524 0 0 0 0 0 OLTP 15:13:02 15:14:02 1 5200 645 0 0 0 0 0

Conclusion

With the advent of the Load Balancing Advisory in Oracle 10gR2, it’s now possible to insure that an unexpectedly dramatic increase in a RAC service’s workload does not overwhelm any single node of an Oracle 10gR2 RAC clustered database. Oracle 10gR2 leverages proven, existing RAC components like FAN and ONS event publishing to implement this enhancement. Finally, several new and enhanced data dictionary views provide excellent feedback to measure the efficiency of the Load Balancing Advisory.

Download the SQL scripts and shell scripts for this article.

References and Additional Reading

Even though I’ve hopefully provided enough technical information in this article to encourage you to explore with these features, I also strongly suggest that you first review the corresponding detailed Oracle documentation before proceeding with any experiments. Actual implementation of these features should commence only after a crystal-clear understanding exists. Please note that I’ve drawn upon the following Oracle 10gR2 documentation for the deeper technical details of this article:

B14197-03 Oracle 10gR2 Clusterware and Real Application Clusters Administration and Deployment Guide

B14203-08 Oracle 10gR2 Clusterware and Real Application Clusters Installation Guide for Linux

B14212-02 Oracle 10gR2 Net Services Administrator’s Guide

B14213-01 Oracle 10gR2 Net Services Reference

B14237-02 Oracle 10gR2 Reference

B14258-01 Oracle 10gR2 PL/SQL Packages and Types Reference

B14210-02 Oracle 10gR2 High Availability Overview

B25159-01 Oracle 10gR2 High Availability Best Practices

And these MetaLink documents are also invaluable to obtaining a deeper understanding of how the Load Balancing Advisory has been implemented in Oracle 10gR2:

226880.1 Configuration of Load Balancing and Transparent Application Failover

» See All Articles by Columnist Jim Czuprynski

Share:
Home
Mobile Site | Full Site
Copyright 2017 © QuinStreet Inc. All Rights Reserved