In the concluding part of this series, we will discuss the cluster latency and overhead alarms, the Global cache alarms and the ASM alarms.
I am beginning to get emails from people who are attempting to install RAC on their ESX servers for production servers. There, I suggest that they get on with tools like SoRAC and even benchmarking tools like Swingbench to evenly generate the load and make that database sweat!
Yes, you guessed it, we will get on to performance testing now!
Hear those alarm bells...
The challenge of managing a RAC cluster comes with great responsibility. Fortunately, we have SoRAC fitted with alarms to warn us about the problems in advance.
Cluster Latency and Overhead Alarms (CLA and COA)
Cluster latency and overhead alarms are primarily related to the high speed interconnect. When comparing a single Oracle instance against a RAC, we see that in a single instance when a user requests data in the form of a data block, it is first looked at on the "conveyer belt" (buffer with LRU and MRU), as I will call it here. It is checked in the buffer and if it is not found there, Oracle retrieves the data block directly from the disk, thus causing a disk I/O. Not a good thing as you would expect to tune your database to have the data block in the buffer. Whereas, in a RAC cluster, a session that requests information (data block) has to request the block from other instance(s) within the RAC. So actually it has already looked for the block in its own buffer and since the copy did not exist it goes ahead and requests the copy of the block from the other clustered instances via the interconnect (your HSI cable, a 1G or even a 10G cable). Now this block is transferred to the requesting instance from its buffer; if there is no available block for the requesting instance, it has to go to the disk to get that block.
All this requesting, transferring and waiting leads to what is called a "cluster overhead" and as the technical manual says:
"Cluster Overhead is defined as the amount of time spent waiting for cluster related activities as a percentage of the total time spent waiting for all activities. If this percentage exceeds a threshold of 5%, and assuming there is significant work being done by the cluster, than an overhead alarm is raised. If this percentage exceeds a threshold of 5%, and assuming there is significant work being done by the cluster, then an overhead alarm is raised.
The cluster latency is defined as the average elapsed time from when a requesting instance requests a block of data to the time when the requesting instance receives the block. If the average time exceeds 5 milliseconds per block, then a latency alarm is fired. "
The following print screen shows how both Latency and Overhead issues are rather prominent.
Moreover, normally both of these alarms are fired at the same time, as high latency is something that you see rather often during a cluster overhead. These two alarms can fire in the following scenarios:
- Even if cluster latency is low, an excessive amount of inter-instance block transfer can cause the cluster overhead to be high. This might occur if there are "hot" blocks in an application that are constantly being transferred between instances.
- Cluster latency may be high and still the cluster overhead bearable as the application has no issues since it finds its blocks on the local instance. (There is no need to go ahead and fetch it from other instances, a situation that is typical for a partitioned scenario where a typical OLTP/DSS database ends up finding its partitioned data on its own instance).
Upon clicking on the Latency alarm, you can go to the drilldown, which will display a lot of information regarding the latency issues:
The top left graph shows the overall cluster latency: the large spike on the right of the graph, when averaged over time, has resulted in the latency alarm being fired. Note that this graph is only available for Oracle 10g clusters.
The ping graph shows the time required to transfer a packet across the HSI (High Speed Interconnect) cable.
All other graphs are a mere breakdown of the latency components such as "Total Latency," "Prepare Time," "Prepare Latency" and "Transfer Latency," all rather self-explanatory.
The same thing is seen when we click on the Overhead alarm, so see the drilldowns displaying a breakdown of the overhead issues: "The Cluster Overhead graph compares waits due to the RAC interconnect to non-RAC waits. The recent data shows the RAC waits as a significant percentage of the overall waits, resulting in the Overhead alarm being fired."
As you can see on the right, all the color coding says it all.
Global Cache Alarm (GCA)
GCA is further split into the Cache Miss Rate alarm, the Corrupt Blocks alarm and the Lost Blocks alarm. So what is a cache miss? When a user requests a data block, the block is cached in the SGA for re-use by other sessions until such time as the block is not used and the space is required for other data. A "cache miss" occurs when a data block requested by the user session is not found in the local SGA, and the block has to be obtained from a remote instance, or read from disk.
The Cache Miss Rate metric indicates the percentage of such requests that cannot be satisfied from the local cache, and it directly affects the overall performance of the cluster. When it exceeds a threshold, a Global Cache Local Miss Rate alarm is fired. Possible causes of this alarm are:
- Hot Blocks (A hot block is the most wanted block; think of certain queries in your mission critical application which are good candidates for top 10 queries)
- Inappropriately sized local cache (If your local cache does not have the data, the global cache will be requested).
Check out the print screen:
The Miss Type graph gives a breakdown of the misses for Current Block requests (for an Update operation) or for Consistent Block requests (for a Read operation). The other three graphs delve in further details such as the total misses, current blocks misses and consistent blocks misses.
Now it is a totally different story with corrupt blocks. Yes, we do have them too, dont we? SoRAC raises separate alarms for corrupt and lost blocks. As the manual states,
"The Corrupt Blocks alarm is raised when the number of corrupt blocks exceeds a threshold as a proportion of total global cache consistent read requests. The Lost Blocks alarm is similar, but is based on the number of lost blocks. Possible causes of these alarms are:
- The cluster interconnect (which I sometimes call HSI) is overloaded by excessive number of block.
- Checksum processes have been disabled and can no longer identify corrupted blocks."
Check out this print screen:
As you can see, there are blocks lost (top left).
Well, we know by now what ASM is. And yes, we do have alarms related to ASM as well. This print screen shows a total I/O activity on an ASM instance.
SoRAC can raise an ASM Service Time alarm when the average time taken to perform a read on the ASM instance has exceeded a specified threshold value. This indicates that the ASM subsystem is performing poorly. "Service time represents the response time of a disk plus any wait time experienced during disk operations."
As you can see clearly below, the two graphs show I/O throughput and service time overhead.
"ASM service time alarms will typically be associated with alarms for the service time of the shared disk subsystem as a whole. However, because ASM disk groups can be shared across multiple RAC clusters and because a cluster may comprise both ASM and non-ASM storage, it is possible that this alarm might fire independently of the normal disk alarms."
In this last SoRAC article, we have seen that we are using a highly intelligent tool, which displays all of the information graphically. Needless to say, it is indeed a very handy tool for DBAs. So, what are you waiting for, go ahead and try it out!