Implementing RAC, or any clustered solution, requires planning. Many mission critical applications tend to be very demanding and are often load balanced both at front end (application servers) and backend (database servers). Lets try to break it up into several components such as network, storage, infrastructure etc.
Designing the Infrastructure
This is the mother of all foundations. You need to select the platforms, Network configurations and Storage options carefully. I always advise starting small and smart so that you have a good scale-out plan. The tendency to go out and buy the coolest HP EVA 8000 for your RAC might seem like a great idea when the consultant is on board, but your TCA (Total Cost of Acquisition) will shoot up to the roof. There are many options to choose from, but well look at the basic options here, assuming that you have already made some fundamental choices:
Storage (SAN, NAS, DAS), Redundant HBA (Host Bus Adapters)
Network: NICs: High Speed Interconnects, Redundant switches, VLANs, NIC Teaming
Software/Application stack: OS (Windows, Linux, Solaris, AIX,)
Hardware optimization (CPUs: AMD or Intel, Dual socket, Quad Socket)
Lets just break it all up and start with storage.
In many cases, storage is already rested on a SAN or NAS. Be sure to have two redundant SAN switches, connected to separate power units! We recently had a problem with our SAN where the hosting firm, (the same place where Google has its farm hosted), made the mistake of hooking our SANs up to the same power supply. LUN distribution should be of the same size--I recommend a RAID 5 (SAN RAID 5 are allegedly 5 times faster than a poor mans RAID 5)--and then distribute them equally across the storage processors, 2 port HBAs connecting to each node. This way, you protect SPOF on Storage Processors, HBA, LUN levels. Ensure LUN visibility from all nodes and the equal sizing ensures optimal I/O throughput expectations. Be sure to avoid the Inter Switch Link in the SAN switch, by designing the SAN switch appropriately.
High Speed Internetwork Connects
Make sure that your high speed inter-network connectivity is supported and that you have separate network cards to support it (you can also go ahead and have teaming on two NICs for private traffic and two NICS teamed for the public traffic) but in any case make sure that the following are taken into account:
- Oracle Clusterware is going to use it and the messages are all small packets.
- Oracle database: Transfers both large and small blocks across the nodes and is the main user for the HSIs
- Bandwidth: I have normally seen CAT5e cables but lately just separate VLANs for private traffic, all assigned their own 1Gbps NICs.
Note that the 10Gbps NICs are gearing up for the production, and while we will not deviate from our path, remember that a lot of other standards are emerging in the wake of the expensive SAN fabrics, such as FCoE (Fibre Channel over Ethernet), AoE (ATA over ethernet, iSCSI), etc. Getting back to the interconnect traffic across the nodes; we can take a look at the messages. They can be small messages (under 256 bytes) from GES and GCS, they cab be cache fusion block messages or some tuned parallel query for a load intensive DSS tablespace or a DSS RAC itself (you may adjust the block size to a much higher value, the default block size is 8K). Therefore, to do the math, you would want to add up the GES+GCS messages, Parallel Query messages and the Cache Fusion blocks and then divide that by the bandwidth they consume per second. Normally with a 1Gbps backbone, you would expect that this might go fine but there's no harm in checking the expected values before you start complaining about the performance issues.
Where do I get my figures? Simple, check out the AWR reports (we will cover that thoroughly when I have my laptop RAC fully configured on my Dual Core HP laptop) and you will be able to get the Cache Fusion Blocks (default 8K), Parallel Query message chunks (8K default) and the GES+GCS messages. After doing the quick math, you will realize that a typical 1Gbps NIC is sufficient.
We have covered the installation test-beds in our ongoing installation series and have tried to comply with a scalable Virtualized RAC infrastructure, which might even come close to production planning. Choosing an OS is normally done long before you start with RAC; if it hasn't been, choose wisely. Choosing the Oracle install directory, CLUSTER_HOME, ORACLE_HOME, etc. must reside in their own directory. We checked the mount points and even provided mirroring on separate virtual disks for OCR, Voting disk and ASM spfile.
Network configurations can be taken care of in the same way. Switches, VLAN and NICs can be load balanced. This will ensure that your network is also up, should any of those hardware appliances fail. Then the choice comes to memory and CPUs--with RAC you are only in the horizontal scaling mode so I wouldnt worry about that. Also, note that Oracle RAC is and can be a very Virtualization aware application. Having said this, virtualization on the processor level will provide a greater performance win should you choose a 4 socket quad core node. Imagine 16 cores on a simple DL 585 box, OR if you choose a smaller version, such as a DL 385, you can achieve an 8 core box! Isnt that amazing? Make sure that you stick to some basic procurement policies, like the same type of machines and specs. This not only helps diagnose problems on the hardware level but also helps propagate the updates in a uniform manner.
Capacity planning and scaling out is more than sticking a wet finger in the air. You will certainly need a robust and reliable infrastructure but you should also note that it is the metrics (we will cover them extensively in our AWR articles) that will help you determine when and where you can use more capacity. Be it on a node level (advisable) or on a per machine level (CPU, Memory etc), meaning the individual scaling up of each machines. Obviously you start looking at the CPU when it is shooting over a 75% range for a length of time, and rely on your instincts to scale out your infrastructure before a downtime hits you, but keeping the metrics and practicing on them (a replica of your production in a Virtualized infrastructure for your test and development) will help you plan and sustain a highly available RAC successfully.