This is part four of a four-part tutorial. This part discusses how to interpret the most common NT Server and SQL Server Performance Monitor counters.
In the previous three parts of this tutorial, you learned how to use Performance Monitor to collect SQL Server-related performance data, store it in SQL Server, and how to perform trend analysis on it. Now it is time to take a look at how to interpret the data you have collected.
For the purposes of this article, the Performance Monitor data can be in any form, whether it is being viewed with Performance Monitor itself, Microsoft Excel, or any other program. All you need is access to the results of your data. How you interpret your data does not depend on the format in which it is viewed.
While our main focus on this article is on SQL Server performance, we must keep in mind that the operating system (whether it is Windows NT Server 4.0 or Windows 2000) is closely intertwined with SQL Server, especially when it comes to Performance Monitor counters. In fact, when monitoring SQL Server performance, I tend to monitor more operating system performance counters than SQL Server counters. This is because many of the most important performance counters are a part of the operating system, not SQL Server. Keep this in mind as you read about how to interpret the various performance counters later in this article.
As you may know, Windows NT Server 4.0 and Widows 2000 have over 350 performance monitor counters available. SQL Server 7.0 and SQL Server 2000 both have over 100 performance monitor counters available. As might well imagine, the number of potential performance-related counters is overwhelming. Fortunately, the huge number of performance monitor counters available are seldom used. Generally, I monitor about a dozen counters on a regular basis, and only use some of the more obscure ones when researching specific performance-related problems. In this article, I will focus on the key performance counters only.
Focus on the Big Picture
My goal when I monitor SQL Server with Performance Monitor is to get the big picture, not the details. If the big picture indicates that I need to drill down for more detail, that is OK. But until then, I have better things to do with my time. The big picture allows me to see how my SQL Servers are doing, and that is why I only focus on about a dozen Performance Monitor counters, which are discussed below.
Like most people, I use Performance Monitor to help identify potential performance bottlenecks, which are usually categorized into one of these five area:
CPU: SQL Server can't do its job if it has run out of CPU cycles, so monitoring this potential bottleneck is important.
Memory: If you want maximum SQL Server performance, your server must not have a memory bottleneck. Sure, the operating system can page if there is not enough physical RAM in the server, but do you really want to want that long?
I/O: Of all the potential performance bottlenecks, disk I/O is the probably hardest one to correct. And like lack of physical RAM, it can significantly affect SQL Server's performance.
Network: This is generally the least bothersome bottleneck, as most servers today can hardly fill a 100Mbs network connection, no matter how hard they try.
SQL Server Specific: This includes a variety of SQL Server counters that you can monitor to help identify several potential SQL Server performance-related problems.
In the next several sections we will be taking a look at each of the above bottlenecks, and examining the various Performance Monitor counters we can use to help identify them. We will also be taking a look at how to interpret their results, and take a look at how to resolve these various bottlenecks.
CPU Performance Monitor Counters
Measuring the CPU activity of your SQL Server is a key way to identify potential CPU bottlenecks. The Process Object: % Processor Time counter is available for each CPU (instance), and measures the utilization of each individual CPU. While viewing CPU activity for each of the CPUs in your server can be useful, I generally prefer to monitor the total CPU activity for the server, using another counter, described in the next paragraph. If your server has multiple CPUs and you use the above counter to watch each one, you will notice that how busy each CPU is varies considerably. Don't worry about this. While the operating system does its best to evenly spread the processing load over each CPU, the reality is that this is an imperfect process and different CPUs will be busier than others.
The System Object: % Total Processor Time counter measures the average of all the CPUs in your server. This is the key counter to watch for CPU utilization. If the % Total Processor Time counter exceeds 80% for continuous periods (over 10 minutes or so), then you may have a CPU bottleneck on your server. Occasional spikes of 100% are nothing to worry about and are normal for most SQL Servers.
This counter is also useful when performing trend analysis. For example, if you notice that while the % Total Processor Time counter is well within limits now, but you notice that is increasing each month, month after month, this is a good clue that you may eventually run out of CPU cycles on your server. If you recognize this problem now, then you can better plan for the future.
While the % Total Processor Time counter is important, I don't like to rely on just a single counter to let me know if a server has a bottleneck or not. Another valuable indicator of CPU performance is the System Object: Processor Queue Length. If the Processor Queue Length exceeds 2 per CPU for continuous periods (over 10 minutes or so), then you probably have a CPU bottleneck. For example, if you have 4 CPUs in your server, the Processor Queue Length should not exceed a total of 8 for the entire server.
Use both the Processor Queue Length and the % Total Process Time counters together to determine if you have a CPU bottleneck. If both indicators are exceeding their recommended amounts during the same continuous time periods, you can be assured there is a CPU bottleneck.
If the Processor Queue Length regularly exceeds the recommended maximum, but the CPU utilization is not correspondingly as high (which is typical), then consider reducing the SQL Server "max worker threads" configuration setting. It is possible the reason that the Processor Queue Length is high is because there are an excess number of worker threads waiting to take their turn. By reducing the number of "maximum worker threads", what you are doing is forcing thread pooling to kick in (if it hasn't already), or to take greater advantage of thread pooling.
If your SQL Server is experiencing CPU bottlenecks, consider these possible solutions:
Get faster CPUs, or add additional CPUs.
Get CPUs with a larger L2 cache.
Tune your application so that it doesn't have to access the disk as often. For example, add indexes so table scans aren't needed, normalize your database to eliminate redundant data, etc.)
Tune your queries to reduce the CPU load.
Move some of the processing load to another SQL Server.
Consider turning on Windows NT fibers. [7.0, 2000 only]
Be sure that both OLTP and OLAP queries are not being run on the same server. These different database applications should be performed on separate servers.
I/O Performance Counters
If your I/O subsystem is working efficiently, then each time SQL Server wants to write or read data, it can without waiting. But if the load on the server is too great, then reads and writes will have to wait, each taking their turn. This can significantly reduce SQL Server's performance.
The best way to monitor this is to use the PhysicalDisk Object: Avg. Disk Queue Length to monitor each disk array in your server. If the Avg. Disk Queue Length exceeds 2 for continuous periods (over 10 minutes or so) for each disk drive in an array, then you probably have an I/O bottleneck for that array. You will need to calculate this figure because Performance Monitor does not know how many physical drives are in arrays.
For example, if you have an array of 6 physical disks, and the Avg. Disk Queue Length is 10 for a particular array, then the actual Avg. Disk Queue Length for each drive is .83 (10/12=.83), which is well within the recommend 2 per physical disk.
The Physical Disk Object: % Disk Time counter is another handy tool for several reasons. This counter measures how busy a physical array is (not a logical partition or individual disks in an array). It provides a good relative measure of how busy your arrays are, and over a period of time, can be used to determine if I/O needs are your server are increasing, indicating a potential need for more I/O capacity in the near future.
As a rule of thumb, the % Disk Time counter should run less than 90%. If this counter exceeds 90% for continuous periods (over 10 minutes or so), then your SQL Server may be experiencing an I/O bottleneck. If you suspect a physical disk bottleneck, you may also want to monitor the Physical Disk Object: % Disk Read Time counter and the Physical Disk Object: % Disk Write Time counter in order to help determine if the I/O bottleneck is being mostly caused by reads or writes.
Also, this counter is a good indicator of how busy each array on your server is. By monitoring each array, you can tell how well balanced your I/O is over each array. Ideally, you want to distribute the I/O load of SQL Server as evenly as possible over your arrays, and this counter will tell you how successful you have been doing this. If you find that one array is much busier than another array, you may want to consider moving one or more files from the busy array to the less busy array.
Before using these counters for Window NT Server 4.0, be sure to manually turn it on by going to the NT Command Prompt and entering the following: "diskperf -y", and then rebooting your server. This is required to turn on the disk counters on for the first time.
If your SQL Server is experiencing I/O bottlenecks, consider these possible solutions:
Add additional physical RAM so that your server can go to RAM instead of the I/O system to access data.
If you are not already, use RAID level 5 or RAID level 10 for your arrays. RAID level 10 is the fastest RAID level you can choose that supports redundancy.
Add more physical drives to the current arrays. This helps to boost both read and write access times. But don't add more drives to the array than your I/O controller can support.
Replace your current hard drives with faster drives.
Add faster or additional I/O controllers. Consider adding more cache (of possible) to your current controllers.
Tune your application so that it doesn't have to access the disk as often. For example, add indexes so table scans aren't needed, normalize your database to eliminate redundant data, etc.)
Move database or transaction log files from busy arrays to less busy arrays.
Store your databases and transaction log files on a SAN (storage area network).
Used partitioned views and federated servers to distribute workload (2000 only).
Memory Performance Counters
This section assumes that your server is dedicated to SQL Server, and perhaps some related server utilities. If not, and you are having memory-related performance problems on your SQL Server, then your first step should be to move all non-SQL Server-related programs off of the physical server running SQL Server. Once you have done that, then use the following information to look for memory-related bottlenecks.
One of the key counters you should be regularly watching is the Memory Object: Pages/Sec. This measures the number of pages per second that are paged out of memory to disk, or paged into memory from disk. Assuming that SQL Server is the only major application running on your server, then this figure should average nearly zero, except for occasional spikes, which are normal.
Over continuous periods of time (10 minutes or so) the Pages/Sec should ideally be near zero (you will see some activity). If this is not the case, this means that NT Server is having to page data, which it should not be doing. This is because SQL Server does not use NT Server's page file, and since this should be the only application on your server, there should be little paging going on.
If there is regular paging going on, this means that you are running other applications on your server, which is causing NT Server to page, or you have set the SQL Server Max Server Memory configuration setting to some other setting other than "Dynamically configure SQL Server memory". Determine which is the problem and fix it, as this paging is slowing down SQL Server's performance. Ideally, remove the NT applications causing the paging.
If you have changed the SQL Server Max Server Memory configuration to some other value other than "Dynamically configure SQL Server memory", then change it back to this setting. SQL Server should be allowed to take as much RAM as it wants for its own use without having to compete for RAM with other applications.
Another way to double-check to see if your SQL Server has enough physical RAM is to check the Memory Object: Available Bytes counter. This counter can be viewed from Performance Monitor or from the NT Server or Windows 2000 Task Manager (see the Performance tab). This value should be greater than 5MB. If not, then your SQL Server needs more physical RAM. On a server dedicated to SQL Server, SQL Server attempts to maintain from 4-10MB of free physical memory. The remaining physical RAM is used by the operating system and SQL Server. When the amount of available bytes is less than 4MB, most likely SQL Server is also paging (which it shouldn't) and is experiencing a performance hit.
If your SQL Server is experiencing memory bottlenecks, consider these possible solutions:
Add more physical RAM. If you are already at 2GB of physical RAM and are still having a memory problem, consider upgrading to SQL Server 7 Enterprise Edition (which supports up to 3GB), or upgrading to SQL 2000 Enterprise Edition, which in theory can support up to 64GB of physical RAM.
Ensure that SQL Server is the only application running on the server (other than server management utilities).
Remove or disable all unnecessary services.
Ensure SQL Server is running as a member server (not a domain controller).
Configure SQL Server to allocate memory dynamically (the default), don't hard code how much RAM SQL Server is allocated.
Network Performance Counters
One of the best ways to monitor if you have a network bottleneck is to watch the Network Interface Object: Bytes Total/Sec counter. This counter measures the number of bytes that are being sent back and forth between your server and the network. This includes both SQL Server and non-SQL Server network traffic. Assuming your server is a dedicated SQL Server, then the vast majority of the traffic measured by this counter should be from SQL Server.
There is no hard and fast "correct" number for this counter as it measures the actual traffic. To help you decide if your server has a network bottleneck, one way to use this number is to compare it with the maximum traffic supported by the network connection your server is using. Also, this is another important counter to watch over time. It is important to know if your network traffic is increasing regularly. If it is, then you can use this information to help you plan for future hardware needs.
If your SQL Server is experiencing network bottlenecks, consider these possible solutions:
Add faster network cards.
Add additional network cards.
Server network card should be attached to switches.
Network cards should be running in full duplex mode.
Tune your application so that it does not require unnecessary network trips. Do this by returning only the required data and used stored procedures.
Remove all unnecessary network protocols from the server.
Use TCP/IP as the network library on the client and server.
Before you can use the network performance counters, the Network Monitor Agent service must be installed on your server. After installing it, you will have to reboot. Also, don't forget to rerun the latest NT service pack to update the files added during the installation process.
SQL Server Performance Counters
Up to this point we have focused on identifying and resolving key hardware-related bottlenecks. In this section, we will take a look of some of the Performance Monitor counters you can use to identify specific performance issues in SQL Server.
One of the key SQL Server Performance Monitor counters to watch is the SQL Server Buffer Manager Object: Buffer Cache Hit Ratio. This indicates how often SQL Server goes to the buffer, not the hard disk, to get data. In OLTP applications, this ratio should exceed 95% on a regular basis. If it doesn't, then you should consider adding more RAM to your server to increase performance. In OLAP applications, the hit ratio may be much less because of the nature of how OLAP works. In any case, more physical RAM should increase the performance of SQL Server running either OLTP or OLAP applications.
If you want to see how much physical RAM is devoted to SQL Server's data cache, monitor the SQL Server Buffer Manager Object: Cache Size (pages). This number is presented in pages, so you will have to take this number and multiply it by 8K (8,192) to determine the amount of RAM in K that is being used. Generally, this number should almost come close to the total amount of RAM in your computer, assuming you are devoting your server to SQL Server. This number should be close to the total amount of RAM in the server, less the RAM used by NT, SQL Server, and any utilities you have running on the server. If the amount of RAM devoted to the data cache is much smaller than you would expect, then you need to do some investigating to find out why. Perhaps you aren't allowing SQL Server to dynamically allocate RAM. Whatever the cause, you need to find a solution, as the amount of data cache available to SQL Server can significantly affect SQL Server's performance.
Since the number of users using SQL Server affects its performance, you may want to keep an eye on the SQL Server General Statistics Object: User Connections. This shows the number of user connections, not the number of users, that currently are connected to SQL Server. When interpreting this number, keep in mind that a single user can have multiple connections open, and also that multiple people can share a single user connection. Don't make the assumption that this number represents actual users. Instead, use it as a relative measure of how "used" the server is. Watch the number over time to get a feel if your server is being more used, or less used. If usage is going up, then you can use this information to help you better plan future hardware needs.
One cause of excess I/O on a SQL Server is page splitting. Page splitting occurs when an index or data page becomes full, and then is split between the current page and a newly allocated page. While occasional page splitting is normal, excess page splitting can cause performance issues. To find out if you are experiencing a large number of page splits, monitor the SQL Server Access Methods Object: Page Splits/sec. Unfortunately, I don't have a good figure to recommend as to what this maximum number should be for this counter. The goal should be to keep it as low as possible. What you may want to do is watch it over a period of time. If it is increasing, this may indicate that you need to rebuild the indexes on the tables in your databases, and you may want to consider increasing the fillfactor on the indexes when you rebuild them.
If your users are complaining that they have to wait for their transactions to complete, you may want to find out if object locking on the server is contributing to this problem. To do this, use the SQL Server Locks Object: Average Wait Time (ms). You can use this counter to measure the average wait time of a variety of locks, including: database, extent, Key, Page, RID, and table. If you can identify one or more types of locks causing transaction delays, then you will want to investigate further to see if you can identify what specific transactions are causing the locking. The Profiler is the best tool for this detailed analysis.
While table scans are a fact of life, and sometimes faster than index seeks, generally it is better to have fewer table scans than more. To find out how many table scans your server is performing, use the SQL Server Access Methods Object: Full Scans/sec. Note that this counter is for an entire server, not just a single database. One thing you will notice with this counter is that there often appears to a pattern of scans occurring periodically. In many cases, these are table scans SQL Server is performing on a regular basis for internal use. What you want to look for are the random table scans that represent your application. If you see what you consider to be an inordinate number of table scans, then break out the Profiler and Index Tuning Wizard to help you determine exactly what is causing them, and if adding any indexes can help reduce the table scans. Of course, SQL may just be doing its job well, and performing table scans instead of using indexes because it is just plain more efficient.
If you suspect that your backup or restore operations are running at sub-optimal speeds, you can help verify this by using the SQL Server Backup Device Object: Device Throughput Bytes/sec. This counter will give you a good feel for how fast your backups are performing. You will also want to use the Physical Disk Object: Avg. Disk Queue Length counter to help collaborate your suspicions. Most likely, if your are having backup or restore performance issues, it is because of an I/O bottleneck.
Removing Bottlenecks is Only One Part of Performance Tuning
When many people think about performance tuning, they only think about how to remove hardware-related bottlenecks. What they don't know is that most SQL Server-related performance problems are not hardware related, but are mostly caused by poor database and application design. Because of this, they are often disappointed that they can't figure their performance issues by throwing hardware at the problem. Removing hardware bottlenecks is just one part of the much larger subject of SQL Server performance tuning and optimization.