Clustering for Indexes

Thursday Nov 4th 2004 by James Koopmann

Indexes hold the key for speed when accessing data. To ensure that you are getting to the data as fast as possible you should check to make sure the clustering is suitable for the query type.

There is nothing spectacular about using indexes per se. However, on many occasions I have come across a variety of SQL coders that never consider validating that the index they think they are using is efficient or even being used at all. We can all put indexes on the columns that we think will be required to satisfy individual queries, but how do we know if they will ever be used. You see, if the underlying table data is constructed, contains, or is ordered in a particular way, our indexes may never be used. One of the factors around the use of an index is its clustering factor and this is what this article is about.

Clustering defined

The dictionary.com definition of clustering is "A group of the same or similar elements gathered or occurring closely together." This is not too far from what we mean by clustering for an index when relating it to databases. If you were to extract a definition from the Oracle documentation, you would get to a concept of Clustering Factor, which is defined as "the amount of order of the rows in the table based on the values of the index." So what is a good clustering factor as opposed to a bad clustering factor? As the clustering factor approaches and reaches the number of blocks in the table, the index is considered to have good clustering. You can quickly see that if an index block points to a table data block, then all of the pointers in the index are pointing to table data that is very closely related and thus we will get very good response from using the index. On the other hand, poor or bad clustering is very much the opposite. As the clustering factor approaches the number of rows in the table, the worse our usability of the index becomes. This means that the entries in each of the index blocks are in different blocks in the table data blocks, our access to the data could not be sequential, and we loose all ability to scan table data in order of the index. Figure 1 gives a good picture of how good clustering is nicely ordered when accessing data blocks from the index blocks. Figure 2 shows how the index to data blocks is random for bad clustering.

Good clustering

Bad clustering

Effects on performance

1.      db file sequential & scattered reads
Depending on the index access and access to the table data that is signified by your SQL execution plan, if the clustering factor is bad we will end up doing more reads than we want to get to the data to produce the result set needed by the query. These waits on db file reads have huge implications, not only to the current SQL statement but also steal valuable resources from other SQL and taxing our i/o subsystem.

2.      Buffer Busy Waits
Directly related to high db file sequential & scattered reads is the requirement to obtain buffers in the buffer cache. The fact that we are reading many more blocks than we need, and possibly re-reading, we will tax requests for buffers and thus incur possible high wait times for this valuable resource. In addition, if you watch this area and try to tune it, it is like trying to hit a moving target because there tends to be no rhyme or reason to the random access for data.

3.     SQL execution path
Since the optimizer looks at the clustering factor statistic to determine the execution path, if the clustering factor is bad, Oracle will tend to do full table scans for result sets, as it may be less costly to pull in just table data as opposed to reading index and data blocks. All of this drives up the cost of the execution plan and time for the SQL to run.

Structure changes for cluster performance

1.      Order of columns in an index
This part of getting clustering to work is very important. This "typically" means that you should use the columns with the most selectivity first (A primary key is very selective). In addition, you should validate the order of the columns in your index along with the business rules and data heuristics. This is the only way you will know if the index you build will at least satisfy the types of queries requested of it.

2.      Load your data in the order you have set up the columns for the index(es). If you have more than one index on an object, you should revert to the business rules and SLAs around the queries that are most important. You can accomplish this by any method you like but it does require a rebuild (reorganization) of the table. If you have ETL processes, try to pre-sort the data as it will improve performance and limit the complaints latter.

3.      Add data needed in the table to the index. While this is more of a hack than a fix, if you cannot get to reorganize your data in the table this may be your only choice. You may also have a situation where the order in the table is great for a set of indexes but is totally out of order for another. Adding a column or two in these situations is a life saver.

Where to look at clustering

DBA_INDEXES For all indexes in your database
DBA_IND_PARTITIONS For all index partitions in your database
DBA_IND_STATISTICS For showing optimizer statistics for all indexes in your database.
DBA_IND_SUBPARTITIONS For all index subpartitions in your database.

Having a good clustered index on a table reduces the time to execute and resources required to produce result sets. This is important as other queries may starve for those resources. Asking yourself a few questions about the access paths and data required to ask well-defined business questions puts you closer to discovering the true indexes required within your databases. We all know that this is not a perfect world and often we inherit bad designs. By looking at the clustering factor, it only takes a moment, we can quickly gain insight into how the data looks in the objects being queried and thus possible performance issues we need to deal with.

» See All Articles by Columnist James Koopmann

Mobile Site | Full Site