Today's databases (and especially data warehouses) have
become so large, and their query workloads have become so complex, that it is
often impractical and even unnecessary to retrieve all of the data that may be
relevant to a particular query. This article outlines DB2's SQL query sampling
support with the new TABLESAMPLE clause, which has been
added to the FROM clause of the SQL subselect in version 8.1.2.
by Paul C. Zikopoulos and Roman B. Melnyk
Today's databases (and especially data warehouses) have become so large, and their query workloads have become so complex, that it is often impractical and even unnecessary to retrieve all of the data that may be relevant to a particular query. Rather than use all of the data that satisfies some search condition, it might be possible in some cases to arrive at valid conclusions based on analysis of a relatively small sampling of the data. With such an approach, random sampling of a database could greatly increase query performance. In fact, one could reasonably expect the resulting performance increases to be more or less inversely proportional to the sample size.
This article outlines SQL query sampling support in the IBM DB2 Universal Database (DB2 UDB) products as of Version 8.1.2. Sampling is supported in DB2 UDB V8.1.2 with the new TABLESAMPLE clause, which has been added to the FROM clause of the SQL subselect. The TABLESAMPLE clause allows you to specify the percentage of a table that is to be sampled during query processing. The value that you specify here is called the "sampling percentage" or the "sampling rate". For example, if the sampling rate is set to 0.1, then only 1/10 of 1% of the data will be sampled. This would mean that in a 10,000-row table, only 10 rows (approximately) would be accessed.
How could such a small sample produce valid results? There is empirical evidence demonstrating the accuracy of aggregate values (output from column functions such as AVG), even with low sampling rates such as 1% or less, when tables are large enough.
The most common application of query sampling is to queries that use aggregate functions such as AVG, SUM, or COUNT. In such cases, reasonably accurate results can be obtained from a sample of the data.
We can demonstrate this in a simple, yet effective way by running a set of 100 queries against the STAFF table of the SAMPLE database that comes with DB2 UDB. You can try this little exercise yourself; the easiest way to do it is to imbed the queries in a DB2 command script. Each query has the following structure:
db2 select avg(salary) as avg_salary
where n is the sampling rate, whose value ranges from 1 to 100, in increments of 1. Running such queries against a table such as STAFF, which has 35 records, shows that reasonably accurate results can be obtained, even with small tables, if the sampling rate is set high enough. Figure 1 shows that, in this case, that rate is still impressively low, somewhere around 25%.
Figure 1. A linear plot of the results from 100 queries against the STAFF table. The queries are identical except for the sampling rate specified on the TABLESAMPLE clause.
by Paul C. Zikopoulos
and Roman B. Melnyk
The TABLESAMPLE clause in DB2 UDB Version 8.1.2
Figure 2, which shows a partial syntax diagram for the SQL
subselect, includes only those syntax fragments that form a path to the
TABLESAMPLE clause. You can specify one of two different types of sampling,
BERNOULLI or SYSTEM:
Row-level Bernoulli sampling. This method uses a
"sargable" predicate to retrieve a sample of rows from a table. A
sargable predicate is a predicate that can exploit an index to retrieve rows
from a table. If no index is available, the performance associated with this
approach can be relatively poor, because every row must be retrieved and the
sampling predicate applied to it.
Page-level system sampling. In most cases, this
method retrieves a sample of pages (rather than rows) from a table. If a page
is selected, all of the rows in the page are selected. The performance
associated with this approach is excellent, because of the minimal amount of
input/output (I/O) activity that is required. If SYSTEM sampling is specified,
the optimizer determines the most efficient sampling strategy in the given
circumstances; in some cases, this may prove to be Bernoulli sampling.
Figure 2. A partial syntax diagram for the SQL
subselect, showing only those syntax fragments that form a path to the
TABLESAMPLE clause. The highlighted clause in each fragment points to the next
fragment in the diagram.
Let's compare row-level sampling with page-level sampling by
considering the following scenario. Assume that an index (to facilitate
row-level sampling) exists, and that R represents the average number of
rows per page.
In this scenario, for a given sampling rate:
The number of I/Os required for page-level sampling is 1/R
times the number of I/Os required for row-level sampling; page-level sampling
offers significant performance benefits.
The accuracy of aggregate estimates depends on the degree of data
clustering if the rows are clustered on any columns referenced in the query.
For example, suppose the query is computing SUM(x). If the rows
are randomly distributed among pages, the accuracy of page-level sampling will
be similar to the accuracy of row-level sampling. If, however, all of the x-values
within a given page are about the same, but the x-values differ
significantly from page to page, the accuracy of page-level sampling will be
inferior to the accuracy of row-level sampling, because selecting more than one
row per page would not add any new information to the sample. In this clustered
case, it would be better to select many pages, and to pick only one row from
each selected page (which is essentially row-level sampling).
Each execution of the query usually yields a different
sample, and the number of rows in the sample may be slightly different. If you
want the sampling to be repeatable from one execution of the query to the next
(during testing, for example), you can specify the REPEATABLE keyword, followed
by a numeric expression in parentheses, on the TABLESAMPLE clause. The numeric
expression is used to ensure that repeated executions of the query return the
same sample if the data has not been altered in any way.
As you can see, DB2 UDB query sampling support, which allows
you to specify the percentage of a table that is to be sampled during query
processing, can be a very useful way to relieve some of the pressure on heavy
query workloads. Sampling can also be used to obtain a random subset of rows
for auditing purposes. In general, it is good practice to use the TABLESAMPLE
clause with only one table in a query, typically the fact table in a star
schema. If you specify a TABLESAMPLE clause for more than one table, results
could be difficult to interpret. We recommend that you experiment and learn
more about this feature by observing the effects of applying different sampling
rates to your test queries.
About the Authors
Paul C. Zikopoulos, BA, MBA, is with IBM
Canada Ltd. Paul has written numerous magazine articles and books about DB2.
Paul has co-authored the books: DB2 Version 8: The Official Guide, DB2:
The Complete Reference, DB2 Fundamentals Certification for Dummies, DB2
for Dummies, and A DBA's Guide to Databases on Linux. Paul is a DB2
Certified Advanced Technical Expert (DRDA and Cluster/EEE) and a DB2 Certified
Solutions Expert (Business Intelligence and Database Administration). You can
reach him at: paulz_ibm at msn.com.
Roman B. Melnyk, PhD, is with IBM Canada
Ltd., specializing in database administration, DB2 utilities, and SQL. Roman
has written numerous DB2 books, articles, and other related materials. Roman
co-authored DB2 Version 8: The Official Guide, DB2: The Complete
Reference, DB2 Fundamentals Certification for Dummies, and DB2
for Dummies. You can reach him at roman_b_melnyk at hotmail.com.