In this second article about SQL functions, we will look at 11 SQL-related functions commonly used in statistics: count, sum, average, standard deviation, variance and covariance (standard deviation and variance have three each; covariance has two).

Aside from other reasons (mentioned in the last SQL function
article) about why a DBA may need to be familiar with these tools, here is a
reason that pertains directly to what a DBA does: using "real"
statistics for performance tuning or design purposes. Oracle's array of SQL
functions enables a DBA to compute meaningful statistics about almost any set
of input data. Computing the count, sum and average of some item of interest is
very straightforward, but how do you compute the *variability* of that
data?

As a point of clarification, when someone refers to statistics about a set of data, the type of statistics being discussed is that of descriptive statistics or simple data analysis. The counterpart of descriptive statistics is inferential statistics. Inferential statistics typically deal with a sample from a population, and from that sample, we try to infer or answer questions about the entire population. The answers to questions about the population are couched in terms of probability. For our purposes, simple descriptive statistics will suffice because we have all the data.

When looking through the list of SQL functions in the SQL
Reference documentation, you will see "POP" and "SAMP"
appended to covariance, standard deviation and variance-related functions. If
you are dealing with a set of data consisting of more than 30 to 31 elements,
observations, or readings, you can use either one of the *function-name*_POP
or *function-name*_SAMP functions. The "SAMP" functions use a
population correction factor to provide a "better" or unbiased value
of the true population parameter. In simple terms, the denominator of whichever
value is being computed uses n-1 instead of n, where "n" is the
number of data points, readings, observations, and so on.

A quick numerical example shows the practical equivalence of dividing by n-1 versus dividing by n. A large number divided by n-1 is practically the same (for our purposes, anyway) as dividing by n when n is greater than 30. The value of 34.483 (1000/(30-1)) versus that of 33.333 (1000/30) is around a 3% difference. However, when n is much higher, like 100, then the "error" falls to less than 1%. We will see this lack of a difference in the following examples using the sample schemas that ship with Oracle9i.

Logging in as "SH" in the sales history schema, and using the example shown in the SQL Reference guide, you can see there is no practical difference between the sample standard deviation and the population standard deviation.

SQL> SELECT STDDEV_POP (amount_sold) "Pop", 2 STDDEV_SAMP (amount_sold) "Samp" 3 FROM sales; Pop Samp ---------- ---------- 896.355151 896.355592

How many records are in this table? If you do not recall from the last SQL function article, this table has over a million rows of data.

SQL> SELECT count(amount_sold) count 2 from sales; COUNT ---------- 1016271

As an illustrative example of how to use this particular SQL function, the documentation is entirely correct and accurate. As a practical example of using STDDEV_POP and STDDEV_SAMP, the documentation falls short. After reading this article, or already knowing something about statistics, you now know why the values returned are so similar, differing by slightly more than 4/10,000, which is close enough to zero for all practical purposes.

Users of SQL functions should be grateful for these more powerful, aggregate type of functions. If you were restricted to "simpler" functions, the query to compute these standard deviations would look like what is shown below. Note: the "-" is a minus sign, not a continuation symbol, and if you cut and paste the code, you may need to reformat it for SQL*Plus.

select sqrt((sum(power((amount_sold),2)) - (power(sum(amount_sold),2)/count(amount_sold)))/ count(amount_sold)) "Pop" from sales;

and

select sqrt((sum(power((amount_sold),2)) - (power(sum(amount_sold),2)/count(amount_sold)))/ (count(amount_sold)-1) "Samp" from sales;

Some of the parentheses were used to improve readability, and as is readily apparent, the complexity of the queries is quite a bit more involved than using the much simpler STDDEV variants. And as a bonus, the STDDEV functions are faster (just slightly so) than the more computational looking examples.

The VARIANCE, VAR_POP and VAR_SAMP functions are directly related to their standard deviation counterparts, as standard deviation is simply the square root of the variance. STDDEV and VARIANCE are similar in what they return if there is only one element (both return a zero), and STDDEV_SAMP and VAR_SAMP return the same overall values as STDDEV and VARIANCE, with the exception of returning a null if there is only one element.

COUNT, SUM and AVERAGE are probably the most well known SQL functions and there is nothing special about their use. There is one little trick you can use concerning AVERAGE, and that trick has to do with verifying the output given an input when using a regression line (see the previous article on SQL functions). The data point of (X-bar, Y-bar) is a point on the computed regression line even if the value of X-bar is not one of the original observations or input values (or Y-bar is not an actual observed output).

Using the "C" channel from the sales table (there were 20 data pairs), the computed regression line was Y = 16.627808 - 0.0683687(X). X-bar was 65.495 and Y-bar was 12.15. Plugging in 65.495 in the equation, does, in fact, return a value of 12.15 for Y.

Without getting into details about what covariance is, you should know that you have already seen it if you read the previous article. In the regression line example, there was a computed value of -755.885 for the Sxy term. This value was selected using the REGR_SXY function. If you select the COVAR_POP value from the sales table (using the same "where" clause as before):

SQL> SELECT 2 s.channel_id, 3 REGR_SXY(s.quantity_sold, p.prod_list_price) SXY, 4 COVAR_POP(s.quantity_sold, p.prod_list_price) COVAR_POP 5 FROM sales s, products p 6 WHERE s.prod_id=p.prod_id AND 7 p.prod_category='Men' AND 8 s.time_id=to_DATE('10-OCT-2000') 9 AND s.channel_id = 'C' 10 group by s.channel_id; C SXY COVAR_POP - ---------- ---------- C -755.885 -37.79425

Guess what 20 times -37.79425 is? Answer: -755.885, which just happens to be the REGR_SXY value.

The point of demonstrating how some of these SQL functions are related is not to turn you into a statistics wizard, but rather, to help provide more insight into Oracle's analytic and data manipulation capabilities using statistics-related SQL functions. As mentioned before, Oracle is rich with analytic features (and poor in the interface) and knowing more about them further enhances your skills as a DBA. A business analyst or report writer may know the mathematical formula for computing some statistic, but not know how to write the SQL to get it. Being more informed about SQL functions is, as Martha Stewart would put it, "a good thing."

Bonus question: Knowing that the average is computed as the sum of X divided by N, how else could you write the following:

select sqrt((sum(power((amount_sold),2)) - (power(sum(amount_sold),2)/count(amount_sold)))/ count(amount_sold)) "Pop" from sales;

Answer:

select sqrt((sum(power((amount_sold),2)) - (avg(amount_sold)*sum(amount_sold)))/ count(amount_sold)) "Pop" from sales;