Using Oracle's SQL Functions

Thursday Feb 26th 2004 by Steve Callan
Share:

Oracle provides quite an array of functions when it comes to manipulating data via SQL. Of particular interest for this new series, are the functions related to numbers. Learn how this relates to your job as a DBA.

Oracle provides quite an array of functions when it comes to manipulating data via SQL. The Oracle9i SQL Reference guide (for release 2) lists five categories of SQL functions, with each category containing one or more functions within a category. The major functions are single-row, aggregate, analytic, object reference and user defined.

The single-row function category contains the following functions: number, character, datetime, conversion and miscellaneous single row. What are single-row functions? The definition shown in the SQL Reference guide states that single-row "functions return a single result row for every row of a queried table or view. These functions can appear in select lists, WHERE clauses, START WITH and CONNECT BY clauses, and HAVING clauses." Aggregate functions are also quite powerful. These functions "return a single result row based on groups of rows, rather than on single rows." In general, single-row and aggregate functions complement one another.

Of particular interest for this new series are the functions related to numbers. How does this relate to your job as a DBA? Answer: in several ways. First, you may have to support a Decision Support group. Decision support groups frequently use analytic functions and statistical methods to support a business decision. It helps to understand (or at least recognize the name) some of the analytic tools being used. Second, you may be working in a data warehouse type of environment where YOU are the decision support guru. Granted, your job may be to simply massage or extract the data for others to analyze, but you are the SQL expert by virtue of being the DBA. If you are asked to find the regression line for quantity produced versus sales, you will make a better impression by not having that deer in the headlights look come across your face.

A third reason has to do to adding value to your company by saving your company some money. How does that work? Well, companies that need to analyze data often purchase sophisticated tools such as SPSS and SAS. These products are not cheap. What frequently happens is that users of these high-end statistical packages wind up using very few features. Many of these features are the same ones found in Excel. Already you are thinking about how to extract data into a comma-separated value flat file for use in Excel. However, what if you didn't need to do that "export" and could do the same exact thing within Oracle itself? You have already paid for Oracle, and likely have Excel on every PC, but have the overhead of extracting the data and then having to do whatever in Excel. How about skipping the Excel step and performing the data analysis within Oracle? Note: by "Oracle," I mean the RDBMS product and not anything having to do with Oracle Apps. We are talking about using the database as a high-end calculator, so to speak.

One thing that goes hand-in-hand with analysis is some type of pictorial representation of what is being analyzed. Although Oracle does provide other tools for accomplishing this, that falls more into the Forms & Reports developer realm, so we'll say that is a possible solution and leave it at that. So even if there is a heavy chart/graph/histogram/whatever requirement to be met by using Excel (or something else), you can still use Oracle as a backup to check the work performed by someone else using a different tool.

Let's look at the linear regression aggregate function as an example of what Oracle can do as an analysis engine. If you hate math, the next few paragraphs may be painful, but you'll be able to follow along without getting buried in the theorems and formulas.

What is linear regression? Let's say you have an input, like list price of some product your company produces. The lower the price, the more your company sells of that product. Conversely, the higher the price, the fewer your company sells. Over time, you plot selling price versus quantity sold. When you look at the dots or plots on a piece of graph paper, perhaps the arrangement of the dots tends to suggest drawing a straight line, which generally comes close to connecting most of the dots. If the dots are pretty close to the line, then you may have a strong relationship between the X and Y (selling price and quantity produced). In fact, the line may be such a good fit that you can come close to predicting the quantity sold if given the selling price, and vice versa.

On the other hand, the points, when plotted, may look like a hazy cloud on the graph paper - there is no distinct line or trend between the X and Y values. Because there is no line which "fits" or connects the points, there is a weak relationship between X and Y (a list price of 52 may be just as likely to result in a quantity sold of 12, 17, or 18 - there is no or very little predictive power knowing one thing or the other).

The general equation of a line is y = mx + b. In English, that means "Y" is equal to the slope ("m") times "X" plus a constant "b." That constant is actually called the y-intercept (if the input of x = 0, then the output of y = b, or that is where the line crosses the Y axis). The y-intercept may or may not have any real meaning. The slope is important because it shows the rate of change between X and Y, and the sign of the slope reflects the direction of the relationship (a positive slope shows that as X increases, so does Y; a negative slope shows that as X increases, Y decreases).

So, getting back to what Oracle can do for you, with respect to linear regression: quite a bit. The linear regression function, REGR_"component," outputs nine items of interest typically used when evaluating data. A component is selected by specifying the related REGR phrase and the two inputs - expr1 and expr2 - or the independent and dependent variables. I could have said the "X" and the "Y," and I will later, but not for now.

To look at the example Oracle provides, you will need to install the sample schema named SH (for sales history). The sample schemas are referenced in the Oracle9i Sample Schemas guide (http://download-west.oracle.com/docs/cd/B10501_01/server.920/a96539/toc.htm). If you installed the sample schemas during the Oracle9i installation, you will also need to unlock the SH user account (and give it a password you can remember, like "SH"). If you have never seen a million-row table, this is your opportunity. As the user SH, doing a "select count(*) from sales" shows just over a million rows (1016271, to be exact).

The example shown in the SQL Reference guide has five channel_id's, but we will concentrate on the first one (identified as "C"). Here is the select statement and its output as shown in a SQL*Plus session as the user named SH:

SQL> SELECT
  2  s.channel_id,
  3  REGR_SLOPE(s.quantity_sold, p.prod_list_price)  SLOPE ,
  4  REGR_INTERCEPT(s.quantity_sold, p.prod_list_price)  INTCPT ,
  5  REGR_R2(s.quantity_sold, p.prod_list_price)  RSQR ,
  6  REGR_COUNT(s.quantity_sold, p.prod_list_price)  COUNT ,
  7  REGR_AVGX(s.quantity_sold, p.prod_list_price)  AVGLISTP ,
  8  REGR_AVGY(s.quantity_sold, p.prod_list_price)  AVGQSOLD
  9  FROM  sales s, products p
 10  WHERE s.prod_id=p.prod_id AND
 11  p.prod_category='Men'  AND
 12  s.time_id=to_DATE('10-OCT-2000')
 13  GROUP BY s.channel_id;

C      SLOPE     INTCPT       RSQR  COUNT   AVGLISTP   AVGQSOLD
- ---------- ---------- ---------- ------ ---------- ----------
C  -.0683687  16.627808 .051342581     20     65.495      12.15
I .019710295 14.8113924 .001631488     46 51.4804348  15.826087
P -.01247359  12.854546 .017039788     30      81.87 11.8333333
S .006155886 13.9919243 .000898438     83  69.813253 14.4216867
T -.00411314 5.22717214 .008132242     27 82.2444444 4.88888889

Let's add another AND to the WHERE clause to get just the first row:

SQL> SELECT
  2  s.channel_id,
  3  REGR_SLOPE(s.quantity_sold, p.prod_list_price)  SLOPE ,
  4  REGR_INTERCEPT(s.quantity_sold, p.prod_list_price)  INTCPT ,
  5  REGR_R2(s.quantity_sold, p.prod_list_price)  RSQR ,
  6  REGR_COUNT(s.quantity_sold, p.prod_list_price)  COUNT ,
  7  REGR_AVGX(s.quantity_sold, p.prod_list_price)  AVGLISTP ,
  8  REGR_AVGY(s.quantity_sold, p.prod_list_price)  AVGQSOLD
  9  FROM  sales s, products p
 10  WHERE s.prod_id=p.prod_id AND
 11  p.prod_category='Men'  AND
 12  s.time_id=to_DATE('10-OCT-2000')
 13  AND s.channel_id = 'C'
 14  group by s.channel_id;

C      SLOPE     INTCPT       RSQR  COUNT   AVGLISTP   AVGQSOLD
- ---------- ---------- ---------- ------ ---------- ----------
C  -.0683687  16.627808 .051342581     20     65.495      12.15

Overall, there were 206 rows involved, but we are only interested in the 20 rows for "C."



Let's extract the data used to compute the linear regression output:



SQL> select s.quantity_sold, p.prod_list_price
  2  FROM  sales s, products p
  3  WHERE s.prod_id=p.prod_id AND
  4  p.prod_category='Men'  AND
  5  s.time_id=to_DATE('10-OCT-2000')
  6  AND s.channel_id = 'C';


QUANTITY_SOLD PROD_LIST_PRICE
------------- ---------------
           17            53.9
           17            49.9
           22            89.5
            7              45
            3              78
           16            42.5
           12            52.9
            7            69.9
           17              68
            6             135
            1              54
           16            42.5
           25            69.9
           18              54
            1              54
            6             115
            5            59.5
           13            69.9
           22            42.5
           12              64

20 rows selected.


This looks much easier to deal with. In fact, any decent statistics book will show you step-by-step how to determine the linear regression line. You can even take this data and plug it into Excel, which, conveniently, is shown below.



Observation

Product list price

(X)

Quantity sold

(Y)

X-squared

Y-squared

XY

1

53.90

17

2905.21

289

916.30

2

49.90

17

2490.01

289

848.30

3

89.50

22

8010.25

484

1969.00

4

45.00

7

2025.00

49

315.00

5

78.00

3

6084.00

9

234.00

6

42.50

16

1806.25

256

680.00

7

52.90

12

2798.41

144

634.80

8

69.90

7

4886.01

49

489.30

9

68.00

17

4624.00

289

1156.00

10

135.00

6

18225.00

36

810.00

11

54.00

1

2916.00

1

54.00

12

42.50

16

1806.25

256

680.00

13

69.90

25

4886.01

625

1747.50

14

54.00

18

2916.00

324

972.00

15

54.00

1

2916.00

1

54.00

16

115.00

6

13225.00

36

690.00

17

59.50

5

3540.25

25

297.50

18

69.90

13

4886.01

169

908.70

19

42.50

22

1806.25

484

935.00

20

64.00

12

4096.00

144

768.00

SUM

1309.90

243

96847.91

3959

15159.40

AVG

64.495

12.15

The values shown in this table were computed using the formulas listed below, but they could have been calculated using the Analysis Toolpak in Excel.

Component

Calculation

n, number of observations

Count

Sxx, sum of squares related to X

Sxx = ∑x2 - (∑x)2/n

Syy, sum of squares related to Y

Syy = ∑y2 - (∑y)2/n

Sxy, related to interaction between X and Y

Sxy = ∑xy - (∑x)( ∑y) / n

b, the slope

b = Sxy/Sxx

a, the y-intercept

a = y-bar - b(x-bar) ("bar" means average)

Linear regression line

Y = a + bX

R-squared, goodness of fit

R-squared = (Sxy)2/(Sxx)(Syy)

Here is the output for "C" where all the SYY, SXY, and SXX values are shown:

SELECT
s.channel_id,
REGR_SYY(s.quantity_sold, p.prod_list_price)  SYY,
REGR_SXY(s.quantity_sold, p.prod_list_price)  SXY,
REGR_SXX(s.quantity_sold, p.prod_list_price)  SXX
FROM  sales s, products p
WHERE s.prod_id=p.prod_id AND
p.prod_category='Men'  AND
s.time_id=to_DATE('10-OCT-2000')
AND s.channel_id = 'C'
GROUP BY s.channel_id
;

C        SYY        SXY        SXX
- ---------- ---------- ----------
C    1006.55   -755.885 11056.0095

At this point, we have everything we need to complete the linear regression problem. You can read what the equation of the linear regression line is by examining the first or second SQL*Plus outputs shown earlier: Y = 16.627808 - 0.0683687(X).

Hidden in the documentation is the meaning of "expr1" and "expr2." In mathematical terms, one of these is the "X" and one of these is the "Y." The order shown in the "REGR_" line, using "(s.quantity_sold, p.prod_list_price)" really means (Y, X) pairings, not (X, Y) as you would probably expect. We usually expect to see the pairings as (X, Y), domain/range, independent/dependent variables or predictor/response. In regression terminology, I guess Oracle decided to go with the "regression of Y on X," so remember to order the expressions in EXPR_ as Y followed by X.

A picture is worth a thousand words (note to my editor: don't worry, I'm almost done), so here is a picture of the "C" data. The picture shows the scatter plot (the actual X/list price versus the Y/quantity sold), and what the linear regression predictive model shows (the actual X/list price versus what the model would have predicted for Y/quantity sold).

The widely scattered diamond-shaped (blue) points fall into the hazy cloud type of grouping mentioned earlier. The square points (magenta or purple-ish) along a line represent what the regression model would predict for a given list price. As you can readily see, the scatter plot is nowhere near the regression line, but without the picture, we already knew that because of the very low R-squared value (roughly speaking, the goodness of fit; the higher the value, the better the fit). In fact, none of the channel_id's used in the SQL Reference guide example have a good fit (the "best" being the one for "C," but even then, it is very low).

In summary, two things come to mind. First, Oracle can be used to perform data analysis that other tools, expensive or otherwise, can perform. Like any other tool, you have to become familiar with the format of the input and output, so there is some overhead on learning that, regardless of the tool.

Second and perhaps most importantly, in my opinion, is the impression that Oracle has made a colossal blunder by not making these tools more user friendly or surfaced in a GUI interface in their database product line. The Business Intelligence module in Oracle Apps references SQL Analytics. Good job of making that feature pretty obscure and offering practically next to nothing about what is included with it. The number crunching capability in Oracle is already there (for the most part), so why not go the little bit extra and make Oracle competitive with SAS, SPSS, Minitab, or whomever? It would be infinitely harder for SAS to become a competitive relational database management system, but it would be a drop in the bucket for Oracle to become a competitive data analysis tool. If done, the word "STATSPACK" could take on a whole new meaning.

» See All Articles by Columnist Steve Callan

Share:
Home
Mobile Site | Full Site
Copyright 2017 © QuinStreet Inc. All Rights Reserved