Data Densification, Demystified: Oracle 10g SQL Enhancements, Part 2

Friday Aug 26th 2005 by Jim Czuprynski
Share:

Oracle 10g has extended the already robust features of Structured Query Language (SQL) with a plethora of intriguing new capabilities. This article - the second in this series - reviews the new capabilities Oracle 10g provides that, in concert with analytical functions, permit the creation of "densified" output with a few simple queries.

Synopsis. Oracle 10g has extended the already robust features of Structured Query Language (SQL) with a plethora of intriguing new capabilities. This article - the second in this series - reviews the new capabilities Oracle 10g provides that, in concert with analytical functions, permit the creation of "densified" output with a few simple queries.

The previous article in this series reviewed several new SQL features in Oracle 10g, including improved query and access methods for nested tables, upgrades to the MERGE command, and hierarchical query enhancements. Impressive as these new capabilities are, Oracle 10g's new data modeling features allow any Oracle developer or DBA to produce sophisticated, complex calculations using just the database engine, thus eliminating the need to massage data output via a third-party tool. Many of the new Oracle 10g query features depend upon the analytical functions established in Oracle 9i, so I will first delve into those features before discussing partitioned outer joins and data densification.

Analytical Functions: A Brief Introduction

I have been using SQL since 1985, and in my humble but biased opinion, Oracle's extensions to standard ANSI SQL are superior to those of its competitors. Oracle 9i added a complete suite of analytical functions to provide some impressive reporting and calculation capabilities.

Just about every major system I have worked on has at least one historical table, i.e. a table that stores multiple historical entries to track changes to a particular data domain over time. A good example of an historical table is the JOB_HISTORY table in the HR demo schema, which contains multiple rows for each employee detailing that employee's job assignments over time. A typical reporting requirement usually required against historical tables is the retrieval of either the least or most recent entry- for example, the least recent job assignment for a specific employee based on the employee's start date in that position.

If I were using another DBMS, I would probably have little choice but to construct the following SQL query and utilize a subquery to find the minimum starting date, and then use the result from that subquery to isolate the remainder of the employee's job history information:

SQL> TTITLE 'Earliest Employee Job History Entry (Traditional Method)'
SQL> COL employee_id     FORMAT 9999     HEADING 'EmpID'
SQL> COL full_name       FORMAT A24      HEADING 'Employee Name'
SQL> COL job_id          FORMAT A10      HEADING 'JobID'
SQL> COL start_date      FORMAT A10      HEADING 'Start Date'
SQL> COL end_date        FORMAT A10      HEADING 'End Date'
SQL>
SQL> SELECT
  2      E.employee_id
  3     ,E.last_name || ', ' || E.first_name full_name
  4     ,JH.job_id
  5     ,TO_CHAR(JH.start_date, 'mm-dd-yyyy') start_date
  6     ,TO_CHAR(JH.end_date, 'mm-dd-yyyy') end_date
  7    FROM
  8      hr.employees E
  9     ,hr.job_history JH
 10   WHERE E.employee_id =  JH.employee_id
 11     AND JH.start_date = (
 12      SELECT MIN(start_date)
 13        FROM hr.job_history
 14       WHERE employee_id = 114)
 15     AND JH.employee_id = 114;
            Earliest Employee Job History Entry (Traditional Method)
EmpID Employee Name            JobID      Start Date End Date
----- ------------------------ ---------- ---------- ----------
  114 Raphaely, Den            ST_CLERK   03-24-1998 12-31-1999

As of Oracle 9i, however, I have an alternative to this method: I can use an Oracle analytical function to return just the first entry in the list of job history entries for the specific employee. Briefly, here is how analytical functions perform their tasks:

Execution Order. A SQL statement that uses an analytical function first executes all joins and then processes all WHERE, GROUP BY, and HAVING statements. Oracle then hands off the result set to the analytical function for calculation and processing, and then finally any query-level ORDER BY statements are processed.

Partitioning of the Result Set. If the analytical function's interior query has any GROUP BY statements, Oracle will logically divide the result set into partitions (not to be confused with partitioned tables or indexes) based on the columns specified. A single partition can be as large as all the rows in the result set, or the partitions can be relatively tiny, depending on the chosen columns. An analytical function can operate against aggregated values in each result set partition.

The Current Row and "Windowing." As the analytical function processes each row in the result set partition, that row is marked as the current row. The current row becomes a reference point for any windowing operations. Windowing operations occur whenever an analytical function defines a sliding window to limit the selection of rows within a result set partition. This means that analytical processing can be limited to a range of rows based on a displacement of a specific number of rows from the current row in the result set partition (e.g., plus or minus 3 rows from the current row). Analytical processing can also be based on specified inclusion criteria.

Analytical functions are typically used heavily in decision-support and data warehousing reporting because they are extremely powerful for calculating rankings and percentiles within a result set, as well as calculating moving and cumulative aggregations based on a displacement from a specific row in the result set. It is also possible to calculate lagging and leading values within a partition, or first and last values within a partition. Here is an example of how to return the same results using Oracle 9i's FIRST_VALUE analytical function:

SQL> TTITLE 'Earliest Employee Job History Entry (Analytic Function)'
SQL> COL eid             FORMAT 9999     HEADING 'EmpID'
SQL> COL name            FORMAT A24      HEADING 'Employee Name'
SQL> COL jobid           FORMAT A10      HEADING 'JobID'
SQL> COL begin_date      FORMAT A10      HEADING 'Start Date'
SQL> COL stop_date       FORMAT A10      HEADING 'End Date'
SQL> 
SQL> SELECT
  2       eid
  3      ,NAME
  4      ,jobid
  5      ,begin_date
  6      ,stop_date
  7    FROM (
  8      SELECT
  9          JH.employee_id eid
 10         ,E.last_name || ', ' || E.first_name  NAME
 11         ,JH.job_id jobid
 12         ,JH.START_DATE begin_date
 13         ,JH.end_date stop_date
 14         ,FIRST_VALUE(JH.start_date) OVER (
 15              ORDER BY JH.employee_id, JH.start_date
 16              ROWS BETWEEN UNBOUNDED PRECEDING
                      AND UNBOUNDED FOLLOWING
 17              ) recent_history
 18        FROM
 19          hr.employees E
 20         ,hr.job_history JH
 21       WHERE E.employee_id =  JH.employee_id
 22         AND JH.employee_id IN (114)
 23   )
 24   WHERE begin_date = recent_history
 25   ORDER BY eid;
            Earliest Employee Job History Entry (Analytic Function)
EmpID Employee Name            JobID      Start Date End Date
----- ------------------------ ---------- ---------- ----------
  114 Raphaely, Den            ST_CLERK   03/24/1998 12/31/1999

In this example, note that the FIRST_VALUE analytical function indicates the retrieval of a single result set partition, sorted within Employee ID and Start Date based on its ORDER BY clause. The ROWS directive tells Oracle to use all entries in each result set partition to determine which entry has the first (and therefore earliest) start date in each partition.

Listing 2.1 shows the DML statements I issued to add rows into the original JOB_HISTORY table to demonstrate these queries and the original queries themselves. An expanded query that shows how the PARTITION BY clause could be used within an analytical query to return the first job history entry for each employee is shown in Listing 2.2.

Oracle 10g Enhancements to Analytical Functions

One morning I arrived early at my client site to find a developer running multiple SQL queries, then feverishly transferring the resulting output into a Microsoft Excel spreadsheet. When I asked her the reason for her frantic activity, she told me she needed to perform some complex financial calculations on the results, but did not know how to perform them from within Oracle.

I took an informal poll of the developers I supported, and I found out that many of them were still creating queries to extract raw data from tables and/or views, but then dumping that output to comma-delimited files and then into a spreadsheet so that the raw data could be manipulated via Excel's analytical functions. Some typical requirements these developers were pursuing included these scenarios:

  • A Sales account executive asked for a subset of revenue data to create a series of graphs for her accounts. The graph needed to break out sales within each state in the Midwest and across different products and annual periods. The account executive also specified that any missing values for states or time periods in which no sales took place for a particular product could be filled in with meaningful default values.
  • A user in Marketing needed a report that broke down sales within time periods, product types, and regions, including percentages of the total sales calculated at different control breaks.
  • An Accounting department user needed a report that calculated the eventual value of a company's outstanding cash balance if particular assets were invested at different rates of interest, or over different time periods.

The good news is that Oracle 10g has beefed up analytical function processing so that all these scenarios can be handled by simply running an appropriate query directly against the database. I will cover the first scenario in the following sections, and I will tackle the last two scenarios in my next article when I cover the MODEL clause in depth.

Using Partitioned Outer Joins To "Densify" Data

Oracle 10g offers a new type of join, a partitioned outer join, that not only lets me join disparate sets of data, but also allows me to fill in "gaps" in the result sets because some data might be missing. For example, even though sales of a particular product may exist in most of the states in the Midwest, it is entirely possible that no sales exist in a particular calendar year for that product because of the time period in which it was introduced.

Oracle calls the concept of filling in missing data with partitioned outer joins data densification. To illustrate, I will create a new view, SH.CALENDAR_YEARS, that contains all existing calendar years in the SH.TIMES time periods table and adds in all current calendar years up to and including 2005. I will also create a new view named SH.SALES_MIDWEST_ONLY that will gather only a small subset of sales data from the sales history table, SALES. (I am using a small subset only to delineate the results of data densification; in real life, of course, I would use all the available sales data.)

Next, I will construct a query that uses a partitioned outer join to link together sales data from the SH.SALES_MIDWEST_ONLY view with a larger subset of Middle Western U.S. states to illustrate how data densification can be implemented:

SQL> TTITLE 'Partitioned Outer Join Example #1'
SQL> COL prod            FORMAT A32              HEADING 'Product'
SQL> COL state           FORMAT A4               HEADING 'State'
SQL> COL year            FORMAT 9999             HEADING 'JobID'
SQL> COL tot_sales       FORMAT 999,999,999.99   HEADING 'Total Sales'
SQL>
SQL> SELECT
  2       SLS.prod
  3      ,RGNS.state
  4      ,SLS.year
  5      ,NVL(SLS.sales,0) tot_sales
  6  FROM
  7      (SELECT
  8           MWO.state
  9          ,MWO.prod
 10          ,MWO.year
 11          ,SUM(MWO.sale) sales
 12        FROM
 13            sh.sales_midwest_only MWO
 14       GROUP BY
 15           MWO.state
 16          ,MWO.prod
 17          ,MWO.YEAR
 18      ) SLS
 19      PARTITION BY (SLS.prod)
 20      RIGHT OUTER JOIN
 21      (SELECT
 22          DISTINCT cust_state_province state
 23         FROM sh.customers
 24        WHERE cust_state_province IN ('IA', 'IL','IN','MI','OH','WI')
 25      ) RGNS
 26      ON (RGNS.state = SLS.state)
 27   ORDER BY SLS.prod, SLS.state, SLS.year;
                       Partitioned Outer Join Example #1
Product                          Stat JobID     Total Sales
-------------------------------- ---- ----- ---------------
And 2 Crosscourt Tee Kids        IL    1998          826.00
And 2 Crosscourt Tee Kids        IL    1999        1,187.90
And 2 Crosscourt Tee Kids        IL    2000        2,619.40
And 2 Crosscourt Tee Kids        IN    1998          266.00
And 2 Crosscourt Tee Kids        IN    2000        2,086.00
And 2 Crosscourt Tee Kids        MI    1999           53.20
And 2 Crosscourt Tee Kids        MI    2000        1,209.60
And 2 Crosscourt Tee Kids        WI    1998          686.00
And 2 Crosscourt Tee Kids        WI    1999          504.00
And 2 Crosscourt Tee Kids        WI    2000          378.00
And 2 Crosscourt Tee Kids        IA                     .00
And 2 Crosscourt Tee Kids        OH                     .00
Coin Pocket Twill Cargo Trousers MI    1999          234.00
Coin Pocket Twill Cargo Trousers MI    2000           78.00
Coin Pocket Twill Cargo Trousers WI    1999          722.15
Coin Pocket Twill Cargo Trousers WI    2000          390.00
Coin Pocket Twill Cargo Trousers IA                     .00
Coin Pocket Twill Cargo Trousers IL                     .00
Coin Pocket Twill Cargo Trousers OH                     .00
Coin Pocket Twill Cargo Trousers IN                     .00
Gurfield& Murks Pleated Trousers IL    1998        9,100.00
Gurfield& Murks Pleated Trousers IL    1999       13,825.00
Gurfield& Murks Pleated Trousers IL    2000        2,100.00
Gurfield& Murks Pleated Trousers IN    1999        7,525.00
Gurfield& Murks Pleated Trousers MI    1998       26,040.00
Gurfield& Murks Pleated Trousers MI    1999       12,600.00
Gurfield& Murks Pleated Trousers MI    2000       19,425.00
Gurfield& Murks Pleated Trousers WI    1998          175.00
Gurfield& Murks Pleated Trousers IA                     .00
Gurfield& Murks Pleated Trousers OH                     .00
Kahala Pleated Chino Short       IL    1998          504.00
Kahala Pleated Chino Short       IL    1999        3,738.00
Kahala Pleated Chino Short       IA                     .00
Kahala Pleated Chino Short       IN                     .00
Kahala Pleated Chino Short       MI                     .00
Kahala Pleated Chino Short       OH                     .00
Kahala Pleated Chino Short       WI                     .00
37 rows selected.

Listing 2.3 shows the creation of both of these views, the initial query and this complete result set.

N-Dimensional Data Densification

The previous query does produce zeroed totals for those remaining states (i.e. Ohio and Iowa) which are present in the second subquery but not in the first subquery. That is better than nothing, but unfortunately, data is only present for the years 1999, 2000 and 2001 (the only time periods present in the SALES table's data). However, there is nothing to stop me from employing multiple partitioned outer joins to populate more than one data dimension.

Here is an example of how to accomplish this via one of my favorite, elegant SQL features that Oracle provides -- the WITH clause -- to specify the three component subqueries before invoking the multiple partitioned outer joins. The resulting output of this query will contain a fully "densified" result set:

SQL> TTITLE 'Partitioned Outer Join Densifying Across Two Dimensions'
SQL> COL prod            FORMAT A32              HEADING 'Product'
SQL> COL state           FORMAT A4               HEADING 'State'
SQL> COL year            FORMAT 9999             HEADING 'JobID'
SQL> COL tot_sales       FORMAT 999,999,999.99   HEADING 'Total Sales'
SQL> 
SQL> WITH
  2      -- Gather all Midwest-only sales data
  3      Q1 AS (
  4          SELECT
  5               prod
  6              ,state
  7              ,year
  8              ,SUM(sale) sale
  9            FROM
 10               sh.sales_midwest_only
 11           GROUP BY state, prod, year
 12      ),
 13      -- Gather a list of distinct states
 14      Q2 AS (
 15          SELECT DISTINCT cust_state_province state
 16         FROM sh.customers
 17        WHERE cust_state_province IN ('IA', 'IL','IN','MI','OH','WI')
 18      ),
 19      -- Gather a list of distinct calendar years
 20      Q3 AS (
 21          SELECT calendar_year year
 22            FROM sh.calendar_years
 23      )
 24  SELECT
 25       Q4.prod
 26      ,Q4.state
 27      ,Q3.year
 28      ,NVL(Q4.sale, 0) tot_sales
 29    FROM
 30      (SELECT
 31           Q1.prod
 32          ,Q2.state
 33          ,Q1.year
 34          ,Q1.sale
 35        FROM Q1
 36          PARTITION BY (prod)
 37          RIGHT OUTER JOIN Q2
 38          ON (Q1.state = Q2.state)
 39      ) Q4
 40          PARTITION BY (prod,state)
 41          RIGHT OUTER JOIN Q3
 42          ON (Q4.YEAR = Q3.year)
 43   ORDER BY 1, 2, 3;

A fully documented version of this query as well as its complete result set is shown in Listing 2.4.

Conclusion

The new analytical function enhancements in Oracle 10g give developers and DBAs significantly more horsepower for advanced data modeling -- without the use of third-party software to perform these analyses. These new features make Structured Query Language within Oracle 10g even more robust and flexible. The next article in this series will concentrate on the new MODEL clause, an extremely powerful enhancement that Oracle 10g provides that significantly extends Oracle's robust analytical reporting capabilities.

References and Additional Reading

While there is no substitute for direct experience, reading the manual is not a bad idea, either. I have drawn upon the following Oracle 10g documentation for the deeper technical details of this article:

B10736-01 Oracle Database Data Warehousing Guide, Ch. 21

B10750-01 Oracle Database New Features Guide

B10759-01 SQL Reference

» See All Articles by Columnist Jim Czuprynski

Share:
Home
Mobile Site | Full Site
Copyright 2017 © QuinStreet Inc. All Rights Reserved