Oracle’s Histograms: Bane or Boon?

Histograms, in Oracle or most any other relational database, may be one of the most misunderstood and as a result, underused features. Tales abound of performance problems that histograms create by changing once-good plans to terribly inefficient and wasteful resource hogs. Honestly when the histogram data accurately reflects the data distribution nothing could be further from the truth. Let’s explore histograms and what they can do for a query.

By default, the Oracle optimizer believes, rightly or wrongly, that your data is evenly distributed across keys. There are occasions when such assumptions are accurate, yet there are also occasions when those assumptions aren’t and I expect that the majority of the cases fall into that second category. Using a simple example let’s show what can happen with and without histograms. In this simple case, a modern feature of Oracle, dynamic sampling, has been turned off to provide a more ‘level’ playing field and to generate plans older versions of Oracle generated for this same example (the example has been adapted from work done with Oracle 9iR2 some years ago; turning off dynamic sampling brings us back, in a limited form, to the behavior of that optimizer version). Let’s ‘dig in’ to how the optimizer looks at your data and how that can be modified using histograms.

As stated earlier the Oracle optimizer bases what it does upon the assumption of your data having a perfectly even distribution across all of the desired keys. Since that is not likely to be an accurate description of the data distribution, the optimizer will get this wrong and may choose a full table scan when an index would be appropriate, or choose an index scan when that actually creates more work for the database. Histograms address this by providing a more accurate image of how the data is distributed. Data ranges are grouped into units called ‘buckets’, and how many of those get generated depends on the data and the type of histogram created. By default, DBMS_STATS computes single-bucket histograms with the lowest endpoint value being the minimum value in the column and the highest endpoint value being the column maximum, so that the rest of the data falls between these two values. Imagine having 10,000 marbles, 9 being unique in color and 9,991 being blue. Now put all of those marbles into one bucket and see how easy it is to find, say, a pink marble. It’s a daunting task. With skewed data such as this a ‘better’ histogram is necessary. The beauty of this example is that even though we have 10,000 marbles we need only 10 ‘buckets’ to accurately describe the data distribution.

Moving our marbles to the database let’s create a table, TEST_TAB, with the following extremely descriptive column names:

Table TEST_TAB
Name                            Null? Type
------------------------------- ----- ---------
A                                     NUMBER(6)
B                                     NUMBER(6)

Now we insert into the table our skewed data set; Column A will contain distinct values from 1 to 10000, while Column B will contain 10 distinct values: 1, 2, 3, 4, 5, 9996, 9997, 9998, 9999 and 10000. The value 5 (our ‘blue’ marble) will occur 9991 times in the data; all other values will occur only once.

The following queries will be run against this data:

(1) select * from test_tab where b=5;

(2) select * from test_tab where b=3;

Let’s see what plans the optimizer has for us:


SQL>
SQL> set autotrace on linesize 132
SQL>
SQL> select * from test_tab where b=5;

         A          B
---------- ----------
      2430          5
      2431          5
      2432          5
      2433          5
      2434          5
      2435          5
      2436          5
      2437          5
      2438          5
      2439          5
      2440          5
...
      9993          5
      9994          5
      9995          5

9991 rows selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 3962208483

------------------------------------------------------------------------------
| Id  | Operation         | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |          |     1 |    26 |     2   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| TEST_TAB |     1 |    26 |     2   (0)| 00:00:01 |
------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("B"=5)


Statistics
----------------------------------------------------------
          6  recursive calls
          0  db block gets
        694  consistent gets
          0  physical reads
          0  redo size
     212182  bytes sent via SQL*Net to client
       7849  bytes received via SQL*Net from client
        668  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
       9991  rows processed

SQL>
SQL> select * from test_tab where b=3;

         A          B
---------- ----------
         3          3


Execution Plan
----------------------------------------------------------
Plan hash value: 3962208483

------------------------------------------------------------------------------
| Id  | Operation         | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |          |     1 |    26 |     2   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| TEST_TAB |     1 |    26 |     2   (0)| 00:00:01 |
------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("B"=3)


Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
         24  consistent gets
          0  physical reads
          0  redo size
        585  bytes sent via SQL*Net to client
        523  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

SQL>
SQL> set autotrace off

Those results should be no surprise as no indexes have been created so the optimizer has no choice but to scan the table. Let’s change that by creating an index on column B:


SQL>
SQL> create index test_tab_idx
  2  on test_tab(b)
  3  tablespace indx;

Index created.

SQL>

Checking USER_TAB_HISTOGRAMS we see there are no histograms on TEST_TAB, but there are also no statistics on that table, either:


SQL>
SQL> select table_name, column_name, endpoint_number, endpoint_value
  2  from user_tab_histograms
  3  where table_name = 'TEST_TAB';

no rows selected

SQL>

We have an index and no histograms, let’s see what the optimizer does:


SQL>
SQL>
SQL> set autotrace on linesize 132
SQL>
SQL> select * from test_tab where b=5;

         A          B
---------- ----------
      2430          5
      2431          5
      2432          5
      2433          5
      2434          5
      2435          5
      2436          5
      2437          5
      2438          5
      2439          5
      2440          5
...
      9993          5
      9994          5
      9995          5

9991 rows selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 1487393153

--------------------------------------------------------------------------------------------
| Id  | Operation                   | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |              |     1 |    26 |     1   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| TEST_TAB     |     1 |    26 |     1   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | TEST_TAB_IDX |     1 |       |     2   (0)| 00:00:01 |
--------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("B"=5)


Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
       1371  consistent gets
         20  physical reads
          0  redo size
     244152  bytes sent via SQL*Net to client
       7849  bytes received via SQL*Net from client
        668  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
       9991  rows processed

SQL>
SQL> select * from test_tab where b=3;

         A          B
---------- ----------
         3          3


Execution Plan
----------------------------------------------------------
Plan hash value: 1487393153

--------------------------------------------------------------------------------------------
| Id  | Operation                   | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |              |     1 |    26 |     1   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| TEST_TAB     |     1 |    26 |     1   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | TEST_TAB_IDX |     1 |       |     2   (0)| 00:00:01 |
--------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("B"=3)


Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
          4  consistent gets
          0  physical reads
          0  redo size
        589  bytes sent via SQL*Net to client
        523  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

SQL>
SQL> set autotrace off

So now both tables are using the index, and that’s not necessarily good as 99+% of the table data is associated with the value 5 in column B. Let’s finally generate some statistics on TEST_TAB and create a frequency histogram while we’re at it:


SQL>
SQL>
SQL> exec dbms_stats.gather_table_stats(ownname=>NULL, tabname=>'TEST_TAB', method_opt => 'for all indexed columns size auto', cascade=>TRUE);

PL/SQL procedure successfully completed.

SQL>
SQL> select table_name, column_name, endpoint_number, endpoint_value
  2  from user_tab_histograms
  3  where table_name = 'TEST_TAB';

TABLE_NAME                     COLUMN_NAME                         ENDPOINT_NUMBER ENDPOINT_VALUE
------------------------------ ----------------------------------- --------------- --------------
TEST_TAB                       B                                                 1              1
TEST_TAB                       B                                                 2              2
TEST_TAB                       B                                                 3              3
TEST_TAB                       B                                                 4              4
TEST_TAB                       B                                              9995              5
TEST_TAB                       B                                              9996           9996
TEST_TAB                       B                                              9997           9997
TEST_TAB                       B                                              9998           9998
TEST_TAB                       B                                              9999           9999
TEST_TAB                       B                                             10000          10000

10 rows selected.

SQL>

Having a limited number of distinct values in the data set allows us to generate such a histogram and it’s very beneficial to the optimizer as it describes the data far more accurately than the blind assumptions the optimizer made at the outset. Let’s run the queries again and see how Oracle performs:


SQL>
SQL>
SQL> set autotrace on
SQL>
SQL> select * from test_tab where b=5;

         A          B
---------- ----------
      2430          5
      2431          5
      2432          5
      2433          5
      2434          5
      2435          5
      2436          5
      2437          5
      2438          5
      2439          5
      2440          5
...
      9993          5
      9994          5
      9995          5

9991 rows selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 3962208483

------------------------------------------------------------------------------
| Id  | Operation         | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |          |  9991 | 69937 |     7   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| TEST_TAB |  9991 | 69937 |     7   (0)| 00:00:01 |
------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("B"=5)


Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
        689  consistent gets
          0  physical reads
          0  redo size
     212182  bytes sent via SQL*Net to client
       7849  bytes received via SQL*Net from client
        668  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
       9991  rows processed

SQL>
SQL> select * from test_tab where b=3;

         A          B
---------- ----------
         3          3


Execution Plan
----------------------------------------------------------
Plan hash value: 1487393153

--------------------------------------------------------------------------------------------
| Id  | Operation                   | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |              |     1 |     7 |     2   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| TEST_TAB     |     1 |     7 |     2   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | TEST_TAB_IDX |     1 |       |     1   (0)| 00:00:01 |
--------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("B"=3)


Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
          4  consistent gets
          0  physical reads
          0  redo size
        589  bytes sent via SQL*Net to client
        523  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

SQL>
SQL> set autotrace off
SQL>

Remember that this example was ‘tweaked’ to stop dynamic sampling by Oracle; such a feature can provide accurate plans without histograms for simple data (when dynamic sampling was enabled the “b=5” query used a full table scan in the absence of generated statistics and histograms, but also remember that this is fairly simple data and modern production systems generate and query far more complex data sets; dynamic sampling will likely not be enough to generate ‘good’ execution plans consistently).

Because histograms can be a very important additional step when generating table statistics Oracle has expanded the types available to include, in 12.1.0.2 and later releases, the hybrid histogram. Since I’ve already written about that construct here I won’t repeat myself. What this means is that histograms aren’t ‘tools of the Evil Empire designed to ensnare your data’; when they are understood, and have representative statistics to back them up they can be very beneficial.

See all articles by David Fitzjarrell

David Fitzjarrell
David Fitzjarrell
David Fitzjarrell has more than 20 years of administration experience with various releases of the Oracle DBMS. He has installed the Oracle software on many platforms, including UNIX, Windows and Linux, and monitored and tuned performance in those environments. He is knowledgeable in the traditional tools for performance tuning – the Oracle Wait Interface, Statspack, event 10046 and 10053 traces, tkprof, explain plan and autotrace – and has used these to great advantage at the U.S. Postal Service, American Airlines/SABRE, ConocoPhilips and SiriusXM Radio, among others, to increase throughput and improve the quality of the production system. He has also set up scripts to regularly monitor available space and set thresholds to notify DBAs of impending space shortages before they affect the production environment. These scripts generate data which can also used to trend database growth over time, aiding in capacity planning. He has used RMAN, Streams, RAC and Data Guard in Oracle installations to ensure full recoverability and failover capabilities as well as high availability, and has configured a 'cascading' set of DR databases using the primary DR databases as the source, managing the archivelog transfers manually and montoring, through scripts, the health of these secondary DR databases. He has also used ASM, ASMM and ASSM to improve performance and manage storage and shared memory.

Get the Free Newsletter!

Subscribe to Cloud Insider for top news, trends & analysis

Latest Articles