Oracle Optimizer: Moving to and working with CBO - Part 5

Monday Nov 24th 2003 by Amar Kumar Padhi
Share:

This month in Moving to and working with CBO, Amar Kumar Padhi examines the data skewness problem and the use of the Monitoring option for updating statistics on the fly.

We covered the various methods of generating statistics in the last installment; moving forward, let's look at the data skewness problem and the use of the Monitoring option for updating statistics on the fly.

10. Histograms

Histogram tells the optimizer how the data is distributed for a column. This information is used in determining the selectivity of the column for a given query and arriving at an optimal execution plan.

Column statistics in the form of histograms are appropriate for columns whose data distribution deviates from the expected uniform distribution. For uniformly distributed data, the optimizer can do proper costing for executing a particular statement. When data is not uniformly distributed, also know as highly skewed data distribution, the optimizer may not be in a position to accurately estimate the selectivity of a query. This option provides statistics at a very low level and its use would be rare, though it could prove very beneficial in certain scenarios.

Columns not eligible for histogram

Please note that histograms should not be used when any of the following is true.

  1. The column data is uniformly distributed.

    For example, if we have a column in a table that holds around 100 distinct values. If the number of records that each value holds (or a range of values) is somewhat similar, the data is more or less uniformly distributed.

    For instance, the table may have 100000 records, 20% of these are between values 1-15, 15% are between 16-30, 25% are between 31-50 and so on. If we design a data distribution graph (histogram chart) on these figures, the height of each value or range will be more or less balanced.

  2. The column is not at all used in query predicates.

    There is no need for providing histogram statistics on columns that are not being used in query conditions. Histogram statistics are stored in the dictionary and do take up space and analysis time of the optimizer.

  3. All query predicates or criteria for the column use bind variables!

    Yes that's right, here Oracle requires that hard coded values be provided for use of the histogram statistics. Bind variables will not make use of the same.

  4. The column is unique and used only with equality predicates.

Columns eligible for histogram

If the above rules are not satisfying for a column, it could be considered for distribution statistics. For example, if we again consider a column that holds around 100 distinct statuses in a 100000 records table, 80% of the values lie between 15-30 and the remaining 20% is distributed in the other ranges. If we design a data distribution graph (histogram chart) on these figures, the height of few values or ranges will be very high where as others will be very low. This shows an oblique line for the data distribution.

A histogram is good for number columns. For character columns only the first 32 bytes (as of 8.1.7.4) of the string are used for building the histograms, this may sometimes result in incorrect information being created if the size of the column data exceeds this limitation.

User-specific histogram values can also be stored in the dictionary using the DBMS_STATS.PREPARE_COLUMN_VALUES and DBMS_STATS.SET_COLUMN_VALUES routines.

Dictionary tables

Histogram information is stored in the following dictionary tables.

Histogram values for columns in tables:

  DBA_TAB_HISTOGRAMS

  • endpoint_number - End point number
  • endpoint_value - Normalized end point value for the buckets.
  • endpoint_actual_value - Actual data value, only shows non-numeric value for the column.

For partition table histograms values:

  DBA_PART_HISTOGRAMS
  DBA_SUBPART_HISTOGRAMS

For evaluating histograms on indexed columns:

  INDEX_HISTOGRAM

  • repeat_count - Number of times one or more index key is repeated in the table.
  • keys_with_repeat_count - Number of index keys that are repeated.

Other Views that give similar data:

  DBA_TAB_COL_STATISTICS
  DBA_PART_COL_STATISTICS   DBA_SUPPART_COL_STATISTICS

Columns in the above tables are self-explanatory.

Buckets in Histograms

Histogram statistics are stored in the form of buckets. Buckets represent the partitioning of data values, depending on the range. By default, 75 buckets are created. A maximum of 254 buckets can be specified for a column. How many buckets are required for a column will depend on the occurrences of distinct values. The default number of buckets is appropriate, but you will have to experiment with various bucket sizes to find the most suitable size.

If the number of distinct column values is less than the number of buckets specified, the individual column values and the count of these values is directly stored as histogram statistics. If the number of distinct column values is more than the buckets specified, Oracle uses an algorithm to store these values in ranges. If a series of continuous ranges have the same value, they may not be shown in the histogram table to save on space.

You may find columns with one-bucket histograms, these are as good as no histogram statistics and the optimizer ignores them.

Generating Histograms

These statistics could be generated using the following commands:

DBMS_STATS.GATHER_TABLE_STATS with METHOD_OPT => 'FOR COLUMNS <colname> SIZE xxx'

or

ANALYZE command for column statistics with SIZE option.

The SIZE option in the above command specifies the number of buckets to be stored

e.g.: Gather histogram statistics for AM20.ENAME. Since there are 17 distinct values in the column, I can choose to have about 1 to 17 buckets in the histogram. As the frequency of each value is less, an approximate of 5 buckets is chosen initially.

SQL> exec dbms_stats.gather_table_stats(ownname => 'SYS', tabname => 'AM20',-
> method_opt => 'FOR COLUMNS ENAME SIZE 5');

PL/SQL procedure successfully completed.


SQL> SELECT table_name, column_name,num_distinct,num_buckets,last_analyzed,
  2         avg_col_len
  3  from   dba_tab_col_statistics
  4  where  table_name = 'AM20';

TABLE_NAME COLUMN_NAM NUM_DISTINCT NUM_BUCKETS LAST_ANAL AVG_COL_LEN
---------- ---------- ------------ ----------- --------- -----------
AM20       ENAME                17           5 17-OCT-03           6


SQL> select owner, table_name, column_name,
  2         to_char(endpoint_number, '99999.99') EP_NO,
  3         trunc(endpoint_value) EP_VAL,
  4         endpoint_actual_value EP_A_VAL
  5  from   dba_tab_histograms
  6  where  table_name = 'AM20'
  7  and    column_name = 'ENAME'
  8  order by endpoint_number;

OWNER      TABLE_NAME COLUMN_NAM EP_NO         EP_VAL EP_A_VAL
---------- ---------- ---------- --------- ---------- -----------
SYS        AM20       ENAME            .00 3.3888E+35 ADAMS
SYS        AM20       ENAME           1.00 3.6507E+35 FORD
SYS        AM20       ENAME           2.00 4.0632E+35 NARAYAN
SYS        AM20       ENAME           3.00 4.3269E+35 SUNIL
SYS        AM20       ENAME           4.00 4.3788E+35 TURNER
SYS        AM20       ENAME           5.00 4.5305E+35 WARD

6 rows selected.

Again, if you are maintaining histograms for a frequently changing column, it is recommended to update the histograms more frequently to provide accurate statistics to the optimizer.

Histogram on indexed columns

Oracle provides a method of analyzing indexed columns that need to be provided with histogram statistics. This can be done by using the INDEX_HISTOGRAM view.

e.g.: Column ename is provided with an index in table AM20. This will be used to analyze the density of the values. As can be seen below, the INDEX_HISTOGRAM view shows the occurrences of the distinct values. This information can be analyzed to find out if the data is skewed.

SQL> create index am20_ind1 on am20(ename);

Index created.

SQL> analyze index am20_ind1 validate structure;

Index analyzed.

SQL> select * from index_histogram;

        REPEAT_COUNT KEYS_WITH_REPEAT_COUNT
-------------------- ----------------------
                   0                      0
                   1                     12
                   2                      0
                   3                      0
                   4                      1
                   5                      1
                   6                      0
                   7                      0
                   8                      1
                   9                      2
                  10                      0
                  11                      0
                  12                      0
                  13                      0
                  14                      0
                  15                      0

16 rows selected.

Histogram use

The following is a simple example of how the histogram is actually used by the optimizer to pick up an index based on the selectivity of the column present in the predicate.

e.g.: There are 600 records in the table AM20. ENAME data 'SUNIL' has 512 records and 'TURNER' has 9 records. Without the histogram statistics, the index is used to retrieve values for both the queries as shown below.

SQL> select * from am20 where ename = 'TURNER';
...
9 rows selected.

Execution Plan
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=FIRST_ROWS (Cost=1 Card=1 Bytes=11)
   1    0   SORT (AGGREGATE)
   2    1     TABLE ACCESS (BY INDEX ROWID) OF 'AM20' (Cost=1 Card=6 Bytes=66)
   3    2       INDEX (RANGE SCAN) OF 'AM20_N1' (NON-UNIQUE) (Cost=1 Card=6)


SQL> select * from am20 where ename = 'SUNIL';
...
512 rows selected.

Execution Plan
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=FIRST_ROWS (Cost=1 Card=1 Bytes=11)
   1    0   SORT (AGGREGATE)
   2    1     TABLE ACCESS (BY INDEX ROWID) OF 'AM20' (Cost=1 Card=6 Bytes=66)
   3    2       INDEX (RANGE SCAN) OF 'AM20_N1' (NON-UNIQUE) (Cost=1 Card=6)

The use of index in retrieving 9 rows is fast, but to retrieve 512 records (85% of the table) with an index is overkill. The optimizer is not equipped with the data distribution statistics to make this decision. Histogram data is provided and the query is re-run as shown below, with the right execution plan this time.

SQL> exec dbms_stats.gather_table_stats(ownname => 'SYS', tabname => 'AM20',-
> method_opt => 'FOR COLUMNS ENAME SIZE 75');

PL/SQL procedure successfully completed.


SQL> select * from am20 where ename = 'TURNER';
...
512 rows selected.

Execution Plan
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=FIRST_ROWS (Cost=1 Card=1 Bytes=10)
   1    0   SORT (AGGREGATE)
   2    1     TABLE ACCESS (FULL) OF 'AM20' (Cost=1 Card=9 Bytes=90)



11. DML Monitoring



This option is used to automate the updating of statistics as tables are updated. When this option is enabled for a table, Oracle monitors the DML changes (including truncates) being done on the table and maintains the details in the SGA. Every three hours (or after a shutdown), the SMON process incorporates the information collected in the SGA into the data dictionary.



In Oracle 9i, this time is reduced to 15 minutes. Optionally, the routine DBMS_STATS.FLUSH_DATABASE_MONITORING_INFO could be used.

The collected information can be viewed from the DBA_TAB_MODIFICATIONS table. This will contain statistics for stale tables that had more than 10% changes in the rows. Gathering statistics using DBMS_STATS with the GATHER STALE option will then update statistics for these tables only.

This option can be enabled at table level with the ALTER command or when creating the table. In Oracle 9i, the DBMS_STATS package has been provided with routines to enable the Monitoring option. If you are on 9i, it is recommended that you make use of DBMS_STATS to carry out all of your statistics operations.

alter table am20 monitoring;

e.g.: Monitoring option for table AM01.
SQL> exec dbms_stats.gather_table_stats('ACS', 'AM01')

PL/SQL procedure successfully completed.

SQL> select num_rows, blocks, empty_blocks, avg_space, 
  2         avg_row_len, sample_size, last_analyzed
  3  from   dba_tables
  4  where  table_name = 'AM01';

NUM_ROWS BLOCKS EMPTY_BLOCKS  AVG_SPACE AVG_ROW_LEN SAMPLE_SIZE LAST_ANAL
-------- ------ ------------ ---------- ----------- ----------- ---------
524288    795              0          0           3      524288 18-SEP-03

SQL> alter table am01 monitoring;

Table altered.

SQL> select monitoring from dba_tables where table_name = 'AM01';

MON
---
YES

SQL> select * from dba_tab_modifications;

no rows selected

--DML activities were carried on the table.
--The database was bounced.

SQL> select table_owner, table_name, inserts, updates, 
  2         deletes, timestamp, truncated
  3  from   dba_tab_modifications;

TABLE_OWNER TABLE_NAME INSERTS UPDATES DELETES TIMESTAMP TRU
----------- ---------- ------- ------- ------- --------- ---
ACS               AM01 1028577       0   19999 19-SEP-03 NO

SQL> exec dbms_stats.gather_schema_stats(ownname => 'ACS', -
>  options => 'GATHER STALE');

PL/SQL procedure successfully completed.

SQL> select num_rows, blocks, empty_blocks, avg_space, avg_row_len, 
  2         sample_size,Last_analyzed
  3  from   dba_tables
  4  where  table_name = 'AM01';

NUM_ROWS BLOCKS EMPTY_BLOCKS AVG_SPACE AVG_ROW_LEN SAMPLE_SIZE LAST_ANAL
-------- ------ ------------ --------- ----------- ----------- ---------
 2057154   3112            0         0           3     2057154 19-SEP-03

SQL> select table_owner, table_name, inserts, updates, deletes, 
  2         timestamp, truncated
  3  from dba_tab_modifications;

no rows selected

In general, there is no implied way of knowing the tables that are being changed significantly. Normally, statistics are explicitly generated for the complete database when changes have occurred in only a few tables. Monitoring is a powerful option if your database is in a similar situation.

I have not come across any SGA issues related to enabling of this option. The memory used by this option is very nominal and should not result in any performance issues.

I presently use this option for tables that are frequently being changed and the optimizer results fluctuate with changes in the tables. Evaluate this option for your database.

In the next installment, we will look at Hints that are used to direct the optimizer for choosing execution plans directly from the code.

» See All Articles by Columnist Amar Kumar Padhi

Share:
Home
Mobile Site | Full Site
Copyright 2017 © QuinStreet Inc. All Rights Reserved