### 9. Generating Statistics

In Part 1 of this series, we discussed the basics of statistics. This month we'll explore statistics in-depth, learning how well we can maintain them.

There are two ways of gathering statistics, COMPUTE and ESTIMATE. COMPUTE results in 100% statistics generation and is accurate. The drawback is that it needs ample time to generate the complete statistics. ESTIMATE generates statistics as per the sample number of rows or percent provided. It is faster as only partial data is considered for arriving at the statistics. The drawback is that accuracy of the statistics depends on the sample size provided. Please note that the optimizer is better placed when accurate or near-accurate statistics are provided. The ESTIMATE option can be used for large systems that cannot afford the time required for generating statistics that are more accurate.

Please note that columns like NUM_ROWS, AVG_SPACE and AVG_ROW_LEN may not show exact data if the statistics are being estimated with a specific sample size. For example, accurate statistics may show you the exact number of rows in a table when the statistics were taken, but estimate statistics would show a derived value that might not be equal to the actual number of rows. Read further to find whether your setup will benefit with the ESTIMATE option.

*How much to estimate?*

If you intend to ESTIMATE statistics because of time constraints, try to arrive at an optimal sample size that will yield excellent results for your database. For example, start with a sample size of 5% and test your application. Increase the percentage to 10-15% and test your application again.

Some setups that I have come across report good throughput and response time for statistics with a sample size of only 5% to 10%. In general, a sample size of 5% to 10% generates adequate statistics! What will matter here is proper testing in your setup.

The accuracy of the statistics will depend on an appropriate sample size. Beyond a certain sample size, the statistics generated will be fairly consistent. For example, if you analyze tables with a sample size of 5% and later analyze the same set of tables with a sample size of 10%, you may find significant changes in the statistics collected. However, if you analyze the same set of tables with a sample size of 15-20%, it may not show a significant difference from the statistics collected at 10% and requires even more time. Thus, an optimal sample size here would be 10% flat.

If the ESTIMATE sample size is greater than 50%, the statistics will be almost similar to that of the COMPUTE option, and so will the time it takes to generate! There is no point estimating beyond 50% (see below example); rather, go for the COMPUTE option if you can afford that much time.

Gathering statistics on tables requires sorting to be done and this takes up resources. Gathering statistics on indexes does not require sorting. Considering this benefit, you may COMPUTE statistics on indexes for accurate data.

You may also consider generating statistics in Parallel, an example is provided in the DBMS_STATS package section below.

Below is an example of arriving at an appropriate ESTIMATE sample size. Please note that this is just for one table. For an application, it would be advisable to try it out at database level and draw conclusions only after proper testing.

*Example*

The table INV.MATERIAL_TRX has around 4.5 million records. Below are statistics at various sample sizes.

Commands used:

dbms_stats.gather_table_stats('INV', 'MTL_MATERIAL_TRX'); --compute dbms_stats.gather_table_stats('INV', 'MTL_MATERIAL_TRX', estimate_percent => 5); dbms_stats.gather_table_stats('INV', 'MTL_MATERIAL_TRX', estimate_percent => 10); dbms_stats.gather_table_stats('INV', 'MTL_MATERIAL_TRX', estimate_percent => 51);

Final statistics:

MODE | NUM_ROWS| BLOCKS|AVG_ROW_LEN|SAMPLE_SIZE|LAST_ANAL|Time taken _______|__________|__________|___________|___________|_________|__________ compute| 4591474| 193230| 251| 4591474|27-JUL-03|2 hr at 5% | 4582460| 193230| 247| 229123|27-JUL-03|8 mts at 10% | 4587520| 193230| 249| 458752|27-JUL-03|17 mts at 20% | 4591635| 193230| 250| 918327|27-JUL-03|32 mts at 51% | 4590890.2| 193230| 250| 2341354|27-JUL-03|1 hr 56 mts

As you can see, the NUM_ROWS difference between full statistics and 5% is only 9014 records. Increasing the sample size to 10% adds 5060 records. Making it 20% adds an additional 4115 records but doubles the time. Bringing it up to 51% results in almost the same number of records as the COMPUTE option, but the time taken is also same. As per my requirement, statistics at a 5% sample size serves my purpose as the variance is less than 1% that of actual value.

*What should the time interval be?*

To determine an efficient statistics-gathering interval, keep a history of statistics generated and evaluate the variations. If the two statistics remain more or less similar, then you may consider increasing the time interval. If the statistics vary considerable then either the sample size is not appropriate (in the case of ESTIMATE) or the time interval is not appropriate. This is easier said than done and may be a tedious process for huge systems.

When there is a 10-20% change in data, the general convention is to generate fresh statistics. You can start of with a general rule of estimating statistics on a weekly basis. If the tables are giving real bad hits because of heavy activity, you may consider using the DML Monitoring option to update statistics every few hours for such tables.

Statistics are not incremental and are regenerated every time. If there is no considerable change in data, there is no advantage in generating statistics too frequently.

You may consider generating statistics for interface tables every time a bulk uploading is carried out. Objects can be grouped and statistics generated at different intervals, for example transaction tables could be analyzed every week, where as, Master tables might be done once a month. Statistics generation may also be done schema wise, depending on the maintenance window available.

*Statistics locks?*

A table that is in the process of being analyzed cannot undergo DDL changes but DML activities can be carried on. Analyzing an index puts a shared lock on the related table; hence, neither DML nor DDL can be performed. Preferably avoid all activities during the statistics generation phase.

Oracle provides more than one way of generating statistics.

- DBMS_UTILITY
- ANALYZE command
- DBMS_DDL
- DBMS_STATS

Following are the options.