23 November, 2009

21 November, 2009

SIZE specification for Column Histograms

When you use the method_opt parameter in DBMS_STATS.GATHER_TABLE_STATS specifying 'FOR [ALL] COLUMNS [column names] SIZE n' remember to lookup the syntax again. I made the mistake of not verifying the syntax for specifying SIZE when I name columns in the FOR ... SIZE clause.

Simply put, if you specify FOR ALL COLUMNS, the SIZE specification of 'n' does apply to every column. But if you specify individual column names and the SIZE is specified last, Oracle applies the SIZE specification only to the last named column while using it's own limit of 75 buckets where you do not specify a value for SIZE (and haven't specified SIZE AUTO). This means that if you have more than 75 (but less than 250) distinct values, Oracle creates a Height Balanced Histogram instead of the Frequency Histogram that you had desired. (Of course, if you have less than 75 distinct values, Oracle does create the right Frequency Histogram with the corresponding number of buckets to capture each distinct value).

Thus, in this demo, the first GATHER_TABLE_STATS run creates the proper histograms for all 5 columns because I specify "FOR ALL COLUMNS SIZE 250".
However, in the second run, when I name each individual column but do not attach a SIZE specification, Oracle trims columns C2_100D_NUMBERS and C4_100D_VARCHARS down to Height Balanced Histograms with 75 buckets only.
In the third run, I attache a SIZE specification to each column name and achieve the desired Histograms.


SQL> col column_name format a30
SQL>
SQL> Drop Table TEST_SIZE_SPEC purge;

Table dropped.

SQL>
SQL> Create Table TEST_SIZE_SPEC (
2 c1_10d_numbers number not null,
3 c2_100d_numbers number not null,
4 c3_10d_varchars varchar2(5) not null,
5 c4_100d_varchars varchar2(5) not null,
6 c5_dummy number not null)
7 /

Table created.

SQL>
SQL> declare
2 i number;
3 begin
4 for i in 1..10000
5 loop
6 insert into TEST_SIZE_SPEC values (
7 mod(i,10), mod(i,100), 'C3_'||mod(i,10),'C4_'||mod(i,100), i
8 );
9 end loop;
10 end;
11 /

PL/SQL procedure successfully completed.

SQL>
SQL> select count(distinct(c1_10d_numbers)) from TEST_SIZE_SPEC;

COUNT(DISTINCT(C1_10D_NUMBERS))
-------------------------------
10

SQL> select count(distinct(c2_100d_numbers)) from TEST_SIZE_SPEC;

COUNT(DISTINCT(C2_100D_NUMBERS))
--------------------------------
100

SQL> select count(distinct(c3_10d_varchars)) from TEST_SIZE_SPEC;

COUNT(DISTINCT(C3_10D_VARCHARS))
--------------------------------
10

SQL> select count(distinct(c4_100d_varchars)) from TEST_SIZE_SPEC;

COUNT(DISTINCT(C4_100D_VARCHARS))
---------------------------------
100

SQL> select count(distinct(c5_dummy)) from TEST_SIZE_SPEC;

COUNT(DISTINCT(C5_DUMMY))
-------------------------
10000

SQL>
SQL>
SQL> REM =========================================================================================================
SQL> REM =========================================================================================================
SQL> -- Gather stats with 100% sampling but SIZE 250
SQL> exec dbms_stats.gather_table_stats('','TEST_SIZE_SPEC',estimate_percent=>100,method_opt=>'FOR ALL COLUMNS SIZE 250');

PL/SQL procedure successfully completed.

SQL>
SQL> -- Review column statistics, histograms
SQL> select column_name, sample_size, num_distinct, histogram, num_buckets
2 from user_tab_columns
3 where table_name = 'TEST_SIZE_SPEC'
4 order by column_name
5 /

COLUMN_NAME SAMPLE_SIZE NUM_DISTINCT HISTOGRAM NUM_BUCKETS
------------------------------ ----------- ------------ --------------- -----------
C1_10D_NUMBERS 10000 10 FREQUENCY 10
C2_100D_NUMBERS 10000 100 FREQUENCY 100
C3_10D_VARCHARS 10000 10 FREQUENCY 10
C4_100D_VARCHARS 10000 100 FREQUENCY 100
C5_DUMMY 10000 10000 HEIGHT BALANCED 250

SQL>
SQL> select column_name, count(*)
2 from user_tab_histograms
3 where table_name = 'TEST_SIZE_SPEC'
4 group by column_name
5 order by column_name
6 /

COLUMN_NAME COUNT(*)
------------------------------ ----------
C1_10D_NUMBERS 10
C2_100D_NUMBERS 100
C3_10D_VARCHARS 10
C4_100D_VARCHARS 100
C5_DUMMY 251

SQL>
SQL> REM =========================================================================================================
SQL> REM =========================================================================================================
SQL> -- Now, reGather statistics but specifying column names
SQL> exec dbms_stats.gather_table_stats('','TEST_SIZE_SPEC',estimate_percent=>100,-
> method_opt=>'FOR COLUMNS C1_10D_NUMBERS, C2_100D_NUMBERS, C3_10D_VARCHARS, C4_100D_VARCHARS, C5_DUMMY SIZE 250');

PL/SQL procedure successfully completed.

SQL>
SQL> -- Review column statistics, histograms
SQL> select column_name, sample_size, num_distinct, histogram, num_buckets
2 from user_tab_columns
3 where table_name = 'TEST_SIZE_SPEC'
4 order by column_name
5 /

COLUMN_NAME SAMPLE_SIZE NUM_DISTINCT HISTOGRAM NUM_BUCKETS
------------------------------ ----------- ------------ --------------- -----------
C1_10D_NUMBERS 10000 10 FREQUENCY 10
C2_100D_NUMBERS 10000 100 HEIGHT BALANCED 75
C3_10D_VARCHARS 10000 10 FREQUENCY 10
C4_100D_VARCHARS 10000 100 HEIGHT BALANCED 75
C5_DUMMY 10000 10000 HEIGHT BALANCED 250

SQL>
SQL> select column_name, count(*)
2 from user_tab_histograms
3 where table_name = 'TEST_SIZE_SPEC'
4 group by column_name
5 order by column_name
6 /

COLUMN_NAME COUNT(*)
------------------------------ ----------
C1_10D_NUMBERS 10
C2_100D_NUMBERS 76
C3_10D_VARCHARS 10
C4_100D_VARCHARS 76
C5_DUMMY 251

SQL>
SQL>
SQL> REM =========================================================================================================
SQL> REM =========================================================================================================
SQL> -- Now, reGather statistics but specifying column names and size each
SQL> exec dbms_stats.gather_table_stats('','TEST_SIZE_SPEC',estimate_percent=>100,-
> method_opt=>'FOR COLUMNS C1_10D_NUMBERS SIZE 250, C2_100D_NUMBERS SIZE 250, C3_10D_VARCHARS SIZE 250, C4_100D_VARCHARS SIZE 250, C5_DUMMY SIZE 250');

PL/SQL procedure successfully completed.

SQL>
SQL> -- Review column statistics, histograms
SQL> select column_name, sample_size, num_distinct, histogram, num_buckets
2 from user_tab_columns
3 where table_name = 'TEST_SIZE_SPEC'
4 order by column_name
5 /

COLUMN_NAME SAMPLE_SIZE NUM_DISTINCT HISTOGRAM NUM_BUCKETS
------------------------------ ----------- ------------ --------------- -----------
C1_10D_NUMBERS 10000 10 FREQUENCY 10
C2_100D_NUMBERS 10000 100 FREQUENCY 100
C3_10D_VARCHARS 10000 10 FREQUENCY 10
C4_100D_VARCHARS 10000 100 FREQUENCY 100
C5_DUMMY 10000 10000 HEIGHT BALANCED 250

SQL>
SQL> select column_name, count(*)
2 from user_tab_histograms
3 where table_name = 'TEST_SIZE_SPEC'
4 group by column_name
5 order by column_name
6 /

COLUMN_NAME COUNT(*)
------------------------------ ----------
C1_10D_NUMBERS 10
C2_100D_NUMBERS 100
C3_10D_VARCHARS 10
C4_100D_VARCHARS 100
C5_DUMMY 251

SQL>



Therefore, be careful when you name individual columns in the method_opt parameter 'FOR COLUMNS ...'. Ensure that you specify the desired SIZE for each column separately (even if the SIZE is to be the same -- i.e. 250 in my case).


UPDATE : As pointed out by Randolf, an alternative command is
"FOR COLUMNS SIZE 250 C1_10D_NUMBERS, C2_100D_NUMBERS, C3_10D_VARCHARS, C4_100D_VARCHARS, C5_DUMMY"
so that the SIZE n specification does not have to be repeated after every column name.
.
.
.

13 November, 2009

Sample Sizes : Table level and Column level

There appears to be some confusion about Sample Size that the "Auto" behaviour in DBMS_STATS determines.

The AUTO_SAMPLE_SIZE for the "ESTIMATE_PERCENT" parameter doesn't mean the size as "SIZE AUTO" for the "METHOD_OPT" parameter.
The former is for Table level statistics.
The latter is for Column Histograms. NULLs are not included in Histograms.

Here is a demonstration :

SQL> drop table Test_Stats_Size purge;

Table dropped.

SQL>
SQL> create table Test_Stats_Size
2 as select object_id col_1, owner col_2, object_name col_3, created col_4
3 from dba_objects
4 where 1=2
5 /

Table created.

SQL> alter table test_stats_size add (padding_col varchar2(80));

Table altered.

SQL> alter table test_stats_size nologging;

Table altered.

SQL>
SQL> insert /*+ APPEND */ into test_stats_size
2 select mod(object_id,50), owner, object_name, created, dbms_random.string('X',60) from dba_objects
3 union all
4 select mod(object_id,50), owner, object_name, created, dbms_random.string('X',60) from dba_objects
5 union all
6 select mod(object_id,50), owner, object_name, created, dbms_random.string('X',60) from dba_objects
7 union all
8 select mod(object_id,50), owner, object_name, created, dbms_random.string('X',60) from dba_objects
9 union all
10 select mod(object_id,50), owner, object_name, created, dbms_random.string('X',60) from dba_objects
11 union all
12 select mod(object_id,50), owner, object_name, created, dbms_random.string('X',60) from dba_objects
13 union all
14 select mod(object_id,50), owner, object_name, created, dbms_random.string('X',60) from dba_objects
15 /

354501 rows created.

SQL>
SQL> commit;

Commit complete.

SQL>
SQL> select count(*) from test_stats_size ;

COUNT(*)
----------
354501

SQL> select count(*) from test_stats_size where col_2 = 'HEMANT';

COUNT(*)
----------
245

SQL>
SQL>
SQL> -- Gather Stats with 100% sampling but Auto Size for columns
SQL> exec dbms_stats.gather_table_stats('','TEST_STATS_SIZE',estimate_percent=>100,method_opt=>'FOR ALL COLUMNS SIZE AUTO');

PL/SQL procedure successfully completed.

SQL>
SQL> select table_name, num_rows, sample_size, sample_size*100/num_rows Sample_Perc
2 from user_tables
3 where table_name = 'TEST_STATS_SIZE'
4 /

TABLE_NAME NUM_ROWS SAMPLE_SIZE SAMPLE_PERC
------------------------------ ---------- ----------- -----------
TEST_STATS_SIZE 354501 354501 100

SQL> REM We see above that Table Level Sampling is 100%
SQL>
SQL> select column_name, num_distinct, sample_size, histogram , num_buckets
2 from user_tab_columns
3 where table_name = 'TEST_STATS_SIZE'
4 order by column_name
5 /

COLUMN_NAME NUM_DISTINCT SAMPLE_SIZE HISTOGRAM NUM_BUCKETS
------------------------------ ------------ ----------- --------------- -----------
COL_1 50 354494 NONE 1
COL_2 28 354501 FREQUENCY 28
COL_3 30311 354501 NONE 1
COL_4 1411 354501 NONE 1
PADDING_COL 354501 354501 NONE 1

SQL>
SQL> REM Was Column Level Sampling 100% ?
SQL> REM What type of Histograms and how many buckets did Oracle create ?
SQL> REM We see above that not necessarily all columns have the same number of rows sampled
SQL> REM Also, not necessarily all columns have a Histogram !
SQL>
SQL> REM Let's see what happens after a couple of queries on col_1
SQL>
SQL> select count(*) from test_stats_size where col_1 is null;

COUNT(*)
----------
7

SQL> select count(distinct(col_1)) from test_stats_size;

COUNT(DISTINCT(COL_1))
----------------------
50

SQL> select count(*) from test_stats_size where col_1 between 10 and 25;

COUNT(*)
----------
113092

SQL>
SQL> -- Gather Stats with 100% sampling but Auto Size for columns
SQL> exec dbms_stats.gather_table_stats('','TEST_STATS_SIZE',estimate_percent=>100,method_opt=>'FOR ALL COLUMNS SIZE AUTO');

PL/SQL procedure successfully completed.

SQL>
SQL> select table_name, num_rows, sample_size, sample_size*100/num_rows Sample_Perc
2 from user_tables
3 where table_name = 'TEST_STATS_SIZE'
4 /

TABLE_NAME NUM_ROWS SAMPLE_SIZE SAMPLE_PERC
------------------------------ ---------- ----------- -----------
TEST_STATS_SIZE 354501 354501 100

SQL> REM We see above that Table Level Sampling is 100%
SQL>
SQL> select column_name, num_distinct, sample_size, histogram , num_buckets
2 from user_tab_columns
3 where table_name = 'TEST_STATS_SIZE'
4 order by column_name
5 /

COLUMN_NAME NUM_DISTINCT SAMPLE_SIZE HISTOGRAM NUM_BUCKETS
------------------------------ ------------ ----------- --------------- -----------
COL_1 50 354494 FREQUENCY 50
COL_2 28 354501 FREQUENCY 28
COL_3 30311 354501 NONE 1
COL_4 1411 354501 NONE 1
PADDING_COL 354501 354501 NONE 1

SQL>
SQL> REM However, Column Level Sampling and Histogram collection for col_1 is now different !
SQL> REM Also notice that the 7 NULLs in col_1 are excluded !
SQL>
SQL>

And why did Oracle now gather statistics on col_1 ? See my earlier postings on COL_USAGE influencing SIZE AUTO.
.
.
.