28 June, 2008

Cardinality Estimates : Dependent Columns

NOTE : I am reposting the text of the case study (ie SQLs , Outputs and REMarks) as the previous posting had one "less than" sign resulting in blogger presenting a "challenging reading". 15-Jul-08

The Oracle Cost Based Optimizer uses table and column statistics to estimate the number of rows that will be fetched for given predicates in a query. This estimate is called the "Cardinality" and is presented as "CARD=" in a 9i Explain Plan or under the "Rows" Column in a 10g Explain Plan.

It is fairly obvious that the Optimizer uses the number of rows in a table and column selectivity (ie, the number of distinct values, and in the presence of histograms, the count of rows for each distinct value) to derive the expected Cardinality. Thus, the general advice to "ensure that you have fresh statistics" or "ensure that you have the right statistics" that is provided to DBAs.

For multi column predicates on the same table, the Optimizer derives the selectivity of the columns as "independent" columns and then multiplies them. Thus, the selectivity of columns (a,b) is selectivity_of_column_a X selectivity_of_column_b.
There are variants and complexities in this formula. The presence of indexes can also help the Optimizer estimate costs of different steps.

However, what the Optimizer, till 10gR2, cannot do is identify inter-column dependencies or correlations. Therefore, if "STATE" is one column and "COUNTRY" is another column in the same table, the Optimizer may well know the number of distinct STATEs and the number of distinct COUNTRYs in the table, but not how many and which STATEs are present for a given COUNTRY.

Here is a simple example with a fairly small table which has correlated columns. The same set of queries are executed twice, first without Histograms and the next round with Histograms on each of the columns. I've put in some remarks to explain how the Cardinality estimates that the Optimizer uses in deriving the Execution Plan can be very wrong when there are multiple columns and, particularly, more so when there exists high correlation between the Columns.


SQL> set pages600
SQL> set linesize132
SQL> set feedback off
SQL> set SQLPrompt ''

col Expected_Cardinality format 99,990 hea 'ExpCrd'
column parent_id_plus_exp format 999
column id_plus_exp format 990
column plan_plus_exp format a90
column object_node_plus_exp format a14
column other_plus_exp format a90
column other_tag_plus_exp format a29


REM To demonstrate how Oracle computes expected cardinality of two columns as being :
REM (Number_of_Rows X Density_of_Column_1 X Density_of_Column_2)
REM Oracle assumes a Uniform Distribution of values

REM The second set of tests are with Histograms and provide equally interesting results.
REM With histograms, Oracle is aware that there isn't a Uniform distribution for single columns but still isn't aware of multi column distributions

REM Only in 11g does Oracle introduce Column Groups for statistics




drop table test_cardinality ;
create table test_cardinality (
2 date_of_birth date,
3 month_of_birth varchar2(12),
4 month_number number,
5 city_of_birth varchar2(18),
6 country_of_birth varchar2(3)
7 )
8 /

alter session set nls_date_format='DD-MON-RR';

insert into test_cardinality values ('01-JAN-60','JANUARY',1,'BOMBAY','IN');
insert into test_cardinality values ('01-FEB-60','FEBRUARY',2,'DELHI','IN');
insert into test_cardinality values ('01-MAR-60','MARCH',3,'CALCUTTA','IN');
insert into test_cardinality values ('01-APR-60','APRIL',4,'MADRAS','IN');
insert into test_cardinality values ('01-MAY-60','MAY',5,'NAGPUR','IN');
insert into test_cardinality values ('01-JUN-60','JUNE',6,'KANPUR','IN');
insert into test_cardinality values ('01-JUL-60','JULY',7,'KOLHAPUR','IN');
insert into test_cardinality values ('01-AUG-60','AUGUST',8,'SOLAPUR','IN');
insert into test_cardinality values ('01-SEP-60','SEPTEMBER',9,'JABALPUR','IN');
insert into test_cardinality values ('01-OCT-60','OCTOBER',10,'BADLAPUR','IN');
insert into test_cardinality values ('01-NOV-60','NOVEMBER',11,'LONDON','UK');
insert into test_cardinality values ('01-DEC-60','DECEMBER',12,'PARIS','FR');

commit;

alter table test_cardinality nologging;
insert /*+ APPEND */ into test_cardinality select * from test_cardinality ;
commit;
insert /*+ APPEND */ into test_cardinality select * from test_cardinality ;
commit;
insert /*+ APPEND */ into test_cardinality select * from test_cardinality ;
commit;
insert /*+ APPEND */ into test_cardinality select * from test_cardinality ;
commit;
insert /*+ APPEND */ into test_cardinality select * from test_cardinality ;
commit;
insert /*+ APPEND */ into test_cardinality select * from test_cardinality ;
commit;
insert /*+ APPEND */ into test_cardinality select * from test_cardinality ;
commit;

REM We now have 128 copies of each row. ie, 128 'JANUARY's, 128 'DELHI's, 128 'UK's. But 1536*10/12 INs !


exec dbms_stats.gather_table_stats(ownname=>user,tabname=>'TEST_CARDINALITY',method_opt=>'FOR ALL COLUMNS SIZE 1',estimate_percent=>100);

select count(*) from test_cardinality;

COUNT(*)
----------
1536
select num_rows from user_tables where table_name = 'TEST_CARDINALITY';

NUM_ROWS
----------
1536
select column_name,sample_size, num_distinct, density from user_tab_columns where table_name = 'TEST_CARDINALITY' order by column_id;

COLUMN_NAME SAMPLE_SIZE NUM_DISTINCT DENSITY
------------------------------ ----------- ------------ ----------
DATE_OF_BIRTH 1536 12 .083333333
MONTH_OF_BIRTH 1536 12 .083333333
MONTH_NUMBER 1536 12 .083333333
CITY_OF_BIRTH 1536 12 .083333333
COUNTRY_OF_BIRTH 1536 3 .333333333

REM The density for the first 4 columns is 1/12 while that for COUNTRY_OF_BIRTH is 1/3 (ie 4/12)

set autotrace on explain;

select count(*), round(1536*(1/12)*(1/12)) Expected_Cardinality from test_cardinality where month_number = 1 and month_of_birth = 'JANUARY';

COUNT(*) ExpCrd
---------- -------
128 11

Execution Plan
----------------------------------------------------------
Plan hash value: 3936405985

---------------------------------------------------------------------------------------
Id Operation Name Rows Bytes Cost (%CPU) Time
---------------------------------------------------------------------------------------
0 SELECT STATEMENT 1 11 6 (0) 00:00:01
1 SORT AGGREGATE 1 11
* 2 TABLE ACCESS FULL TEST_CARDINALITY 11 121 6 (0) 00:00:01
---------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - filter("MONTH_NUMBER"=1 AND "MONTH_OF_BIRTH"='JANUARY')

REM The expected cardinality is 11 rows although the real count is 128

select count(*), round(1536*(1/12)*(1/12)) Expected_Cardinality from test_cardinality where month_number = 1 and month_of_birth = 'DECEMBER';

COUNT(*) ExpCrd
---------- -------
0 11

Execution Plan
----------------------------------------------------------
Plan hash value: 3936405985

---------------------------------------------------------------------------------------
Id Operation Name Rows Bytes Cost (%CPU) Time
---------------------------------------------------------------------------------------
0 SELECT STATEMENT 1 11 6 (0) 00:00:01
1 SORT AGGREGATE 1 11
* 2 TABLE ACCESS FULL TEST_CARDINALITY 11 121 6 (0) 00:00:01
---------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - filter("MONTH_NUMBER"=1 AND "MONTH_OF_BIRTH"='DECEMBER')

REM The Optimizer expected 11 rows. It is not aware of the inter-column dependencies (there are no 'DECEMBER's being month 1 !)

select count(*), round(1536*(1/12)*(1/12)) Expected_Cardinality from test_cardinality where date_of_birth = '01-JAN-60' and month_of_birth = 'DECEMBER';

COUNT(*) ExpCrd
---------- -------
0 11

Execution Plan
----------------------------------------------------------
Plan hash value: 3936405985

---------------------------------------------------------------------------------------
Id Operation Name Rows Bytes Cost (%CPU) Time
---------------------------------------------------------------------------------------
0 SELECT STATEMENT 1 16 6 (0) 00:00:01
1 SORT AGGREGATE 1 16
* 2 TABLE ACCESS FULL TEST_CARDINALITY 11 176 6 (0) 00:00:01
---------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - filter("MONTH_OF_BIRTH"='DECEMBER' AND "DATE_OF_BIRTH"='01-JAN-60')

REM Once again, the Optimizer expected 11 rows.

select count(*), round(1536*(1/12)*(1/12)) Expected_Cardinality from test_cardinality where to_char(date_of_birth,'MON') = 'JAN' and month_of_birth = 'JANUARY';

COUNT(*) ExpCrd
---------- -------
128 11

Execution Plan
----------------------------------------------------------
Plan hash value: 3936405985

---------------------------------------------------------------------------------------
Id Operation Name Rows Bytes Cost (%CPU) Time
---------------------------------------------------------------------------------------
0 SELECT STATEMENT 1 16 6 (0) 00:00:01
1 SORT AGGREGATE 1 16
* 2 TABLE ACCESS FULL TEST_CARDINALITY 1 16 6 (0) 00:00:01
---------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - filter("MONTH_OF_BIRTH"='JANUARY' AND
TO_CHAR(INTERNAL_FUNCTION("DATE_OF_BIRTH"),'MON')='JAN')

REM When a function ("to_char") is applied to a column, Oracle is unable to estimate the cardinality -- it comes up with "1"
rem so, the Density_of_Column_1 X Density_of_Column_2 formula breaks down

select count(*), round(1536*(1/12)*(4/12)) Expected_Cardinality from test_cardinality where city_of_birth = 'DELHI' and country_of_birth='IN';

COUNT(*) ExpCrd
---------- -------
128 43

Execution Plan
----------------------------------------------------------
Plan hash value: 3936405985

---------------------------------------------------------------------------------------
Id Operation Name Rows Bytes Cost (%CPU) Time
---------------------------------------------------------------------------------------
0 SELECT STATEMENT 1 11 6 (0) 00:00:01
1 SORT AGGREGATE 1 11
* 2 TABLE ACCESS FULL TEST_CARDINALITY 43 473 6 (0) 00:00:01
---------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - filter("CITY_OF_BIRTH"='DELHI' AND "COUNTRY_OF_BIRTH"='IN')

REM Here, the density for country_of_birth is 4/12 so that is used in the calculation.

select count(*), round(1536*(1/12)*(4/12)) Expected_Cardinality from test_cardinality where city_of_birth = 'DELHI' and country_of_birth='UK';

COUNT(*) ExpCrd
---------- -------
0 43

Execution Plan
----------------------------------------------------------
Plan hash value: 3936405985

---------------------------------------------------------------------------------------
Id Operation Name Rows Bytes Cost (%CPU) Time
---------------------------------------------------------------------------------------
0 SELECT STATEMENT 1 11 6 (0) 00:00:01
1 SORT AGGREGATE 1 11
* 2 TABLE ACCESS FULL TEST_CARDINALITY 43 473 6 (0) 00:00:01
---------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - filter("CITY_OF_BIRTH"='DELHI' AND "COUNTRY_OF_BIRTH"='UK')

REM Oracle is unaware that the intersect between the data sets for 'DELHI' and 'UK' is zero ! It just multiplies the two densities

select count(*), round(1536*(2/12)*(4/12)) Expected_Cardinality from test_cardinality where city_of_birth in ('DELHI','LONDON') and country_of_birth='FR';

COUNT(*) ExpCrd
---------- -------
0 85

Execution Plan
----------------------------------------------------------
Plan hash value: 3936405985

---------------------------------------------------------------------------------------
Id Operation Name Rows Bytes Cost (%CPU) Time
---------------------------------------------------------------------------------------
0 SELECT STATEMENT 1 11 6 (0) 00:00:01
1 SORT AGGREGATE 1 11
* 2 TABLE ACCESS FULL TEST_CARDINALITY 85 935 6 (0) 00:00:01
---------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - filter("COUNTRY_OF_BIRTH"='FR' AND ("CITY_OF_BIRTH"='DELHI' OR
"CITY_OF_BIRTH"='LONDON'))

REM The "error" (actually, lack of knowledge about relationships between column values) is compounded further in this case !

select count(*), round(1536*(1/12)*(1/12)*(1/12)*(1/12)*(4/12)) Expected_Cardinality from test_cardinality where date_of_birth='01-JAN-60' and month_of_birth='JANUARY' and month_number = 1 and city_of_birth='BOMBAY' and country_of_birth='IN';

COUNT(*) ExpCrd
---------- -------
128 0

Execution Plan
----------------------------------------------------------
Plan hash value: 3936405985

---------------------------------------------------------------------------------------
Id Operation Name Rows Bytes Cost (%CPU) Time
---------------------------------------------------------------------------------------
0 SELECT STATEMENT 1 28 6 (0) 00:00:01
1 SORT AGGREGATE 1 28
* 2 TABLE ACCESS FULL TEST_CARDINALITY 1 28 6 (0) 00:00:01
---------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - filter("MONTH_OF_BIRTH"='JANUARY' AND "MONTH_NUMBER"=1 AND
"CITY_OF_BIRTH"='BOMBAY' AND "COUNTRY_OF_BIRTH"='IN' AND
"DATE_OF_BIRTH"='01-JAN-60')

REM This really makes it worse ! We've taken the exact values for 1 in every 12 rows but the Optimizer multiplies the densities and comes up with less than 1
rem the Optimizer never presents a value of 0, but rounds it up to 1. So the expected cardinality is presented as 1 (which is very wrong)

select count(*), round(1536*(4/12)) Expected_Cardinality from test_cardinality where country_of_birth='SG';

COUNT(*) ExpCrd
---------- -------
0 512

Execution Plan
----------------------------------------------------------
Plan hash value: 3936405985

---------------------------------------------------------------------------------------
Id Operation Name Rows Bytes Cost (%CPU) Time
---------------------------------------------------------------------------------------
0 SELECT STATEMENT 1 3 6 (0) 00:00:01
1 SORT AGGREGATE 1 3
* 2 TABLE ACCESS FULL TEST_CARDINALITY 512 1536 6 (0) 00:00:01
---------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - filter("COUNTRY_OF_BIRTH"='SG')

REM When we present a value that is not even present, Oracle isn't aware of this (Without histograms, it only knows MIN, MAX, Distinct and Density)
REM The Optimizer still applies the same 4/12 density formula for country of birth and expects 512 'SG' rows !

set autotrace off


REM -----------------------------------------------------------------------------------------------------------
REM -----------------------------------------------------------------------------------------------------------
REM What happens when DO gather Histograms ? Can the Optimizer make use of Histograms on multiple columns ?!
REM -----------------------------------------------------------------------------------------------------------


exec dbms_stats.gather_table_stats(ownname=>user,tabname=>'TEST_CARDINALITY',method_opt=>'FOR ALL COLUMNS SIZE 250',estimate_percent=>100);

select count(*) from test_cardinality;

COUNT(*)
----------
1536
select num_rows from user_tables where table_name = 'TEST_CARDINALITY';

NUM_ROWS
----------
1536
select column_name,sample_size, num_distinct, density from user_tab_columns where table_name = 'TEST_CARDINALITY' order by column_id;

COLUMN_NAME SAMPLE_SIZE NUM_DISTINCT DENSITY
------------------------------ ----------- ------------ ----------
DATE_OF_BIRTH 1536 12 .000325521
MONTH_OF_BIRTH 1536 12 .000325521
MONTH_NUMBER 1536 12 .000325521
CITY_OF_BIRTH 1536 12 .000325521
COUNTRY_OF_BIRTH 1536 3 .000325521

REM With Histograms, the Density is actually presented differently. We should be looking at USER_TAB_HISTOGRAMS for each column's values
REM Now, Oracle does know the exact count of rows for each of the column's range of values.
rem I haven't presented USER_TAB_HISTOGRAMS here

set autotrace on explain;

select count(*), round(1536*(1/12)*(1/12)) Expected_Cardinality from test_cardinality where month_number = 1 and month_of_birth = 'JANUARY';

COUNT(*) ExpCrd
---------- -------
128 11

Execution Plan
----------------------------------------------------------
Plan hash value: 3936405985

---------------------------------------------------------------------------------------
Id Operation Name Rows Bytes Cost (%CPU) Time
---------------------------------------------------------------------------------------
0 SELECT STATEMENT 1 11 6 (0) 00:00:01
1 SORT AGGREGATE 1 11
* 2 TABLE ACCESS FULL TEST_CARDINALITY 11 121 6 (0) 00:00:01
---------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - filter("MONTH_NUMBER"=1 AND "MONTH_OF_BIRTH"='JANUARY')

REM Although I seem to be using 'Density', I am actually using the exact count of rows. There are 1/12 'month_number=1's.
rem The Optimizer still comes up with a cardinality estimate of 11, based on Histograms, not Densities !

select count(*), round(1536*(1/12)*(1/12)) Expected_Cardinality from test_cardinality where month_number = 1 and month_of_birth = 'DECEMBER';

COUNT(*) ExpCrd
---------- -------
0 11

Execution Plan
----------------------------------------------------------
Plan hash value: 3936405985

---------------------------------------------------------------------------------------
Id Operation Name Rows Bytes Cost (%CPU) Time
---------------------------------------------------------------------------------------
0 SELECT STATEMENT 1 11 6 (0) 00:00:01
1 SORT AGGREGATE 1 11
* 2 TABLE ACCESS FULL TEST_CARDINALITY 11 121 6 (0) 00:00:01
---------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - filter("MONTH_NUMBER"=1 AND "MONTH_OF_BIRTH"='DECEMBER')

REM The Histograms say that there the frequency for 'month_number=1' is 1/12 and the frequency for 'DECEMBER' is also 1/12

select count(*), round(1536*(1/12)*(1/12)) Expected_Cardinality from test_cardinality where date_of_birth = '01-JAN-60' and month_of_birth = 'DECEMBER';

COUNT(*) ExpCrd
---------- -------
0 11

Execution Plan
----------------------------------------------------------
Plan hash value: 3936405985

---------------------------------------------------------------------------------------
Id Operation Name Rows Bytes Cost (%CPU) Time
---------------------------------------------------------------------------------------
0 SELECT STATEMENT 1 16 6 (0) 00:00:01
1 SORT AGGREGATE 1 16
* 2 TABLE ACCESS FULL TEST_CARDINALITY 11 176 6 (0) 00:00:01
---------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - filter("MONTH_OF_BIRTH"='DECEMBER' AND "DATE_OF_BIRTH"='01-JAN-60')

REM The Histograms again present frequencies of 1/12 and 1/12

select count(*), round(1536*(1/12)*(1/12)) Expected_Cardinality from test_cardinality where to_char(date_of_birth,'MON') = 'JAN' and month_of_birth = 'JANUARY';

COUNT(*) ExpCrd
---------- -------
128 11

Execution Plan
----------------------------------------------------------
Plan hash value: 3936405985

---------------------------------------------------------------------------------------
Id Operation Name Rows Bytes Cost (%CPU) Time
---------------------------------------------------------------------------------------
0 SELECT STATEMENT 1 16 6 (0) 00:00:01
1 SORT AGGREGATE 1 16
* 2 TABLE ACCESS FULL TEST_CARDINALITY 1 16 6 (0) 00:00:01
---------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - filter("MONTH_OF_BIRTH"='JANUARY' AND
TO_CHAR(INTERNAL_FUNCTION("DATE_OF_BIRTH"),'MON')='JAN')

REM Once again, the usage of a function throws off the Optimizer !

select count(*), round(1536*(1/12)*(10/12)) Expected_Cardinality from test_cardinality where city_of_birth = 'DELHI' and country_of_birth='IN';

COUNT(*) ExpCrd
---------- -------
128 107

Execution Plan
----------------------------------------------------------
Plan hash value: 3936405985

---------------------------------------------------------------------------------------
Id Operation Name Rows Bytes Cost (%CPU) Time
---------------------------------------------------------------------------------------
0 SELECT STATEMENT 1 11 6 (0) 00:00:01
1 SORT AGGREGATE 1 11
* 2 TABLE ACCESS FULL TEST_CARDINALITY 107 1177 6 (0) 00:00:01
---------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - filter("CITY_OF_BIRTH"='DELHI' AND "COUNTRY_OF_BIRTH"='IN')

REM The Histogram for 'IN' is a frequency of 10/12 *not* the 4/12 that was assumed with uniform distribution. Yet, this is multiplied with 1/12 for 'DELHI'

select count(*), round(1536*(1/12)*(1/12)) Expected_Cardinality from test_cardinality where city_of_birth = 'DELHI' and country_of_birth='UK';

COUNT(*) ExpCrd
---------- -------
0 11

Execution Plan
----------------------------------------------------------
Plan hash value: 3936405985

---------------------------------------------------------------------------------------
Id Operation Name Rows Bytes Cost (%CPU) Time
---------------------------------------------------------------------------------------
0 SELECT STATEMENT 1 11 6 (0) 00:00:01
1 SORT AGGREGATE 1 11
* 2 TABLE ACCESS FULL TEST_CARDINALITY 11 121 6 (0) 00:00:01
---------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - filter("COUNTRY_OF_BIRTH"='UK' AND "CITY_OF_BIRTH"='DELHI')

REM The expected frequencies for 'DELHI' and 'UK' are 1/12 and 1/12 respectively. No knowledge of the impossibility of the two data sets intersecting.

select count(*), round(1536*(2/12)*(1/12)) Expected_Cardinality from test_cardinality where city_of_birth in ('DELHI','LONDON') and country_of_birth='FR';

COUNT(*) ExpCrd
---------- -------
0 21

Execution Plan
----------------------------------------------------------
Plan hash value: 3936405985

---------------------------------------------------------------------------------------
Id Operation Name Rows Bytes Cost (%CPU) Time
---------------------------------------------------------------------------------------
0 SELECT STATEMENT 1 11 6 (0) 00:00:01
1 SORT AGGREGATE 1 11
* 2 TABLE ACCESS FULL TEST_CARDINALITY 21 231 6 (0) 00:00:01
---------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - filter("COUNTRY_OF_BIRTH"='FR' AND ("CITY_OF_BIRTH"='DELHI' OR
"CITY_OF_BIRTH"='LONDON'))

REM With two countries, the lack of knowledge makes the cardinality estimate worse !

select count(*), round(1536*(1/12)*(1/12)*(1/12)*(1/12)*(10/12)) Expected_Cardinality from test_cardinality where date_of_birth='01-JAN-60' and month_of_birth='JANUARY' and month_number = 1 and city_of_birth='BOMBAY' and country_of_birth='IN';

COUNT(*) ExpCrd
---------- -------
128 0

Execution Plan
----------------------------------------------------------
Plan hash value: 3936405985

---------------------------------------------------------------------------------------
Id Operation Name Rows Bytes Cost (%CPU) Time
---------------------------------------------------------------------------------------
0 SELECT STATEMENT 1 28 6 (0) 00:00:01
1 SORT AGGREGATE 1 28
* 2 TABLE ACCESS FULL TEST_CARDINALITY 1 28 6 (0) 00:00:01
---------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - filter("MONTH_OF_BIRTH"='JANUARY' AND "MONTH_NUMBER"=1 AND
"CITY_OF_BIRTH"='BOMBAY' AND "COUNTRY_OF_BIRTH"='IN' AND
"DATE_OF_BIRTH"='01-JAN-60')

REM Perfect knowledge about individual columns when multiplied together returns absolutely imperfect knowledge of all the columns !

select count(*), round(1536*(0/12)) Expected_Cardinality from test_cardinality where country_of_birth='SG';

COUNT(*) ExpCrd
---------- -------
0 0

Execution Plan
----------------------------------------------------------
Plan hash value: 3936405985

---------------------------------------------------------------------------------------
Id Operation Name Rows Bytes Cost (%CPU) Time
---------------------------------------------------------------------------------------
0 SELECT STATEMENT 1 3 6 (0) 00:00:01
1 SORT AGGREGATE 1 3
* 2 TABLE ACCESS FULL TEST_CARDINALITY 64 192 6 (0) 00:00:01
---------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - filter("COUNTRY_OF_BIRTH"='SG')

REM Although the Histogram on country_of_birth shows that there are *NO* rows for 'SG', Oracle cannot assume that there really are no rows.
REM What if I had inserted an 'SG' row after the gather_table_stats ?
REM So the optimizer must compute a Cardinality, how does it come up with 64 ? Any guesses ?

set autotrace off








Thus, you can see, in the simplest of cases, that the Cardinality
estimates can be very wrong. Imagine that this table is scaled up to 1,000 times its size, but the selectivity remains, generally, the same. Then, take this table in a complex multi-table query with join conditions. The incorrect Cardinality estimate could very well result in a poor Execution Plan. Thus, for cases like the "where city_of_birth in ('DELHI','LONDON') and country_of_birth='FR'", although we know that there would be 0 (zero) rows (irrespective of how large the table is), the Optimizer might well expect very many rows, depending on the table size. Where we would prefer that Oracle first filter on this table in a multi-table query, the Optimizer might prefer to execute this filter much later in the query, after needlessly wading through many blocks and rows in other tables in the same query!

2 comments:

Ravishankar said...

Hi Hemant,
Thanks for the post.
This is now addressed in 10.2.0.4 where we can use extended statistics.I did not try this 10.2.0.4 .11gR1 has this fixed. http://www.oracle.com/technology/obe/11gr1_db/perform/multistats/multicolstats.htm

Hemant K Chitale said...

Yes, I haven't tried this in 10.2.0.4 yet.
(although the results I published are in 10.2.0.4 , without defining a column group and extended statistics).