Imagine you
have accumulated five years worth of data into your accounting system, say from
01-jan-2000 to 31-Dec-2004 and decided to run a report spanning all data in
2003 it seems likely that all the queries that have the predicate where date_col
between 01-jan-2003 to 31-Dec-2003 should be using tablescans since they
are likely to be querying nearly 20% of data.
So lets create
a table with approximately 100 rows per day for five years, but every thousand
row is set to NULL and verify the optimizer estimates.
demo@XEPDB1>
create table t as
2 select decode( mod(rownum-1,1000),
3 0, to_date(null),
4 to_date('01-jan-2000','dd-mon-yyyy')+ trunc((rownum-1)/100) ) dt
5 from dual
6 connect by level <= ( 1827*100 );
Table created.
demo@XEPDB1>
select num_rows,blocks
2 from user_tab_statistics
3 where table_name ='T';
NUM_ROWS
BLOCKS
---------- ----------
182700 343
demo@XEPDB1>
select num_distinct,num_nulls,num_buckets,sample_size
2 from user_tab_col_statistics
3 where table_name ='T';
NUM_DISTINCT NUM_NULLS NUM_BUCKETS SAMPLE_SIZE
------------ ---------- ----------- -----------
1827 183 1 182517
demo@XEPDB1>
select min(dt), max(dt) from t;
MIN(DT) MAX(DT)
----------- -----------
01-JAN-2000 31-DEC-2004
So got 182700
rows in the table of which 183 rows are having null values. Given a query like
below with a simple range based predicates
select * from t
where dt between to_date('01-jan-2003','dd-mon-yyyy')
and to_date('31-dec-2003','dd-mon-yyyy');
Here is what
the optimizer internally doing a quick calculation for that range scan
selectivity for the query
Selectivity =
required range – available range + N / NDV.
Selectivity = ( 31-dec-2003 – 01-Jan-2003 ) / ( 31-dec-2004 – 01-jan-2000 ) + 2/ 1827 = .200437517
Now the
cardinality is num_rows * selectivity =
(182700 – 183) * .200437517 = 36583.2542
demo@XEPDB1>
set autotrace on exp
demo@XEPDB1> select count(*) from t
2 where dt between to_date('01-jan-2003','dd-mon-yyyy')
3 and to_date('31-dec-2003','dd-mon-yyyy');
COUNT(*)
----------
36463
Execution Plan
----------------------------------------------------------
Plan hash value: 2966233522
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 8 | 98 (4)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | 8 | | |
|* 2 | TABLE ACCESS FULL| T | 36583 | 285K| 98 (4)| 00:00:01 |
---------------------------------------------------------------------------
Predicate
Information (identified by operation id):
---------------------------------------------------
2 - filter("DT">=TO_DATE('
2003-01-01 00:00:00', 'syyyy-mm-dd
hh24:mi:ss') AND "DT"<=TO_DATE(' 2003-12-31 00:00:00', 'syyyy-mm-dd
hh24:mi:ss'))
The estimates reported
in the explain plan is close to the actuals, so what could possibly go wrong in
real Oracle date columns?
As with any
database independent applications, if we decide to use some future values (like
31-dec-4000) instead of nulls, will affect the optimizer estimes heavily.
Let’s say we
update all NULL values in the above example to some future values, say
31-dec-4000 (or we decided to have some fake values like 31-dec-4000 instead of
nulls )
demo@XEPDB1>
update t set
2 dt = to_date('31-dec-4000','dd-mon-yyyy')
3 where dt is null ;
183 rows updated.
demo@XEPDB1>
begin
2 dbms_stats.gather_table_stats(
3 user,'T',no_invalidate=>false,
4 method_opt=>'for all columns size 1');
5 end;
6 /
PL/SQL procedure
successfully completed.
demo@XEPDB1>
select min(dt), max(dt) from t;
MIN(DT) MAX(DT)
----------- -----------
01-JAN-2000 31-DEC-4000
Then the data
distribution changes like this
demo@XEPDB1>
select num_distinct,num_nulls,num_buckets,sample_size
2 from user_tab_col_statistics
3 where table_name ='T';
NUM_DISTINCT NUM_NULLS NUM_BUCKETS SAMPLE_SIZE
------------ ---------- ----------- -----------
1828 0 1 182700
demo@XEPDB1>
re-executing
the same query with simple range based predicates on this new data distribution
and the calculation for the range scan
selectivity is
Selectivity =
required range – available range + N / NDV.
Selectivity = ( 31-dec-2003 – 01-Jan-2003 ) / ( 31-dec-4000 – 01-jan-2000 ) + 2/ 1828 = 0.001592
Now the
cardinality is num_rows * selectivity = (182700)
* 0.001592 = 290.85
demo@XEPDB1>
set autotrace on exp
demo@XEPDB1> select count(*) from t
2 where dt between to_date('01-jan-2003','dd-mon-yyyy')
3 and to_date('31-dec-2003','dd-mon-yyyy');
COUNT(*)
----------
36463
Execution Plan
----------------------------------------------------------
Plan hash value: 2966233522
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 8 | 98 (4)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | 8 | | |
|* 2 | TABLE ACCESS FULL| T | 291 | 2328 | 98 (4)| 00:00:01 |
---------------------------------------------------------------------------
Predicate
Information (identified by operation id):
---------------------------------------------------
2 - filter("DT"<=TO_DATE('
2003-12-31 00:00:00', 'syyyy-mm-dd
hh24:mi:ss') AND "DT">=TO_DATE(' 2003-01-01 00:00:00', 'syyyy-mm-dd
hh24:mi:ss'))
With this
misleading value for the selectivity (which translates into a dramatically
incorrect cardinality), It is not surprising if the optimizer manages to pick
the wrong path through the data, and it takes just one row with this
unfortunate default value to make the statistics look like rubbish.
2 select decode( mod(rownum-1,1000),
3 0, to_date(null),
4 to_date('01-jan-2000','dd-mon-yyyy')+ trunc((rownum-1)/100) ) dt
5 from dual
6 connect by level <= ( 1827*100 );
2 from user_tab_statistics
3 where table_name ='T';
---------- ----------
182700 343
2 from user_tab_col_statistics
3 where table_name ='T';
------------ ---------- ----------- -----------
1827 183 1 182517
----------- -----------
01-JAN-2000 31-DEC-2004
where dt between to_date('01-jan-2003','dd-mon-yyyy')
and to_date('31-dec-2003','dd-mon-yyyy');
Selectivity = ( 31-dec-2003 – 01-Jan-2003 ) / ( 31-dec-2004 – 01-jan-2000 ) + 2/ 1827 = .200437517
demo@XEPDB1> select count(*) from t
2 where dt between to_date('01-jan-2003','dd-mon-yyyy')
3 and to_date('31-dec-2003','dd-mon-yyyy');
----------
36463
----------------------------------------------------------
Plan hash value: 2966233522
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 8 | 98 (4)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | 8 | | |
|* 2 | TABLE ACCESS FULL| T | 36583 | 285K| 98 (4)| 00:00:01 |
---------------------------------------------------
hh24:mi:ss') AND "DT"<=TO_DATE(' 2003-12-31 00:00:00', 'syyyy-mm-dd
hh24:mi:ss'))
2 dt = to_date('31-dec-4000','dd-mon-yyyy')
3 where dt is null ;
2 dbms_stats.gather_table_stats(
3 user,'T',no_invalidate=>false,
4 method_opt=>'for all columns size 1');
5 end;
6 /
----------- -----------
01-JAN-2000 31-DEC-4000
2 from user_tab_col_statistics
3 where table_name ='T';
------------ ---------- ----------- -----------
1828 0 1 182700
Selectivity = ( 31-dec-2003 – 01-Jan-2003 ) / ( 31-dec-4000 – 01-jan-2000 ) + 2/ 1828 = 0.001592
demo@XEPDB1> select count(*) from t
2 where dt between to_date('01-jan-2003','dd-mon-yyyy')
3 and to_date('31-dec-2003','dd-mon-yyyy');
36463
----------------------------------------------------------
Plan hash value: 2966233522
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 8 | 98 (4)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | 8 | | |
|* 2 | TABLE ACCESS FULL| T | 291 | 2328 | 98 (4)| 00:00:01 |
---------------------------------------------------
hh24:mi:ss') AND "DT">=TO_DATE(' 2003-01-01 00:00:00', 'syyyy-mm-dd
hh24:mi:ss'))
No comments:
Post a Comment