THE RAJESH BLOG: Fake Values

Imagine you have accumulated five years worth of data into your accounting system, say from 01-jan-2000 to 31-Dec-2004 and decided to run a report spanning all data in 2003 it seems likely that all the queries that have the predicate where date_col between 01-jan-2003 to 31-Dec-2003 should be using tablescans since they are likely to be querying nearly 20% of data.

So lets create a table with approximately 100 rows per day for five years, but every thousand row is set to NULL and verify the optimizer estimates.

demo@XEPDB1> create table t as
2 select decode( mod(rownum-1,1000),
3     0, to_date(null),
4     to_date('01-jan-2000','dd-mon-yyyy')+ trunc((rownum-1)/100) ) dt
5 from dual
6 connect by level <= ( 1827*100 );

Table created.

demo@XEPDB1> select num_rows,blocks
2 from user_tab_statistics
3 where table_name ='T';

NUM_ROWS     BLOCKS
---------- ----------
    182700        343

demo@XEPDB1> select num_distinct,num_nulls,num_buckets,sample_size
2 from user_tab_col_statistics
3 where table_name ='T';

NUM_DISTINCT NUM_NULLS NUM_BUCKETS SAMPLE_SIZE
------------ ---------- ----------- -----------
        1827        183           1      182517

demo@XEPDB1> select min(dt), max(dt) from t;

MIN(DT)     MAX(DT)
----------- -----------
01-JAN-2000 31-DEC-2004

So got 182700 rows in the table of which 183 rows are having null values. Given a query like below with a simple range based predicates

select * from t
where dt between to_date('01-jan-2003','dd-mon-yyyy')
and to_date('31-dec-2003','dd-mon-yyyy');

Here is what the optimizer internally doing a quick calculation for that range scan selectivity for the query

Selectivity = required range – available range + N / NDV.
Selectivity = ( 31-dec-2003 – 01-Jan-2003 ) / ( 31-dec-2004 – 01-jan-2000 ) + 2/ 1827 = .200437517

Now the cardinality is num_rows * selectivity =   (182700 – 183) * .200437517 = 36583.2542

demo@XEPDB1> set autotrace on exp
demo@XEPDB1> select count(*) from t
2 where dt between to_date('01-jan-2003','dd-mon-yyyy')
3 and to_date('31-dec-2003','dd-mon-yyyy');

COUNT(*)
----------
     36463

Execution Plan
----------------------------------------------------------
Plan hash value: 2966233522

---------------------------------------------------------------------------
| Id | Operation          | Name | Rows | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |     |     1 |     8 |    98   (4)| 00:00:01 |
|   1 | SORT AGGREGATE    |      |     1 |     8 |            |          |
|* 2 |   TABLE ACCESS FULL| T    | 36583 |   285K|    98   (4)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("DT">=TO_DATE(' 2003-01-01 00:00:00', 'syyyy-mm-dd
              hh24:mi:ss') AND "DT"<=TO_DATE(' 2003-12-31 00:00:00', 'syyyy-mm-dd
              hh24:mi:ss'))

The estimates reported in the explain plan is close to the actuals, so what could possibly go wrong in real Oracle date columns?

As with any database independent applications, if we decide to use some future values (like 31-dec-4000) instead of nulls, will affect the optimizer estimes heavily.

Let’s say we update all NULL values in the above example to some future values, say 31-dec-4000 (or we decided to have some fake values like 31-dec-4000 instead of nulls )

demo@XEPDB1> update t set
2 dt = to_date('31-dec-4000','dd-mon-yyyy')
3 where dt is null ;

183 rows updated.

demo@XEPDB1> begin
2     dbms_stats.gather_table_stats(
3             user,'T',no_invalidate=>false,
4             method_opt=>'for all columns size 1');
5 end;
6 /

PL/SQL procedure successfully completed.

demo@XEPDB1> select min(dt), max(dt) from t;

MIN(DT)     MAX(DT)
----------- -----------
01-JAN-2000 31-DEC-4000

Then the data distribution changes like this

demo@XEPDB1> select num_distinct,num_nulls,num_buckets,sample_size
2 from user_tab_col_statistics
3 where table_name ='T';

NUM_DISTINCT NUM_NULLS NUM_BUCKETS SAMPLE_SIZE
------------ ---------- ----------- -----------
        1828          0           1      182700

demo@XEPDB1>

re-executing the same query with simple range based predicates on this new data distribution and the calculation for the range scan selectivity is

Selectivity = required range – available range + N / NDV.
Selectivity = ( 31-dec-2003 – 01-Jan-2003 ) / ( 31-dec-4000 – 01-jan-2000 ) + 2/ 1828 = 0.001592

Now the cardinality is num_rows * selectivity =   (182700) * 0.001592 = 290.85

demo@XEPDB1> set autotrace on exp
demo@XEPDB1> select count(*) from t
2 where dt between to_date('01-jan-2003','dd-mon-yyyy')
3 and to_date('31-dec-2003','dd-mon-yyyy');

COUNT(*)
----------
     36463

Execution Plan
----------------------------------------------------------
Plan hash value: 2966233522

---------------------------------------------------------------------------
| Id | Operation          | Name | Rows | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     8 |    98   (4)| 00:00:01 |
|   1 | SORT AGGREGATE    |      |     1 |     8 |            |          |
|* 2 |   TABLE ACCESS FULL| T    |   291 | 2328 |    98   (4)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("DT"<=TO_DATE(' 2003-12-31 00:00:00', 'syyyy-mm-dd
              hh24:mi:ss') AND "DT">=TO_DATE(' 2003-01-01 00:00:00', 'syyyy-mm-dd
              hh24:mi:ss'))

With this misleading value for the selectivity (which translates into a dramatically incorrect cardinality), It is not surprising if the optimizer manages to pick the wrong path through the data, and it takes just one row with this unfortunate default value to make the statistics look like rubbish.

Thursday, September 30, 2021

Fake Values

No comments:

Post a Comment