Wednesday, June 18, 2014

Outlier Values

Outlier values are basically values that sit way way outside the standard range of a column’s normal value range

Data can be a funny thing and sometimes there are values that are naturally “exceptional”. However, very commonly, outlier values are used by applications to represent bizarre default values, to avoid confusion with legitimate values

Usually, these weird outlier values are used to avoid nulls values, as nulls can be problematic and cannot be indexed

However, outlier values while (maybe) solving one problem, can introduce some very significant problems in return.

Firstly, the CBO “hates” outlier values as it potentially totally screws up the CBO’s selectivity calculations. The selectivity of a range scan is basically calculated by the CBO to be the number of values in the range of interest divided by the full range of possible values (IE. the max value minus the min value). Therefore if this calculation is invalidated by a massive and disprotionate “hole” in the full range of possible values, the CBO can get things horribly wrong.

rajesh@ORA10GR2> create table t as
  2  select a.*,
  3     case when mod(rownum,100) <=50 then last_ddl_time
  4             end as load_date
  5  from all_objects a ;

Table created.

rajesh@ORA10GR2> create index t_ind on t(load_date) nologging;

Index created.

rajesh@ORA10GR2>
rajesh@ORA10GR2> select count(*), count(distinct load_date),
  2     count(load_date), max(load_date),
  3     min(load_date) from t ;

  COUNT(*) COUNT(DISTINCTLOAD_DATE) COUNT(LOAD_DATE) MAX(LOAD_DATE)          MIN(LOAD_DATE)
---------- ------------------------ ---------------- ----------------------- -----------------------
     58391                     4189            29783 18-JUN-2014 01-OCT-2002

1 row selected.

rajesh@ORA10GR2>
rajesh@ORA10GR2> select count(*) from t
  2  where load_date between
  3  to_date('01-jan-2005','dd-mon-yyyy')
  4  and to_date('31-dec-2005','dd-mon-yyyy') ;

  COUNT(*)
----------
     25001

1 row selected.

rajesh@ORA10GR2> exec dbms_stats.gather_table_stats(user,'T');

PL/SQL procedure successfully completed.

rajesh@ORA10GR2> set autotrace traceonly explain statistics
rajesh@ORA10GR2>
rajesh@ORA10GR2>
rajesh@ORA10GR2> select * from t
  2  where load_date between
  3  to_date('01-jan-2005','dd-mon-yyyy')
  4  and to_date('31-dec-2005','dd-mon-yyyy') ;

25001 rows selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 1601196873

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      | 24581 |  2352K|   193   (2)| 00:00:03 |
|*  1 |  TABLE ACCESS FULL| T    | 24581 |  2352K|   193   (2)| 00:00:03 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("LOAD_DATE"<=TO_DATE('2005-12-31 00:00:00', 'yyyy-mm-dd
              hh24:mi:ss') AND "LOAD_DATE">=TO_DATE('2005-01-01 00:00:00',
              'yyyy-mm-dd hh24:mi:ss'))


Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
       1016  consistent gets
          0  physical reads
          0  redo size
    1044322  bytes sent via SQL*Net to client
       2172  bytes received via SQL*Net from client
        168  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
      25001  rows processed

rajesh@ORA10GR2> set autotrace off
rajesh@ORA10GR2>

So far the plan looks good. Now, see what happens when we default all NULLs to some big junk values.

rajesh@ORA10GR2> update t
  2  set load_date = to_date('01-jan-9999','dd-mon-yyyy')
  3  where load_date is null;

28608 rows updated.

rajesh@ORA10GR2> commit;

Commit complete.

rajesh@ORA10GR2> exec dbms_stats.gather_table_stats(user,'T');

PL/SQL procedure successfully completed.

rajesh@ORA10GR2> set autotrace traceonly explain statistics
rajesh@ORA10GR2> select * from t
  2  where load_date between
  3  to_date('01-jan-2005','dd-mon-yyyy')
  4  and to_date('31-dec-2005','dd-mon-yyyy') ;

25001 rows selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 1376202287

-------------------------------------------------------------------------------------
| Id  | Operation                   | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |       |    72 |  7344 |     6   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T     |    72 |  7344 |     6   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | T_IND |    72 |       |     2   (0)| 00:00:01 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("LOAD_DATE">=TO_DATE('2005-01-01 00:00:00', 'yyyy-mm-dd
              hh24:mi:ss') AND "LOAD_DATE"<=TO_DATE('2005-12-31 00:00:00', 'yyyy-mm-dd
              hh24:mi:ss'))


Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
       1687  consistent gets
          0  physical reads
          0  redo size
    1028883  bytes sent via SQL*Net to client
       2172  bytes received via SQL*Net from client
        168  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
      25001  rows processed

rajesh@ORA10GR2>

The estimated cardinality deviates from the actual cardinality and the net result is 40% increase in logical IO (which greatly affects the scalability of application).  The expected plan in this case would be Full scan (since this query would get about half the volume of data)

rajesh@ORA10GR2> select /*+ full(t) */ * from t
  2  where load_date between
  3  to_date('01-jan-2005','dd-mon-yyyy')
  4  and to_date('31-dec-2005','dd-mon-yyyy') ;

25001 rows selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 1601196873

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |    72 |  7344 |   192   (2)| 00:00:03 |
|*  1 |  TABLE ACCESS FULL| T    |    72 |  7344 |   192   (2)| 00:00:03 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("LOAD_DATE"<=TO_DATE('2005-12-31 00:00:00', 'yyyy-mm-dd
              hh24:mi:ss') AND "LOAD_DATE">=TO_DATE('2005-01-01 00:00:00',
              'yyyy-mm-dd hh24:mi:ss'))


Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
       1016  consistent gets
          0  physical reads
          0  redo size
    1044322  bytes sent via SQL*Net to client
       2172  bytes received via SQL*Net from client
        168  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
      25001  rows processed

rajesh@ORA10GR2> set autotrace off
rajesh@ORA10GR2>

Indexes also don’t like these kinds of outlier values, as they prevent the efficient 90-10 splits.

rajesh@ORA10GR2>
rajesh@ORA10GR2> create table t(x date,y varchar2(40));

Table created.

rajesh@ORA10GR2> alter table t
  2  add constraint t_pk
  3  primary key(x);

Table altered.

Now populate the table with monotonically increasing values for date column.

rajesh@ORA10GR2> insert into t
  2  select sysdate+rownum,
  3        object_name
  4  from all_objects ;

58392 rows created.

rajesh@ORA10GR2> commit;

Commit complete.

rajesh@ORA10GR2>

Let’s see how compact and efficient our indexes is on this primary key

rajesh@ORA10GR2> exec show_space('T_PK',user,'INDEX');

Unformatted Blocks .....................               0
FS1 Blocks (0-25)  .....................               0
FS2 Blocks (25-50) .....................               3
FS3 Blocks (50-75) .....................               0
FS4 Blocks (75-100).....................               0
Full Blocks        .....................             131
Total Blocks............................             144
Total Bytes.............................       1,179,648
Total MBytes............................               1
Unused Blocks...........................               0
Unused Bytes............................               0
Last Used Ext FileId....................               8
Last Used Ext BlockId...................         145,561
Last Used Block.........................              16

PL/SQL procedure successfully completed.
rajesh@ORA10GR2> analyze index t_pk validate structure;

Index analyzed.

rajesh@ORA10GR2>
rajesh@ORA10GR2> select lf_blks,btree_space,pct_used
  2  from index_stats;

   LF_BLKS BTREE_SPACE   PCT_USED
---------- ----------- ----------
       132     1064032         99

1 row selected.

rajesh@ORA10GR2>

As expected the index is entirely used with no free space as Oracle has been generating 90-10 splits.

Now let’s repeat the demo, but this time introduce junk value before inserting all the datas

rajesh@ORA10GR2> truncate table t;

Table truncated.

rajesh@ORA10GR2> insert into t values(to_date('31-dec-9999','dd-mon-yyyy'), 'HelloWorld');

1 row created.

rajesh@ORA10GR2> commit;

Commit complete.

rajesh@ORA10GR2> insert into t
  2  select sysdate+rownum,
  3        object_name
  4  from all_objects ;

58392 rows created.

rajesh@ORA10GR2> commit;

Commit complete.

rajesh@ORA10GR2> exec show_space('T_PK',user,'INDEX');
Unformatted Blocks .....................               0
FS1 Blocks (0-25)  .....................               0
FS2 Blocks (25-50) .....................               3
FS3 Blocks (50-75) .....................               0
FS4 Blocks (75-100).....................               0
Full Blocks        .....................             257
Total Blocks............................             272
Total Bytes.............................       2,228,224
Total MBytes............................               2
Unused Blocks...........................               0
Unused Bytes............................               0
Last Used Ext FileId....................               8
Last Used Ext BlockId...................         146,185
Last Used Block.........................              16

PL/SQL procedure successfully completed.

rajesh@ORA10GR2> analyze index t_pk validate structure;

Index analyzed.

rajesh@ORA10GR2> select lf_blks,btree_space,pct_used
  2  from index_stats;

   LF_BLKS BTREE_SPACE   PCT_USED
---------- ----------- ----------
       258     2072032         51

1 row selected.

rajesh@ORA10GR2>

Where previously we had a fully utilized index structure, now we are only using 51% of the index.
The other 49% is totally wasted and redundant as this space cannot be used be subsequent inserts,

Unless subsequent deletes totally empty an index leaf block and the block is placed again on the freelist