Outlier values are basically values that sit way way outside
the standard range of a column’s normal value range
Data can be a funny thing and sometimes there are values
that are naturally “exceptional”. However, very commonly, outlier values are
used by applications to represent bizarre default values, to avoid confusion
with legitimate values
Usually, these weird outlier values are used to avoid nulls
values, as nulls can be problematic and cannot be indexed
However, outlier values while (maybe) solving one problem,
can introduce some very significant problems in return.
Firstly, the CBO “hates” outlier values as it potentially
totally screws up the CBO’s selectivity calculations. The selectivity of a
range scan is basically calculated by the CBO to be the number of values in the
range of interest divided by the full range of possible values (IE. the max
value minus the min value). Therefore if this calculation is invalidated by a
massive and disprotionate “hole” in the full range of possible values, the CBO
can get things horribly wrong.
rajesh@ORA10GR2> create table t as
2 select a.*,
3 case when mod(rownum,100) <=50 then
last_ddl_time
4 end as load_date
5 from all_objects a ;
Table created.
rajesh@ORA10GR2> create index t_ind on t(load_date)
nologging;
Index created.
rajesh@ORA10GR2>
rajesh@ORA10GR2> select count(*), count(distinct load_date),
2 count(load_date), max(load_date),
3 min(load_date) from t ;
COUNT(*)
COUNT(DISTINCTLOAD_DATE) COUNT(LOAD_DATE) MAX(LOAD_DATE) MIN(LOAD_DATE)
---------- ------------------------ ----------------
----------------------- -----------------------
58391 4189 29783 18-JUN-2014 01-OCT-2002
1 row selected.
rajesh@ORA10GR2>
rajesh@ORA10GR2> select count(*) from t
2 where load_date between
3 to_date('01-jan-2005','dd-mon-yyyy')
4 and to_date('31-dec-2005','dd-mon-yyyy') ;
COUNT(*)
----------
25001
1 row selected.
rajesh@ORA10GR2> exec
dbms_stats.gather_table_stats(user,'T');
PL/SQL procedure successfully completed.
rajesh@ORA10GR2> set autotrace traceonly explain statistics
rajesh@ORA10GR2>
rajesh@ORA10GR2>
rajesh@ORA10GR2> select * from t
2 where load_date between
3 to_date('01-jan-2005','dd-mon-yyyy')
4 and to_date('31-dec-2005','dd-mon-yyyy') ;
25001 rows selected.
Execution Plan
----------------------------------------------------------
Plan hash value: 1601196873
--------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------
| 0 | SELECT
STATEMENT | | 24581 |
2352K| 193 (2)| 00:00:03 |
|* 1 | TABLE ACCESS FULL| T |
24581 | 2352K| 193
(2)| 00:00:03 |
--------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 -
filter("LOAD_DATE"<=TO_DATE('2005-12-31 00:00:00', 'yyyy-mm-dd
hh24:mi:ss')
AND "LOAD_DATE">=TO_DATE('2005-01-01 00:00:00',
'yyyy-mm-dd
hh24:mi:ss'))
Statistics
----------------------------------------------------------
1 recursive calls
0
db block gets
1016 consistent gets
0 physical reads
0 redo size
1044322 bytes sent via SQL*Net to client
2172 bytes received via SQL*Net from client
168 SQL*Net roundtrips to/from client
0
sorts (memory)
0 sorts (disk)
25001 rows processed
rajesh@ORA10GR2> set autotrace off
rajesh@ORA10GR2>
So far the plan looks good. Now, see what happens when we default all
NULLs to some big junk values.
rajesh@ORA10GR2> update t
2 set load_date =
to_date('01-jan-9999','dd-mon-yyyy')
3 where load_date is null;
28608 rows updated.
rajesh@ORA10GR2> commit;
Commit complete.
rajesh@ORA10GR2> exec
dbms_stats.gather_table_stats(user,'T');
PL/SQL procedure successfully completed.
rajesh@ORA10GR2> set autotrace traceonly explain statistics
rajesh@ORA10GR2> select * from t
2 where load_date between
3 to_date('01-jan-2005','dd-mon-yyyy')
4 and to_date('31-dec-2005','dd-mon-yyyy') ;
25001 rows selected.
Execution Plan
----------------------------------------------------------
Plan hash value: 1376202287
-------------------------------------------------------------------------------------
| Id | Operation | Name | Rows
| Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------------
| 0 | SELECT
STATEMENT | |
72 | 7344 |
6 (0)| 00:00:01 |
| 1 | TABLE ACCESS BY INDEX ROWID| T |
72 | 7344 |
6 (0)| 00:00:01 |
|* 2 | INDEX RANGE SCAN | T_IND | 72
| | 2
(0)| 00:00:01 |
-------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 -
access("LOAD_DATE">=TO_DATE('2005-01-01 00:00:00', 'yyyy-mm-dd
hh24:mi:ss')
AND "LOAD_DATE"<=TO_DATE('2005-12-31 00:00:00', 'yyyy-mm-dd
hh24:mi:ss'))
Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
1687 consistent gets
0 physical reads
0 redo size
1028883 bytes sent via SQL*Net to client
2172 bytes received via SQL*Net from client
168 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
25001 rows processed
rajesh@ORA10GR2>
The estimated cardinality deviates from the actual cardinality and the net
result is 40% increase in logical IO (which greatly affects the scalability of application). The expected plan in this case would be Full
scan (since this query would get about half the volume of data)
rajesh@ORA10GR2> select /*+ full(t) */ * from t
2 where load_date between
3 to_date('01-jan-2005','dd-mon-yyyy')
4 and to_date('31-dec-2005','dd-mon-yyyy') ;
25001 rows selected.
Execution Plan
----------------------------------------------------------
Plan hash value: 1601196873
--------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------
| 0 | SELECT
STATEMENT | |
72 | 7344 | 192
(2)| 00:00:03 |
|* 1 | TABLE ACCESS FULL| T |
72 | 7344 | 192
(2)| 00:00:03 |
--------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 -
filter("LOAD_DATE"<=TO_DATE('2005-12-31 00:00:00', 'yyyy-mm-dd
hh24:mi:ss')
AND "LOAD_DATE">=TO_DATE('2005-01-01 00:00:00',
'yyyy-mm-dd
hh24:mi:ss'))
Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
1016 consistent gets
0 physical reads
0 redo size
1044322 bytes sent via SQL*Net to client
2172 bytes received via SQL*Net from client
168 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
25001 rows processed
rajesh@ORA10GR2> set autotrace off
rajesh@ORA10GR2>
Indexes also don’t like these kinds of outlier values, as they prevent
the efficient 90-10 splits.
rajesh@ORA10GR2>
rajesh@ORA10GR2> create table t(x date,y varchar2(40));
Table created.
rajesh@ORA10GR2> alter table t
2 add constraint t_pk
3 primary key(x);
Table altered.
Now populate the table with monotonically increasing values for date
column.
rajesh@ORA10GR2> insert into t
2 select sysdate+rownum,
3 object_name
4 from all_objects ;
58392 rows created.
rajesh@ORA10GR2> commit;
Commit complete.
rajesh@ORA10GR2>
Let’s see how compact and efficient our indexes is on this primary key
rajesh@ORA10GR2> exec show_space('T_PK',user,'INDEX');
Unformatted Blocks ..................... 0
FS1 Blocks (0-25)
.....................
0
FS2 Blocks (25-50) ..................... 3
FS3 Blocks (50-75) ..................... 0
FS4 Blocks (75-100)..................... 0
Full Blocks
.....................
131
Total Blocks............................ 144
Total Bytes............................. 1,179,648
Total
MBytes............................
1
Unused Blocks........................... 0
Unused Bytes............................ 0
Last Used Ext FileId.................... 8
Last Used Ext BlockId................... 145,561
Last Used Block......................... 16
PL/SQL procedure successfully completed.
rajesh@ORA10GR2> analyze index t_pk validate structure;
Index analyzed.
rajesh@ORA10GR2>
rajesh@ORA10GR2> select lf_blks,btree_space,pct_used
2 from index_stats;
LF_BLKS
BTREE_SPACE PCT_USED
---------- ----------- ----------
132 1064032 99
1 row selected.
rajesh@ORA10GR2>
As expected the index is entirely used with no free space as Oracle has
been generating 90-10 splits.
Now let’s repeat the demo, but this time introduce junk value before
inserting all the datas
rajesh@ORA10GR2> truncate table t;
Table truncated.
rajesh@ORA10GR2> insert into t values(to_date('31-dec-9999','dd-mon-yyyy'),
'HelloWorld');
1 row created.
rajesh@ORA10GR2> commit;
Commit complete.
rajesh@ORA10GR2> insert into t
2 select sysdate+rownum,
3 object_name
4 from all_objects ;
58392 rows created.
rajesh@ORA10GR2> commit;
Commit complete.
rajesh@ORA10GR2> exec show_space('T_PK',user,'INDEX');
Unformatted Blocks ..................... 0
FS1 Blocks (0-25)
.....................
0
FS2 Blocks (25-50) ..................... 3
FS3 Blocks (50-75) ..................... 0
FS4 Blocks (75-100)..................... 0
Full Blocks
.....................
257
Total Blocks............................ 272
Total Bytes............................. 2,228,224
Total
MBytes............................
2
Unused Blocks........................... 0
Unused Bytes............................ 0
Last Used Ext FileId.................... 8
Last Used Ext BlockId................... 146,185
Last Used Block......................... 16
PL/SQL procedure successfully completed.
rajesh@ORA10GR2> analyze index t_pk validate structure;
Index analyzed.
rajesh@ORA10GR2> select lf_blks,btree_space,pct_used
2 from index_stats;
LF_BLKS
BTREE_SPACE PCT_USED
---------- ----------- ----------
258 2072032 51
1 row selected.
rajesh@ORA10GR2>
Where previously we had a fully utilized index structure, now we are
only using 51% of the index.
The other 49% is totally wasted and redundant as this space cannot be
used be subsequent inserts,
Unless subsequent deletes totally empty an index leaf block and the
block is placed again on the freelist
No comments:
Post a Comment