Where there is a high degree of skew in the column distribution, called a non-uniform distribution of data, histograms should lead to a better estimation of selectivity. This should produce plans that are more likely to be optimal. The histogram approach provides an efficient and compact way to represent data distributions.
When building histograms the information it stores is interpreted differently depending on whether the number of buckets requested is less than the number distinct values or if it is the same.
rajesh@9IR2> create table t
2 as
3 select rownum as a,
4 case when rownum <=9 then rownum
5 else 99 end as b
6 from dual
7 connect by level <=10000;
Table created.
Elapsed: 00:00:00.03
rajesh@9IR2>
The column 'a' has unique values from 1 to 10000.
The column 'b' has values from 1 to 9 occurs only once and '99' occurs 9991 times. (massively skewed)
select * from t where b=99;
select * from t where b=3;
Both the above queries use FULL TABLE SCAN as there is no other access path available.
rajesh@9IR2> set autotrace traceonly explain;
rajesh@9IR2> select * from t where b=99;
Elapsed: 00:00:00.00
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE
1 0 TABLE ACCESS (FULL) OF 'T'
rajesh@9IR2> select * from t where b=3;
Elapsed: 00:00:00.00
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE
1 0 TABLE ACCESS (FULL) OF 'T'
rajesh@9IR2>
rajesh@9IR2> set autotrace off;
lets create Index on this massively skewed column.
rajesh@9IR2> create index t_ind on t(b) nologging;
Index created.
Elapsed: 00:00:00.03
rajesh@9IR2>
rajesh@9IR2> select table_name, column_name, nullable, num_buckets, num_distinct
2 from user_tab_columns
3 where table_name ='T'
4 /
TABLE_NAME COLUMN_NAME N NUM_BUCKETS NUM_DISTINCT
-------------------- -------------------- - ----------- ------------
T A Y
T B Y
Elapsed: 00:00:00.01
rajesh@9IR2>
With index present query(2) should use INDEX RANGE SCAN and FULL TABLE SCAN for query(1).
rajesh@9IR2> set autotrace traceonly explain;
rajesh@9IR2> select * from t where b=99;
Elapsed: 00:00:00.00
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE
1 0 TABLE ACCESS (BY INDEX ROWID) OF 'T'
2 1 INDEX (RANGE SCAN) OF 'T_IND' (NON-UNIQUE)
rajesh@9IR2> select * from t where b=3;
Elapsed: 00:00:00.00
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE
1 0 TABLE ACCESS (BY INDEX ROWID) OF 'T'
2 1 INDEX (RANGE SCAN) OF 'T_IND' (NON-UNIQUE)
rajesh@9IR2>
rajesh@9IR2> set autotrace off;
Both do an INDEX RANGE SCAN to get the ROWID to do a lookup in the table 'T'.
Analyzing the table using DBMS_STATS API shows the following information in data dictionary.
rajesh@9IR2> exec dbms_stats.gather_table_stats(user,'T');
PL/SQL procedure successfully completed.
Elapsed: 00:00:00.04
rajesh@9IR2>
rajesh@9IR2> select table_name, column_name, nullable, num_buckets, num_distinct
2 from user_tab_columns
3 where table_name ='T'
4 /
TABLE_NAME COLUMN_NAME N NUM_BUCKETS NUM_DISTINCT
-------------------- -------------------- - ----------- ------------
T A Y 1 10000
T B Y 1 10
Elapsed: 00:00:00.31
rajesh@9IR2>
for Histograms it is,
rajesh@9IR2> SELECT table_name, column_name, endpoint_number, endpoint_value
2 FROM user_tab_histograms
3 WHERE table_name ='T'
4 order by column_name,endpoint_number
5 /
TABLE_NAME COLUMN_NAME ENDPOINT_NUMBER ENDPOINT_VALUE
-------------------- -------------------- --------------- --------------
T A 0 1
T A 1 10000
T B 0 1
T B 1 99
Elapsed: 00:00:00.04
rajesh@9IR2>
Analyze has created 1 bucket ( since the DEFAULT value for method_opt parameter is 'FOR ALL COLUMNS SIZE 1' ) for each column, so all values for the column are in same bucket. ENDPOINT_NUMBER represents Bucket Number and ENDPOINT_VALUE represents last column value present in the bucket.
Now both query ( query(1) & query(2) ) will do a FULL TABLE SCAN.
rajesh@9IR2> set autotrace traceonly explain;
rajesh@9IR2> select * from t where b=99;
Elapsed: 00:00:00.00
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=3 Card=1000 Bytes=6000)
1 0 TABLE ACCESS (FULL) OF 'T' (Cost=3 Card=1000 Bytes=6000)
rajesh@9IR2> select * from t where b=3;
Elapsed: 00:00:00.01
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=3 Card=1000 Bytes=6000)
1 0 TABLE ACCESS (FULL) OF 'T' (Cost=3 Card=1000 Bytes=6000)
rajesh@9IR2>
rajesh@9IR2> set autotrace off;
So, the fact that you have statistics about the table and columns does not help the optimizer to distinguish between how many of each value we have. The reason it does a FULL TABLE SCAN is because there is a 1 BUCKET histogram and any value selected for should be in that bucket.
Now we need to create Histograms (based on the Skewed data) that will be really help full to the optimizer to come up with Optimal plan. ( SIZE can be given as SIZE 10 if you know exact data values or AUTO - Will helps Oracle determines the columns to collect histograms based on data distribution and the workload of the columns )
rajesh@9IR2> EXEC dbms_stats.gather_table_stats(ownname=>user,tabname=>'T',method_opt=>'FOR ALL COLUMNS SIZE AUTO');
PL/SQL procedure successfully completed.
Elapsed: 00:00:00.03
rajesh@9IR2>
rajesh@9IR2>
rajesh@9IR2> select table_name, column_name, nullable, num_buckets, num_distinct
2 from user_tab_columns
3 where table_name ='T'
4 /
TABLE_NAME COLUMN_NAME N NUM_BUCKETS NUM_DISTINCT
-------------------- -------------------- - ----------- ------------
T A Y 1 10000
T B Y 9 10
Elapsed: 00:00:00.01
rajesh@9IR2>
rajesh@9IR2> SELECT table_name, column_name, endpoint_number, endpoint_value
2 FROM user_tab_histograms
3 WHERE table_name ='T'
4 order by column_name,endpoint_number
5 /
TABLE_NAME COLUMN_NAME ENDPOINT_NUMBER ENDPOINT_VALUE
-------------------- -------------------- --------------- --------------
T A 0 1
T A 1 10000
T B 1 1
T B 2 2
T B 3 3
T B 4 4
T B 5 5
T B 6 6
T B 7 7
T B 8 8
T B 9 9
T B 10000 99
12 rows selected.
Elapsed: 00:00:00.37
rajesh@9IR2>
Analyze has created 1 bucket for column 'A' having all 10000 unique values in one bucket and 9 buckets for column 'B'.
The ENDPOINT_VALUE shows the column value and the ENDPOINT_NUMBER shows the cumulative number of rows.
For example, for ENDPOINT_VALUE 2 for column 'B' , it has an ENDPOINT_NUMBER 2, the previous ENDPOINT_NUMBER is 1, hence the number of rows with value 2 is 1.
Simillarly ENDPOINT_VALUE 99. Its ENDPOINT_NUMBER is 10000, The previous bucket ENDPOINT_NUMBER is 9, so 10000 - 9 = 9991 rows containing value of '99'.
Now the Query(1) will do a FULL TABLE SCAN and Query(2) will do a INDEX RANGE SCAN like below.
rajesh@9IR2> set autotrace traceonly explain;
rajesh@9IR2> select * from t where b=99;
Elapsed: 00:00:00.00
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=3 Card=9991 Bytes=59946)
1 0 TABLE ACCESS (FULL) OF 'T' (Cost=3 Card=9991 Bytes=59946)
rajesh@9IR2> select * from t where b=3;
Elapsed: 00:00:00.01
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=2 Card=1 Bytes=6)
1 0 TABLE ACCESS (BY INDEX ROWID) OF 'T' (Cost=2 Card=1 Bytes=6)
2 1 INDEX (RANGE SCAN) OF 'T_IND' (NON-UNIQUE) (Cost=1 Card=1)
rajesh@9IR2>
rajesh@9IR2>
This is fine, if we have a low distinct values, but there can be tables with huge number of distinct values, you dont want to create a bucket for each value.
rajesh@9IR2> EXEC dbms_stats.gather_table_stats(ownname=>user,tabname=>'T',method_opt=>'FOR ALL COLUMNS SIZE 5');
PL/SQL procedure successfully completed.
Elapsed: 00:00:00.14
rajesh@9IR2>
rajesh@9IR2>
rajesh@9IR2> select table_name, column_name, nullable, num_buckets, num_distinct
2 from user_tab_columns
3 where table_name ='T'
4 /
TABLE_NAME COLUMN_NAME N NUM_BUCKETS NUM_DISTINCT
-------------------- -------------------- - ----------- ------------
T A Y 5 10000
T B Y 1 10
Elapsed: 00:00:00.03
rajesh@9IR2>
rajesh@9IR2>
rajesh@9IR2> SELECT table_name, column_name, endpoint_number, endpoint_value
2 FROM user_tab_histograms
3 WHERE table_name ='T'
4 order by column_name,endpoint_number
5 /
TABLE_NAME COLUMN_NAME ENDPOINT_NUMBER ENDPOINT_VALUE
-------------------- -------------------- --------------- --------------
T A 0 1
T A 1 2000
T A 2 4000
T A 3 6000
T A 4 8000
T A 5 10000
T B 0 1
T B 5 99
8 rows selected.
Elapsed: 00:00:00.04
rajesh@9IR2>
rajesh@9IR2>
Now we have 5 buckets for column 'A' where data ( 10000 / 5 = 2000 ) gets equially distributed.
rajesh@9IR2> select 10000 / 5 from dual;
10000/5
----------
2000
For column 'B', bucket '0' holds the lowest value. you cannot see the buckets 1 to bucket 4 so as to save space.
rajesh@9IR2> set autotrace traceonly explain;
rajesh@9IR2> select * from t where b=99;
Elapsed: 00:00:00.01
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=3 Card=9000 Bytes=54000)
1 0 TABLE ACCESS (FULL) OF 'T' (Cost=3 Card=9000 Bytes=54000)
rajesh@9IR2> select * from t where b=3;
Elapsed: 00:00:00.01
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=2 Card=1 Bytes=6)
1 0 TABLE ACCESS (BY INDEX ROWID) OF 'T' (Cost=2 Card=1 Bytes=6)
2 1 INDEX (RANGE SCAN) OF 'T_IND' (NON-UNIQUE) (Cost=1 Card=1)
rajesh@9IR2>
rajesh@9IR2> set autotrace off;
Oracle 10g Optimizer is pretty good in handling Histograms. Repeating the same snippet in 10GR2 database I see ( since the DEFAULT value for method_opt parameter is 'FOR ALL COLUMNS SIZE AUTO' )
(For demonstration i disable dynamic sampling at session level, to show how thing keep going )
rajesh@10GR2> alter session set optimizer_dynamic_sampling=0;
Session altered.
Elapsed: 00:00:00.01
rajesh@10GR2>
rajesh@10GR2> create table t
2 as
3 select rownum as a,
4 case when rownum <=9 then rownum
5 else 99 end as b
6 from dual
7 connect by level <=10000;
Table created.
Elapsed: 00:00:00.35
rajesh@10GR2> set autotrace traceonly explain;
rajesh@10GR2> select * from t where b=99;
Elapsed: 00:00:00.01
Execution Plan
----------------------------------------------------------
Plan hash value: 1601196873
--------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 17 | 442 | 6 (0)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| T | 17 | 442 | 6 (0)| 00:00:01 |
--------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("B"=99)
rajesh@10GR2> select * from t where b=3;
Elapsed: 00:00:00.01
Execution Plan
----------------------------------------------------------
Plan hash value: 1601196873
--------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 17 | 442 | 6 (0)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| T | 17 | 442 | 6 (0)| 00:00:01 |
--------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("B"=3)
rajesh@10GR2>
Both the above queries would use a FULL TABLE SCAN as there is no other access method available.
rajesh@10GR2> create index t_ind on t(b) nologging;
Index created.
Elapsed: 00:00:00.12
rajesh@10GR2>
rajesh@10GR2> set autotrace off;
rajesh@10GR2>
rajesh@10GR2> select table_name, column_name, nullable, num_buckets, num_distinct
2 from user_tab_columns
3 where table_name ='T'
4 /
TABLE_NAME COLUMN_NAME N NUM_BUCKETS NUM_DISTINCT
------------------------------ ------------------------------ - ----------- ------------
T A Y
T B Y
Elapsed: 00:00:00.04
rajesh@10GR2>
rajesh@10GR2> set autotrace traceonly explain;
rajesh@10GR2> select * from t where b=99;
Elapsed: 00:00:00.01
Execution Plan
----------------------------------------------------------
Plan hash value: 1376202287
-------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 17 | 442 | 3 (0)| 00:00:01 |
| 1 | TABLE ACCESS BY INDEX ROWID| T | 17 | 442 | 3 (0)| 00:00:01 |
|* 2 | INDEX RANGE SCAN | T_IND | 7 | | 2 (0)| 00:00:01 |
-------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("B"=99)
rajesh@10GR2> select * from t where b=3;
Elapsed: 00:00:00.01
Execution Plan
----------------------------------------------------------
Plan hash value: 1376202287
-------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 17 | 442 | 3 (0)| 00:00:01 |
| 1 | TABLE ACCESS BY INDEX ROWID| T | 17 | 442 | 3 (0)| 00:00:01 |
|* 2 | INDEX RANGE SCAN | T_IND | 7 | | 2 (0)| 00:00:01 |
-------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("B"=3)
rajesh@10GR2>
rajesh@10GR2> SELECT table_name, column_name, endpoint_number, endpoint_value
2 FROM user_tab_histograms
3 WHERE table_name ='T'
4 /
no rows selected
Elapsed: 00:00:00.01
rajesh@10GR2>
rajesh@10GR2>
'method_opt' parameter in gather_table_stats method has default value of FOR ALL COLUMNS SIZE AUTO - this helps Oracle determines the columns to collect histograms based on data distribution.
rajesh@10GR2> exec dbms_stats.gather_table_stats(user,'T');
PL/SQL procedure successfully completed.
Elapsed: 00:00:00.37
rajesh@10GR2>
rajesh@10GR2>
rajesh@10GR2> select table_name, column_name, nullable, num_buckets, num_distinct
2 from user_tab_columns
3 where table_name ='T'
4 /
TABLE_NAME COLUMN_NAME N NUM_BUCKETS NUM_DISTINCT
-------------------- -------------------- - ----------- ------------
T A Y 1 10000
T B Y 10 10
Elapsed: 00:00:00.01
rajesh@10GR2>
rajesh@10GR2>
rajesh@10GR2> SELECT table_name, column_name, endpoint_number, endpoint_value
2 FROM user_tab_histograms
3 WHERE table_name ='T'
4 order by column_name,endpoint_number
5 /
TABLE_NAME COLUMN_NAME ENDPOINT_NUMBER ENDPOINT_VALUE
-------------------- -------------------- --------------- --------------
T A 0 1
T A 1 10000
T B 1 1
T B 2 2
T B 3 3
T B 4 4
T B 5 5
T B 6 6
T B 7 7
T B 8 8
T B 9 9
T B 10000 99
12 rows selected.
Elapsed: 00:00:00.07
rajesh@10GR2>
Now the query(1) does Full Table scan (due to massively skewed data on column 'b') and query(1) does an Index Range scan.
rajesh@10GR2> set autotrace traceonly explain;
rajesh@10GR2> select * from t where b=99;
Elapsed: 00:00:00.00
Execution Plan
----------------------------------------------------------
Plan hash value: 1601196873
--------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 9991 | 59946 | 6 (0)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| T | 9991 | 59946 | 6 (0)| 00:00:01 |
--------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("B"=99)
rajesh@10GR2> select * from t where b=3;
Elapsed: 00:00:00.01
Execution Plan
----------------------------------------------------------
Plan hash value: 1376202287
-------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 6 | 2 (0)| 00:00:01 |
| 1 | TABLE ACCESS BY INDEX ROWID| T | 1 | 6 | 2 (0)| 00:00:01 |
|* 2 | INDEX RANGE SCAN | T_IND | 1 | | 1 (0)| 00:00:01 |
-------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("B"=3)
rajesh@10GR2>
rajesh@10GR2>
No comments:
Post a Comment