The clustering
factor (CF) is the important index related statistics, with the efficiency
of index performing range scans very much is depended on CF of the index. If
the data in the table is very well clustered in relation to the index, then
index range scan has to visit few table blocks to obtain necessary data. If the
data is effectively randomized and not well clustered in relation to the index,
then index index range scan has to visit more table blocks to obtain necessary
data. The CBO will be less included to use such an index as a result depending
on the overall selectivity of the query.
It is very similar story for Exadata storage indexes (SI) as well. The
better the table data is clustered in relation to the SI’s, the more efficient
and effective the SI’s are likely to be in relation to being able eliminate
accessing storage region that can’t possibly contain data of interest. By
having the data more clustered (or ordered) in relation to specific SI, the
min/max ranges associated with SI are more likely to be able to determine area
of table where data can’t exists.
c##rajesh@PDB1>
create table t
2 as
3
select rownum as id, trunc( rownum/100000 )+1 as album_id,
4
mod( rownum,100000 ) artist_id,
5
ceil( dbms_random.value(0,5) )*2 as format_id,
6 trunc(sysdate - ceil(
dbms_random.value(0,10000) )) as relase_date,
7
ceil( dbms_random.value(0,500000) ) as total_sales,
8
rpad('some really big string',80) data
9 from
big_table
10
where rownum <= 10000000;
Table created.
c##rajesh@PDB1>
create index t_album_id_idx on t( album_id ) nologging;
Index created.
c##rajesh@PDB1>
create index t_total_sales_idx on t( total_sales ) nologging;
Index created.
c##rajesh@PDB1>
col index_name format a20
c##rajesh@PDB1>
select i.index_name, i.clustering_factor,i.num_rows,t.blocks
2 from
user_indexes i,
3
user_tables t
4
where i.table_name = t.table_name
5 and
t.table_name ='T';
INDEX_NAME CLUSTERING_FACTOR NUM_ROWS
BLOCKS
--------------------
----------------- ---------- ----------
T_ALBUM_ID_IDX 161259 10000000
161903
T_TOTAL_SALES_IDX 9999399 10000000
161903
c##rajesh@PDB1>
The following query is on the album_id column that has an excellent CF.
c##rajesh@PDB1>
@script03a.sql
PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------------------
SQL_ID 1cd1u6mdr6ap4, child number 1
-------------------------------------
select * from t
where album_id = 42
Plan hash
value: 4221737049
----------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time
| Buffers |
----------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 |
| 100K|00:00:00.05 | 3123 |
| 1 |
TABLE ACCESS BY INDEX ROWID BATCHED| T | 1 |
99010 | 100K|00:00:00.05 | 3123 |
|* 2 |
INDEX RANGE SCAN
| T_ALBUM_ID_IDX | 1 | 99010 |
100K|00:00:00.02 | 863 |
----------------------------------------------------------------------------------------------------------------
Predicate
Information (identified by operation id):
---------------------------------------------------
2 - access("ALBUM_ID"=42)
The index range scan access path is selected by CBO to retrieve 100K
rows, with just 3000 consistent gets. However if we run the same query on
Exadata platform for storage index to kick in
c##rajesh@PDB1>
alter index t_album_id_idx invisible;
Index altered.
c##rajesh@PDB1>
select s.name,m.value
2 from
v$statname s ,
3
v$mystat m
4
where s.statistic# = m.statistic#
5 and
s.name in ('cell physical IO bytes saved by storage index',
6
'cell physical IO interconnect bytes returned by smart scan' );
NAME
VALUE
------------------------------------------------------------
----------------
cell physical
IO bytes saved by storage index 0
cell physical
IO interconnect bytes returned by smart scan 0
c##rajesh@PDB1>
@script03a.sql
PLAN_TABLE_OUTPUT
-----------------------------------------------------------------------------------------------------
SQL_ID 1cd1u6mdr6ap4, child number 1
-------------------------------------
select * from t
where album_id = 42
Plan hash
value: 1601196873
-----------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows |
A-Rows | A-Time | Buffers | Reads |
-----------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT |
| 1 | |
100K|00:00:00.23 | 161K| 161K|
|* 1 |
TABLE ACCESS STORAGE FULL| T
| 1 | 99010 |
100K|00:00:00.23 | 161K| 161K|
-----------------------------------------------------------------------------------------------------
Predicate
Information (identified by operation id):
---------------------------------------------------
1 - storage("ALBUM_ID"=42)
filter("ALBUM_ID"=42)
19 rows
selected.
c##rajesh@PDB1>
select s.name,m.value
2 from
v$statname s ,
3
v$mystat m
4
where s.statistic# = m.statistic#
5 and
s.name in ('cell physical IO bytes saved by storage index',
6
'cell physical IO interconnect bytes returned by smart scan' );
NAME
VALUE
------------------------------------------------------------
----------------
cell
physical IO bytes saved by storage index 1306501120
cell physical
IO interconnect bytes returned by smart scan 2245248
c##rajesh@PDB1>
We notice that full scan is being performed; storage indexes are
kicking in here and by saving 1306 MB. Storage indexes based on album_id column
are therefore very effective at eliminating the access of un-necessary data.
If we now run a query based on “total_sales” column in which the data
is randomly distributed all over the place and so the associated index have a
very poor CF:
c##rajesh@PDB1>
@script03b.sql
PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------------------------------------------
SQL_ID 5qpmchf457gjc, child number 0
-------------------------------------
select * from t
where total_sales between 42 and 142
Plan hash
value: 3620091251
-------------------------------------------------------------------------------------------------------------------
| Id | Operation
|
Name | Starts | E-Rows |
A-Rows | A-Time | Buffers |
-------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 |
| 1975 |00:00:00.01 | 1996 |
| 1 |
TABLE ACCESS BY INDEX ROWID BATCHED| T | 1 |
2039 | 1975 |00:00:00.01 | 1996 |
|* 2 |
INDEX RANGE SCAN
| T_TOTAL_SALES_IDX | 1 |
2039 | 1975 |00:00:00.01 | 21 |
-------------------------------------------------------------------------------------------------------------------
Predicate
Information (identified by operation id):
---------------------------------------------------
2 - access("TOTAL_SALES">=42
AND "TOTAL_SALES"<=142)
We noticed that only 1975 rows returned and 1996 consistent gets have
been performed (practically 1 for each row returned) this is somewhat more than
3123 consistent gets from pervious example that returned 100K rows.
If we run the same query on Exadata environment then
c##rajesh@PDB1>
alter index t_total_sales_idx invisible;
Index altered.
c##rajesh@PDB1>
select s.name,m.value
2 from
v$statname s ,
3
v$mystat m
4
where s.statistic# = m.statistic#
5 and
s.name in ('cell physical IO bytes saved by storage index',
6
'cell physical IO interconnect bytes returned by smart scan' );
NAME
VALUE
------------------------------------------------------------
----------------
cell physical
IO bytes saved by storage index 0
cell physical
IO interconnect bytes returned by smart scan 0
c##rajesh@PDB1>
@script03b.sql
PLAN_TABLE_OUTPUT
-----------------------------------------------------------------------------------------------------
SQL_ID 5qpmchf457gjc, child number 0
-------------------------------------
select * from t
where total_sales between 42 and 142
Plan hash
value: 1601196873
-----------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows |
A-Rows | A-Time | Buffers | Reads |
-----------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT |
| 1 | |
1975 |00:00:00.09 | 161K| 161K|
|* 1 |
TABLE ACCESS STORAGE FULL| T
| 1 | 2039 |
1975 |00:00:00.09 | 161K| 161K|
-----------------------------------------------------------------------------------------------------
Predicate
Information (identified by operation id):
---------------------------------------------------
1 - storage(("TOTAL_SALES"<=142
AND "TOTAL_SALES">=42))
filter(("TOTAL_SALES"<=142
AND "TOTAL_SALES">=42))
19 rows
selected.
c##rajesh@PDB1>
select s.name,m.value
2 from
v$statname s ,
3
v$mystat m
4
where s.statistic# = m.statistic#
5 and
s.name in ('cell physical IO bytes saved by storage index',
6
'cell physical IO interconnect bytes returned by smart scan' );
NAME
VALUE
------------------------------------------------------------
----------------
cell physical
IO bytes saved by storage index 122060800
cell physical
IO interconnect bytes returned by smart scan 1661880
c##rajesh@PDB1>
we noticed that amount of Physical IO saved by storage index has
significantly reduced from previous execution (just 122MB down to 1306 MB), the
poor clustering of data in relation to “total_sales” column has effectively
neutralized the effectiveness of the associated SI on “total_sales” column.