THE RAJESH BLOG: Top Frequency histogram in 12c

Top Frequency histogram was newly introduced in 12c and it is a variation on a Frequency histogram that ignores non-popular values that are statistically insignificant. A top frequency histogram can produce better histogram for highly popular values.

If a small number of values occupies most of the rows, then creating a Frequency histogram on that small set of values is useful even when the NDV (Number of Distinct Values) is greater than the requested number of histogram buckets. To create a better quality histogram for popular values, the optimizer ignores the non-popular values and creates a frequency histogram.

The database creates a Top frequency histogram, when the following criteria are met.

NDV is greater than n, where n is the requested number of buckets (default 254)
The percentage of rows occupied by Top-frequent values is greater than or equal to the threshold p where p is (1-(1/n) * 100).
The estimate_percent parameter in dbms_stats gathering procedure should be auto_sample_size (set to default)

rajesh@ORA12C> create table t as select * from all_objects;

Table created.

rajesh@ORA12C> select num_rows from user_tables where table_name ='T';

NUM_ROWS

----------

90215

1 row selected.

rajesh@ORA12C> select column_name,num_distinct,density,histogram

2 fromuser_tab_col_statistics

3 wheretable_name ='T'

4 andcolumn_name ='OWNER';

COLUMN_NAM NUM_DISTINCT DENSITY HISTOGRAM

---------- ------------ ---------- ---------------

OWNER 33 .03030303 NONE

1 row selected.

rajesh@ORA12C> select owner,count(*)

2 from t

3 group by owner

4 order by 2 desc;

OWNER COUNT(*)

-------------------- ----------

SYS 41882

PUBLIC 37076

ORDSYS 3157

APEX_040200 3025

MDSYS 1647

SYSTEM 589

XDB 401

CTXSYS 401

WMSYS 371

SH 300

DVSYS 292

ORDDATA 274

LBACSYS 237

OE 112

GSMADMIN_INTERNAL 104

DBSNMP 55

IX 48

RAJESH 38

HR 34

OLAPSYS 25

DEMO 25

OJVMSYS 22

DVF 19

FLOWS_FILES 11

ORDPLUGINS 10

PM 10

AUDSYS 9

OUTLN 8

SI_INFORMTN_SCHEMA 8

BI 8

SCOTT 6

ORACLE_OCM 6

APPQOSSYS 5

33 rows selected.

rajesh@ORA12C> begin

2 dbms_stats.gather_table_stats

3 (ownname=>user,

4 tabname=>'T',

5 method_opt=>'for columns owner size 10’);

6 end;

7 /

PL/SQL procedure successfully completed.

rajesh@ORA12C>

So during stats gathering we requested for histogram on the column OWNER with bucket size of 10.

OWNER column has NDV = 33 and the requested bucket size is 10. So NDV > N (where N is the number of buckets requested)
The Top 10 values of the column OWNER corresponds to 98% rows in the table, which is greater than the arrived threshold value is ( (1 – (1/10))*100) = 90%, so obviously the rows occupied by Top-N dominates here.
The default sample size is being used for this histogram generation

Since it matches all the above three conditions, we end up with Top-Frequency histogram.

rajesh@ORA12C> select column_name,num_distinct,density,histogram

2 fromuser_tab_col_statistics

3 wheretable_name ='T'

4 andcolumn_name ='OWNER';

COLUMN_NAM NUM_DISTINCT DENSITY HISTOGRAM

---------- ------------ ---------- ---------------

OWNER 33 5.5423E-06 TOP-FREQUENCY

1 row selected.

rajesh@ORA12C>

rajesh@ORA12C> column endpoint_actual_value format a15

rajesh@ORA12C> select endpoint_number,endpoint_actual_value

2 fromuser_tab_histograms

3 wheretable_name ='T'

4 andcolumn_name ='OWNER';

ENDPOINT_NUMBER ENDPOINT_ACTUAL

--------------- ---------------

3025 APEX_040200

3426 CTXSYS

5073 MDSYS

8230 ORDSYS

45306 PUBLIC

45606 SH

87488 SYS

88077 SYSTEM

88448 WMSYS

88849 XDB

10 rows selected.

As per the documentation, the cardinality algorithm for popular values were.

Cardinality = (Num_of_rows *Num_of_endpoints_spanned) / Total_num_of_endpoints.

rajesh@ORA12C> select count(*) from t where owner ='MDSYS';

Execution Plan

----------------------------------------------------------

Plan hash value: 2966233522

---------------------------------------------------------------------------

---------------------------------------------------------------------------

| 0 | SELECT STATEMENT | | 1 | 6 | 480 (13)| 00:00:01 |

| 1 | SORT AGGREGATE | | 1 | 6 | | |

|* 2 | TABLE ACCESS FULL| T | 1647 | 9882 | 480 (13)| 00:00:01 |

---------------------------------------------------------------------------

Predicate Information (identified by operation id):

---------------------------------------------------

2 - filter("OWNER"='MDSYS')

So taking the value of OWNER = MDSYS into consideration and applying the above cardinality formula produces this result.

Cardinality = ((5073 - 3426) * 90215)/88847 = 1672 (that pretty much matches up with the estimated cardinality)

But in case of non-popular values, the formula from the documentation is

Cardinality = Num_of_rows * Density, applying this formula to our data sets produces the result as 0.5 – rounded off to the nearestinteger the estimated cardinality should be one.

rajesh@ORA12C> select num_distinct,density, density*90215 the_guess

2 fromuser_tab_col_statistics

3 wheretable_name ='T'

4 andcolumn_name ='OWNER' ;

NUM_DISTINCT DENSITY THE_GUESS

------------ ---------- ----------

33 5.5423E-06 .5

1 row selected.

rajesh@ORA12C>

But the explain plan for the non-popular values reported the estimated cardinality as 59, which seems far away from our estimated calculations.

rajesh@ORA12C> select count(*) from t where owner ='SCOTT';

Execution Plan

----------------------------------------------------------

Plan hash value: 2966233522

---------------------------------------------------------------------------

---------------------------------------------------------------------------

| 0 | SELECT STATEMENT | | 1 | 6 | 480 (13)| 00:00:01 |

| 1 | SORT AGGREGATE | | 1 | 6 | | |

|* 2 | TABLE ACCESS FULL| T | 59 | 354 | 480 (13)| 00:00:01 |

---------------------------------------------------------------------------

Predicate Information (identified by operation id):

---------------------------------------------------

2 - filter("OWNER"='SCOTT')

So looking into 10053 trace, showed that rather than using density from user_tab_col_statistics, optimizer used a different values called “NewDensity” to produce the cardinality.

=====================================

Access path analysis for T

***************************************

SINGLE TABLE ACCESS PATH

Single Table Cardinality Estimation for T[T]

SPD: Return code in qosdDSDirSetup: NOCTX, estType = TABLE

Column (#1):

NewDensity:0.000658, OldDensity:0.000006 BktCnt:88849.000000, PopBktCnt:88849.000000, PopValCnt:10, NDV:33

Column (#1): OWNER(VARCHAR2)

AvgLen: 6 NDV: 33 Nulls: 0 Density: 0.000658

Histogram: Top-Freq #Bkts: 88849 UncompBkts: 88849 EndPtVals: 10 ActualVal: yes

Table: T Alias: T

Card: Original: 90215.000000 Rounded: 59 Computed: 59.391304 Non Adjusted: 59.391304

So applying the formula as Cardinality = NewDensity * Num_of_rows = 0.000658 * 90215 = 59.3 (rounded to the nearest integer produced the value as 59) – that matches up with the estimated cardinality shown up in the explain plan.

But how does the optimizer arrived this NewDensity value shown up in 10053 Trace?

NewDensity = sum_of_unpopular_value_occurances/ (unpopular NDV * Num_of_rows )

Applying this to our data set resulted in this.

NewDensity = (292 + 274 + 237 + 112 + 104 + 55 + 48 + 38 + 34 + 25 + 25 + 22 + 19 + 11 + 10 + 10 + 9 + 8 + 8 + 8 + 6 + 6 + 5 ) / ( 23*90215) = 0.000658

Tuesday, June 28, 2016

Top Frequency histogram in 12c

No comments:

Post a Comment