Tuesday, June 28, 2016

Top Frequency histogram in 12c

Top Frequency histogram was newly introduced in 12c and it is a variation on a Frequency histogram that ignores non-popular values that are statistically insignificant.  A top frequency histogram can produce better histogram for highly popular values.

If a small number of values occupies most of the rows, then creating a Frequency histogram on that small set of values is useful even when the NDV (Number of Distinct Values) is greater than the requested number of histogram buckets. To create a better quality histogram for popular values, the optimizer ignores the non-popular values and creates a frequency histogram.

The database creates a Top frequency histogram, when the following criteria are met.
  • NDV is greater than n, where n is the requested number of buckets (default 254)
  • The percentage of rows occupied by Top-frequent values is greater than or equal to the threshold p where p is (1-(1/n) * 100).
  • The estimate_percent parameter in dbms_stats gathering procedure should be auto_sample_size (set to default)

rajesh@ORA12C> create table t as select * from all_objects;

Table created.

rajesh@ORA12C> select num_rows from user_tables where table_name ='T';

  NUM_ROWS
----------
     90215

1 row selected.

rajesh@ORA12C> select column_name,num_distinct,density,histogram
2  fromuser_tab_col_statistics
3  wheretable_name ='T'
4  andcolumn_name ='OWNER';

COLUMN_NAM NUM_DISTINCT    DENSITY HISTOGRAM
---------- ------------ ---------- ---------------
OWNER                33  .03030303 NONE

1 row selected.

rajesh@ORA12C> select owner,count(*)
2  from t
3  group by owner
4  order by 2 desc;

OWNER                  COUNT(*)
-------------------- ----------
SYS                       41882
PUBLIC                    37076
ORDSYS                     3157
APEX_040200                3025
MDSYS                      1647
SYSTEM                      589
XDB                         401
CTXSYS                      401
WMSYS                       371
SH                          300
DVSYS                       292
ORDDATA                     274
LBACSYS                     237
OE                          112
GSMADMIN_INTERNAL           104
DBSNMP                       55
IX                           48
RAJESH                       38
HR                           34
OLAPSYS                      25
DEMO                         25
OJVMSYS                      22
DVF                          19
FLOWS_FILES                  11
ORDPLUGINS                   10
PM                           10
AUDSYS                        9
OUTLN                         8
SI_INFORMTN_SCHEMA            8
BI                            8
SCOTT                         6
ORACLE_OCM                    6
APPQOSSYS                     5

33 rows selected.

rajesh@ORA12C> begin
  2     dbms_stats.gather_table_stats
  3             (ownname=>user,
  4             tabname=>'T',
  5             method_opt=>'for columns owner size 10’);
  6  end;
  7  /

PL/SQL procedure successfully completed.

rajesh@ORA12C>

So during stats gathering we requested for histogram on the column OWNER with bucket size of 10.

  • OWNER column has NDV = 33 and the requested bucket size is 10. So NDV > N (where N is the number of buckets requested)
  • The Top 10 values of the column OWNER corresponds to 98% rows in the table, which is greater than the arrived threshold value is ( (1 – (1/10))*100) =  90%, so obviously the rows occupied by Top-N dominates here.
  • The default sample size is being used for this histogram generation

Since it matches all the above three conditions, we end up with Top-Frequency histogram.

rajesh@ORA12C> select column_name,num_distinct,density,histogram
2  fromuser_tab_col_statistics
3  wheretable_name ='T'
4  andcolumn_name ='OWNER';

COLUMN_NAM NUM_DISTINCT    DENSITY HISTOGRAM
---------- ------------ ---------- ---------------
OWNER                33 5.5423E-06 TOP-FREQUENCY

1 row selected.

rajesh@ORA12C>
rajesh@ORA12C> column endpoint_actual_value format a15
rajesh@ORA12C> select endpoint_number,endpoint_actual_value
2  fromuser_tab_histograms
3  wheretable_name ='T'
4  andcolumn_name ='OWNER';

ENDPOINT_NUMBER ENDPOINT_ACTUAL
--------------- ---------------
           3025 APEX_040200
           3426 CTXSYS
           5073 MDSYS
           8230 ORDSYS
          45306 PUBLIC
          45606 SH
          87488 SYS
          88077 SYSTEM
          88448 WMSYS
          88849 XDB

10 rows selected.

As per the documentation, the cardinality algorithm for popular values were.

Cardinality = (Num_of_rows  *Num_of_endpoints_spanned) /  Total_num_of_endpoints.

rajesh@ORA12C> select count(*) from t where owner ='MDSYS';

Execution Plan
----------------------------------------------------------
Plan hash value: 2966233522

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     6 |   480  (13)| 00:00:01 |
|   1 |  SORT AGGREGATE    |      |     1 |     6 |            |          |
|*  2 |   TABLE ACCESS FULL| T    |  1647 |  9882 |   480  (13)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("OWNER"='MDSYS')

So taking the value of OWNER = MDSYS into consideration and applying the above cardinality formula produces this result.

Cardinality = ((5073 - 3426) * 90215)/88847 = 1672 (that pretty much matches up with the estimated cardinality)

But in case of non-popular values, the formula from the documentation is

Cardinality = Num_of_rows * Density, applying this formula to our data sets produces the result as 0.5 – rounded off to the nearestinteger the estimated cardinality should be one.

rajesh@ORA12C> select num_distinct,density, density*90215 the_guess
2  fromuser_tab_col_statistics
3  wheretable_name ='T'
4  andcolumn_name ='OWNER' ;

NUM_DISTINCT    DENSITY  THE_GUESS
------------ ---------- ----------
          33 5.5423E-06         .5

1 row selected.

rajesh@ORA12C>

But the explain plan for the non-popular values reported the estimated cardinality as 59, which seems far away from our estimated calculations.

rajesh@ORA12C> select count(*) from t where owner ='SCOTT';

Execution Plan
----------------------------------------------------------
Plan hash value: 2966233522

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     6 |   480  (13)| 00:00:01 |
|   1 |  SORT AGGREGATE    |      |     1 |     6 |            |          |
|*  2 |   TABLE ACCESS FULL| T    |    59 |   354 |   480  (13)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("OWNER"='SCOTT')

So looking into 10053 trace, showed that rather than using density from user_tab_col_statistics, optimizer used a different values called “NewDensity” to produce the cardinality.

=====================================
Access path analysis for T
***************************************
SINGLE TABLE ACCESS PATH
  Single Table Cardinality Estimation for T[T]
  SPD: Return code in qosdDSDirSetup: NOCTX, estType = TABLE
  Column (#1):
NewDensity:0.000658, OldDensity:0.000006 BktCnt:88849.000000, PopBktCnt:88849.000000, PopValCnt:10, NDV:33
  Column (#1): OWNER(VARCHAR2)
AvgLen: 6 NDV: 33 Nulls: 0 Density: 0.000658
    Histogram: Top-Freq  #Bkts: 88849  UncompBkts: 88849  EndPtVals: 10  ActualVal: yes
  Table: T  Alias: T
    Card: Original: 90215.000000  Rounded: 59  Computed: 59.391304  Non Adjusted: 59.391304

So applying the formula as Cardinality = NewDensity * Num_of_rows = 0.000658 * 90215 = 59.3 (rounded to the nearest integer produced the value as 59) – that matches up with the estimated cardinality shown up in the explain plan.

But how does the optimizer arrived this NewDensity value shown up in 10053 Trace?

NewDensity = sum_of_unpopular_value_occurances/  (unpopular NDV * Num_of_rows )

Applying this to our data set resulted in this.

NewDensity = (292 + 274 + 237 + 112 + 104 + 55 + 48 + 38 + 34 + 25 + 25 + 22 + 19 + 11 + 10 + 10 + 9 + 8 + 8 + 8 + 6 + 6 + 5 )  / ( 23*90215)  = 0.000658


No comments:

Post a Comment