THE RAJESH BLOG: Histograms on Date, Varchar and Numeric columns

Not, entirely learnt newly today but revisited some good old stuffs. Its about how to translate a histogram into a report of underlying datas (sometimes to see how histograms matches real datas). Lets see how this to be handled on dates, numeric and Varchar2 columns.

rajesh@10GR2>
rajesh@10GR2> create table t
2 nologging
3 as
4 select object_id,ntile(10) over(order by object_id) as ntile
5 from all_objects;

Table created.

Elapsed: 00:00:01.90
rajesh@10GR2>
rajesh@10GR2> begin
2     dbms_stats.gather_table_stats(
3             ownname =>user,
4             tabname=>'T',
5             method_opt=>'for all columns size 10',
6             estimate_percent=>100
7             );
8 end;
9 /

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.64
rajesh@10GR2>

Now table 'T' will have only values from 1 to 10 in the column NTILE. so provide the values SIZE 10 in gathering histograms. Now looking at user_tab_histograms will represent below values.

rajesh@10GR2> select table_name, column_name, endpoint_number, endpoint_value
2 from user_tab_histograms
3 where table_name ='T'
4 and column_name ='NTILE'
5 order by column_name,endpoint_value;

TABLE_NAME COLUMN_NAM ENDPOINT_NUMBER                        ENDPOINT_VALUE
---------- ---------- --------------- -------------------------------------
T          NTILE                 5675                                     1
T          NTILE                11349                                     2
T          NTILE                17023                                     3
T          NTILE                22697                                     4
T          NTILE                28371                                     5
T          NTILE                34045                                     6
T          NTILE                39719                                     7
T          NTILE                45393                                     8
T          NTILE                51067                                     9
T          NTILE                56741                                    10

10 rows selected.

Elapsed: 00:00:00.54
rajesh@10GR2>

The ENDPOINT_VALUE value in user_tab_histograms holds the list of values available in column NTILE. the ENDPOINT_NUMBER is cumulative frequency for appearance of each ENDPOINT_VALUE's.

rajesh@10GR2> select ntile,count(*)
2 from t
3 group by ntile;

     NTILE   COUNT(*)
---------- ----------
         1       5675
         6       5674
         2       5674
         4       5674
         5       5674
         8       5674
         3       5674
         7       5674
         9       5674
        10       5674

10 rows selected.

Elapsed: 00:00:00.28
rajesh@10GR2>
rajesh@10GR2>

This show that we have count of 5674 (approx) for each value of NTILE. lets see how we can match this using Histogram buckets (from below query).

rajesh@10GR2> select table_name, column_name, endpoint_number, endpoint_value,
2             endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_value) as actual_valu
3 from user_tab_histograms
4 where table_name ='T'
5 and column_name ='NTILE'
6 order by column_name,endpoint_value;

TABLE_NAME COLUMN_NAM ENDPOINT_NUMBER ENDPOINT_VALUE ACTUAL_VALUES
---------- ---------- --------------- -------------- -------------
T          NTILE                 5675              1          5675
T          NTILE                11349              2          5674
T          NTILE                17023              3          5674
T          NTILE                22697              4          5674
T          NTILE                28371              5          5674
T          NTILE                34045              6          5674
T          NTILE                39719              7          5674
T          NTILE                45393              8          5674
T          NTILE                51067              9          5674
T          NTILE                56741             10          5674

10 rows selected.

Elapsed: 00:00:00.04
rajesh@10GR2>
rajesh@10GR2>

This code is for Numeric columns, for Date columns we need to get day part and fraction of day, multiplies the fraction by the number of seconds in a day, converts both bits to character and then uses the ‘Julian’ and ‘seconds’ conversion format.

rajesh@10GR2> create table t
2 as
3 select sysdate as dt, sysdate - ntile(10) over(order by object_id) as ntile
4 from all_objects;

Table created.

Elapsed: 00:00:02.12
rajesh@10GR2>
rajesh@10GR2> begin
2     dbms_stats.gather_table_stats(
3             ownname =>user,
4             tabname=>'T',
5             method_opt=>'for all columns size 10',
6             estimate_percent=>100
7             );
8 end;
9 /

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.39
rajesh@10GR2>

rajesh@10GR2>
rajesh@10GR2> select ntile,count(*)
2 from t
3 group by ntile
4 order by 1;

NTILE       COUNT(*)
--------- ----------
11-JAN-11       5674
12-JAN-11       5674
13-JAN-11       5674
14-JAN-11       5674
15-JAN-11       5674
16-JAN-11       5674
17-JAN-11       5674
18-JAN-11       5674
19-JAN-11       5674
20-JAN-11       5675

10 rows selected.

Elapsed: 00:00:00.07
rajesh@10GR2>
rajesh@10GR2>

Now, reconstructing and validating data's from Histogram buckets will be like below.

rajesh@10GR2> SELECT table_name,
2    column_name,
3    endpoint_number,
4    endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_value) AS COUNT,
5    to_date(floor(endpoint_value)
6    ||'.'
7    || TO_CHAR(86400 * MOD(endpoint_value, 1), 'FM999999999'),'J.sssss')endpoint_value
8 FROM user_tab_histograms
9 WHERE table_name ='T'
10 AND column_name ='NTILE'
11 ORDER BY column_name,endpoint_value;

TABLE_NAME COLUMN_NAM ENDPOINT_NUMBER      COUNT ENDPOINT_
---------- ---------- --------------- ---------- ---------
T          NTILE                 5674       5674 11-JAN-11
T          NTILE                11348       5674 12-JAN-11
T          NTILE                17022       5674 13-JAN-11
T          NTILE                22696       5674 14-JAN-11
T          NTILE                28370       5674 15-JAN-11
T          NTILE                34044       5674 16-JAN-11
T          NTILE                39718       5674 17-JAN-11
T          NTILE                45392       5674 18-JAN-11
T          NTILE                51066       5674 19-JAN-11
T          NTILE                56741       5675 20-JAN-11

10 rows selected.

Elapsed: 00:00:00.07
rajesh@10GR2>

Now for character based columns it involves.

Take the first six bytes of the string
View this as a hexadecimal number, and convert to decimal
Round to 15 significant digits and store as the endpoint_value
If duplicate rows appear, store the first 32 bytes of each string as the endpoint_actual_value

rajesh@10GR2> create table t
2 as
3 select empno, ename as OBJECT_NAME
4 from emp
5 where rownum <= 10;

Table created.

Elapsed: 00:00:00.39
rajesh@10GR2>
rajesh@10GR2>
rajesh@10GR2> begin
2     dbms_stats.gather_table_stats(
3             ownname =>user,
4             tabname=>'T',
5             method_opt=>'for all columns size 10',
6             estimate_percent=>100
7             );
8 end;
9 /

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.14
rajesh@10GR2>
rajesh@10GR2>
rajesh@10GR2> SELECT endpoint_number,
2    endpoint_number - NVL(prev_endpoint,0) frequency,
3    hex_values,
4    chr(to_number(SUBSTR(hex_values, 2,2),'XX'))
5    || chr(to_number(SUBSTR(hex_values, 4,2),'XX'))
6    || chr(to_number(SUBSTR(hex_values, 6,2),'XX'))
7    || chr(to_number(SUBSTR(hex_values, 8,2),'XX'))
8    || chr(to_number(SUBSTR(hex_values,10,2),'XX'))
9    || chr(to_number(SUBSTR(hex_values,12,2),'XX'))
10    || chr(to_number(SUBSTR(hex_values,14,2),'XX'))
11    || chr(to_number(SUBSTR(hex_values,16,2),'XX'))
12    || chr(to_number(SUBSTR(hex_values,18,2),'XX'))
13    || chr(to_number(SUBSTR(hex_values,20,2),'XX'))
14    || chr(to_number(SUBSTR(hex_values,22,2),'XX'))
15    || chr(to_number(SUBSTR(hex_values,24,2),'XX'))
16    as column_values
17 FROM
18    (SELECT endpoint_number,
19      lag(endpoint_number,1) over( order by endpoint_number ) prev_endpoint,
20      TO_CHAR(endpoint_value,'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX')hex_values,
21      endpoint_actual_value
22    FROM user_tab_histograms
23    WHERE table_name = 'T'
24    AND column_name = 'OBJECT_NAME'
25    )
26 ORDER BY endpoint_number ;

ENDPOINT_NUMBER FREQUENCY HEX_VALUES                      COLUMN_VALUE
--------------- ---------- ------------------------------- ------------
              1          1 414C4C454DFFEFD17E20CECE400000 ALLEM n-~ ++
              2          1 424C414B44FFF5632009023BE00000 BLAKD )c     ?;
              3          1 434C41524B001321ED8AF356E00000 CLARK ?!fè=V
              4          1 4A4F4E4552FFEC7EDB45D5C0800000 JONER 8~¦E++
              5          1 4B494E4700001D9911EA0EFCE00000 KING ?Ö?O?n
              6          1 4D415254494DFF74618D0FA4400000 MARTIM taì¤ñ
              7          1 53434F54540016FED29C0B71200000 SCOTT ?¦-£?q
              8          1 534D49544800059420ACC8DF600000 SMITH ?ö ¼+¯
              9          1 5455524E455201F1D23C139C800000 TURNER?±-<?£
             10          1 57415243FFFFF52B667FFCC9400000 WARC  )+f¦n+

10 rows selected.

Elapsed: 00:00:00.75
rajesh@10GR2>

But due to some Hexadecimal value mismatch, I feel column  COLUMN_VALUE from the above output provides some special characters along with actual data.

Friday, January 21, 2011

Histograms on Date, Varchar and Numeric columns

No comments:

Post a Comment