Friday, January 21, 2011

Histograms on Date, Varchar and Numeric columns

Not, entirely learnt newly today but revisited some good old stuffs. Its about how to translate a histogram into a report of underlying datas (sometimes to see how histograms matches real datas). Lets see how this to be handled on dates, numeric and Varchar2 columns.

rajesh@10GR2>
rajesh@10GR2> create table t
  2  nologging
  3  as
  4  select object_id,ntile(10) over(order by object_id) as ntile
  5  from all_objects;

Table created.

Elapsed: 00:00:01.90
rajesh@10GR2>
rajesh@10GR2> begin
  2     dbms_stats.gather_table_stats(
  3             ownname =>user,
  4             tabname=>'T',
  5             method_opt=>'for all columns size 10',
  6             estimate_percent=>100
  7             );
  8  end;
  9  /

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.64
rajesh@10GR2>

Now table 'T' will have only values from 1 to 10 in the column NTILE. so provide the values SIZE 10 in gathering histograms. Now looking at user_tab_histograms will represent below values.

rajesh@10GR2> select table_name, column_name, endpoint_number, endpoint_value
  2  from user_tab_histograms
  3  where table_name ='T'
  4  and column_name ='NTILE'
  5  order by column_name,endpoint_value;

TABLE_NAME COLUMN_NAM ENDPOINT_NUMBER                        ENDPOINT_VALUE
---------- ---------- --------------- -------------------------------------
T          NTILE                 5675                                     1
T          NTILE                11349                                     2
T          NTILE                17023                                     3
T          NTILE                22697                                     4
T          NTILE                28371                                     5
T          NTILE                34045                                     6
T          NTILE                39719                                     7
T          NTILE                45393                                     8
T          NTILE                51067                                     9
T          NTILE                56741                                    10

10 rows selected.

Elapsed: 00:00:00.54
rajesh@10GR2>

The ENDPOINT_VALUE value in user_tab_histograms holds the list of values available in column NTILE. the ENDPOINT_NUMBER is cumulative frequency for appearance of each  ENDPOINT_VALUE's.

rajesh@10GR2> select ntile,count(*)
  2  from t
  3  group by ntile;

     NTILE   COUNT(*)
---------- ----------
         1       5675
         6       5674
         2       5674
         4       5674
         5       5674
         8       5674
         3       5674
         7       5674
         9       5674
        10       5674

10 rows selected.

Elapsed: 00:00:00.28
rajesh@10GR2>
rajesh@10GR2>

This show that we have count of 5674 (approx) for each value of NTILE. lets see how we can match this using Histogram buckets (from below query).

rajesh@10GR2> select table_name, column_name, endpoint_number, endpoint_value,
  2             endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_value) as actual_valu
  3  from user_tab_histograms
  4  where table_name ='T'
  5  and column_name ='NTILE'
  6  order by column_name,endpoint_value;

TABLE_NAME COLUMN_NAM ENDPOINT_NUMBER ENDPOINT_VALUE ACTUAL_VALUES
---------- ---------- --------------- -------------- -------------
T          NTILE                 5675              1          5675
T          NTILE                11349              2          5674
T          NTILE                17023              3          5674
T          NTILE                22697              4          5674
T          NTILE                28371              5          5674
T          NTILE                34045              6          5674
T          NTILE                39719              7          5674
T          NTILE                45393              8          5674
T          NTILE                51067              9          5674
T          NTILE                56741             10          5674

10 rows selected.

Elapsed: 00:00:00.04
rajesh@10GR2>
rajesh@10GR2>

This code is for Numeric columns, for Date columns we need to get day part and fraction of day, multiplies the fraction by the number of seconds in a day, converts both bits to character and then uses the ‘Julian’ and ‘seconds’ conversion format.

rajesh@10GR2> create table t
  2  as
  3  select sysdate as dt, sysdate - ntile(10) over(order by object_id) as ntile
  4  from all_objects;

Table created.

Elapsed: 00:00:02.12
rajesh@10GR2>
rajesh@10GR2> begin
  2     dbms_stats.gather_table_stats(
  3             ownname =>user,
  4             tabname=>'T',
  5             method_opt=>'for all columns size 10',
  6             estimate_percent=>100
  7             );
  8  end;
  9  /

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.39
rajesh@10GR2>

rajesh@10GR2>
rajesh@10GR2> select ntile,count(*)
  2  from t
  3  group by ntile
  4  order by 1;

NTILE       COUNT(*)
--------- ----------
11-JAN-11       5674
12-JAN-11       5674
13-JAN-11       5674
14-JAN-11       5674
15-JAN-11       5674
16-JAN-11       5674
17-JAN-11       5674
18-JAN-11       5674
19-JAN-11       5674
20-JAN-11       5675

10 rows selected.

Elapsed: 00:00:00.07
rajesh@10GR2>
rajesh@10GR2>

Now, reconstructing and validating data's from Histogram buckets will be like below.

rajesh@10GR2> SELECT table_name,
  2    column_name,
  3    endpoint_number,
  4    endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_value) AS COUNT,
  5    to_date(floor(endpoint_value)
  6    ||'.'
  7    || TO_CHAR(86400 * MOD(endpoint_value, 1), 'FM999999999'),'J.sssss')endpoint_value
  8  FROM user_tab_histograms
  9  WHERE table_name ='T'
 10  AND column_name  ='NTILE'
 11  ORDER BY column_name,endpoint_value;

TABLE_NAME COLUMN_NAM ENDPOINT_NUMBER      COUNT ENDPOINT_
---------- ---------- --------------- ---------- ---------
T          NTILE                 5674       5674 11-JAN-11
T          NTILE                11348       5674 12-JAN-11
T          NTILE                17022       5674 13-JAN-11
T          NTILE                22696       5674 14-JAN-11
T          NTILE                28370       5674 15-JAN-11
T          NTILE                34044       5674 16-JAN-11
T          NTILE                39718       5674 17-JAN-11
T          NTILE                45392       5674 18-JAN-11
T          NTILE                51066       5674 19-JAN-11
T          NTILE                56741       5675 20-JAN-11

10 rows selected.

Elapsed: 00:00:00.07
rajesh@10GR2>

Now for character based columns it involves.
  • Take the first six bytes of the string
  • View this as a hexadecimal number, and convert to decimal
  • Round to 15 significant digits and store as the endpoint_value
  • If duplicate rows appear, store the first 32 bytes of each string as the endpoint_actual_value

rajesh@10GR2> create table t
  2  as
  3  select empno, ename as OBJECT_NAME
  4  from emp
  5  where rownum <= 10;

Table created.

Elapsed: 00:00:00.39
rajesh@10GR2>
rajesh@10GR2>
rajesh@10GR2> begin
  2     dbms_stats.gather_table_stats(
  3             ownname =>user,
  4             tabname=>'T',
  5             method_opt=>'for all columns size 10',
  6             estimate_percent=>100
  7             );
  8  end;
  9  /

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.14
rajesh@10GR2>
rajesh@10GR2>
rajesh@10GR2> SELECT endpoint_number,
  2    endpoint_number - NVL(prev_endpoint,0) frequency,
  3    hex_values,
  4    chr(to_number(SUBSTR(hex_values, 2,2),'XX'))
  5    || chr(to_number(SUBSTR(hex_values, 4,2),'XX'))
  6    || chr(to_number(SUBSTR(hex_values, 6,2),'XX'))
  7    || chr(to_number(SUBSTR(hex_values, 8,2),'XX'))
  8    || chr(to_number(SUBSTR(hex_values,10,2),'XX'))
  9    || chr(to_number(SUBSTR(hex_values,12,2),'XX'))
 10    || chr(to_number(SUBSTR(hex_values,14,2),'XX'))
 11    || chr(to_number(SUBSTR(hex_values,16,2),'XX'))
 12    || chr(to_number(SUBSTR(hex_values,18,2),'XX'))
 13    || chr(to_number(SUBSTR(hex_values,20,2),'XX'))
 14    || chr(to_number(SUBSTR(hex_values,22,2),'XX'))
 15    || chr(to_number(SUBSTR(hex_values,24,2),'XX'))
 16    as column_values
 17  FROM
 18    (SELECT endpoint_number,
 19      lag(endpoint_number,1) over( order by endpoint_number ) prev_endpoint,
 20      TO_CHAR(endpoint_value,'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX')hex_values,
 21      endpoint_actual_value
 22    FROM user_tab_histograms
 23    WHERE table_name  = 'T'
 24    AND column_name = 'OBJECT_NAME'
 25    )
 26  ORDER BY endpoint_number ;

ENDPOINT_NUMBER  FREQUENCY HEX_VALUES                      COLUMN_VALUE
--------------- ---------- ------------------------------- ------------
              1          1  414C4C454DFFEFD17E20CECE400000 ALLEM n-~ ++
              2          1  424C414B44FFF5632009023BE00000 BLAKD )c     ?;
              3          1  434C41524B001321ED8AF356E00000 CLARK ?!fè=V
              4          1  4A4F4E4552FFEC7EDB45D5C0800000 JONER 8~¦E++
              5          1  4B494E4700001D9911EA0EFCE00000 KING  ?Ö?O?n
              6          1  4D415254494DFF74618D0FA4400000 MARTIM taì¤ñ
              7          1  53434F54540016FED29C0B71200000 SCOTT ?¦-£?q
              8          1  534D49544800059420ACC8DF600000 SMITH ?ö ¼+¯
              9          1  5455524E455201F1D23C139C800000 TURNER?±-<?£
             10          1  57415243FFFFF52B667FFCC9400000 WARC  )+f¦n+

10 rows selected.

Elapsed: 00:00:00.75
rajesh@10GR2>

But due to some Hexadecimal value mismatch, I feel column  COLUMN_VALUE from the above output provides some special characters along with actual data.

No comments:

Post a Comment