After discussion
about Parquet files a lot, it is time to come back to Oracle database and how
to easily access and work with parquet files as tables in Oracle database.
Accessing External data from within the database provides business benefits and
parquet files are optimal
storage format due to its reduced size compared to other formats. Now it is
the time to see how to access external parquet files and work with its data
using the database as a processing engine.
First we need
to make the data accessible to the database, External tables are the way to go,
since the parquet files are self-describing and contains schema information,
defining an External tables should be straight forward.
demo-user@ATP19C> variable
uri varchar2(100)
demo-user@ATP19C> exec :uri := 'https://objectstorage.us-ashburn-1.oraclecloud.com/n/idcglquusbz6/b/MY_DEMO_BUCKET/o/BUCKET02/';
PL/SQL procedure successfully
completed.
demo-user@ATP19C>
demo-user@ATP19C> select object_name, bytes
2 from dbms_cloud.list_objects( 'MY_DEMO_CRED', :uri );
OBJECT_NAME BYTES
----------------------------------- ----------
java.exe_1_20251015T143444892671Z.p 36017158
arquet
demo-user@ATP19C> begin
2 dbms_cloud.create_external_table(
3 table_name =>'MY_DEMO_EXT'
4 , credential_name =>'MY_DEMO_CRED'
5 , file_uri_list => :uri||'java.exe_1_20251015T143444892671Z.parquet'
6 , format => json_object('type' value 'parquet', 'schema' value 'first') );
7 end;
8 /
PL/SQL procedure successfully
completed.
demo-user@ATP19C> set
linesize 71
demo-user@ATP19C> desc my_demo_ext
Name Null? Type
----------------------------------- -------- -------------------------
OWNER VARCHAR2(32767)
OBJECT_NAME VARCHAR2(32767)
SUBOBJECT_NAME VARCHAR2(32767)
OBJECT_ID NUMBER(38,20)
DATA_OBJECT_ID NUMBER(38,20)
OBJECT_TYPE VARCHAR2(32767)
CREATED DATE
LAST_DDL_TIME DATE
TIMESTAMP VARCHAR2(32767)
STATUS VARCHAR2(32767)
TEMPORARY VARCHAR2(32767)
GENERATED VARCHAR2(32767)
SECONDARY VARCHAR2(32767)
NAMESPACE NUMBER(38,20)
EDITION_NAME VARCHAR2(32767)
SHARING VARCHAR2(32767)
EDITIONABLE VARCHAR2(32767)
ORACLE_MAINTAINED VARCHAR2(32767)
APPLICATION VARCHAR2(32767)
DEFAULT_COLLATION VARCHAR2(32767)
DUPLICATED VARCHAR2(32767)
SHARDED VARCHAR2(32767)
CREATED_APPID NUMBER(38,20)
CREATED_VSNID NUMBER(38,20)
MODIFIED_APPID NUMBER(38,20)
MODIFIED_VSNID NUMBER(38,20)
ID NUMBER(38,20)
demo-user@ATP19C> select
count(*) from my_demo_ext;
COUNT(*)
------------
1,000,000
demo-user@ATP19C>
if you got
access to the Database action or the web UI page for Autonomous database then
we can make use of the Data Load options under Data Studio
in a user-friendly manner to load data.
After creating
these external tables, you can use it like any other in Oracle database.
Also in certain
cases where you don’t have privilege to create table, we can leverage
inline-External table feature to access parquet files from object storage. To
do that we need obtain information (like the number of columns and its data
types) about Parquet files
The easiest way
to get insights to Parquet files it to make use of Python based Parquet-tools , when you
have Python 3 in place then it is super easy to install, just run the following
command and it’s ready to use.
C:\Users\Rajeshwaran
Jeyabal>pip3 install parquet-tools
Defaulting to user installation because normal site-packages is not writeable
Collecting parquet-tools
Downloading parquet_tools-0.2.16-py3-none-any.whl.metadata (3.8 kB)
Collecting boto3<2.0.0,>=1.34.11 (from parquet-tools)
Downloading boto3-1.40.52-py3-none-any.whl.metadata (6.7 kB)
…………….
…………….
…………….
…………….
Successfully installed
boto3-1.40.52 botocore-1.40.52 colorama-0.4.6 halo-0.0.31 jmespath-1.0.1
log_symbols-0.0.14 numpy-2.3.3 pandas-2.3.3 parquet-tools-0.2.16 pyarrow-21.0.0
python-dateutil-2.9.0.post0 pytz-2025.2 s3transfer-0.14.0 six-1.17.0
spinners-0.0.24 tabulate-0.9.0 termcolor-3.1.0 thrift-0.16.0 tzdata-2025.2
urllib3-2.5.0
Now you got a
simple and powerful tool, just run the “inspect” command to learn how the file
is created and what compression is used, how many columns are there and how
many row groups are there.
C:\Users\Rajeshwaran
Jeyabal>parquet-tools.exe inspect "C:\Users\Rajeshwaran
Jeyabal\Downloads\BUCKET02_java.exe_1_20251015T143444892671Z.parquet"
############ file meta data
############
created_by: parquet-mr version 1.12.2 (build 77e30c8093386ec52c3cfa6c34b7ef3321322c94)
num_columns: 27
num_rows: 1000000
num_row_groups: 174
format_version: 1.0
serialized_size: 419719
############ Columns
############
OWNER
OBJECT_NAME
SUBOBJECT_NAME
OBJECT_ID
DATA_OBJECT_ID
OBJECT_TYPE
CREATED
LAST_DDL_TIME
TIMESTAMP
STATUS
TEMPORARY
GENERATED
SECONDARY
NAMESPACE
EDITION_NAME
SHARING
EDITIONABLE
ORACLE_MAINTAINED
APPLICATION
DEFAULT_COLLATION
DUPLICATED
SHARDED
CREATED_APPID
CREATED_VSNID
MODIFIED_APPID
MODIFIED_VSNID
ID
############ Column(OWNER)
############
name: OWNER
path: OWNER
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: SNAPPY (space_saved: 95%)
############
Column(OBJECT_NAME) ############
name: OBJECT_NAME
path: OBJECT_NAME
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: SNAPPY (space_saved: 65%)
…………………..
…………………..
…………………..
…………………..
############ Column(ID)
############
name: ID
path: ID
max_definition_level: 1
max_repetition_level: 0
physical_type: FIXED_LEN_BYTE_ARRAY
logical_type: Decimal(precision=38, scale=20)
converted_type (legacy): DECIMAL
compression: SNAPPY (space_saved: 41%)
Using the above
list of columns in Parquet files, we can create an Inline-External table like
this
demo-user@ATP19C> with rws
as (
2 select *
3 from external(
4 ( OWNER VARCHAR2(4000)
5 , OBJECT_NAME VARCHAR2(4000)
6 , SUBOBJECT_NAME VARCHAR2(4000)
7 , OBJECT_ID NUMBER(38,20)
8 , DATA_OBJECT_ID NUMBER(38,20)
9 , OBJECT_TYPE VARCHAR2(4000)
10 , CREATED DATE
11 , LAST_DDL_TIME DATE
12 , TIMESTAMP VARCHAR2(4000)
13 , STATUS VARCHAR2(4000)
14 , TEMPORARY VARCHAR2(4000)
15 , GENERATED VARCHAR2(4000)
16 , SECONDARY VARCHAR2(4000)
17 , NAMESPACE NUMBER(38,20)
18 , EDITION_NAME VARCHAR2(4000)
19 , SHARING VARCHAR2(4000)
20 , EDITIONABLE VARCHAR2(4000)
21 , ORACLE_MAINTAINED VARCHAR2(4000)
22 , APPLICATION VARCHAR2(4000)
23 , DEFAULT_COLLATION VARCHAR2(4000)
24 , DUPLICATED VARCHAR2(4000)
25 , SHARDED VARCHAR2(4000)
26 , CREATED_APPID NUMBER(38,20)
27 , CREATED_VSNID NUMBER(38,20)
28 , MODIFIED_APPID NUMBER(38,20)
29 , MODIFIED_VSNID NUMBER(38,20)
30 , ID NUMBER(38,20) )
31 type oracle_bigdata
32 access parameters(
33 com.oracle.bigdata.fileformat = parquet
34 com.oracle.bigdata.credential.name = MY_DEMO_CRED
35 com.oracle.bigdata.removequotes = true
36 com.oracle.bigdata.dateformat = 'YYYY-MM-DD hh24:mi:ss'
37 )
38 location ('https://objectstorage.us-ashburn-1.oraclecloud.com/n/idcglquusbz6/b/MY_DEMO_BUCKET/o/BUCKET02/java.exe_1_20251015T143444892671Z.parquet')
39 ) )
40 select trunc(created,'Y'), count(*)
41 from rws
42 group by trunc(created,'Y');
TRUNC(CREAT COUNT(*)
----------- ------------
01-JAN-2022 142,042
01-JAN-2023 57,636
01-JAN-2024 271,683
01-JAN-2021 92,060
01-JAN-2020 436,579
demo-user@ATP19C>
demo-user@ATP19C> exec :uri := 'https://objectstorage.us-ashburn-1.oraclecloud.com/n/idcglquusbz6/b/MY_DEMO_BUCKET/o/BUCKET02/';
demo-user@ATP19C> select object_name, bytes
2 from dbms_cloud.list_objects( 'MY_DEMO_CRED', :uri );
----------------------------------- ----------
java.exe_1_20251015T143444892671Z.p 36017158
arquet
2 dbms_cloud.create_external_table(
3 table_name =>'MY_DEMO_EXT'
4 , credential_name =>'MY_DEMO_CRED'
5 , file_uri_list => :uri||'java.exe_1_20251015T143444892671Z.parquet'
6 , format => json_object('type' value 'parquet', 'schema' value 'first') );
7 end;
8 /
demo-user@ATP19C> desc my_demo_ext
Name Null? Type
----------------------------------- -------- -------------------------
OWNER VARCHAR2(32767)
OBJECT_NAME VARCHAR2(32767)
SUBOBJECT_NAME VARCHAR2(32767)
OBJECT_ID NUMBER(38,20)
DATA_OBJECT_ID NUMBER(38,20)
OBJECT_TYPE VARCHAR2(32767)
CREATED DATE
LAST_DDL_TIME DATE
TIMESTAMP VARCHAR2(32767)
STATUS VARCHAR2(32767)
TEMPORARY VARCHAR2(32767)
GENERATED VARCHAR2(32767)
SECONDARY VARCHAR2(32767)
NAMESPACE NUMBER(38,20)
EDITION_NAME VARCHAR2(32767)
SHARING VARCHAR2(32767)
EDITIONABLE VARCHAR2(32767)
ORACLE_MAINTAINED VARCHAR2(32767)
APPLICATION VARCHAR2(32767)
DEFAULT_COLLATION VARCHAR2(32767)
DUPLICATED VARCHAR2(32767)
SHARDED VARCHAR2(32767)
CREATED_APPID NUMBER(38,20)
CREATED_VSNID NUMBER(38,20)
MODIFIED_APPID NUMBER(38,20)
MODIFIED_VSNID NUMBER(38,20)
ID NUMBER(38,20)
------------
1,000,000
Defaulting to user installation because normal site-packages is not writeable
Collecting parquet-tools
Downloading parquet_tools-0.2.16-py3-none-any.whl.metadata (3.8 kB)
Collecting boto3<2.0.0,>=1.34.11 (from parquet-tools)
Downloading boto3-1.40.52-py3-none-any.whl.metadata (6.7 kB)
…………….
…………….
…………….
…………….
created_by: parquet-mr version 1.12.2 (build 77e30c8093386ec52c3cfa6c34b7ef3321322c94)
num_columns: 27
num_rows: 1000000
num_row_groups: 174
format_version: 1.0
serialized_size: 419719
OWNER
OBJECT_NAME
SUBOBJECT_NAME
OBJECT_ID
DATA_OBJECT_ID
OBJECT_TYPE
CREATED
LAST_DDL_TIME
TIMESTAMP
STATUS
TEMPORARY
GENERATED
SECONDARY
NAMESPACE
EDITION_NAME
SHARING
EDITIONABLE
ORACLE_MAINTAINED
APPLICATION
DEFAULT_COLLATION
DUPLICATED
SHARDED
CREATED_APPID
CREATED_VSNID
MODIFIED_APPID
MODIFIED_VSNID
ID
name: OWNER
path: OWNER
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: SNAPPY (space_saved: 95%)
name: OBJECT_NAME
path: OBJECT_NAME
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: SNAPPY (space_saved: 65%)
…………………..
…………………..
…………………..
name: ID
path: ID
max_definition_level: 1
max_repetition_level: 0
physical_type: FIXED_LEN_BYTE_ARRAY
logical_type: Decimal(precision=38, scale=20)
converted_type (legacy): DECIMAL
compression: SNAPPY (space_saved: 41%)
2 select *
3 from external(
4 ( OWNER VARCHAR2(4000)
5 , OBJECT_NAME VARCHAR2(4000)
6 , SUBOBJECT_NAME VARCHAR2(4000)
7 , OBJECT_ID NUMBER(38,20)
8 , DATA_OBJECT_ID NUMBER(38,20)
9 , OBJECT_TYPE VARCHAR2(4000)
10 , CREATED DATE
11 , LAST_DDL_TIME DATE
12 , TIMESTAMP VARCHAR2(4000)
13 , STATUS VARCHAR2(4000)
14 , TEMPORARY VARCHAR2(4000)
15 , GENERATED VARCHAR2(4000)
16 , SECONDARY VARCHAR2(4000)
17 , NAMESPACE NUMBER(38,20)
18 , EDITION_NAME VARCHAR2(4000)
19 , SHARING VARCHAR2(4000)
20 , EDITIONABLE VARCHAR2(4000)
21 , ORACLE_MAINTAINED VARCHAR2(4000)
22 , APPLICATION VARCHAR2(4000)
23 , DEFAULT_COLLATION VARCHAR2(4000)
24 , DUPLICATED VARCHAR2(4000)
25 , SHARDED VARCHAR2(4000)
26 , CREATED_APPID NUMBER(38,20)
27 , CREATED_VSNID NUMBER(38,20)
28 , MODIFIED_APPID NUMBER(38,20)
29 , MODIFIED_VSNID NUMBER(38,20)
30 , ID NUMBER(38,20) )
31 type oracle_bigdata
32 access parameters(
33 com.oracle.bigdata.fileformat = parquet
34 com.oracle.bigdata.credential.name = MY_DEMO_CRED
35 com.oracle.bigdata.removequotes = true
36 com.oracle.bigdata.dateformat = 'YYYY-MM-DD hh24:mi:ss'
37 )
38 location ('https://objectstorage.us-ashburn-1.oraclecloud.com/n/idcglquusbz6/b/MY_DEMO_BUCKET/o/BUCKET02/java.exe_1_20251015T143444892671Z.parquet')
39 ) )
40 select trunc(created,'Y'), count(*)
41 from rws
42 group by trunc(created,'Y');
----------- ------------
01-JAN-2022 142,042
01-JAN-2023 57,636
01-JAN-2024 271,683
01-JAN-2021 92,060
01-JAN-2020 436,579
No comments:
Post a Comment