Tuesday, October 21, 2025

Parquet Files - Part II


After discussion about Parquet files a lot, it is time to come back to Oracle database and how to easily access and work with parquet files as tables in Oracle database. Accessing External data from within the database provides business benefits and parquet files are optimal storage format due to its reduced size compared to other formats. Now it is the time to see how to access external parquet files and work with its data using the database as a processing engine.
 
First we need to make the data accessible to the database, External tables are the way to go, since the parquet files are self-describing and contains schema information, defining an External tables should be straight forward.

 
demo-user@ATP19C> variable uri varchar2(100)
demo-user@ATP19C> exec :uri := 'https://objectstorage.us-ashburn-1.oraclecloud.com/n/idcglquusbz6/b/MY_DEMO_BUCKET/o/BUCKET02/';
 
PL/SQL procedure successfully completed.
 
demo-user@ATP19C>
demo-user@ATP19C> select object_name, bytes
  2  from dbms_cloud.list_objects( 'MY_DEMO_CRED', :uri );
 
OBJECT_NAME                              BYTES
----------------------------------- ----------
java.exe_1_20251015T143444892671Z.p   36017158
arquet
 
demo-user@ATP19C> begin
  2     dbms_cloud.create_external_table(
  3             table_name =>'MY_DEMO_EXT'
  4             , credential_name =>'MY_DEMO_CRED'
  5             , file_uri_list => :uri||'java.exe_1_20251015T143444892671Z.parquet'
  6             , format => json_object('type' value 'parquet', 'schema' value 'first') );
  7  end;
  8  /
 
PL/SQL procedure successfully completed.
 
demo-user@ATP19C> set linesize 71
demo-user@ATP19C> desc my_demo_ext
 Name                                Null?    Type
 ----------------------------------- -------- -------------------------
 OWNER                                        VARCHAR2(32767)
 OBJECT_NAME                                  VARCHAR2(32767)
 SUBOBJECT_NAME                               VARCHAR2(32767)
 OBJECT_ID                                    NUMBER(38,20)
 DATA_OBJECT_ID                               NUMBER(38,20)
 OBJECT_TYPE                                  VARCHAR2(32767)
 CREATED                                      DATE
 LAST_DDL_TIME                                DATE
 TIMESTAMP                                    VARCHAR2(32767)
 STATUS                                       VARCHAR2(32767)
 TEMPORARY                                    VARCHAR2(32767)
 GENERATED                                    VARCHAR2(32767)
 SECONDARY                                    VARCHAR2(32767)
 NAMESPACE                                    NUMBER(38,20)
 EDITION_NAME                                 VARCHAR2(32767)
 SHARING                                      VARCHAR2(32767)
 EDITIONABLE                                  VARCHAR2(32767)
 ORACLE_MAINTAINED                            VARCHAR2(32767)
 APPLICATION                                  VARCHAR2(32767)
 DEFAULT_COLLATION                            VARCHAR2(32767)
 DUPLICATED                                   VARCHAR2(32767)
 SHARDED                                      VARCHAR2(32767)
 CREATED_APPID                                NUMBER(38,20)
 CREATED_VSNID                                NUMBER(38,20)
 MODIFIED_APPID                               NUMBER(38,20)
 MODIFIED_VSNID                               NUMBER(38,20)
 ID                                           NUMBER(38,20)
 
demo-user@ATP19C> select count(*) from my_demo_ext;
 
    COUNT(*)
------------
   1,000,000
 
demo-user@ATP19C>
 
 
if you got access to the Database action or the web UI page for Autonomous database then we can make use of the Data Load options under Data Studio in a user-friendly manner to load data.
 


 
 


 
 
After creating these external tables, you can use it like any other in Oracle database.
 
Also in certain cases where you don’t have privilege to create table, we can leverage inline-External table feature to access parquet files from object storage. To do that we need obtain information (like the number of columns and its data types) about Parquet files
 
The easiest way to get insights to Parquet files it to make use of Python based Parquet-tools , when you have Python 3 in place then it is super easy to install, just run the following command and it’s ready to use.

 
C:\Users\Rajeshwaran Jeyabal>pip3 install parquet-tools
Defaulting to user installation because normal site-packages is not writeable
Collecting parquet-tools
  Downloading parquet_tools-0.2.16-py3-none-any.whl.metadata (3.8 kB)
Collecting boto3<2.0.0,>=1.34.11 (from parquet-tools)
  Downloading boto3-1.40.52-py3-none-any.whl.metadata (6.7 kB)
…………….
…………….
…………….
…………….
 
Successfully installed boto3-1.40.52 botocore-1.40.52 colorama-0.4.6 halo-0.0.31 jmespath-1.0.1 log_symbols-0.0.14 numpy-2.3.3 pandas-2.3.3 parquet-tools-0.2.16 pyarrow-21.0.0 python-dateutil-2.9.0.post0 pytz-2025.2 s3transfer-0.14.0 six-1.17.0 spinners-0.0.24 tabulate-0.9.0 termcolor-3.1.0 thrift-0.16.0 tzdata-2025.2 urllib3-2.5.0
 
Now you got a simple and powerful tool, just run the “inspect” command to learn how the file is created and what compression is used, how many columns are there and how many row groups are there.
 
C:\Users\Rajeshwaran Jeyabal>parquet-tools.exe inspect "C:\Users\Rajeshwaran Jeyabal\Downloads\BUCKET02_java.exe_1_20251015T143444892671Z.parquet"
 
############ file meta data ############
created_by: parquet-mr version 1.12.2 (build 77e30c8093386ec52c3cfa6c34b7ef3321322c94)
num_columns: 27
num_rows: 1000000
num_row_groups: 174
format_version: 1.0
serialized_size: 419719
 
 
############ Columns ############
OWNER
OBJECT_NAME
SUBOBJECT_NAME
OBJECT_ID
DATA_OBJECT_ID
OBJECT_TYPE
CREATED
LAST_DDL_TIME
TIMESTAMP
STATUS
TEMPORARY
GENERATED
SECONDARY
NAMESPACE
EDITION_NAME
SHARING
EDITIONABLE
ORACLE_MAINTAINED
APPLICATION
DEFAULT_COLLATION
DUPLICATED
SHARDED
CREATED_APPID
CREATED_VSNID
MODIFIED_APPID
MODIFIED_VSNID
ID
 
############ Column(OWNER) ############
name: OWNER
path: OWNER
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: SNAPPY (space_saved: 95%)
 
############ Column(OBJECT_NAME) ############
name: OBJECT_NAME
path: OBJECT_NAME
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: SNAPPY (space_saved: 65%)
 
…………………..
…………………..
…………………..
…………………..
 
 
############ Column(ID) ############
name: ID
path: ID
max_definition_level: 1
max_repetition_level: 0
physical_type: FIXED_LEN_BYTE_ARRAY
logical_type: Decimal(precision=38, scale=20)
converted_type (legacy): DECIMAL
compression: SNAPPY (space_saved: 41%)
 
 
Using the above list of columns in Parquet files, we can create an Inline-External table like this
 
demo-user@ATP19C> with rws as (
  2  select *
  3  from external(
  4      ( OWNER    VARCHAR2(4000)
  5     , OBJECT_NAME    VARCHAR2(4000)
  6     , SUBOBJECT_NAME    VARCHAR2(4000)
  7     , OBJECT_ID    NUMBER(38,20)
  8     , DATA_OBJECT_ID    NUMBER(38,20)
  9     , OBJECT_TYPE    VARCHAR2(4000)
 10     , CREATED    DATE
 11     , LAST_DDL_TIME    DATE
 12     , TIMESTAMP    VARCHAR2(4000)
 13     , STATUS    VARCHAR2(4000)
 14     , TEMPORARY    VARCHAR2(4000)
 15     , GENERATED    VARCHAR2(4000)
 16     , SECONDARY    VARCHAR2(4000)
 17     , NAMESPACE    NUMBER(38,20)
 18     , EDITION_NAME    VARCHAR2(4000)
 19     , SHARING    VARCHAR2(4000)
 20     , EDITIONABLE    VARCHAR2(4000)
 21     , ORACLE_MAINTAINED    VARCHAR2(4000)
 22     , APPLICATION    VARCHAR2(4000)
 23     , DEFAULT_COLLATION    VARCHAR2(4000)
 24     , DUPLICATED    VARCHAR2(4000)
 25     , SHARDED    VARCHAR2(4000)
 26     , CREATED_APPID    NUMBER(38,20)
 27     , CREATED_VSNID    NUMBER(38,20)
 28     , MODIFIED_APPID    NUMBER(38,20)
 29     , MODIFIED_VSNID    NUMBER(38,20)
 30     , ID    NUMBER(38,20) )
 31      type oracle_bigdata
 32      access parameters(
 33          com.oracle.bigdata.fileformat = parquet
 34          com.oracle.bigdata.credential.name = MY_DEMO_CRED
 35          com.oracle.bigdata.removequotes = true
 36          com.oracle.bigdata.dateformat = 'YYYY-MM-DD hh24:mi:ss'
 37          )
 38  location ('https://objectstorage.us-ashburn-1.oraclecloud.com/n/idcglquusbz6/b/MY_DEMO_BUCKET/o/BUCKET02/java.exe_1_20251015T143444892671Z.parquet')
 39  ) )
 40  select trunc(created,'Y'), count(*)
 41  from rws
 42  group by trunc(created,'Y');
 
TRUNC(CREAT     COUNT(*)
----------- ------------
01-JAN-2022      142,042
01-JAN-2023       57,636
01-JAN-2024      271,683
01-JAN-2021       92,060
01-JAN-2020      436,579
 
demo-user@ATP19C>