Tuesday, October 21, 2025

Parquet Files - Part II


After discussion about Parquet files a lot, it is time to come back to Oracle database and how to easily access and work with parquet files as tables in Oracle database. Accessing External data from within the database provides business benefits and parquet files are optimal storage format due to its reduced size compared to other formats. Now it is the time to see how to access external parquet files and work with its data using the database as a processing engine.
 
First we need to make the data accessible to the database, External tables are the way to go, since the parquet files are self-describing and contains schema information, defining an External tables should be straight forward.

 
demo-user@ATP19C> variable uri varchar2(100)
demo-user@ATP19C> exec :uri := 'https://objectstorage.us-ashburn-1.oraclecloud.com/n/idcglquusbz6/b/MY_DEMO_BUCKET/o/BUCKET02/';
 
PL/SQL procedure successfully completed.
 
demo-user@ATP19C>
demo-user@ATP19C> select object_name, bytes
  2  from dbms_cloud.list_objects( 'MY_DEMO_CRED', :uri );
 
OBJECT_NAME                              BYTES
----------------------------------- ----------
java.exe_1_20251015T143444892671Z.p   36017158
arquet
 
demo-user@ATP19C> begin
  2     dbms_cloud.create_external_table(
  3             table_name =>'MY_DEMO_EXT'
  4             , credential_name =>'MY_DEMO_CRED'
  5             , file_uri_list => :uri||'java.exe_1_20251015T143444892671Z.parquet'
  6             , format => json_object('type' value 'parquet', 'schema' value 'first') );
  7  end;
  8  /
 
PL/SQL procedure successfully completed.
 
demo-user@ATP19C> set linesize 71
demo-user@ATP19C> desc my_demo_ext
 Name                                Null?    Type
 ----------------------------------- -------- -------------------------
 OWNER                                        VARCHAR2(32767)
 OBJECT_NAME                                  VARCHAR2(32767)
 SUBOBJECT_NAME                               VARCHAR2(32767)
 OBJECT_ID                                    NUMBER(38,20)
 DATA_OBJECT_ID                               NUMBER(38,20)
 OBJECT_TYPE                                  VARCHAR2(32767)
 CREATED                                      DATE
 LAST_DDL_TIME                                DATE
 TIMESTAMP                                    VARCHAR2(32767)
 STATUS                                       VARCHAR2(32767)
 TEMPORARY                                    VARCHAR2(32767)
 GENERATED                                    VARCHAR2(32767)
 SECONDARY                                    VARCHAR2(32767)
 NAMESPACE                                    NUMBER(38,20)
 EDITION_NAME                                 VARCHAR2(32767)
 SHARING                                      VARCHAR2(32767)
 EDITIONABLE                                  VARCHAR2(32767)
 ORACLE_MAINTAINED                            VARCHAR2(32767)
 APPLICATION                                  VARCHAR2(32767)
 DEFAULT_COLLATION                            VARCHAR2(32767)
 DUPLICATED                                   VARCHAR2(32767)
 SHARDED                                      VARCHAR2(32767)
 CREATED_APPID                                NUMBER(38,20)
 CREATED_VSNID                                NUMBER(38,20)
 MODIFIED_APPID                               NUMBER(38,20)
 MODIFIED_VSNID                               NUMBER(38,20)
 ID                                           NUMBER(38,20)
 
demo-user@ATP19C> select count(*) from my_demo_ext;
 
    COUNT(*)
------------
   1,000,000
 
demo-user@ATP19C>
 
 
if you got access to the Database action or the web UI page for Autonomous database then we can make use of the Data Load options under Data Studio in a user-friendly manner to load data.
 


 
 


 
 
After creating these external tables, you can use it like any other in Oracle database.
 
Also in certain cases where you don’t have privilege to create table, we can leverage inline-External table feature to access parquet files from object storage. To do that we need obtain information (like the number of columns and its data types) about Parquet files
 
The easiest way to get insights to Parquet files it to make use of Python based Parquet-tools , when you have Python 3 in place then it is super easy to install, just run the following command and it’s ready to use.

 
C:\Users\Rajeshwaran Jeyabal>pip3 install parquet-tools
Defaulting to user installation because normal site-packages is not writeable
Collecting parquet-tools
  Downloading parquet_tools-0.2.16-py3-none-any.whl.metadata (3.8 kB)
Collecting boto3<2.0.0,>=1.34.11 (from parquet-tools)
  Downloading boto3-1.40.52-py3-none-any.whl.metadata (6.7 kB)
…………….
…………….
…………….
…………….
 
Successfully installed boto3-1.40.52 botocore-1.40.52 colorama-0.4.6 halo-0.0.31 jmespath-1.0.1 log_symbols-0.0.14 numpy-2.3.3 pandas-2.3.3 parquet-tools-0.2.16 pyarrow-21.0.0 python-dateutil-2.9.0.post0 pytz-2025.2 s3transfer-0.14.0 six-1.17.0 spinners-0.0.24 tabulate-0.9.0 termcolor-3.1.0 thrift-0.16.0 tzdata-2025.2 urllib3-2.5.0
 
Now you got a simple and powerful tool, just run the “inspect” command to learn how the file is created and what compression is used, how many columns are there and how many row groups are there.
 
C:\Users\Rajeshwaran Jeyabal>parquet-tools.exe inspect "C:\Users\Rajeshwaran Jeyabal\Downloads\BUCKET02_java.exe_1_20251015T143444892671Z.parquet"
 
############ file meta data ############
created_by: parquet-mr version 1.12.2 (build 77e30c8093386ec52c3cfa6c34b7ef3321322c94)
num_columns: 27
num_rows: 1000000
num_row_groups: 174
format_version: 1.0
serialized_size: 419719
 
 
############ Columns ############
OWNER
OBJECT_NAME
SUBOBJECT_NAME
OBJECT_ID
DATA_OBJECT_ID
OBJECT_TYPE
CREATED
LAST_DDL_TIME
TIMESTAMP
STATUS
TEMPORARY
GENERATED
SECONDARY
NAMESPACE
EDITION_NAME
SHARING
EDITIONABLE
ORACLE_MAINTAINED
APPLICATION
DEFAULT_COLLATION
DUPLICATED
SHARDED
CREATED_APPID
CREATED_VSNID
MODIFIED_APPID
MODIFIED_VSNID
ID
 
############ Column(OWNER) ############
name: OWNER
path: OWNER
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: SNAPPY (space_saved: 95%)
 
############ Column(OBJECT_NAME) ############
name: OBJECT_NAME
path: OBJECT_NAME
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: SNAPPY (space_saved: 65%)
 
…………………..
…………………..
…………………..
…………………..
 
 
############ Column(ID) ############
name: ID
path: ID
max_definition_level: 1
max_repetition_level: 0
physical_type: FIXED_LEN_BYTE_ARRAY
logical_type: Decimal(precision=38, scale=20)
converted_type (legacy): DECIMAL
compression: SNAPPY (space_saved: 41%)
 
 
Using the above list of columns in Parquet files, we can create an Inline-External table like this
 
demo-user@ATP19C> with rws as (
  2  select *
  3  from external(
  4      ( OWNER    VARCHAR2(4000)
  5     , OBJECT_NAME    VARCHAR2(4000)
  6     , SUBOBJECT_NAME    VARCHAR2(4000)
  7     , OBJECT_ID    NUMBER(38,20)
  8     , DATA_OBJECT_ID    NUMBER(38,20)
  9     , OBJECT_TYPE    VARCHAR2(4000)
 10     , CREATED    DATE
 11     , LAST_DDL_TIME    DATE
 12     , TIMESTAMP    VARCHAR2(4000)
 13     , STATUS    VARCHAR2(4000)
 14     , TEMPORARY    VARCHAR2(4000)
 15     , GENERATED    VARCHAR2(4000)
 16     , SECONDARY    VARCHAR2(4000)
 17     , NAMESPACE    NUMBER(38,20)
 18     , EDITION_NAME    VARCHAR2(4000)
 19     , SHARING    VARCHAR2(4000)
 20     , EDITIONABLE    VARCHAR2(4000)
 21     , ORACLE_MAINTAINED    VARCHAR2(4000)
 22     , APPLICATION    VARCHAR2(4000)
 23     , DEFAULT_COLLATION    VARCHAR2(4000)
 24     , DUPLICATED    VARCHAR2(4000)
 25     , SHARDED    VARCHAR2(4000)
 26     , CREATED_APPID    NUMBER(38,20)
 27     , CREATED_VSNID    NUMBER(38,20)
 28     , MODIFIED_APPID    NUMBER(38,20)
 29     , MODIFIED_VSNID    NUMBER(38,20)
 30     , ID    NUMBER(38,20) )
 31      type oracle_bigdata
 32      access parameters(
 33          com.oracle.bigdata.fileformat = parquet
 34          com.oracle.bigdata.credential.name = MY_DEMO_CRED
 35          com.oracle.bigdata.removequotes = true
 36          com.oracle.bigdata.dateformat = 'YYYY-MM-DD hh24:mi:ss'
 37          )
 38  location ('https://objectstorage.us-ashburn-1.oraclecloud.com/n/idcglquusbz6/b/MY_DEMO_BUCKET/o/BUCKET02/java.exe_1_20251015T143444892671Z.parquet')
 39  ) )
 40  select trunc(created,'Y'), count(*)
 41  from rws
 42  group by trunc(created,'Y');
 
TRUNC(CREAT     COUNT(*)
----------- ------------
01-JAN-2022      142,042
01-JAN-2023       57,636
01-JAN-2024      271,683
01-JAN-2021       92,060
01-JAN-2020      436,579
 
demo-user@ATP19C>
 
 

Friday, October 17, 2025

Parquet Files - Part I

 
Apache Parquet is a popular columnar file format that stores data efficiently by employing column-wise compression, different encoding strategies, and compression based on data type and stores metadata alongside the data. Generally, the less data you have to read and transfer, the better your query performance and the lower your running costs. And obviously, the smaller the data footprint, the smaller the costs for storing it. Parquet helps us here quite a bit.
 
Lets see why Parquet offers the increase in performance of processing external data.
Parquet Files are organized in columns, which means only columns listed in the SQL statement need to be read by compute layer, very similar to Oracle storage index.

 



If you need to read a single column from a Parquet file, you only read the corresponding column chunks from all Row Groups.
 
All the technical details of how the parquet files are organized is gone more detailed in Paraquet open source standard specification, but each file is divided into chunks (called as row groups) and each of those chunks has metadata for each column (min/max value) , row groups are way to implement row/column storage, where rows are joint into row groups and within a row group stored by columns. A single row group contains data for all columns for a small number of rows. Pretty much close to what Oracle HCC and Storage index does. The optimizer can use these metadata and skip reading part of the file based on SQL Predicates.

 

 
 

For example the following query
 
Select * from <your_table_name> where col1 = 1;
 
Will only read one row group (because of the predicate col1 = 1 ) and will skip the other row group.
 
Paraquet is also self-explaining file format, so metadata is stitched together with the data, It means only having a file user could define a table, nothing extra was required. everything is compressed by default, since it is columnar file format we get good compression ratios.
 
Parquet is an open, widely adopted format. It has been available for quite some time and has become the default standard for storing data in a columnar format outside your database engine. So whenever your data does not require the security and robustness of being stored, managed, and protected by your database and you have a choice, you should consider storing data in Parquet, compared to other open formats like CSV, in this blogpost we will see the size of the output file produced by each file format compared to the Paraquet file sizes.
 
Got a table with 160MB of data.
 
demo-user@ATP19C> exec show_space('BIG_TABLE');
Unformatted Blocks .....................               0
FS1 Blocks (0-25) ......................               0
FS2 Blocks (25-50) .....................               0
FS3 Blocks (50-75) .....................               0
FS4 Blocks (75-100).....................              18
Full Blocks ............................          20,526
Total Blocks............................          20,920
Total Bytes.............................     171,376,640
Total MBytes............................             163
Unused Blocks...........................               0
Unused Bytes............................               0
Last Used Ext FileId....................           2,731
Last Used Ext BlockId...................         575,872
Last Used Block.........................              24
 
PL/SQL procedure successfully completed.
 
 
When exported to Object storage as CSV
 
demo-user@ATP19C> variable uri varchar2(100)
demo-user@ATP19C> exec :uri := 'https://objectstorage.us-ashburn-1.oraclecloud.com/n/idcglquusbz6/b/MY_DEMO_BUCKET/o/BUCKET02/';
 
PL/SQL procedure successfully completed.
demo-user@ATP19C> begin
  2      dbms_cloud.export_data(
  3          credential_name =>'MY_DEMO_CRED'
  4          , file_uri_list => :uri
  5          , query => ' select * from big_Table '
  6          , format => json_object('type' value 'csv')
  7      ) ;
  8  end;
  9* /
 
PL/SQL procedure successfully completed.
 
demo-user@ATP19C> select count(*), round(sum(bytes)/1024/1024,2) size_mb
  2  from dbms_cloud.list_objects( 'MY_DEMO_CRED'
  3      , :uri )
  4* where object_name like '%.csv'    ;
 
COUNT(*) SIZE_MB
________ _______
      17  165.37
 
The total file size was close to 165 MB, when exported as JSON it was close to 577 MB
 
demo-user@ATP19C> begin
  2      dbms_cloud.export_data(
  3          credential_name =>'MY_DEMO_CRED'
  4          , file_uri_list => :uri
  5          , query => ' select * from big_Table '
  6          , format => json_object('type' value 'json')
  7      ) ;
  8  end;
  9* /
 
PL/SQL procedure successfully completed.
 
demo-user@ATP19C> select count(*), round(sum(bytes)/1024/1024,2) size_mb
  2  from dbms_cloud.list_objects( 'MY_DEMO_CRED'
  3      , :uri )
  4* where object_name like '%.json'   ;
 
COUNT(*) SIZE_MB
________ _______
      58     577
 
The XML was close to 420 MB
 
demo-user@ATP19C> begin
  2      dbms_cloud.export_data(
  3          credential_name =>'MY_DEMO_CRED'
  4          , file_uri_list => :uri
  5          , query => ' select * from big_Table '
  6          , format => json_object('type' value 'xml')
  7      ) ;
  8  end;
  9* /
 
PL/SQL procedure successfully completed.
 
demo-user@ATP19C> select count(*), round(sum(bytes)/1024/1024,2) size_mb
  2  from dbms_cloud.list_objects( 'MY_DEMO_CRED', :uri )
  3* where object_name like '%.xml' ;
 
COUNT(*) SIZE_MB
________ _______
      43  426.02
 
Where as Paraquet it was 34 MB
 

demo-user@ATP19C> begin

  2      dbms_cloud.export_data(
  3          credential_name =>'MY_DEMO_CRED'
  4          , file_uri_list => :uri
  5          , query => ' select * from big_Table '
  6          , format => json_object('type' value 'parquet')
  7      ) ;
  8  end;
  9* /
 
PL/SQL procedure successfully completed.
 
demo-user@ATP19C> select count(*), round(sum(bytes)/1024/1024,2) size_mb
  2  from dbms_cloud.list_objects( 'MY_DEMO_CRED', :uri )
  3* where object_name like '%.parquet' ;
 
COUNT(*) SIZE_MB
________ _______
       1     34.35
 
In the next blogpost we will see about how to read these parquet files from object storage to Oracle database using Autonomous database features.