THE RAJESH BLOG: October 2025

Tuesday, October 21, 2025

Parquet Files - Part II

After discussion about Parquet files a lot, it is time to come back to Oracle database and how to easily access and work with parquet files as tables in Oracle database. Accessing External data from within the database provides business benefits and parquet files are optimal storage format due to its reduced size compared to other formats. Now it is the time to see how to access external parquet files and work with its data using the database as a processing engine.

First we need to make the data accessible to the database, External tables are the way to go, since the parquet files are self-describing and contains schema information, defining an External tables should be straight forward.

demo-user@ATP19C> variable uri varchar2(100)
demo-user@ATP19C> exec :uri := 'https://objectstorage.us-ashburn-1.oraclecloud.com/n/idcglquusbz6/b/MY_DEMO_BUCKET/o/BUCKET02/';

PL/SQL procedure successfully completed.

demo-user@ATP19C>
demo-user@ATP19C> select object_name, bytes
2 from dbms_cloud.list_objects( 'MY_DEMO_CRED', :uri );

OBJECT_NAME                              BYTES
----------------------------------- ----------
java.exe_1_20251015T143444892671Z.p   36017158
arquet

demo-user@ATP19C> begin
2     dbms_cloud.create_external_table(
3             table_name =>'MY_DEMO_EXT'
4             , credential_name =>'MY_DEMO_CRED'
5             , file_uri_list => :uri||'java.exe_1_20251015T143444892671Z.parquet'
6             , format => json_object('type' value 'parquet', 'schema' value 'first') );
7 end;
8 /

PL/SQL procedure successfully completed.

demo-user@ATP19C> set linesize 71
demo-user@ATP19C> desc my_demo_ext
Name                                Null?    Type
----------------------------------- -------- -------------------------
OWNER                                        VARCHAR2(32767)
OBJECT_NAME                                  VARCHAR2(32767)
SUBOBJECT_NAME                               VARCHAR2(32767)
OBJECT_ID                                    NUMBER(38,20)
DATA_OBJECT_ID                               NUMBER(38,20)
OBJECT_TYPE                                  VARCHAR2(32767)
CREATED                                      DATE
LAST_DDL_TIME                                DATE
TIMESTAMP                                    VARCHAR2(32767)
STATUS                                       VARCHAR2(32767)
TEMPORARY                                    VARCHAR2(32767)
GENERATED                                    VARCHAR2(32767)
SECONDARY                                    VARCHAR2(32767)
NAMESPACE                                    NUMBER(38,20)
EDITION_NAME                                 VARCHAR2(32767)
SHARING                                      VARCHAR2(32767)
EDITIONABLE                                  VARCHAR2(32767)
ORACLE_MAINTAINED                            VARCHAR2(32767)
APPLICATION                                  VARCHAR2(32767)
DEFAULT_COLLATION                            VARCHAR2(32767)
DUPLICATED                                   VARCHAR2(32767)
SHARDED                                      VARCHAR2(32767)
CREATED_APPID                                NUMBER(38,20)
CREATED_VSNID                                NUMBER(38,20)
MODIFIED_APPID                               NUMBER(38,20)
MODIFIED_VSNID                               NUMBER(38,20)
ID                                           NUMBER(38,20)

demo-user@ATP19C> select count(*) from my_demo_ext;

    COUNT(*)
------------
   1,000,000

demo-user@ATP19C>

if you got access to the Database action or the web UI page for Autonomous database then we can make use of the Data Load options under Data Studio in a user-friendly manner to load data.

After creating these external tables, you can use it like any other in Oracle database.

Also in certain cases where you don’t have privilege to create table, we can leverage inline-External table feature to access parquet files from object storage. To do that we need obtain information (like the number of columns and its data types) about Parquet files

The easiest way to get insights to Parquet files it to make use of Python based Parquet-tools , when you have Python 3 in place then it is super easy to install, just run the following command and it’s ready to use.

C:\Users\Rajeshwaran Jeyabal>pip3 install parquet-tools
Defaulting to user installation because normal site-packages is not writeable
Collecting parquet-tools
Downloading parquet_tools-0.2.16-py3-none-any.whl.metadata (3.8 kB)
Collecting boto3<2.0.0,>=1.34.11 (from parquet-tools)
Downloading boto3-1.40.52-py3-none-any.whl.metadata (6.7 kB)
…………….
…………….
…………….
…………….

Successfully installed boto3-1.40.52 botocore-1.40.52 colorama-0.4.6 halo-0.0.31 jmespath-1.0.1 log_symbols-0.0.14 numpy-2.3.3 pandas-2.3.3 parquet-tools-0.2.16 pyarrow-21.0.0 python-dateutil-2.9.0.post0 pytz-2025.2 s3transfer-0.14.0 six-1.17.0 spinners-0.0.24 tabulate-0.9.0 termcolor-3.1.0 thrift-0.16.0 tzdata-2025.2 urllib3-2.5.0

Now you got a simple and powerful tool, just run the “inspect” command to learn how the file is created and what compression is used, how many columns are there and how many row groups are there.

C:\Users\Rajeshwaran Jeyabal>parquet-tools.exe inspect "C:\Users\Rajeshwaran Jeyabal\Downloads\BUCKET02_java.exe_1_20251015T143444892671Z.parquet"

############ file meta data ############
created_by: parquet-mr version 1.12.2 (build 77e30c8093386ec52c3cfa6c34b7ef3321322c94)
num_columns: 27
num_rows: 1000000
num_row_groups: 174
format_version: 1.0
serialized_size: 419719

############ Columns ############
OWNER
OBJECT_NAME
SUBOBJECT_NAME
OBJECT_ID
DATA_OBJECT_ID
OBJECT_TYPE
CREATED
LAST_DDL_TIME
TIMESTAMP
STATUS
TEMPORARY
GENERATED
SECONDARY
NAMESPACE
EDITION_NAME
SHARING
EDITIONABLE
ORACLE_MAINTAINED
APPLICATION
DEFAULT_COLLATION
DUPLICATED
SHARDED
CREATED_APPID
CREATED_VSNID
MODIFIED_APPID
MODIFIED_VSNID
ID

############ Column(OWNER) ############
name: OWNER
path: OWNER
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: SNAPPY (space_saved: 95%)

############ Column(OBJECT_NAME) ############
name: OBJECT_NAME
path: OBJECT_NAME
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: SNAPPY (space_saved: 65%)

…………………..
…………………..
…………………..
…………………..

############ Column(ID) ############
name: ID
path: ID
max_definition_level: 1
max_repetition_level: 0
physical_type: FIXED_LEN_BYTE_ARRAY
logical_type: Decimal(precision=38, scale=20)
converted_type (legacy): DECIMAL
compression: SNAPPY (space_saved: 41%)

Using the above list of columns in Parquet files, we can create an Inline-External table like this

demo-user@ATP19C> with rws as (
2 select *
3 from external(
4      ( OWNER    VARCHAR2(4000)
5     , OBJECT_NAME    VARCHAR2(4000)
6     , SUBOBJECT_NAME    VARCHAR2(4000)
7     , OBJECT_ID    NUMBER(38,20)
8     , DATA_OBJECT_ID    NUMBER(38,20)
9     , OBJECT_TYPE    VARCHAR2(4000)
10     , CREATED    DATE
11     , LAST_DDL_TIME    DATE
12     , TIMESTAMP    VARCHAR2(4000)
13     , STATUS    VARCHAR2(4000)
14     , TEMPORARY    VARCHAR2(4000)
15     , GENERATED    VARCHAR2(4000)
16     , SECONDARY    VARCHAR2(4000)
17     , NAMESPACE    NUMBER(38,20)
18     , EDITION_NAME    VARCHAR2(4000)
19     , SHARING    VARCHAR2(4000)
20     , EDITIONABLE    VARCHAR2(4000)
21     , ORACLE_MAINTAINED    VARCHAR2(4000)
22     , APPLICATION    VARCHAR2(4000)
23     , DEFAULT_COLLATION    VARCHAR2(4000)
24     , DUPLICATED    VARCHAR2(4000)
25     , SHARDED    VARCHAR2(4000)
26     , CREATED_APPID    NUMBER(38,20)
27     , CREATED_VSNID    NUMBER(38,20)
28     , MODIFIED_APPID    NUMBER(38,20)
29     , MODIFIED_VSNID    NUMBER(38,20)
30    , ID    NUMBER(38,20) )
31      type oracle_bigdata
32      access parameters(
33          com.oracle.bigdata.fileformat = parquet
34          com.oracle.bigdata.credential.name = MY_DEMO_CRED
35          com.oracle.bigdata.removequotes = true
36          com.oracle.bigdata.dateformat = 'YYYY-MM-DD hh24:mi:ss'
37          )
38 location ('https://objectstorage.us-ashburn-1.oraclecloud.com/n/idcglquusbz6/b/MY_DEMO_BUCKET/o/BUCKET02/java.exe_1_20251015T143444892671Z.parquet')
39 ) )
40 select trunc(created,'Y'), count(*)
41 from rws
42 group by trunc(created,'Y');

TRUNC(CREAT     COUNT(*)
----------- ------------
01-JAN-2022      142,042
01-JAN-2023       57,636
01-JAN-2024      271,683
01-JAN-2021       92,060
01-JAN-2020      436,579

demo-user@ATP19C>

Friday, October 17, 2025

Parquet Files - Part I

is a popular that stores data efficiently by employing , different , and compression based on data type and stores alongside the data. Generally, the less data you have to read and transfer, the better your query performance and the lower your running costs. And obviously, the smaller the data footprint, the smaller the costs for storing it. Parquet helps us here quite a bit.

Lets see why Parquet offers the increase in performance of processing external data.
Parquet Files are organized in columns, which means only columns listed in the need to be read by , very similar to Oracle storage index.

If you need to read a single column from a Parquet file, you only read the corresponding column chunks from all .

All the technical details of how the parquet files are organized is gone more detailed in Paraquet open source standard specification, but each file is divided into chunks (called as row groups) and each of those chunks has metadata for each column (min/max value) , row groups are way to implement , where rows are joint into row groups and within a row group stored by columns. A single row group contains data for all columns for a small number of rows. Pretty much close to what Oracle HCC and Storage index does. The optimizer can use these metadata and skip reading part of the file based on .

For example the following query

Select * from <your_table_name> where col1 = 1;

Will only read one row group (because of the predicate col1 = 1 ) and will skip the other row group.

Paraquet is also self-explaining file format, so metadata is stitched together with the data, It means only having a file user could define a table, nothing extra was required. everything is compressed by default, since it is columnar file format we get good compression ratios.

Parquet is an open, widely adopted format. It has been available for quite some time and has become the default standard for storing data in a columnar format outside your database engine. So whenever your data does not require the security and robustness of being stored, managed, and protected by your database and you have a choice, you should consider storing data in Parquet, compared to other open formats like , in this blogpost we will see the size of the output file produced by each file format compared to the Paraquet file sizes.

Got a table with 160MB of data.

demo-user@ATP19C> exec show_space('BIG_TABLE');
Unformatted Blocks .....................               0
FS1 Blocks (0-25) ......................               0
FS2 Blocks (25-50) .....................               0
FS3 Blocks (50-75) .....................               0
FS4 Blocks (75-100).....................              18
Full Blocks ............................          20,526
Total Blocks............................          20,920
Total Bytes.............................     171,376,640
Total MBytes............................             163
Unused Blocks...........................               0
Unused Bytes............................               0
Last Used Ext FileId....................           2,731
Last Used Ext BlockId...................         575,872
Last Used Block.........................              24

PL/SQL procedure successfully completed.

When exported to Object storage as CSV

demo-user@ATP19C> variable uri varchar2(100)
demo-user@ATP19C> exec :uri := 'https://objectstorage.us-ashburn-1.oraclecloud.com/n/idcglquusbz6/b/MY_DEMO_BUCKET/o/BUCKET02/';

PL/SQL procedure successfully completed.
demo-user@ATP19C> begin
2      dbms_cloud.export_data(
3          credential_name =>'MY_DEMO_CRED'
4          , file_uri_list => :uri
5          , query => ' select * from big_Table '
6          , format => json_object('type' value 'csv')
7      ) ;
8 end;
9* /

PL/SQL procedure successfully completed.

demo-user@ATP19C> select count(*), round(sum(bytes)/1024/1024,2) size_mb
2 from dbms_cloud.list_objects( 'MY_DEMO_CRED'
3      , :uri )
4* where object_name like '%.csv'    ;

COUNT(*) SIZE_MB
________ _______
      17 165.37

The total file size was close to 165 MB, when exported as it was close to 577 MB

demo-user@ATP19C> begin
2      dbms_cloud.export_data(
3          credential_name =>'MY_DEMO_CRED'
4          , file_uri_list => :uri
5          , query => ' select * from big_Table '
6          , format => json_object('type' value 'json')
7      ) ;
8 end;
9* /

PL/SQL procedure successfully completed.

demo-user@ATP19C> select count(*), round(sum(bytes)/1024/1024,2) size_mb
2 from dbms_cloud.list_objects( 'MY_DEMO_CRED'
3      , :uri )
4* where object_name like '%.json'   ;

COUNT(*) SIZE_MB
________ _______
      58     577

The was close to 420 MB

demo-user@ATP19C> begin
2      dbms_cloud.export_data(
3          credential_name =>'MY_DEMO_CRED'
4          , file_uri_list => :uri
5          , query => ' select * from big_Table '
6          , format => json_object('type' value 'xml')
7      ) ;
8 end;
9* /

PL/SQL procedure successfully completed.

demo-user@ATP19C> select count(*), round(sum(bytes)/1024/1024,2) size_mb
2 from dbms_cloud.list_objects( 'MY_DEMO_CRED', :uri )
3* where object_name like '%.xml' ;

COUNT(*) SIZE_MB
________ _______
      43 426.02

Where as Paraquet it was 34 MB

demo-user@ATP19C> begin

2      dbms_cloud.export_data(
3          credential_name =>'MY_DEMO_CRED'
4          , file_uri_list => :uri
5          , query => ' select * from big_Table '
6          , format => json_object('type' value 'parquet')
7      ) ;
8 end;
9* /

PL/SQL procedure successfully completed.

demo-user@ATP19C> select count(*), round(sum(bytes)/1024/1024,2) size_mb
2 from dbms_cloud.list_objects( 'MY_DEMO_CRED', :uri )
3* where object_name like '%.parquet' ;

COUNT(*) SIZE_MB
________ _______
1 34.35

In the next blogpost we will see about how to read these parquet files from object storage to Oracle database using .

Books I follow...

Advanced SQL Functions in Oracle 10g - by Richard Walsh Earp
Cost-Based Oracle Fundamentals - By Jonathan Lewis
Effective Oracle by Design - by Thomas kyte
Expert One on One Oracle - by Thomas kyte
Expert Oracle - by Thomas kyte
Expert Oracle Database Architecture Oracle Database 9i, 10g - Thomas Kyte
Expert Oracle Database Architecture Oracle Database 9i, 10g, and 11g Programming Techniques and Solutions - Thomas Kyte
Expert Oracle Practices Oracle Database Administration from the Oak Table
High-Performance Oracle Database Applications - by Donald K. Burleson
Mastering Oracle PL/SQL: Practical Solutions - by Connor McDonald
Mastering Oracle Scheduler in Oracle 11g Databases - by Ronald Rood
Mastering Oracle SQL and SQL*Plus - By LEX DE HAAN
Optimizing Oracle Performance - By Cary Millsap
Oracle Database 10g Performance Tuning Tips & Techniques - By Richard J. Niemiec
Oracle Database 10g XML & SQL: Design, Build & Manage XML Applications in Java, C, C++ & PL/SQL - by Mark V. Scardina, Ben Chang and Jinyu Wang
Oracle Database 10g: The Complete Reference - by Kevin Loney
Oracle Database 11g DBA Handbook - by Bob Bryla , Kevin Loney
Oracle Database 11g New Features - Maximize the new capabilities of the Latest Database Release - by Robert G. Freeman
Oracle Performance Tuning for 10gR2 - By Gavin Powell
Oracle PL/SQL for DBAs By Steven Feuerstein, Arup Nanda
Oracle SQL*Plus: The Definitive Guide - By Jonathan Gennick
Oracle Wait Interface: A Practical Guide to Performance Diagnostics & Tuning by Richmond Shee, Kirtikumar Deshpande and K Gopalakrishnan
Oracle8i Internal Services for Waits, Latches, Locks, and Memory - by Steve Adams
Professional Oracle 8i Application Programming - By Thomas Kyte
Secrets of the Oracle Database - By Norbert Debes
Troubleshooting Oracle Performance - By Christian Antognini
Tuning Third-party Vendor Oracle systems Tuning when you can't touch the code - By Mike Ault
Using Oracle SQL Stored Outlines & Optimizer Plan Stability - By Mike Ault