Friday, October 17, 2025

Parquet Files - Part I

 
Apache Parquet is a popular columnar file format that stores data efficiently by employing column-wise compression, different encoding strategies, and compression based on data type and stores metadata alongside the data. Generally, the less data you have to read and transfer, the better your query performance and the lower your running costs. And obviously, the smaller the data footprint, the smaller the costs for storing it. Parquet helps us here quite a bit.
 
Lets see why Parquet offers the increase in performance of processing external data.
Parquet Files are organized in columns, which means only columns listed in the SQL statement need to be read by compute layer, very similar to Oracle storage index.

 



If you need to read a single column from a Parquet file, you only read the corresponding column chunks from all Row Groups.
 
All the technical details of how the parquet files are organized is gone more detailed in Paraquet open source standard specification, but each file is divided into chunks (called as row groups) and each of those chunks has metadata for each column (min/max value) , row groups are way to implement row/column storage, where rows are joint into row groups and within a row group stored by columns. A single row group contains data for all columns for a small number of rows. Pretty much close to what Oracle HCC and Storage index does. The optimizer can use these metadata and skip reading part of the file based on SQL Predicates.

 

 
 

For example the following query
 
Select * from <your_table_name> where col1 = 1;
 
Will only read one row group (because of the predicate col1 = 1 ) and will skip the other row group.
 
Paraquet is also self-explaining file format, so metadata is stitched together with the data, It means only having a file user could define a table, nothing extra was required. everything is compressed by default, since it is columnar file format we get good compression ratios.
 
Parquet is an open, widely adopted format. It has been available for quite some time and has become the default standard for storing data in a columnar format outside your database engine. So whenever your data does not require the security and robustness of being stored, managed, and protected by your database and you have a choice, you should consider storing data in Parquet, compared to other open formats like CSV, in this blogpost we will see the size of the output file produced by each file format compared to the Paraquet file sizes.
 
Got a table with 160MB of data.
 
demo-user@ATP19C> exec show_space('BIG_TABLE');
Unformatted Blocks .....................               0
FS1 Blocks (0-25) ......................               0
FS2 Blocks (25-50) .....................               0
FS3 Blocks (50-75) .....................               0
FS4 Blocks (75-100).....................              18
Full Blocks ............................          20,526
Total Blocks............................          20,920
Total Bytes.............................     171,376,640
Total MBytes............................             163
Unused Blocks...........................               0
Unused Bytes............................               0
Last Used Ext FileId....................           2,731
Last Used Ext BlockId...................         575,872
Last Used Block.........................              24
 
PL/SQL procedure successfully completed.
 
 
When exported to Object storage as CSV
 
demo-user@ATP19C> variable uri varchar2(100)
demo-user@ATP19C> exec :uri := 'https://objectstorage.us-ashburn-1.oraclecloud.com/n/idcglquusbz6/b/MY_DEMO_BUCKET/o/BUCKET02/';
 
PL/SQL procedure successfully completed.
demo-user@ATP19C> begin
  2      dbms_cloud.export_data(
  3          credential_name =>'MY_DEMO_CRED'
  4          , file_uri_list => :uri
  5          , query => ' select * from big_Table '
  6          , format => json_object('type' value 'csv')
  7      ) ;
  8  end;
  9* /
 
PL/SQL procedure successfully completed.
 
demo-user@ATP19C> select count(*), round(sum(bytes)/1024/1024,2) size_mb
  2  from dbms_cloud.list_objects( 'MY_DEMO_CRED'
  3      , :uri )
  4* where object_name like '%.csv'    ;
 
COUNT(*) SIZE_MB
________ _______
      17  165.37
 
The total file size was close to 165 MB, when exported as JSON it was close to 577 MB
 
demo-user@ATP19C> begin
  2      dbms_cloud.export_data(
  3          credential_name =>'MY_DEMO_CRED'
  4          , file_uri_list => :uri
  5          , query => ' select * from big_Table '
  6          , format => json_object('type' value 'json')
  7      ) ;
  8  end;
  9* /
 
PL/SQL procedure successfully completed.
 
demo-user@ATP19C> select count(*), round(sum(bytes)/1024/1024,2) size_mb
  2  from dbms_cloud.list_objects( 'MY_DEMO_CRED'
  3      , :uri )
  4* where object_name like '%.json'   ;
 
COUNT(*) SIZE_MB
________ _______
      58     577
 
The XML was close to 420 MB
 
demo-user@ATP19C> begin
  2      dbms_cloud.export_data(
  3          credential_name =>'MY_DEMO_CRED'
  4          , file_uri_list => :uri
  5          , query => ' select * from big_Table '
  6          , format => json_object('type' value 'xml')
  7      ) ;
  8  end;
  9* /
 
PL/SQL procedure successfully completed.
 
demo-user@ATP19C> select count(*), round(sum(bytes)/1024/1024,2) size_mb
  2  from dbms_cloud.list_objects( 'MY_DEMO_CRED', :uri )
  3* where object_name like '%.xml' ;
 
COUNT(*) SIZE_MB
________ _______
      43  426.02
 
Where as Paraquet it was 34 MB
 

demo-user@ATP19C> begin

  2      dbms_cloud.export_data(
  3          credential_name =>'MY_DEMO_CRED'
  4          , file_uri_list => :uri
  5          , query => ' select * from big_Table '
  6          , format => json_object('type' value 'parquet')
  7      ) ;
  8  end;
  9* /
 
PL/SQL procedure successfully completed.
 
demo-user@ATP19C> select count(*), round(sum(bytes)/1024/1024,2) size_mb
  2  from dbms_cloud.list_objects( 'MY_DEMO_CRED', :uri )
  3* where object_name like '%.parquet' ;
 
COUNT(*) SIZE_MB
________ _______
       1     34.35
 
In the next blogpost we will see about how to read these parquet files from object storage to Oracle database using Autonomous database features.
 
 

No comments:

Post a Comment