Apache Parquet is a popular columnar file format that stores data efficiently by employing column-wise compression, different encoding strategies, and compression based on data type and stores metadata alongside the data. Generally, the less data you have to read and transfer, the better your query performance and the lower your running costs. And obviously, the smaller the data footprint, the smaller the costs for storing it. Parquet helps us here quite a bit.
Parquet Files are organized in columns, which means only columns listed in the SQL statement need to be read by compute layer, very similar to Oracle storage index.
If you need to
read a single column from a Parquet file, you only read the corresponding
column chunks from all Row Groups.
All the
technical details of how the parquet files are organized is gone more detailed in
Paraquet open source
standard specification, but each file is divided into chunks (called as row
groups) and each of those chunks has metadata for each column (min/max value) ,
row groups are way to implement row/column storage, where rows are joint into
row groups and within a row group stored by columns. A single row group
contains data for all columns for a small number of rows. Pretty much close to
what Oracle
HCC and Storage
index does. The optimizer can use these metadata and skip reading part of
the file based on SQL Predicates.
For example the
following query
Select * from
<your_table_name> where col1 = 1;
Will only read
one row group (because of the predicate col1 = 1 ) and will skip the other row group.
Paraquet is
also self-explaining file format, so metadata is stitched together with the data,
It means only having a file user could define a table, nothing extra was required.
everything is compressed by default, since it is columnar file format we get good
compression ratios.
Parquet is an
open, widely adopted format. It has been available for quite some time and has
become the default standard for storing data in a columnar format outside your
database engine. So whenever your data does not require the security and
robustness of being stored, managed, and protected by your database and you
have a choice, you should consider storing data in Parquet, compared to other
open formats like CSV, in this blogpost we will see the size of the output file
produced by each file format compared to the Paraquet file sizes.
Got a table
with 160MB of data.
demo-user@ATP19C> exec
show_space('BIG_TABLE');
Unformatted Blocks ..................... 0
FS1 Blocks (0-25) ...................... 0
FS2 Blocks (25-50) ..................... 0
FS3 Blocks (50-75) ..................... 0
FS4 Blocks (75-100)..................... 18
Full Blocks ............................ 20,526
Total Blocks............................ 20,920
Total Bytes............................. 171,376,640
Total MBytes............................ 163
Unused Blocks........................... 0
Unused Bytes............................ 0
Last Used Ext FileId.................... 2,731
Last Used Ext BlockId................... 575,872
Last Used Block......................... 24
PL/SQL procedure successfully
completed.
When exported
to Object storage as CSV
demo-user@ATP19C> variable
uri varchar2(100)
demo-user@ATP19C> exec :uri := 'https://objectstorage.us-ashburn-1.oraclecloud.com/n/idcglquusbz6/b/MY_DEMO_BUCKET/o/BUCKET02/';
PL/SQL procedure successfully
completed.
demo-user@ATP19C> begin
2 dbms_cloud.export_data(
3 credential_name =>'MY_DEMO_CRED'
4 , file_uri_list => :uri
5 , query => ' select * from big_Table '
6 , format => json_object('type' value 'csv')
7 ) ;
8 end;
9* /
PL/SQL procedure successfully
completed.
demo-user@ATP19C> select
count(*), round(sum(bytes)/1024/1024,2) size_mb
2 from dbms_cloud.list_objects( 'MY_DEMO_CRED'
3 , :uri )
4* where object_name like '%.csv' ;
COUNT(*) SIZE_MB
________ _______
17 165.37
The total file
size was close to 165 MB, when exported as JSON it was close to 577 MB
demo-user@ATP19C> begin
2 dbms_cloud.export_data(
3 credential_name =>'MY_DEMO_CRED'
4 , file_uri_list => :uri
5 , query => ' select * from big_Table '
6 , format => json_object('type' value 'json')
7 ) ;
8 end;
9* /
PL/SQL procedure successfully
completed.
demo-user@ATP19C> select
count(*), round(sum(bytes)/1024/1024,2) size_mb
2 from dbms_cloud.list_objects( 'MY_DEMO_CRED'
3 , :uri )
4* where object_name like '%.json' ;
COUNT(*) SIZE_MB
________ _______
58 577
The XML was
close to 420 MB
demo-user@ATP19C> begin
2 dbms_cloud.export_data(
3 credential_name =>'MY_DEMO_CRED'
4 , file_uri_list => :uri
5 , query => ' select * from big_Table '
6 , format => json_object('type' value 'xml')
7 ) ;
8 end;
9* /
PL/SQL procedure successfully
completed.
demo-user@ATP19C> select
count(*), round(sum(bytes)/1024/1024,2) size_mb
2 from dbms_cloud.list_objects( 'MY_DEMO_CRED', :uri )
3* where object_name like '%.xml' ;
COUNT(*) SIZE_MB
________ _______
43 426.02
Where as
Paraquet it was 34 MB
3 credential_name =>'MY_DEMO_CRED'
4 , file_uri_list => :uri
5 , query => ' select * from big_Table '
6 , format => json_object('type' value 'parquet')
7 ) ;
8 end;
9* /
PL/SQL procedure successfully
completed.
demo-user@ATP19C> select
count(*), round(sum(bytes)/1024/1024,2) size_mb
2 from dbms_cloud.list_objects( 'MY_DEMO_CRED', :uri )
3* where object_name like '%.parquet' ;
COUNT(*) SIZE_MB
________ _______
1 34.35
In the next
blogpost we will see about how to read these parquet files from object storage
to Oracle database using Autonomous database features.
Unformatted Blocks ..................... 0
FS1 Blocks (0-25) ...................... 0
FS2 Blocks (25-50) ..................... 0
FS3 Blocks (50-75) ..................... 0
FS4 Blocks (75-100)..................... 18
Full Blocks ............................ 20,526
Total Blocks............................ 20,920
Total Bytes............................. 171,376,640
Total MBytes............................ 163
Unused Blocks........................... 0
Unused Bytes............................ 0
Last Used Ext FileId.................... 2,731
Last Used Ext BlockId................... 575,872
Last Used Block......................... 24
demo-user@ATP19C> exec :uri := 'https://objectstorage.us-ashburn-1.oraclecloud.com/n/idcglquusbz6/b/MY_DEMO_BUCKET/o/BUCKET02/';
demo-user@ATP19C> begin
2 dbms_cloud.export_data(
3 credential_name =>'MY_DEMO_CRED'
4 , file_uri_list => :uri
5 , query => ' select * from big_Table '
6 , format => json_object('type' value 'csv')
7 ) ;
8 end;
9* /
2 from dbms_cloud.list_objects( 'MY_DEMO_CRED'
3 , :uri )
4* where object_name like '%.csv' ;
________ _______
17 165.37
2 dbms_cloud.export_data(
3 credential_name =>'MY_DEMO_CRED'
4 , file_uri_list => :uri
5 , query => ' select * from big_Table '
6 , format => json_object('type' value 'json')
7 ) ;
8 end;
9* /
2 from dbms_cloud.list_objects( 'MY_DEMO_CRED'
3 , :uri )
4* where object_name like '%.json' ;
________ _______
58 577
2 dbms_cloud.export_data(
3 credential_name =>'MY_DEMO_CRED'
4 , file_uri_list => :uri
5 , query => ' select * from big_Table '
6 , format => json_object('type' value 'xml')
7 ) ;
8 end;
9* /
2 from dbms_cloud.list_objects( 'MY_DEMO_CRED', :uri )
3* where object_name like '%.xml' ;
________ _______
43 426.02
demo-user@ATP19C> begin
3 credential_name =>'MY_DEMO_CRED'
4 , file_uri_list => :uri
5 , query => ' select * from big_Table '
6 , format => json_object('type' value 'parquet')
7 ) ;
8 end;
9* /
2 from dbms_cloud.list_objects( 'MY_DEMO_CRED', :uri )
3* where object_name like '%.parquet' ;
________ _______
1 34.35