Got an interesting question from a fellow co-worked about how to read a
specific CSV file from a zipped contents without unzipping it. However with
Oracle 11g and above it is possible through using pre-processor utility added
up in External table family.
One can think of the preprocessor as like a unix pipe style option.
The preprocessor program converts the data to a record format supported
by the access driver and then writes the converted record data to standard
output (stdout), which the access driver reads as input.
Here is a step-by-step approach to use pre-processor utility in
external table to read a specific CSV file from a zipped contents.
Step#1 – created three CSV
files
[oracle@en701c202vm02
TEMP]$ pwd
/oracleexabkp/TEMP
[oracle@en701c202vm02
TEMP]$ ls -ltr
total
20
-rwxrwxrwx
1 oracle oinstall 132 Jul 11 04:42 read_zip1.sh
-rw-r--r--
1 oracle oinstall 143 Jul 11 05:38 emp1_header.csv
-rw-r--r--
1 oracle oinstall 871 Jul 11 05:38 emp1.csv
-rw-r--r--
1 oracle oinstall 286 Jul 11 05:38 dept.csv
Step#2 – Here is the row
count for three CSV files.
[oracle@en701c202vm02
TEMP]$ wc -l *.csv
6 dept.csv
14 emp1.csv
3 emp1_header.csv
23 total
Step#3 – put them together
into a ZIP file.
[oracle@en701c202vm02
TEMP]$ zip e1.zip emp1.csv
adding: emp1.csv (deflated 61%)
[oracle@en701c202vm02
TEMP]$ zip e1.zip dept.csv
adding: dept.csv (deflated 61%)
[oracle@en701c202vm02
TEMP]$ zip e1.zip emp1_header.csv
adding: emp1_header.csv (deflated 27%)
Step#4 – now verify all
three files are available inside the ZIP file.
[oracle@en701c202vm02
TEMP]$ unzip -Z1 e1.zip
emp1.csv
dept.csv
emp1_header.csv
Step#5 – now the goal is
the process only emp1.csv file out of all those three files inside
the zip, so did a small shell script like this.
[oracle@en701c202vm02
TEMP]$ cat read_zip1.sh
#!/bin/bash
x="$(/usr/bin/unzip
-Z1 $1 | /bin/egrep -i 'emp[0-9]+.csv')"
echo
$x
#
/usr/bin/unzip -c $x
/usr/bin/unzip
-p $1 $x
Running up that shell script
with a sample zip file, produced an output like this
[oracle@en701c202vm02
TEMP]$ sh read_zip1.sh e1.zip
emp1.csv
7369,"SMITH","CLERK",7902,17-DEC-1980
12:00:00 AM,800,,20
7499,"ALLEN","SALESMAN",7698,20-FEB-1981
12:00:00 AM,1600,300,30
7521,"WARD","SALESMAN",7698,22-FEB-1981
12:00:00 AM,1250,500,30
7566,"JONES","MANAGER",7839,02-APR-1981
12:00:00 AM,2975,,20
7654,"MARTIN","SALESMAN",7698,28-SEP-1981
12:00:00 AM,1250,1400,30
7698,"BLAKE","MANAGER",7839,01-MAY-1981
12:00:00 AM,2850,,30
7782,"CLARK","MANAGER",7839,09-JUN-1981
12:00:00 AM,2450,,10
7788,"SCOTT","ANALYST",7566,09-DEC-1982
12:00:00 AM,3000,,20
7839,"KING","PRESIDENT",,17-NOV-1981
12:00:00 AM,5000,,10
7844,"TURNER","SALESMAN",7698,08-SEP-1981
12:00:00 AM,1500,0,30
7876,"ADAMS","CLERK",7788,12-JAN-1983
12:00:00 AM,1100,,20
7900,"JAMES","CLERK",7698,03-DEC-1981
12:00:00 AM,950,,30
7902,"FORD","ANALYST",7566,03-DEC-1981
12:00:00 AM,3000,,20
7934,"MILLER","CLERK",7782,23-JAN-1982
12:00:00 AM,1300,,10
[oracle@en701c202vm02
TEMP]$
Step#6 – now on the
database side, going to create an External table like this.
We are going to tell the
ZIP file to this external table from the LOCATION section with PREPROCESSOR
component pointing to Shell script created in the Step#5, highlighted below.
demo@PDB1>
select directory_path
2 from
all_directories
3
where directory_name ='DIR2' ;
DIRECTORY_PATH
--------------------
/oracleexabkp/TEMP
demo@PDB1>
select grantor,privilege
2 from
user_tab_privs
3
where table_name ='DIR2'
4 and
grantor = user ;
GRANTOR PRIVILEGE
----------
----------
C##RAJESH EXECUTE
C##RAJESH READ
C##RAJESH WRITE
demo@PDB1>
CREATE TABLE emp_ext
2 (
3
EMPNO varchar(30),
4
ENAME VARCHAR2(10),
5
JOB varchar(30),
6
MGR varchar(30),
7
HIREDATE varchar(30),
8
SAL varchar(30),
9
COMM varchar(30),
10 DEPTNO varchar(30)
11 )
12 ORGANIZATION external
13 (
14 TYPE oracle_loader
15 DEFAULT DIRECTORY DIR2
16 ACCESS PARAMETERS
17 (
18 RECORDS DELIMITED BY NEWLINE
19 PREPROCESSOR DIR2:'read_zip1.sh'
20 BADFILE 'data.bad'
21 LOGFILE 'log.txt'
22 READSIZE 1048576
23 FIELDS TERMINATED BY ","
OPTIONALLY ENCLOSED BY '"' LDRTRIM
24 REJECT ROWS WITH ALL NULL FIELDS
25 (
26 EMPNO char(255) TERMINATED BY
"," OPTIONALLY ENCLOSED BY '"',
27 ENAME char(255) TERMINATED BY
"," OPTIONALLY ENCLOSED BY '"',
28 JOB
char(255) TERMINATED BY "," OPTIONALLY ENCLOSED BY '"',
29 MGR
char(255) TERMINATED BY "," OPTIONALLY ENCLOSED BY '"',
30 HIREDATE char(255) TERMINATED BY
"," OPTIONALLY ENCLOSED BY '"',
31 SAL
char(255) TERMINATED BY "," OPTIONALLY ENCLOSED BY '"',
32 COMM
char(255) TERMINATED BY "," OPTIONALLY ENCLOSED BY '"',
33 DEPTNO char(255) TERMINATED BY
"," OPTIONALLY ENCLOSED BY '"'
34 )
35 )
36 location
37 (
38 'e1.zip'
39 )
40 )REJECT LIMIT UNLIMITED
41 /
Table
created.
Step#7 – post that we can
query the external table, that would just process the data from emp1.csv within
the e1.zip (note: the count matches with the count identified in the Step#2)
demo@PDB1>
select count(*) from emp_ext ;
COUNT(*)
----------
14
Or even using External
table - query time modifications feature introduced in 12.2 database, like
this.
demo@PDB1>
select count(*) from emp_ext
2
external modify(
3 location('e1.zip') ) ;
COUNT(*)
----------
14
All these are done without
unzip emp.csv from e1.zip file.
[oracle@en701c202vm02
TEMP]$ pwd
/oracleexabkp/TEMP
[oracle@en701c202vm02
TEMP]$ ls -ltr emp1.csv
ls: cannot access emp1.csv: No such
file or directory
[oracle@en701c202vm02
TEMP]$ unzip -Z1 e1.zip
emp1.csv
dept.csv
emp1_header.csv
[oracle@en701c202vm02
TEMP]$
Hi Rajesh,
ReplyDeleteI was trying to use the shell script pattern here into my logic. And getting one issue.
So i have a zip file named : say : BPA.zip.
ZIP has 3 CSV files inside it as:- A1.csv , B1.csv , C1.csv.
I Wrote 03 preprocessor shells for these which look like :-
___________________________________________________________________________________________
#!/bin/bash
x="$(/usr/bin/unzip -Z1 $1 | /bin/grep -i '')"
# /usr/bin/unzip -c $x
/usr/bin/unzip -p $1 $x
Aboce CSV_FILE_NAME is A1.csv , B1.csv and C1.csv : 03 different .bat file for each csv.
___________________________________________________________________________________________
Issue is :For example if i create the ZIP folder with data in only A1.csv & B1.csv.
And, there is no data in C1.csv file.
Now when I am querying the external table for C1, it is somehow showing the data of A1.csv and B1.csv together in C1's external table.
When ideally C1's external table should have been empty as the corresponding C1.csv is empty.
I am doing something incorrect in the way shell script if reading the ZIP. Can you please help.
My Database is Oracle Cloud Infrastructure Database.