Wednesday, July 17, 2019

Read a zipped contents without unzip

Got an interesting question from a fellow co-worked about how to read a specific CSV file from a zipped contents without unzipping it. However with Oracle 11g and above it is possible through using pre-processor utility added up in External table family.
 
One can think of the preprocessor as like a unix pipe style option.
 
The preprocessor program converts the data to a record format supported by the access driver and then writes the converted record data to standard output (stdout), which the access driver reads as input.
 
Here is a step-by-step approach to use pre-processor utility in external table to read a specific CSV file from a zipped contents.
 
Step#1 – created three CSV files
 
[oracle@en701c202vm02 TEMP]$ pwd
/oracleexabkp/TEMP
[oracle@en701c202vm02 TEMP]$ ls -ltr
total 20
-rwxrwxrwx 1 oracle oinstall  132 Jul 11 04:42 read_zip1.sh
-rw-r--r-- 1 oracle oinstall  143 Jul 11 05:38 emp1_header.csv
-rw-r--r-- 1 oracle oinstall  871 Jul 11 05:38 emp1.csv
-rw-r--r-- 1 oracle oinstall  286 Jul 11 05:38 dept.csv
 
Step#2 – Here is the row count for three CSV files.
 
[oracle@en701c202vm02 TEMP]$ wc -l *.csv
   6 dept.csv
  14 emp1.csv
   3 emp1_header.csv
  23 total
 
Step#3 – put them together into a ZIP file.
 
[oracle@en701c202vm02 TEMP]$ zip e1.zip emp1.csv
  adding: emp1.csv (deflated 61%)
[oracle@en701c202vm02 TEMP]$ zip e1.zip dept.csv
  adding: dept.csv (deflated 61%)
[oracle@en701c202vm02 TEMP]$ zip e1.zip emp1_header.csv
  adding: emp1_header.csv (deflated 27%)
 
Step#4 – now verify all three files are available inside the ZIP file.
 
[oracle@en701c202vm02 TEMP]$ unzip -Z1 e1.zip
emp1.csv
dept.csv
emp1_header.csv
 
Step#5 – now the goal is the process only emp1.csv file out of all those three files inside the zip, so did a small shell script like this.
 
[oracle@en701c202vm02 TEMP]$ cat read_zip1.sh
#!/bin/bash
x="$(/usr/bin/unzip -Z1 $1 | /bin/egrep -i 'emp[0-9]+.csv')"
echo $x
# /usr/bin/unzip -c $x
/usr/bin/unzip -p $1 $x
 
Running up that shell script with a sample zip file, produced an output like this
 
[oracle@en701c202vm02 TEMP]$ sh read_zip1.sh e1.zip
emp1.csv
7369,"SMITH","CLERK",7902,17-DEC-1980 12:00:00 AM,800,,20
7499,"ALLEN","SALESMAN",7698,20-FEB-1981 12:00:00 AM,1600,300,30
7521,"WARD","SALESMAN",7698,22-FEB-1981 12:00:00 AM,1250,500,30
7566,"JONES","MANAGER",7839,02-APR-1981 12:00:00 AM,2975,,20
7654,"MARTIN","SALESMAN",7698,28-SEP-1981 12:00:00 AM,1250,1400,30
7698,"BLAKE","MANAGER",7839,01-MAY-1981 12:00:00 AM,2850,,30
7782,"CLARK","MANAGER",7839,09-JUN-1981 12:00:00 AM,2450,,10
7788,"SCOTT","ANALYST",7566,09-DEC-1982 12:00:00 AM,3000,,20
7839,"KING","PRESIDENT",,17-NOV-1981 12:00:00 AM,5000,,10
7844,"TURNER","SALESMAN",7698,08-SEP-1981 12:00:00 AM,1500,0,30
7876,"ADAMS","CLERK",7788,12-JAN-1983 12:00:00 AM,1100,,20
7900,"JAMES","CLERK",7698,03-DEC-1981 12:00:00 AM,950,,30
7902,"FORD","ANALYST",7566,03-DEC-1981 12:00:00 AM,3000,,20
7934,"MILLER","CLERK",7782,23-JAN-1982 12:00:00 AM,1300,,10
[oracle@en701c202vm02 TEMP]$
 
Step#6 – now on the database side, going to create an External table like this.
 
We are going to tell the ZIP file to this external table from the LOCATION section with PREPROCESSOR component pointing to Shell script created in the Step#5, highlighted below.
 
demo@PDB1> select directory_path
  2  from all_directories
  3  where directory_name ='DIR2' ;
 
DIRECTORY_PATH
--------------------
/oracleexabkp/TEMP
 
demo@PDB1> select grantor,privilege
  2  from user_tab_privs
  3  where table_name ='DIR2'
  4  and grantor = user ;
 
GRANTOR    PRIVILEGE
---------- ----------
C##RAJESH  EXECUTE
C##RAJESH  READ
C##RAJESH  WRITE
 
demo@PDB1> CREATE TABLE emp_ext
  2  (
  3    EMPNO varchar(30),
  4    ENAME VARCHAR2(10),
  5    JOB varchar(30),
  6    MGR varchar(30),
  7    HIREDATE varchar(30),
  8    SAL varchar(30),
  9    COMM varchar(30),
10    DEPTNO varchar(30)
11  )
12  ORGANIZATION external
13  (
14    TYPE oracle_loader
15    DEFAULT DIRECTORY DIR2
16    ACCESS PARAMETERS
17    (
18      RECORDS DELIMITED BY NEWLINE
19      PREPROCESSOR DIR2:'read_zip1.sh'
20      BADFILE 'data.bad'
21      LOGFILE 'log.txt'
22      READSIZE 1048576
23      FIELDS TERMINATED BY "," OPTIONALLY ENCLOSED BY '"' LDRTRIM
24      REJECT ROWS WITH ALL NULL FIELDS
25      (
26        EMPNO char(255) TERMINATED BY "," OPTIONALLY ENCLOSED BY '"',
27        ENAME char(255) TERMINATED BY "," OPTIONALLY ENCLOSED BY '"',
28        JOB   char(255) TERMINATED BY "," OPTIONALLY ENCLOSED BY '"',
29        MGR   char(255) TERMINATED BY "," OPTIONALLY ENCLOSED BY '"',
30        HIREDATE char(255) TERMINATED BY "," OPTIONALLY ENCLOSED BY '"',
31        SAL    char(255) TERMINATED BY "," OPTIONALLY ENCLOSED BY '"',
32        COMM   char(255) TERMINATED BY "," OPTIONALLY ENCLOSED BY '"',
33        DEPTNO char(255) TERMINATED BY "," OPTIONALLY ENCLOSED BY '"'
34      )
35    )
36    location
37    (
38      'e1.zip'
39    )
40  )REJECT LIMIT UNLIMITED
41   /
 
Table created.
 
 
Step#7 – post that we can query the external table, that would just process the data from emp1.csv within the e1.zip (note: the count matches with the count identified in the Step#2)
 
demo@PDB1> select count(*) from emp_ext ;
 
  COUNT(*)
----------
        14
 
Or even using External table - query time modifications feature introduced in 12.2 database, like this.
 
demo@PDB1> select count(*) from emp_ext
 2  external modify(
  3      location('e1.zip') ) ;
 
  COUNT(*)
----------
        14
 
All these are done without unzip emp.csv from e1.zip file.
 
[oracle@en701c202vm02 TEMP]$ pwd
/oracleexabkp/TEMP
[oracle@en701c202vm02 TEMP]$ ls -ltr emp1.csv
ls: cannot access emp1.csv: No such file or directory
[oracle@en701c202vm02 TEMP]$ unzip -Z1 e1.zip
emp1.csv
dept.csv
emp1_header.csv
[oracle@en701c202vm02 TEMP]$
 

1 comment:

  1. Hi Rajesh,

    I was trying to use the shell script pattern here into my logic. And getting one issue.

    So i have a zip file named : say : BPA.zip.
    ZIP has 3 CSV files inside it as:- A1.csv , B1.csv , C1.csv.


    I Wrote 03 preprocessor shells for these which look like :-
    ___________________________________________________________________________________________
    #!/bin/bash
    x="$(/usr/bin/unzip -Z1 $1 | /bin/grep -i '')"
    # /usr/bin/unzip -c $x
    /usr/bin/unzip -p $1 $x

    Aboce CSV_FILE_NAME is A1.csv , B1.csv and C1.csv : 03 different .bat file for each csv.
    ___________________________________________________________________________________________

    Issue is :For example if i create the ZIP folder with data in only A1.csv & B1.csv.
    And, there is no data in C1.csv file.
    Now when I am querying the external table for C1, it is somehow showing the data of A1.csv and B1.csv together in C1's external table.
    When ideally C1's external table should have been empty as the corresponding C1.csv is empty.

    I am doing something incorrect in the way shell script if reading the ZIP. Can you please help.
    My Database is Oracle Cloud Infrastructure Database.

    ReplyDelete