How can I load a PDF file using DI

I wish to load a PDF into a database using DI. All the file format options seem to relate to text files. How can I specifiy that it is binary data? I want to load the whole file into a BLOB column in Oracle 9i.

Any ideas?


dataintegrator.net (BOB member since 2006-01-12)

Connecting to a PDF file??
Not possible with DI I 'm afraid.


dataintegrator.net (BOB member since 2006-01-12)

Right, DI does not support blobs at this time.
You have another solution for doing that:
First of all you need such procedure onto your Oracle Schema :

–BEGIN–

/---------------------------------------/
/* Read a blob from a file, and write /
/
it to the database. /
/
---------------------------------------*/

create or replace directory image_dir as ‘/apps/temp/images’ ;

create or replace procedure proc_imp_jpg
(fname1 in varchar2, image_id1 in numeric) is

file1 bfile ;
lblob blob ;
len int ;
e_blob blob ;

begin

file1 := bfilename(‘IMAGE_DIR’,fname1);

e_blob := empty_blob();

insert into image_test (image_id, image_data)
values (image_id1, e_blob )
returning image_data into lblob ;

dbms_lob.fileopen(file1);

len := dbms_lob.getlength(file1) ;

dbms_lob.loadfromfile(lblob,file1,len);

dbms_lob.filecloseall();

commit;

exception
when others then
begin
dbms_output.put_line(sqlerrm);
dbms_lob.filecloseall();
commit;
end;

end ;
/

Ref: http://www.rocket99.com/oracle/oracle32.html

–END–

The it will depend on your OS platform:
You need to list all files that are present in the import directory:
So, do something like $DI_VARIABLE = exec(‘cat’, ‘/var/pdfimport/’, 4); if you’re on Unix or exec(‘cmd.exe’, ‘dir C:\temp\pdfimport’, 4); on Windows.

The you’ll have a variable containing the list of the files from your directory (not all, just what can fit in your variable size) (separated by spaces, unfortunatly) [I solved this using a perl] [You can also do using bat : FOR %i IN C:\temp\pdfimport\ ECHO %i ;]

You put that line at top of your DI job in a Script object;

Then a While where condition is that $DI_VARIABLE length() is > 0.
Inside the while you extract first file name getting position of the first delimiter etc.

You put this file in a $CURRENT_FILE variable and use it in a dummy datatflow placed inside the while.

Row Generation : 1 Row,
Query that maps generated column to a DUMMY field that will be mapped as sql(‘MyDATASTORE’, ‘call proc_imp_jpg(’’ || $CURRENT_FILE || ‘’,101)’),
that all to a dummy file format for closing the flow.

That’s up to me the most DI usage you can do in such processing.
I suggest you to just use DI Scheduler to run PL/SQL procedure that will list, iterate and load your files.

Hope can help,

Hervé.


dataintegrator.net (BOB member since 2006-01-12)

OK actually it depends on pdf file type (compressed or not).
PDF files are just like RTF or TXT files, they are simple text files that describes to the viewer what to do (images inside are UUEncoded, etc.)
So it’s pure text until it’s compressed [Unfortunatly you cannot build a solution based on that if you index more than your own PDF files].


dataintegrator.net (BOB member since 2006-01-12)

Hervé - thanks for sharing all this info!!
Much appreciated.
Nico


dataintegrator.net (BOB member since 2006-01-12)

Thanks for all the assistance. I managed to get another (text based) source for the data for now so have loaded it the standard way. But I will definitely need to do this in the future and will try your technique. One complication is that the db is one Unix and the files will be on Windows. So I will have to FTP them across as a first step. Be nice if we didnt need all the mucking around and could just load them directly!! Could be an enhancement request.

Also the part of the example that showed how to get a list of files and iterate through them was very useful. I am going to use this for another problem I am working on.

Thanks again.

Simon


dataintegrator.net (BOB member since 2006-01-12)

I you just need PDF content (Text) you can manage other way :

On Unix you can, for shure, find a pdf to text Perl Script of something.
On Windows you can create a vbs that will print this PDF to “Generic Text” Windows Standard Printer, specifying a filename in printer port mapping, you’ll get a txt file filled with all text-printable symbols from your PDF document (skip drawing lines, just grabb all chars sent to the printer).
You can see what it would look doing so : select all text from your PDF, copy and paste in notepad :wink:

Hope can help.


dataintegrator.net (BOB member since 2006-01-12)