BusinessObjects Board

Reading Structured Data from PDF

Hello,

We have the data in PDF as shown in the attachment ‘sample’. We have to extract the data and make it structured in target database so that further universe can be created over it for anlalysis.

I have tried to extract the data using text entity extraction but only few port names are getting identified. In the field converted_text, we are getting the whole dump in one field only ( which is still unstructured ). As shown in sample 2.

In sample 3, we are getting the whole dump in one field only. Attaching screenshot for more clarity.

How to proceed to fullfill the requirement please suggest.

Many thanks
sample 3.JPG
sample 2.png
sample.png


oops2001 (BOB member since 2016-03-22)

an idea could be:

  1. read the text file into a 3 columns table: column 1 being a sequence (ID Field), column 2 the whole file line (VAL Field) and column 3 initially NULL (TYPE Field). You have 11 rows in 6 columns, hence the table should have a total of 11x6 = 66 rows;

  2. update the TYPE Field 6 times, setting it to “Col_x”, filtering each time from ID (x + 1) and (x + 1 + TotRows/6), with “x” spanning from 1 to 6.
    Now you should have something like the following:

ID - VAL - TYPE
1 Rank Col_1
2 1 Col_1

11 10 Col_1
12 Port Col_2

  1. Pivot the table around TYPE Column, and leave ID Field back. Now you should have what you display in sample 3.png

  2. move everything to your final table, excluding the first row (that contains column names)


CLS69 :it: (BOB member since 2009-06-11)