Results 1 to 2 of 2

Thread: [RESOLVED] Advice on reading tabular data from PDF format.

  1. #1

    Thread Starter
    Hyperactive Member
    Join Date
    Apr 2011
    Location
    England
    Posts
    421

    Resolved [RESOLVED] Advice on reading tabular data from PDF format.

    Hi guys,

    I am working on a small project where I need to read some text from PDF files. After some googling I tried iTextSharp which can extract the text but states that it is not possible to read tabular data. I understand the why but wonder if there is a library anybody has used that is able to.

    Here is an example of the layout of the PDF:
    Name:  pdf_example.jpg
Views: 1638
Size:  444.0 KB
    With iTextSharp the text is read line by line ignoring the column break between.

    Has anybody come across a (preferably free) library that is able to read text from sections so that I can grab the text from the left "column" and then the text from the right?

    Thanks
    Jay

  2. #2

    Thread Starter
    Hyperactive Member
    Join Date
    Apr 2011
    Location
    England
    Posts
    421

    Re: Advice on reading tabular data from PDF format.

    I have got it working using iTextSharp. My code is below in case anybody else finds it useful. It will extract the text within the left half of a PDF page. You would just change the rectangle dimensions to suit your own region:

    VB.NET Code:
    1. 'iTextSharp Imports
    2. Imports iTextSharp.text
    3. Imports iTextSharp.text.pdf
    4. Imports iTextSharp.text.pdf.parser
    5.  
    6. Public Function GetPDFTextFromRectangle(ByVal PDFPath As String, ByVal PageNo As Integer) As String
    7.  
    8.     Dim Reader As PdfReader = Nothing
    9.     Dim PDFOutput As String = Nothing
    10.  
    11.     Try
    12.         Reader = New PdfReader(PDFPath)
    13.  
    14.         'Get the Page Width/Height    
    15.         Dim PageHeight As Single = Reader.GetPageSize(PageNo).Height
    16.         Dim PageWidth As Single = Reader.GetPageSize(PageNo).Width
    17.         'Rectangle representing the area that contains the text. Parameters:
    18.         '  Bottom-Left-X
    19.         '  Bottom-Left-Y
    20.         '  Top-Right-X
    21.         '  Top-Right-Y
    22.         Dim PageRect As New iTextSharp.text.Rectangle(0, PageHeight, PageWidth / 2, 0)
    23.         'Required Filter and Strategy to extract text
    24.         Dim Filter As RenderFilter = New RegionTextRenderFilter(PageRect)
    25.         Dim Strategy As ITextExtractionStrategy = New FilteredTextRenderListener( _
    26.                                                   New LocationTextExtractionStrategy, _
    27.                                                   Filter)
    28.         'Extract the text from the rectangle region of the given page number
    29.         PDFOutput = PdfTextExtractor.GetTextFromPage(Reader, PageNo, Strategy)
    30.  
    31.     Catch Ex As Exception
    32.         MessageBox.Show(Ex.Message.ToString)
    33.     Finally
    34.         Reader.Close()
    35.         Return PDFOutput
    36.     End Try
    37.  
    38. End Function

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width