1 Attachment(s)
[RESOLVED] Advice on reading tabular data from PDF format.
Hi guys,
I am working on a small project where I need to read some text from PDF files. After some googling I tried iTextSharp which can extract the text but states that it is not possible to read tabular data. I understand the why but wonder if there is a library anybody has used that is able to.
Here is an example of the layout of the PDF:
Attachment 99119
With iTextSharp the text is read line by line ignoring the column break between.
Has anybody come across a (preferably free) library that is able to read text from sections so that I can grab the text from the left "column" and then the text from the right?
Thanks
Jay
Re: Advice on reading tabular data from PDF format.
I have got it working using iTextSharp. My code is below in case anybody else finds it useful. It will extract the text within the left half of a PDF page. You would just change the rectangle dimensions to suit your own region:
VB.NET Code:
'iTextSharp Imports
Imports iTextSharp.text
Imports iTextSharp.text.pdf
Imports iTextSharp.text.pdf.parser
Public Function GetPDFTextFromRectangle(ByVal PDFPath As String, ByVal PageNo As Integer) As String
Dim Reader As PdfReader = Nothing
Dim PDFOutput As String = Nothing
Try
Reader = New PdfReader(PDFPath)
'Get the Page Width/Height
Dim PageHeight As Single = Reader.GetPageSize(PageNo).Height
Dim PageWidth As Single = Reader.GetPageSize(PageNo).Width
'Rectangle representing the area that contains the text. Parameters:
' Bottom-Left-X
' Bottom-Left-Y
' Top-Right-X
' Top-Right-Y
Dim PageRect As New iTextSharp.text.Rectangle(0, PageHeight, PageWidth / 2, 0)
'Required Filter and Strategy to extract text
Dim Filter As RenderFilter = New RegionTextRenderFilter(PageRect)
Dim Strategy As ITextExtractionStrategy = New FilteredTextRenderListener( _
New LocationTextExtractionStrategy, _
Filter)
'Extract the text from the rectangle region of the given page number
PDFOutput = PdfTextExtractor.GetTextFromPage(Reader, PageNo, Strategy)
Catch Ex As Exception
MessageBox.Show(Ex.Message.ToString)
Finally
Reader.Close()
Return PDFOutput
End Try
End Function