Alternative to PDFBox - .NET Version
I am trying to read a PDF file line by line using PDFBox.
This is the first time I have ever attempted to do this with a PDF so I am not sure what I was expecting but I thought I would end up with some "mark up", which I could use to parse the lines I was looking for out of the file.
Anyway it didn't work out the way I had planned and I am not sure if that is because I am using the wrong tool or not using the right tool correctly. I managed to extract the text but I didn't really see any mark up that would be usefull to parse the file.
Does anybody have any experiece with extracting test from PDFs. What tool(s) do you recomend?
Re: Alternative to PDFBox - .NET Version
I've never used it but it seems to me that the most commonly used managed PDF library is iText#.
Re: Alternative to PDFBox - .NET Version
hey FastEddie
use itextsharp dll and following is the code posted in vbforums for reading a pdf file
enjoy :-)
vb Code:
Imports iTextSharp.text.pdf
Public Function ParsePdfText(ByVal sourcePDF As String, Optional ByVal fromPageNum As Integer = 0, Optional ByVal toPageNum As Integer = 0) As String
Dim sb As New System.Text.StringBuilder()
Try
Dim reader As New PdfReader(sourcePDF)
Dim pageBytes() As Byte = Nothing
Dim token As PRTokeniser = Nothing
Dim tknType As Integer = -1
Dim tknValue As String = String.Empty
If fromPageNum = 0 Then
fromPageNum = 1
End If
If toPageNum = 0 Then
toPageNum = reader.NumberOfPages
End If
If fromPageNum > toPageNum Then
Throw New ApplicationException("Parameter error: The value of fromPageNum can " & "not be larger than the value of toPageNum")
End If
For i As Integer = fromPageNum To toPageNum Step 1
pageBytes = reader.GetPageContent(i)
If Not IsNothing(pageBytes) Then
token = New PRTokeniser(pageBytes)
While token.NextToken()
tknType = token.TokenType()
tknValue = token.StringValue
If tknType = PRTokeniser.TK_STRING Then
sb.Append(token.StringValue)
'I need to add these additional tests to properly add whitespace to the output string
ElseIf tknType = 1 AndAlso tknValue = "-600" Then
sb.Append(" ")
ElseIf tknType = 10 AndAlso tknValue = "TJ" Then
sb.Append(" ")
End If
End While
End If
Next i
Catch ex As Exception
MessageBox.Show("Exception occured. " & ex.Message)
Return String.Empty
End Try
Return sb.ToString()
End Function
Re: Alternative to PDFBox - .NET Version
Or You can for the ABC Pdf
Re: Alternative to PDFBox - .NET Version
Hi, I tryed the code posted by mr. Su_ki and there's some troubles. Even you try to "make appear" the spaces what Ive discovered is that "end of line"'s are not present. It seems to be related to Tj tokens but I got nothing very clear about this topic yet, can you lend a hand?:wave: