|
-
May 4th, 2009, 08:56 PM
#1
Thread Starter
Hyperactive Member
Alternative to PDFBox - .NET Version
I am trying to read a PDF file line by line using PDFBox.
This is the first time I have ever attempted to do this with a PDF so I am not sure what I was expecting but I thought I would end up with some "mark up", which I could use to parse the lines I was looking for out of the file.
Anyway it didn't work out the way I had planned and I am not sure if that is because I am using the wrong tool or not using the right tool correctly. I managed to extract the text but I didn't really see any mark up that would be usefull to parse the file.
Does anybody have any experiece with extracting test from PDFs. What tool(s) do you recomend?
Last edited by FastEddie; May 5th, 2009 at 06:53 AM.
-
May 4th, 2009, 09:14 PM
#2
Re: Alternative to PDFBox - .NET Version
I've never used it but it seems to me that the most commonly used managed PDF library is iText#.
-
May 5th, 2009, 12:12 AM
#3
Hyperactive Member
Re: Alternative to PDFBox - .NET Version
hey FastEddie
use itextsharp dll and following is the code posted in vbforums for reading a pdf file
enjoy :-)
vb Code:
Imports iTextSharp.text.pdf Public Function ParsePdfText(ByVal sourcePDF As String, Optional ByVal fromPageNum As Integer = 0, Optional ByVal toPageNum As Integer = 0) As String Dim sb As New System.Text.StringBuilder() Try Dim reader As New PdfReader(sourcePDF) Dim pageBytes() As Byte = Nothing Dim token As PRTokeniser = Nothing Dim tknType As Integer = -1 Dim tknValue As String = String.Empty If fromPageNum = 0 Then fromPageNum = 1 End If If toPageNum = 0 Then toPageNum = reader.NumberOfPages End If If fromPageNum > toPageNum Then Throw New ApplicationException("Parameter error: The value of fromPageNum can " & "not be larger than the value of toPageNum") End If For i As Integer = fromPageNum To toPageNum Step 1 pageBytes = reader.GetPageContent(i) If Not IsNothing(pageBytes) Then token = New PRTokeniser(pageBytes) While token.NextToken() tknType = token.TokenType() tknValue = token.StringValue If tknType = PRTokeniser.TK_STRING Then sb.Append(token.StringValue) 'I need to add these additional tests to properly add whitespace to the output string ElseIf tknType = 1 AndAlso tknValue = "-600" Then sb.Append(" ") ElseIf tknType = 10 AndAlso tknValue = "TJ" Then sb.Append(" ") End If End While End If Next i Catch ex As Exception MessageBox.Show("Exception occured. " & ex.Message) Return String.Empty End Try Return sb.ToString() End Function
* If my post helped you, please Rate it
* If your problem is solved please also mark the thread resolved it is there in right top of page under thread tools
* Why Rating is useful
-
May 5th, 2009, 12:35 AM
#4
Re: Alternative to PDFBox - .NET Version
Or You can for the ABC Pdf
Please mark you thread resolved using the Thread Tools as shown
-
Jun 25th, 2010, 07:45 PM
#5
New Member
Re: Alternative to PDFBox - .NET Version
Hi, I tryed the code posted by mr. Su_ki and there's some troubles. Even you try to "make appear" the spaces what Ive discovered is that "end of line"'s are not present. It seems to be related to Tj tokens but I got nothing very clear about this topic yet, can you lend a hand?
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|