Results 1 to 5 of 5

Thread: Alternative to PDFBox - .NET Version

  1. #1

    Thread Starter
    Hyperactive Member
    Join Date
    Nov 2005
    Posts
    259

    Alternative to PDFBox - .NET Version

    I am trying to read a PDF file line by line using PDFBox.

    This is the first time I have ever attempted to do this with a PDF so I am not sure what I was expecting but I thought I would end up with some "mark up", which I could use to parse the lines I was looking for out of the file.

    Anyway it didn't work out the way I had planned and I am not sure if that is because I am using the wrong tool or not using the right tool correctly. I managed to extract the text but I didn't really see any mark up that would be usefull to parse the file.

    Does anybody have any experiece with extracting test from PDFs. What tool(s) do you recomend?
    Last edited by FastEddie; May 5th, 2009 at 06:53 AM.

  2. #2
    Super Moderator jmcilhinney's Avatar
    Join Date
    May 2005
    Location
    Sydney, Australia
    Posts
    111,221

    Re: Alternative to PDFBox - .NET Version

    I've never used it but it seems to me that the most commonly used managed PDF library is iText#.
    Why is my data not saved to my database? | MSDN Data Walkthroughs
    VBForums Database Development FAQ
    My CodeBank Submissions: VB | C#
    My Blog: Data Among Multiple Forms (3 parts)
    Beginner Tutorials: VB | C# | SQL

  3. #3
    Hyperactive Member su ki's Avatar
    Join Date
    Oct 2007
    Posts
    354

    Re: Alternative to PDFBox - .NET Version

    hey FastEddie
    use itextsharp dll and following is the code posted in vbforums for reading a pdf file

    enjoy :-)

    vb Code:
    1. Imports iTextSharp.text.pdf
    2.  
    3. Public Function ParsePdfText(ByVal sourcePDF As String, Optional ByVal fromPageNum As Integer = 0, Optional ByVal toPageNum As Integer = 0) As String
    4.     Dim sb As New System.Text.StringBuilder()
    5.     Try
    6.         Dim reader As New PdfReader(sourcePDF)
    7.         Dim pageBytes() As Byte = Nothing
    8.         Dim token As PRTokeniser = Nothing
    9.         Dim tknType As Integer = -1
    10.         Dim tknValue As String = String.Empty
    11.        
    12.         If fromPageNum = 0 Then
    13.             fromPageNum = 1
    14.         End If
    15.         If toPageNum = 0 Then
    16.             toPageNum = reader.NumberOfPages
    17.         End If
    18.        
    19.         If fromPageNum > toPageNum Then
    20.             Throw New ApplicationException("Parameter error: The value of fromPageNum can " & "not be larger than the value of toPageNum")
    21.         End If
    22.        
    23.         For i As Integer = fromPageNum To toPageNum Step 1
    24.             pageBytes = reader.GetPageContent(i)
    25.             If Not IsNothing(pageBytes) Then
    26.                 token = New PRTokeniser(pageBytes)
    27.                 While token.NextToken()
    28.                     tknType = token.TokenType()
    29.                     tknValue = token.StringValue
    30.                     If tknType = PRTokeniser.TK_STRING Then
    31.                         sb.Append(token.StringValue)
    32.                         'I need to add these additional tests to properly add whitespace to the output string
    33.                     ElseIf tknType = 1 AndAlso tknValue = "-600" Then
    34.                         sb.Append(" ")
    35.                     ElseIf tknType = 10 AndAlso tknValue = "TJ" Then
    36.                         sb.Append(" ")
    37.                     End If
    38.                 End While
    39.             End If
    40.         Next i
    41.     Catch ex As Exception
    42.         MessageBox.Show("Exception occured. " & ex.Message)
    43.         Return String.Empty
    44.     End Try
    45.     Return sb.ToString()
    46. End Function
    * If my post helped you, please Rate it
    * If your problem is solved please also mark the thread resolved it is there in right top of page under thread tools
    * Why Rating is useful

  4. #4
    Learning .Net danasegarane's Avatar
    Join Date
    Aug 2004
    Location
    VBForums
    Posts
    5,853

    Re: Alternative to PDFBox - .NET Version

    Or You can for the ABC Pdf
    Please mark you thread resolved using the Thread Tools as shown

  5. #5
    New Member
    Join Date
    Dec 2007
    Posts
    11

    Question Re: Alternative to PDFBox - .NET Version

    Hi, I tryed the code posted by mr. Su_ki and there's some troubles. Even you try to "make appear" the spaces what Ive discovered is that "end of line"'s are not present. It seems to be related to Tj tokens but I got nothing very clear about this topic yet, can you lend a hand?

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width