|
-
Jun 25th, 2007, 01:18 PM
#1
Extract Text from Pdfs using iTextSharp (02-03/2005)
Hello all,
I was recently working on a pdf manipulating project. One of the things I needed to do was to extract the text from pdf files and search for a specific phrase. I was using iTextSharp for manipulating pdfs. While iTextSharp includes a PdfReader class, it isn't directly capable of extracting text out of the box. I did some Google and all I could find is this project by Zollor http://www.codeproject.com/useritems/PDFToText.asp. Unfortunately, his code can't extract the Pdfs created by our company (while PdfBox can - but to use PdfBox it requires another library reference and adds another 16MB to the final footprint of my project and it very is sloooowwww...), so I just went ahead and wrote my own function...
And here it is. To use it, you'll have to add a reference to itextsharp.dll to your project and import iTextSharp.text.pdf
VB Code:
Imports iTextSharp.text.pdf
Public Function ParsePdfText(ByVal sourcePDF As String, _
Optional ByVal fromPageNum As Integer = 0, _
Optional ByVal toPageNum As Integer = 0) As String
Dim sb As New System.Text.StringBuilder()
Try
Dim reader As New PdfReader(sourcePDF)
Dim pageBytes() As Byte = Nothing
Dim token As PRTokeniser = Nothing
Dim tknType As Integer = -1
Dim tknValue As String = String.Empty
If fromPageNum = 0 Then
fromPageNum = 1
End If
If toPageNum = 0 Then
toPageNum = reader.NumberOfPages
End If
If fromPageNum > toPageNum Then
Throw New ApplicationException("Parameter error: The value of fromPageNum can " & _
"not be larger than the value of toPageNum")
End If
For i As Integer = fromPageNum To toPageNum Step 1
pageBytes = reader.GetPageContent(i)
If Not IsNothing(pageBytes) Then
token = New PRTokeniser(pageBytes)
While token.NextToken()
tknType = token.TokenType()
tknValue = token.StringValue
If tknType = PRTokeniser.TK_STRING Then
sb.Append(token.StringValue)
'I need to add these additional tests to properly add whitespace to the output string
ElseIf tknType = 1 AndAlso tknValue = "-600" Then
sb.Append(" ")
ElseIf tknType = 10 AndAlso tknValue = "TJ" Then
sb.Append(" ")
End If
End While
End If
Next i
Catch ex As Exception
MessageBox.Show("Exception occured. " & ex.Message)
Return String.Empty
End Try
Return sb.ToString()
End Function
Last edited by stanav; Jun 25th, 2007 at 01:22 PM.
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|