|
-
Jun 25th, 2007, 01:18 PM
#1
Extract Text from Pdfs using iTextSharp (02-03/2005)
Hello all,
I was recently working on a pdf manipulating project. One of the things I needed to do was to extract the text from pdf files and search for a specific phrase. I was using iTextSharp for manipulating pdfs. While iTextSharp includes a PdfReader class, it isn't directly capable of extracting text out of the box. I did some Google and all I could find is this project by Zollor http://www.codeproject.com/useritems/PDFToText.asp. Unfortunately, his code can't extract the Pdfs created by our company (while PdfBox can - but to use PdfBox it requires another library reference and adds another 16MB to the final footprint of my project and it very is sloooowwww...), so I just went ahead and wrote my own function...
And here it is. To use it, you'll have to add a reference to itextsharp.dll to your project and import iTextSharp.text.pdf
VB Code:
Imports iTextSharp.text.pdf
Public Function ParsePdfText(ByVal sourcePDF As String, _
Optional ByVal fromPageNum As Integer = 0, _
Optional ByVal toPageNum As Integer = 0) As String
Dim sb As New System.Text.StringBuilder()
Try
Dim reader As New PdfReader(sourcePDF)
Dim pageBytes() As Byte = Nothing
Dim token As PRTokeniser = Nothing
Dim tknType As Integer = -1
Dim tknValue As String = String.Empty
If fromPageNum = 0 Then
fromPageNum = 1
End If
If toPageNum = 0 Then
toPageNum = reader.NumberOfPages
End If
If fromPageNum > toPageNum Then
Throw New ApplicationException("Parameter error: The value of fromPageNum can " & _
"not be larger than the value of toPageNum")
End If
For i As Integer = fromPageNum To toPageNum Step 1
pageBytes = reader.GetPageContent(i)
If Not IsNothing(pageBytes) Then
token = New PRTokeniser(pageBytes)
While token.NextToken()
tknType = token.TokenType()
tknValue = token.StringValue
If tknType = PRTokeniser.TK_STRING Then
sb.Append(token.StringValue)
'I need to add these additional tests to properly add whitespace to the output string
ElseIf tknType = 1 AndAlso tknValue = "-600" Then
sb.Append(" ")
ElseIf tknType = 10 AndAlso tknValue = "TJ" Then
sb.Append(" ")
End If
End While
End If
Next i
Catch ex As Exception
MessageBox.Show("Exception occured. " & ex.Message)
Return String.Empty
End Try
Return sb.ToString()
End Function
Last edited by stanav; Jun 25th, 2007 at 01:22 PM.
-
Jun 25th, 2007, 02:45 PM
#2
Hyperactive Member
Re: Extract Text from Pdfs using iTextSharp (02-03/2005)
Thanks,
Good timing
Exactly what I was looking for....
I'd like to figure a way to do Diffs on 2 PDF's, seems like converting first to text might be a viable solution.
Last edited by gtilles; Jun 25th, 2007 at 03:43 PM.
Truly, you have a dizzying intellect.
-
Jun 26th, 2007, 02:10 AM
#3
Re: Extract Text from Pdfs using iTextSharp (02-03/2005)
Nice once Stanv,
Why dont you add one more ,that is Extract Images from PDF
Please mark you thread resolved using the Thread Tools as shown
-
Apr 15th, 2009, 07:07 AM
#4
Member
Re: Extract Text from Pdfs using iTextSharp (02-03/2005)
Hi ..
I extract data of pdf file using asp.net 2005 .
but I can't extract data of 1 pdf file .
and this pdf file is Readonly (you can't copy data ).
so , I think thats why I can't Extract Data .
if any one have any idea then please help me .
thanks .
-
Apr 15th, 2009, 07:17 AM
#5
Re: Extract Text from Pdfs using iTextSharp (02-03/2005)
 Originally Posted by Dipal
Hi ..
I extract data of pdf file using asp.net 2005 .
but I can't extract data of 1 pdf file .
and this pdf file is Readonly (you can't copy data ).
so , I think thats why I can't Extract Data .
if any one have any idea then please help me .
thanks .
PDF files can be created in many different ways... And it depends on how the pdf file was created that you can or cannot extract text from it. For example, if a pdf is made by scanning a document, it is an image and you cannot extract the text using iTextSharp or any other PDF library. In this case, you will need some kind of OCR software to do it.
Let us have faith that right makes might, and in that faith, let us, to the end, dare to do our duty as we understand it.
- Abraham Lincoln -
-
Apr 15th, 2009, 07:30 AM
#6
Member
Re: Extract Text from Pdfs using iTextSharp (02-03/2005)
-
May 13th, 2009, 11:20 PM
#7
New Member
Re: Extract Text from Pdfs using iTextSharp (02-03/2005)
Hi, I want to extract the "Tags" from a "Tagged" PDF using C# or VB.Net. How i can do this with "ITextSharp" or any other opensource PDF application
Thanks,
KadSat
-
Apr 9th, 2010, 07:41 AM
#8
Addicted Member
Re: Extract Text from Pdfs using iTextSharp (02-03/2005)
Hi,Is it possible to search a word in PDF and get the font of this word in the file?
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|