Extract Text from Pdfs using iTextSharp (02-03/2005)

**stanav** · Jun 25th, 2007, 01:18 PM

Hello all,
I was recently working on a pdf manipulating project. One of the things I needed to do was to extract the text from pdf files and search for a specific phrase. I was using iTextSharp for manipulating pdfs. While iTextSharp includes a PdfReader class, it isn't directly capable of extracting text out of the box. I did some Google and all I could find is this project by Zollor http://www.codeproject.com/useritems/PDFToText.asp. Unfortunately, his code can't extract the Pdfs created by our company (while PdfBox can - but to use PdfBox it requires another library reference and adds another 16MB to the final footprint of my project and it very is sloooowwww...), so I just went ahead and wrote my own function...
And here it is. To use it, you'll have to add a reference to itextsharp.dll to your project and import iTextSharp.text.pdf

VB Code:

Imports iTextSharp.text.pdf
 
Public Function ParsePdfText(ByVal sourcePDF As String, _
                                  Optional ByVal fromPageNum As Integer = 0, _
                                  Optional ByVal toPageNum As Integer = 0) As String
 
        Dim sb As New System.Text.StringBuilder()
        Try
            Dim reader As New PdfReader(sourcePDF)
            Dim pageBytes() As Byte = Nothing
            Dim token As PRTokeniser = Nothing
            Dim tknType As Integer = -1
            Dim tknValue As String = String.Empty
 
            If fromPageNum = 0 Then
                fromPageNum = 1
            End If
            If toPageNum = 0 Then
                toPageNum = reader.NumberOfPages
            End If
 
            If fromPageNum > toPageNum Then
                Throw New ApplicationException("Parameter error: The value of fromPageNum can " & _
                                           "not be larger than the value of toPageNum")
            End If
 
            For i As Integer = fromPageNum To toPageNum Step 1
                pageBytes = reader.GetPageContent(i)
                If Not IsNothing(pageBytes) Then
                    token = New PRTokeniser(pageBytes)
                    While token.NextToken()
                        tknType = token.TokenType()
                        tknValue = token.StringValue
                        If tknType = PRTokeniser.TK_STRING Then
                            sb.Append(token.StringValue)
                        'I need to add these additional tests to properly add whitespace to the output string
                        ElseIf tknType = 1 AndAlso tknValue = "-600" Then
                            sb.Append(" ")
                        ElseIf tknType = 10 AndAlso tknValue = "TJ" Then
                            sb.Append(" ")
                        End If
                   End While
                End If
            Next i
        Catch ex As Exception
            MessageBox.Show("Exception occured. " & ex.Message)
            Return String.Empty
        End Try
        Return sb.ToString()
    End Function

**gtilles** · Jun 25th, 2007, 02:45 PM

Thanks,
Good timing
Exactly what I was looking for....
I'd like to figure a way to do Diffs on 2 PDF's, seems like converting first to text might be a viable solution.

**danasegarane** · Jun 26th, 2007, 02:10 AM

Nice once Stanv,
Why dont you add one more ,that is Extract Images from PDF

**Dipal** · Apr 15th, 2009, 07:07 AM

Hi ..

I extract data of pdf file using asp.net 2005 .
but I can't extract data of 1 pdf file .
and this pdf file is Readonly (you can't copy data ).
so , I think thats why I can't Extract Data .

if any one have any idea then please help me .

thanks .

**stanav** · Apr 15th, 2009, 07:17 AM

Originally Posted by Dipal

Hi ..

I extract data of pdf file using asp.net 2005 .
but I can't extract data of 1 pdf file .
and this pdf file is Readonly (you can't copy data ).
so , I think thats why I can't Extract Data .

if any one have any idea then please help me .

thanks .

PDF files can be created in many different ways... And it depends on how the pdf file was created that you can or cannot extract text from it. For example, if a pdf is made by scanning a document, it is an image and you cannot extract the text using iTextSharp or any other PDF library. In this case, you will need some kind of OCR software to do it.

**Dipal** · Apr 15th, 2009, 07:30 AM

Thanks stanav .

**kadsat** · May 13th, 2009, 11:20 PM

Hi, I want to extract the "Tags" from a "Tagged" PDF using C# or VB.Net. How i can do this with "ITextSharp" or any other opensource PDF application

Thanks,
KadSat

**jivangoyal** · Apr 9th, 2010, 07:41 AM

Hi,Is it possible to search a word in PDF and get the font of this word in the file?

Thread: Extract Text from Pdfs using iTextSharp (02-03/2005)

Thread Tools

Display

Extract Text from Pdfs using iTextSharp (02-03/2005)

Re: Extract Text from Pdfs using iTextSharp (02-03/2005)

Re: Extract Text from Pdfs using iTextSharp (02-03/2005)

Re: Extract Text from Pdfs using iTextSharp (02-03/2005)

Re: Extract Text from Pdfs using iTextSharp (02-03/2005)

Re: Extract Text from Pdfs using iTextSharp (02-03/2005)

Re: Extract Text from Pdfs using iTextSharp (02-03/2005)

Re: Extract Text from Pdfs using iTextSharp (02-03/2005)

Posting Permissions