dcsimg
Results 1 to 10 of 10

Thread: [RESOLVED] "replace" Words in PDF file using iTextSharp

  1. #1

    Thread Starter
    Frenzied Member HanneSThEGreaT's Avatar
    Join Date
    Nov 2003
    Location
    Vereeniging, South Africa
    Posts
    1,491

    Resolved [RESOLVED] "replace" Words in PDF file using iTextSharp

    Hello.

    Long time no see.

    I have been given a task to replace text within an existing PDF file. I played around with iTextSharp and is halfway.

    I did come accross an excellent sample on the CodeBank by stanav, which I have been using.

    Now, I know that you cannot replace the existing text on the file, because a PDF document is not a Word document as such. So my way of thinking is to draw a block around the existing text, and resave the file.

    For this, I need help. I need to find the precise x & y location of the text, and then I could draw the block(s) over it - that way, blanking the words out.

    This is my code:
    Code:
    Imports System.IO
    Imports System.Text
    Imports System.Collections.Generic
    Imports System.Linq
    Imports iTextSharp.text
    Imports iTextSharp.text.pdf
    Imports iTextSharp.text.pdf.parser
    
    Public Class Form1
    
        Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
    
            ParsePdfText(Application.StartupPath & "\file.pdf", "Test")
        End Sub
    
        Public Function ParsePdfText(ByVal sourcePDF As String, ByVal texttosearch As String, _
                                      Optional ByVal fromPageNum As Integer = 0, _
                                      Optional ByVal toPageNum As Integer = 0) 'As String
            Dim sb As New System.Text.StringBuilder()
            Try
                Dim reader As New PdfReader(sourcePDF)
                Dim pageBytes() As Byte = Nothing
                Dim token As PRTokeniser = Nothing
                Dim tknType As Integer = -1
                Dim tknValue As String = String.Empty
                If fromPageNum = 0 Then
                    fromPageNum = 1
                End If
                If toPageNum = 0 Then
                    toPageNum = reader.NumberOfPages
                End If
                If fromPageNum > toPageNum Then
                    Throw New ApplicationException("Parameter error: The value of fromPageNum can " & _
                                               "not be larger than the value of toPageNum")
                End If
                For i As Integer = fromPageNum To toPageNum Step 1
                    pageBytes = reader.GetPageContent(i)
                    If Not IsNothing(pageBytes) Then
                        token = New PRTokeniser(pageBytes)
                        While token.NextToken()
                            tknType = token.TokenType()
                            tknValue = token.StringValue
                            If tknType = PRTokeniser.TokType.STRING AndAlso tknValue = texttosearch Then
    
                                sb.Append(token.StringValue)
                                MsgBox("found")
    
                            End If
                        End While
                    End If
                Next i
            Catch ex As Exception
                MessageBox.Show("Exception occured. " & ex.Message)
                Return String.Empty
            End Try
            Return sb.ToString()
           
        End Function
    
    
        Public Sub Write(ByVal stream As Stream)
            ' step 1
            Using document As New Document()
                ' step 2
                PdfWriter.GetInstance(document, New FileStream(Application.StartupPath & "\Output1.pdf", FileMode.Create))
                ' step 3
                document.Open()
                ' step 4
                document.Add(New Paragraph("Hello World!"))
            End Using
        End Sub
    
        Public Function ReadPdfFile(ByVal fileName As String) As String
            Dim text As New StringBuilder()
    
            If File.Exists(fileName) Then
                Dim pdfReader As New PdfReader(fileName)
    
                For page As Integer = 1 To pdfReader.NumberOfPages
                    Dim strategy As ITextExtractionStrategy = New SimpleTextExtractionStrategy()
                    Dim currentText As String = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy)
    
                    currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.[Default], Encoding.UTF8, Encoding.[Default].GetBytes(currentText)))
                    text.Append(currentText)
                    pdfReader.Close()
                Next
            End If
            Return text.ToString()
        End Function
    End Class
    I need to be able to blank out the word(s) and save it as new file.

    Can any one help an old dumb guy out please?

  2. #2
    PowerPoster jcis's Avatar
    Join Date
    Jan 2003
    Location
    Argentina
    Posts
    4,423

    Re: "replace" Words in PDF file using iTextSharp

    Creating a Template from your PDF and using fields (Acrofields) is not an option? Just in case you didn't consider this alternative here is a link where it's explained:
    Using a template to programmatically create PDFs with C# and iTextSharp

  3. #3
    PowerPoster jcis's Avatar
    Join Date
    Jan 2003
    Location
    Argentina
    Posts
    4,423

    Re: "replace" Words in PDF file using iTextSharp

    And about the method you described (placing blocks to blanking words using OverContent layer), the only thing that's going to be difficult to accomplish is getting the matching words x,y coordinates in the document. After doing a Google search and reading lots of answers saying "it can't be done" I found this one. The idea in the answer given there is just taking Class LocationTextExtractionStrategy from iTextSharp sourcecode to your Project, because it contains these coordinates you need. I don't have time right now to investigate further, but i'll take a look later, having these coordinates the rest should be easy.
    Last edited by jcis; Jun 6th, 2012 at 08:40 PM.

  4. #4

    Thread Starter
    Frenzied Member HanneSThEGreaT's Avatar
    Join Date
    Nov 2003
    Location
    Vereeniging, South Africa
    Posts
    1,491

    Re: "replace" Words in PDF file using iTextSharp

    Hi. Thanks for all your kind help.

    This is what I have done so far :

    Code:
        Public Function ParsePdfText(ByVal sourcePDF As String, ByVal texttosearch As String, ByVal DestPDF As String, _
                                      Optional ByVal fromPageNum As Integer = 0, _
                                      Optional ByVal toPageNum As Integer = 0) 'As String
            Dim sb As New System.Text.StringBuilder()
            Try
                Using existingFileStream = New FileStream(sourcePDF, FileMode.Open)
                    Using newFileStream = New FileStream(DestPDF, FileMode.Create)
                        Dim reader As New PdfReader(existingFileStream)
                        Dim stamper = New PdfStamper(reader, newFileStream)
    
                        Dim pageBytes() As Byte = Nothing
                        Dim token As PRTokeniser = Nothing
                        Dim tknType As Integer = -1
                        Dim tknValue As String = String.Empty
                        If fromPageNum = 0 Then
                            fromPageNum = 1
                        End If
                        If toPageNum = 0 Then
                            toPageNum = reader.NumberOfPages
                        End If
                        If fromPageNum > toPageNum Then
                            Throw New ApplicationException("Parameter error: The value of fromPageNum can " & _
                                                       "not be larger than the value of toPageNum")
                        End If
                        For i As Integer = fromPageNum To toPageNum Step 1
                            pageBytes = reader.GetPageContent(i)
                            If Not IsNothing(pageBytes) Then
                                token = New PRTokeniser(pageBytes)
                                While token.NextToken()
                                    tknType = token.TokenType()
                                    tknValue = token.StringValue
                                    If tknType = PRTokeniser.TokType.STRING AndAlso tknValue = texttosearch Then
    
                                        sb.Append(token.StringValue)
                                        RichTextBox1.Text = "found"
                                        FindLocation()
    
                                    End If
                                End While
                            End If
                        Next i
                        stamper.FormFlattening = True
    
    
                        stamper.Close()
                        reader.Close()
                    End Using
                End Using
            Catch ex As Exception
                MessageBox.Show("Exception occured. " & ex.Message)
                Return String.Empty
            End Try
            Return sb.ToString()
    
        End Function
    
        Public Function FindLocation()
            Dim tempstr As String
            ' Dim tr As New parser.TextRenderInfo
            Dim LST As New LocationTextExtractionStrategy()
            tempstr = LST.GetResultantText()
            RichTextBox1.Text = tempstr
        End Function
    And I added the LocationTextExtractionStrategy class :

    Code:
    Imports System.Collections.Generic
    Imports System.Text
    Imports iTextSharp.text.pdf.parser
    
    
    Public Class LocationTextExtractionStrategy
        Implements ITextExtractionStrategy
    
    
        '* set to true for debugging 
    
        Public Shared DUMP_STATE As Boolean = False
    
        '* a summary of all found text 
    
        Private locationalResult As New List(Of TextChunk)()
    
        '*
        '         * Creates a new text extraction renderer.
        '         
    
        Public Sub New()
        End Sub
    
        '*
        '         * @see com.itextpdf.text.pdf.parser.RenderListener#beginTextBlock()
        '         
    
        Public Overridable Sub BeginTextBlock()
        End Sub
    
        '*
        '         * @see com.itextpdf.text.pdf.parser.RenderListener#endTextBlock()
        '         
    
        Public Overridable Sub EndTextBlock()
        End Sub
    
        '*
        '         * Returns the result so far.
        '         * @return  a String with the resulting text.
        '         
    
        Public Overridable Function GetResultantText() As [String]
    
            If DUMP_STATE Then
                DumpState()
            End If
    
            locationalResult.Sort()
    
            Dim sb As New StringBuilder()
            Dim lastChunk As TextChunk = Nothing
            For Each chunk As TextChunk In locationalResult
    
                If lastChunk Is Nothing Then
                    sb.Append(chunk.text)
                Else
                    If chunk.SameLine(lastChunk) Then
                        Dim dist As Single = chunk.DistanceFromEndOf(lastChunk)
    
                        If dist < -chunk.charSpaceWidth Then
                            sb.Append(" "c)
    
                            ' we only insert a blank space if the trailing character of the previous string wasn't a space, and the leading character of the current string isn't a space
                        ElseIf dist > chunk.charSpaceWidth / 2.0F AndAlso chunk.text(0) <> " "c AndAlso lastChunk.text(lastChunk.text.Length - 1) <> " "c Then
                            sb.Append(" "c)
                        End If
    
                        sb.Append(chunk.text)
                    Else
                        sb.Append(ControlChars.Lf)
                        sb.Append(chunk.text)
                    End If
                End If
                lastChunk = chunk
            Next
    
            Return sb.ToString()
    
        End Function
    
        '* Used for debugging only 
    
        Private Sub DumpState()
            For Each location As TextChunk In locationalResult
    
                location.PrintDiagnostics()
    
                Console.WriteLine()
            Next
    
        End Sub
    
        '*
        '         * 
        '         * @see com.itextpdf.text.pdf.parser.RenderListener#renderText(com.itextpdf.text.pdf.parser.TextRenderInfo)
        '         
    
        Public Overridable Sub RenderText(ByVal renderInfo As TextRenderInfo)
            Dim segment As LineSegment = renderInfo.GetBaseline()
            Dim location As New TextChunk(renderInfo.GetText(), segment.GetStartPoint(), segment.GetEndPoint(), renderInfo.GetSingleSpaceWidth())
            locationalResult.Add(location)
        End Sub
    
    
    
        '*
        '         * Represents a chunk of text, it's orientation, and location relative to the orientation vector
        '         
    
        Private Class TextChunk
            Implements IComparable(Of TextChunk)
    
            '* the text of the chunk 
    
            Friend text As [String]
            '* the starting location of the chunk 
    
            Friend startLocation As Vector
            '* the ending location of the chunk 
    
            Friend endLocation As Vector
            '* unit vector in the orientation of the chunk 
    
            Friend orientationVector As Vector
            '* the orientation as a scalar for quick sorting 
    
            Friend orientationMagnitude As Integer
            '* perpendicular distance to the orientation unit vector (i.e. the Y position in an unrotated coordinate system)
            '             * we round to the nearest integer to handle the fuzziness of comparing floats 
    
            Friend distPerpendicular As Integer
            '* distance of the start of the chunk parallel to the orientation unit vector (i.e. the X position in an unrotated coordinate system) 
    
            Friend distParallelStart As Single
            '* distance of the end of the chunk parallel to the orientation unit vector (i.e. the X position in an unrotated coordinate system) 
    
            Friend distParallelEnd As Single
            '* the width of a single space character in the font of the chunk 
    
            Friend charSpaceWidth As Single
    
            Public Sub New(ByVal str As [String], ByVal startLocation As Vector, ByVal endLocation As Vector, ByVal charSpaceWidth As Single)
                Me.text = str
                Me.startLocation = startLocation
                Me.endLocation = endLocation
                Me.charSpaceWidth = charSpaceWidth
    
                orientationVector = endLocation.Subtract(startLocation).Normalize()
                orientationMagnitude = CInt(Math.Truncate(Math.Atan2(orientationVector(Vector.I2), orientationVector(Vector.I1)) * 1000))
    
                ' see http://mathworld.wolfram.com/Point-LineDistance2-Dimensional.html
                ' the two vectors we are crossing are in the same plane, so the result will be purely
                ' in the z-axis (out of plane) direction, so we just take the I3 component of the result
                Dim origin As New Vector(0, 0, 1)
                distPerpendicular = CInt((startLocation.Subtract(origin)).Cross(orientationVector)(Vector.I3))
    
                distParallelStart = orientationVector.Dot(startLocation)
                distParallelEnd = orientationVector.Dot(endLocation)
            End Sub
    
            Public Sub PrintDiagnostics()
                Console.WriteLine("Text (@" & Convert.ToString(startLocation) & " -> " & Convert.ToString(endLocation) & "): " & text)
                Console.WriteLine("orientationMagnitude: " & orientationMagnitude)
                Console.WriteLine("distPerpendicular: " & distPerpendicular)
                Console.WriteLine("distParallel: " & distParallelStart)
            End Sub
    
            '*
            '             * @param as the location to compare to
            '             * @return true is this location is on the the same line as the other
            '             
    
            Public Function SameLine(ByVal a As TextChunk) As Boolean
                If orientationMagnitude <> a.orientationMagnitude Then
                    Return False
                End If
                If distPerpendicular <> a.distPerpendicular Then
                    Return False
                End If
                Return True
            End Function
    
            '*
            '             * Computes the distance between the end of 'other' and the beginning of this chunk
            '             * in the direction of this chunk's orientation vector.  Note that it's a bad idea
            '             * to call this for chunks that aren't on the same line and orientation, but we don't
            '             * explicitly check for that condition for performance reasons.
            '             * @param other
            '             * @return the number of spaces between the end of 'other' and the beginning of this chunk
            '             
    
            Public Function DistanceFromEndOf(ByVal other As TextChunk) As Single
                Dim distance As Single = distParallelStart - other.distParallelEnd
                Return distance
            End Function
    
            '*
            '             * Compares based on orientation, perpendicular distance, then parallel distance
            '             * @see java.lang.Comparable#compareTo(java.lang.Object)
            '             
    
            Public Function CompareTo(ByVal rhs As TextChunk) As Integer
                If Me Is rhs Then
                    Return 0
                End If
                ' not really needed, but just in case
                Dim rslt As Integer
                rslt = CompareInts(orientationMagnitude, rhs.orientationMagnitude)
                If rslt <> 0 Then
                    Return rslt
                End If
    
                rslt = CompareInts(distPerpendicular, rhs.distPerpendicular)
                If rslt <> 0 Then
                    Return rslt
                End If
    
                ' note: it's never safe to check floating point numbers for equality, and if two chunks
                ' are truly right on top of each other, which one comes first or second just doesn't matter
                ' so we arbitrarily choose this way.
                rslt = If(distParallelStart < rhs.distParallelStart, -1, 1)
    
                Return rslt
            End Function
    
            '*
            '             *
            '             * @param int1
            '             * @param int2
            '             * @return comparison of the two integers
            '             
    
            Private Shared Function CompareInts(ByVal int1 As Integer, ByVal int2 As Integer) As Integer
                Return If(int1 = int2, 0, If(int1 < int2, -1, 1))
            End Function
    
    
            Public Function CompareTo1(ByVal other As TextChunk) As Integer Implements System.IComparable(Of TextChunk).CompareTo
    
            End Function
        End Class
    
    End Class
    I did play around with it in my FindLocation function, but I am getting nowhere. I am missing a trick somewhere. Any advice?

  5. #5
    New Member
    Join Date
    Jun 2012
    Posts
    1

    Re: "replace" Words in PDF file using iTextSharp

    Hi, could you solve the problem?

  6. #6
    PowerPoster jcis's Avatar
    Join Date
    Jan 2003
    Location
    Argentina
    Posts
    4,423

    Re: "replace" Words in PDF file using iTextSharp

    I did some experiments with the overcontent approach, seems to be the only way to do it, because altering a PDF file or even creating a new with the same format but changes in text looks even harder than decompiling a C++ EXE

    I think i'll be able to find Text with x,y coordinates. Doing this would be easier just by making some modifications in iTextSharp dll but i don't want to give a solution that would require recompiling this dll sourcecode. iTextSharp comes from iText (Java) and is still in develpment state, some things look a bit "experimental", some important Objects/methods are not being exposed from inside the deeper Classes and this makes things difficult from outside.

    But i'm getting closer, i'll post back just after I get this part about placing blank blocks over words or phrases.

    About the Replace idea that comes after all this (adding other text over this blank rectangles), this has known limitations like: the text you add in the Overcontent layer won't be available when the user searches the document, and also, if you replace a large word (or phrase) with another much shorter word (or phrase) there will be a big portion of the blank rect not being covered, not a very nice visual effect.
    Last edited by jcis; Jun 13th, 2012 at 01:25 PM.

  7. #7
    PowerPoster jcis's Avatar
    Join Date
    Jan 2003
    Location
    Argentina
    Posts
    4,423

    Re: "replace" Words in PDF file using iTextSharp

    Ok, See the project attached. Add the reference to your iTextSharp dll.

    See the comments in code. In the call you should specify the text to search, compare method, source and destination file paths.

    The example will Hightlight in pink all words/sentences found in the PDF document that match the search text.

    I'm not sure what's the best way to make solid white rectangles but If you remove line cb.Fill() it will cover those rectangles with white making all these words/sentences dissapear.

    This will get the correct coordinates in the PDF for each word/sentence you search, even taking into account font name and font size for making this calculations.

    Report back if you find an error or any weird behavior, or if there is something else to add, for example adding different text where the words have been removed or whatever.
    Attached Files Attached Files
    Last edited by jcis; Jul 30th, 2012 at 06:22 PM.

  8. #8

    Thread Starter
    Frenzied Member HanneSThEGreaT's Avatar
    Join Date
    Nov 2003
    Location
    Vereeniging, South Africa
    Posts
    1,491

    Re: "replace" Words in PDF file using iTextSharp

    Wow. Words cannot describe your kindness. Your way of doing things are way more simpler than mine. I'm getting old. My brain is not what it used to be.

    Thank you for helping me!

  9. #9
    New Member
    Join Date
    Aug 2011
    Posts
    10

    Question Re: "replace" Words in PDF file using iTextSharp

    Hello JCIS,

    First of all thanks for the code in test.zip file you have written for replacing text in PDF file.

    I need little help from you in this regard that i am using VS 2008 i.e. C# 2008 also getting complie errors
    in below code.

    Error 1 :- .Last is not a member of List
    Function Name:- GetTextLocations
    Code Line : If ThisLineChunks.Count > 0 AndAlso Not chunk.SameLine(ThisLineChunks.Last) Then

    Error 2 :- ElementAt is .Last is not a member of IList
    Function Name:- GetRectangleFromText
    Code Line : Dim LineTextWidth As Single = GetStringWidth(sTextinChunks, LastChunk.curFontSize, _
    LastChunk.charSpaceWidth, _
    ThisPdfDocFonts.Values.ElementAt(LastChunk.FontIndex))

    So could you please help me in sharing c# 2008 i.e. Framework 2.0 compatible code.

    Your code will be of great help.

    Thanks,
    RK Tech

  10. #10
    New Member
    Join Date
    Apr 2015
    Posts
    1

    Re: [RESOLVED] "replace" Words in PDF file using iTextSharp

    Im having a similar problem with the replacement of text. I am using a modified version of the test solution which creates the new file where i need it to be but when i go to open it i get the error There was an error opening this document. The file is damaged and could not be repaired

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Featured


Click Here to Expand Forum to Full Width