Itextsharp search word in multiple PDF then isolate the PDF document
My first post. I already used itextsharp pdfreader to find a word in multiple PDF documents. Now I want to copy the PDF that contains the word into a new PDF. How can I download Manipulatepdf2.vb? I think the class includes method to accomplish what I am trying to do. Thank you all.
Re: Itextsharp search word in multiple PDF then isolate the PDF document
From what you wrote, it seems like you want to merge 2 or more pdf documents into a single pdf. To do this, you can use either the MergePdfFiles or MergePdfFilesWithBookmarks method found in PdfManipulation2 class. You can download that class here: http://www.vbforums.com/showthread.p...ing-iTextSharp
Let us have faith that right makes might, and in that faith, let us, to the end, dare to do our duty as we understand it. - Abraham Lincoln -
Re: Itextsharp search word in multiple PDF then isolate the PDF document
Actually I do not want to merge. I found and downloaded PdfManipulation2 class. So what I want is this. I used pdfreader to search for a word in a mutiple pdf document in other words the document contains all kinds of individual pdfs. pages range from 1 to 3. Now let's say I searched for "Dave Jones". I found it on page 5! How do I get the page number or document to PdfManipulation2.ExtractPdfPage("sss.pdf",pagenumber,outputpdf) ? Thank you.
Re: Itextsharp search word in multiple PDF then isolate the PDF document
I tried this from your prior post but not working. Thank you.
'Specified the path to the source pdf file
Dim sourcePdf as sgtring = "C:\Documents and Settings\rch\Desktop\psm2010\venteps\out\table40.pdf"
'Extract page # 2 off this above pdf file
Dim pageNumberToExtract As Integer = 2
'And then save it to a new pdf named 'table40_page2.pdf'
Dim outputPdf As String = "C:\Documents and Settings\rch\Desktop\psm2010\venteps\out\table40_page2.pdf"
'Call the sub somewhere in your program passing in the above arguments
PdfManipulation.ExtractPdfPage("C:\Documents and Settings\rch\Desktop\psm2010\venteps\out\table40.pdf", pageNumberToExtract, outputPdf)
Re: Itextsharp search word in multiple PDF then isolate the PDF document
1. To know on which page your search term was found, you need to search for it page by page. That is, run a loop thru the pdf pages and for each pdf page, you do the search. If found, you mark that page number (i.e adding it to a list) for later use. Once you get out of the loop, you check in your found list to see if anything in there. If there is, you loop thru the list and extract the pages.
2. "It's not working" isn't very informative. It's like going to a doctor and say "I'm sick" without any detailed descriptions of the symptoms... You need to tell me what happened and/or what didn't happened when you run that code.
Let us have faith that right makes might, and in that faith, let us, to the end, dare to do our duty as we understand it. - Abraham Lincoln -
Re: Itextsharp search word in multiple PDF then isolate the PDF document
Stanav, thank you much for your help. I think I put the error message on a different post. The message was "the item has already being created" the copypdf line creates the file and the addpage(page) line was choking.
Re: Itextsharp search word in multiple PDF then isolate the PDF document
I have the search word "Hello" on page 15 of a 30 page pdf document which is made up of 10 separate pdf documents. When I run this function, it finds the word after the first read when i=1 and sets sOut="Hello and the rest of the information on the page". What I am doing wrong.
BTW I also have input directory and output directory with different file names.
Public Shared Function GetTextFromPDF(PdfFileName As String) As String
Dim oReader As New iTextSharp.text.pdf.PdfReader(PdfFileName)
Dim sOut As String
Dim _pageNumber As Integer
Dim i As Integer
sOut = Hello"
Dim x As Integer = 1
For i = 1 To oReader.NumberOfPages
Dim its As New iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy
sOut &= iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(oReader, i, its)
Next
Return sOut
End Function
Re: Itextsharp search word in multiple PDF then isolate the PDF document
Try something like this:
vb.net Code:
''' <summary>
''' Simple page by page text search a PDF file and return a list of the page numbers where a match was found.
''' </summary>
''' <param name="sourcePdf">the full path to the pdf file to be searched</param>
''' <param name="searchPhrase">the string to search for</param>
''' <returns>List(Of Integer) containing the page number whose page contains one or more match string</returns>
''' <remarks></remarks>
Public Shared Function SearchTextFromPdf(ByVal sourcePdf As String, ByVal searchPhrase As String, Optional ByVal caseSensitive As Boolean = False) As List(Of Integer)
Dim foundList As New List(Of Integer)
Dim raf As iTextSharp.text.pdf.RandomAccessFileOrArray = Nothing
Dim reader As iTextSharp.text.pdf.PdfReader = Nothing
Try
raf = New iTextSharp.text.pdf.RandomAccessFileOrArray(sourcePdf)
reader = New iTextSharp.text.pdf.PdfReader(raf, Nothing)
If caseSensitive = False Then
searchPhrase = searchPhrase.ToLower()
End If
For i As Integer = 1 To reader.NumberOfPages()
Dim pageText As String = iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(reader, i)
If caseSensitive = False Then
pageText = pageText.ToLower()
End If
If pageText.Contains(searchPhrase) Then
foundList.Add(i)
End If
Next
reader.Close()
Catch ex As Exception
MessageBox.Show(ex.Message)
End Try
Return foundList
End Function
After you've got the list of the page numbers where the search matched, simply loop thru it and call ExtractPdfPage method of PdfManipulation2 class.
Let us have faith that right makes might, and in that faith, let us, to the end, dare to do our duty as we understand it. - Abraham Lincoln -
Re: Itextsharp search word in multiple PDF then isolate the PDF document
Hey Stanav, Please shed some light on how to retrieve a 3-page PDF from multiple PDFs PDF file? I search for word, I find the word on page 5 but page 5 is the first page of a 3-page PDF and I want to retrieve all 3 pages into output folder. Currently, I have combines methods from PDFmanipulation2 but its become a mess. Thanks for your help.
Re: Itextsharp search word in multiple PDF then isolate the PDF document
How can you tell how many pages to grab after each search match? There has to be a rule of some sort. Computers are not human, and if the commands/rules aren't clear, they can't be executed reliably.
As for achieving the task you are working on, I've given you all the relevant code needed to get it done. It's now just a matter of using it - modify it when necessary - to make it work the way you want. Programming is a lot more than copying and pasting.
In order for me to provide further help, you need to:
1. Upload a sample pdf file that I can use to test with
2. State clearly what you want to do with the pdf
3. State any rules/patterns that must be obeyed...
I don't promise anything, but if I can spare some time, I'll give it a try.
Let us have faith that right makes might, and in that faith, let us, to the end, dare to do our duty as we understand it. - Abraham Lincoln -
Re: Itextsharp search word in multiple PDF then isolate the PDF document
I can't tell you what's wrong until I have a chance to examine it myself. And that's the reason why I asked you upload a test file and provide me the necessary info to do the test. The line you pointed out where the error occurred is entirely within iTextSharp code, and therefore I'm suspecting that you do something wrong in your code rather than iTextSharp's bug.
Let us have faith that right makes might, and in that faith, let us, to the end, dare to do our duty as we understand it. - Abraham Lincoln -
Re: Itextsharp search word in multiple PDF then isolate the PDF document
Due to constraints of the contents of the data, I cannot send actual stuff but this is what I am doing. Thanks for your patience.
I call this function with the supplied parameters
PdfManipulation2.ExtractPdfPage("c:\documents\xyz.PDF", 22, "c:\single_XYZPdf\single.PDF")
"c:\documents\xyz.PDF" --contains 30 customer letters(each 3 pages long)
"c:\single_XYZPdf\single.PDF" -- will contain page 22. ** I will write code to loop from page 22 for 3 pages to output to single_PDF
This code is from PdfManipulation2.ExtractPdfPage
pdfCpy = New iTextSharp.text.pdf.PdfCopy(doc, New IO.FileStream(outPdf, IO.FileMode.Create)) --This line is creating the PDF.
doc.Open()
page = pdfCpy.GetImportedPage(reader, pageNumberToExtract)
pdfCpy.AddPage(page) ----ERROR occurs here.
Re: Itextsharp search word in multiple PDF then isolate the PDF document
As far as I see, I can't replicate the error... The code works as intended each and very time I run it. For testing purpose, use a different pdf file and extract a random page from it. Does that work?
Let us have faith that right makes might, and in that faith, let us, to the end, dare to do our duty as we understand it. - Abraham Lincoln -
Re: Itextsharp search word in multiple PDF then isolate the PDF document
Hello Stanav, I tried with the attached PDF .
Private Sub btnSearch_Click(sender As System.Object, e As System.EventArgs) Handles btnSearch.Click
Dim sourcepdf As String = "C:\HH\diabeteslbs.pdf"
PdfManipulation2.ExtractPdfPage(sourcepdf, 4, "c:\HO\Page_4.pdf")
MessageBox.Show("Done!")
I still got message "An item with the same key has already been added."
Thanks for your help.
Re: Itextsharp search word in multiple PDF then isolate the PDF document
I still can't replicate the error using the sample file you uploaded. It's has to be something in your project or the PdfManipulation2.ExtractPdfPage code has been modified.
Can you compare the code you have with this one? If yours is different than you know why it didn't work, right.
Code:
Public Overloads Shared Sub ExtractPdfPage(ByVal sourcePdf As String, ByVal pageNumberToExtract As Integer, ByVal outPdf As String)
Dim reader As iTextSharp.text.pdf.PdfReader = Nothing
Dim doc As iTextSharp.text.Document = Nothing
Dim pdfCpy As iTextSharp.text.pdf.PdfCopy = Nothing
Dim page As iTextSharp.text.pdf.PdfImportedPage = Nothing
Try
reader = New iTextSharp.text.pdf.PdfReader(sourcePdf)
doc = New iTextSharp.text.Document(reader.GetPageSizeWithRotation(1))
pdfCpy = New iTextSharp.text.pdf.PdfCopy(doc, New IO.FileStream(outPdf, IO.FileMode.Create))
doc.Open()
page = pdfCpy.GetImportedPage(reader, pageNumberToExtract)
pdfCpy.AddPage(page)
doc.Close()
reader.Close()
Catch ex As Exception
Throw ex
End Try
End Sub
Alternately, you can re-download the PdfManipulation2 class and start a fresh project to test the function. If it works, and I'm pretty sure that it will, you have your conclusion...
Let us have faith that right makes might, and in that faith, let us, to the end, dare to do our duty as we understand it. - Abraham Lincoln -
Re: Itextsharp search word in multiple PDF then isolate the PDF document
Hey Stanav, Please advice on the difference between the two snippets. I re-downloaded the pdfmanipulation2.vb. You can see the "Throw Ex" happened when I ran the code again.
Re: Itextsharp search word in multiple PDF then isolate the PDF document
The 2 code snippets are the same, except the 3 lines that are commented out which is OK. I have no idea why you keep getting that error while I don't... Are you using the right version of iTextSharp? It should be 5.2.1.0 or newer.
For testing purposes, can you start a new project and test the function again?
Let us have faith that right makes might, and in that faith, let us, to the end, dare to do our duty as we understand it. - Abraham Lincoln -
Re: Itextsharp search word in multiple PDF then isolate the PDF document
Ah... The new itextsharp version 5.3.4 seems to be the culprit. I tested using that new version and sure enough, I got the same error as you did.
Use this 5.2.1 version below and you should be good to go.... https://dl.dropbox.com/u/20581085/itextsharp_5.2.1.zip
Let us have faith that right makes might, and in that faith, let us, to the end, dare to do our duty as we understand it. - Abraham Lincoln -
Re: Itextsharp search word in multiple PDF then isolate the PDF document
Okay, I knew you were smarter than me. Thank you for sticking with me.
Now I am going back to my original project which is
1. search a pdf document for a word(label).
2. Once found, select the page number as starting page and select all the pages that contain the label into a separate pdf.
For example.
I have a 30 page pdf that actually contain 10 customer invoices. Each invoice has a unique label. I search for label "Rome34". When I find it, I want to select all the pages that have "Rome34" to create a separate pdf.
Re: Itextsharp search word in multiple PDF then isolate the PDF document
Copy and paste this function to the PdfManipulation2 class
Code:
Public Shared Function FindAndExtract(ByVal sourcePdf As String, ByVal outPdf As String, ByVal searchPhrase As String, Optional ByVal caseSensitive As Boolean = False) As Boolean
Dim result As Boolean = False
Dim raf As iTextSharp.text.pdf.RandomAccessFileOrArray = Nothing
Dim reader As iTextSharp.text.pdf.PdfReader = Nothing
Dim doc As iTextSharp.text.Document = Nothing
Dim pdfCpy As iTextSharp.text.pdf.PdfCopy = Nothing
Dim page As iTextSharp.text.pdf.PdfImportedPage = Nothing
Try
raf = New iTextSharp.text.pdf.RandomAccessFileOrArray(sourcePdf)
reader = New iTextSharp.text.pdf.PdfReader(raf, Nothing)
If caseSensitive = False Then
searchPhrase = searchPhrase.ToLower()
End If
For i As Integer = 1 To reader.NumberOfPages()
Dim pageText As String = iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(reader, i)
If caseSensitive = False Then
pageText = pageText.ToLower()
End If
If pageText.Contains(searchPhrase) Then
If doc Is Nothing Then
doc = New iTextSharp.text.Document(reader.GetPageSizeWithRotation(1))
pdfCpy = New iTextSharp.text.pdf.PdfCopy(doc, New IO.FileStream(outPdf, IO.FileMode.Create))
doc.Open()
End If
page = pdfCpy.GetImportedPage(reader, i)
pdfCpy.AddPage(page)
End If
Next
If doc IsNot Nothing Then
doc.Close()
result = True
End If
reader.Close()
Catch ex As Exception
Throw ex
End Try
Return result
End Function
Usage example:
Code:
Private Sub Button1_Click(ByVal sender As Object, ByVal e As EventArgs) Handles Button1.Click
Dim searchText As String = "4 oz. regular soda"
Dim result As Boolean = PdfManipulation2.FindAndExtract("d:\test1.pdf", "d:\test1_extracted.pdf", searchText)
MessageBox.Show(result.ToString)
End Sub
Let us have faith that right makes might, and in that faith, let us, to the end, dare to do our duty as we understand it. - Abraham Lincoln -
Re: Itextsharp search word in multiple PDF then isolate the PDF document
Hey Stanav, I have folder X with 20 PDF files. I want to merge them into one PDF and output to folder Y.
Can I use the wildcat *pdf with the sourceTable" ? Instead of me stringing all the PDF file names as an array?
pdfManipulation.ExtractAndMergePdfPages(SourceTable, outPdf)
Thanks
Re: Itextsharp search word in multiple PDF then isolate the PDF document
Originally Posted by pinokio
Hey Stanav, I have folder X with 20 PDF files. I want to merge them into one PDF and output to folder Y.
Can I use the wildcat *pdf with the sourceTable" ? Instead of me stringing all the PDF file names as an array?
pdfManipulation.ExtractAndMergePdfPages(SourceTable, outPdf)
Thanks
You need to use this method:
Code:
'Merge multiple pdfs into a single one.
Public Shared Function MergePdfFiles(ByVal pdfFiles() As String, ByVal outputPath As String, _
Optional ByVal authorName As String = "", _
Optional ByVal creatorName As String = "", _
Optional ByVal subject As String = "", _
Optional ByVal title As String = "", _
Optional ByVal keywords As String = "") As Boolean
As you see in the function signature, it takes an array of pdf files and then merge to a single outpdf file. And yes, you can use system.io.directory.GetFiles(folderPath, "*.pdf") to get the pdf files and feed that array to the function.
The method that you mentioned pdfManipulation.ExtractAndMergePdfPages(SourceTable, outPdf) is for extract some pages from each pdf and merge them to 1 single pdf. For example, take pages 1, 3, 7 from A.pdf, pages 9, 11, 30 from B.pdf. pages 2, 8, 11 from C.pdf and merge them into a new pdf. As you can see, since the parameters are pretty complex, it's easier to build a datatable to feed the function. However, you don't have to worry about this method since the MergePdfFiles will do exactly what you need.
Let us have faith that right makes might, and in that faith, let us, to the end, dare to do our duty as we understand it. - Abraham Lincoln -
Re: Itextsharp search word in multiple PDF then isolate the PDF document
Oops! I meant it worked beautifully. I have a question. How fast do you think the method to Extract PDF will retrieve a Tagged PDF(meaning a unique identifier) from one million page PDF? Do you also know any disk size calculation for one million pages of a PDF document?
Re: Itextsharp search word in multiple PDF then isolate the PDF document
I haven't work with any tagged pdf so I can't be sure on this, but it you're asking about using the FindAndExtract method above, I'd say it won't make much difference compared to non-tagged pdf's. The whole 1 mil pages are still being looped through 1 by 1. As for how long it'll take to complete a 1 mil page pdf, you're going to try and time it yourself. I don't have anything that large. The largest pdf file I've ever worked on was around 30k pages, and iTextSharp handled it without any problems.
How much disk size a 1 mil page pdf takes? It's a tricky question because there are way too many variables involved in creating a pdf page: images, embedded resources, layers... just to name a few. And no, I don't know of anyway you can calculate or estimate the final disk size of a pdf before it is created.
Let us have faith that right makes might, and in that faith, let us, to the end, dare to do our duty as we understand it. - Abraham Lincoln -
Re: Itextsharp search word in multiple PDF then isolate the PDF document
Hi Stanav, hope you are well. I am using Pdfmanipulation2. Everything is fine. I just want to know how to position the bookmark (the BLUE font) from top of page to bottom preferably (footer section)during merging. I don't want to replace the current footer but insert the bookmark at the bottom two line footer. If that is too difficult then to the end of document before the footer. Thanks
Re: Itextsharp search word in multiple PDF then isolate the PDF document
Originally Posted by pinokio
Hi Stanav, hope you are well. I am using Pdfmanipulation2. Everything is fine. I just want to know how to position the bookmark (the BLUE font) from top of page to bottom preferably (footer section)during merging. I don't want to replace the current footer but insert the bookmark at the bottom two line footer. If that is too difficult then to the end of document before the footer. Thanks
You see in the code how a paragraph is added to every page 1 of a pdf file before the original pdf page is copied over. That paragraph is what makes the bookmark. If you want the bookmark to be at the bottom of the page, just add the paragraph after you add the copied page to the new document. That is, change the inner while loop to this:
Code:
While i < pageCount
i += 1
'Get the input page size
pdfDoc.SetPageSize(reader.GetPageSizeWithRotation(i))
'Create a new page on the output document
pdfDoc.NewPage()
'Now we get the imported page
page = writer.GetImportedPage(reader, i)
'Read the imported page's rotation
rotation = reader.GetPageRotation(i)
'Then add the imported page to the PdfContentByte object as a template based on the page's rotation
If rotation = 90 Then
cb.AddTemplate(page, 0, -1.0F, 1.0F, 0, 0, reader.GetPageSizeWithRotation(i).Height)
ElseIf rotation = 270 Then
cb.AddTemplate(page, 0, 1.0F, -1.0F, 0, reader.GetPageSizeWithRotation(i).Width + 60, -30)
Else
cb.AddTemplate(page, 1.0F, 0, 0, 1.0F, 0, 0)
End If
'If it is the 1st page, we add bookmarks to the page
If i = 1 Then
'First create a paragraph using the filename as the heading
Dim para As New iTextSharp.text.Paragraph(IO.Path.GetFileName(fileName).ToUpper(), bookmarkFont)
'Then create a chapter from the above paragraph
Dim chpter As New iTextSharp.text.Chapter(para, f + 1)
'Finally add the chapter to the document
pdfDoc.Add(chpter)
End If
End While
Let us have faith that right makes might, and in that faith, let us, to the end, dare to do our duty as we understand it. - Abraham Lincoln -
Re: Itextsharp search word in multiple PDF then isolate the PDF document
Hey Stanav, I 'm back. I hope all is well with you.
Issue I got all the pdfs in a folder. The question is how to programmatically(VB.NET) open folder and print all the pdfs stored as individual documents.
Re: Itextsharp search word in multiple PDF then isolate the PDF document
You would get a list of all the pdf files in that folder and then loop through the list printing 1 at a time.
1. To get the pdfs in a folder, you can use System.IO.Directory.GetFiles method.
2. To print a pdf file using the default application and printer, you start a process, set the verb to "print" and pass in the filepath as the argument. Search the forum and you will find examples.
Let us have faith that right makes might, and in that faith, let us, to the end, dare to do our duty as we understand it. - Abraham Lincoln -
Re: Itextsharp search word in multiple PDF then isolate the PDF document
Hello Stanav,
I am trying to shrink the sizes pdf files in a folder. Basically compress each page by 80% or more without affecting the contents . I checked out the ResizePage function in the PDFManipulation2 but I am not sure it will do I need. Ideally, I would like to set the dpi to 72 and reduce the pixel count. Any ideas will be appreciated.
Re: Itextsharp search word in multiple PDF then isolate the PDF document
Hello Stanav,
I have just downloaded your PdfManipulation2 and it is great. However in the functions that use "token = New iTextSharp.text.pdf.PRTokeniser(pageBytes)" I get:
Error 1 Value of type '1-dimensional array of Byte' cannot be converted to 'iTextSharp.text.pdf.RandomAccessFileOrArray'.
Also getting these warnings:
Warning 2 'Public Sub New(raf As iTextSharp.text.pdf.RandomAccessFileOrArray, ownerPassword() As Byte)' is obsolete: 'Use the constructor that takes a RandomAccessFileOrArray'.
Am I doing something wrong? I am using "VB Express 2012"
Thanks for your help
Brad
Re: Itextsharp search word in multiple PDF then isolate the PDF document
Originally Posted by bhendo54
Hello Stanav,
I have just downloaded your PdfManipulation2 and it is great. However in the functions that use "token = New iTextSharp.text.pdf.PRTokeniser(pageBytes)" I get:
Error 1 Value of type '1-dimensional array of Byte' cannot be converted to 'iTextSharp.text.pdf.RandomAccessFileOrArray'.
Also getting these warnings:
Warning 2 'Public Sub New(raf As iTextSharp.text.pdf.RandomAccessFileOrArray, ownerPassword() As Byte)' is obsolete: 'Use the constructor that takes a RandomAccessFileOrArray'.
Am I doing something wrong? I am using "VB Express 2012"
Thanks for your help
Brad
iTextSharp has evolved quite a bit since the last version that I worked on... So to answer your question, I'd need to know 2 things:
1. What version of iTextSharp are you using?
2. What exactly is it that you're trying to do?
I've been extremely busy and also since I haven't had a need to use newer versions of iTextSharp, it's unlikely that I will update PdfManipulation2 class any time soon. If you're using a iTextSharp version newer than 5.2.1, I'd suggest you to download 5.2.1 and try again. Most of the time it will resolve the issues.
Let us have faith that right makes might, and in that faith, let us, to the end, dare to do our duty as we understand it. - Abraham Lincoln -
Re: Itextsharp search word in multiple PDF then isolate the PDF document
Hello Stanav,
I am back after search all of PDFManipulation2. Here is the situation. Every time I convert MS Word 2007 to PDF, The PDF is shrunk to about 90% of the Word document. Is there anyway to send printer commands to keep pdf 100%?
Thank you.