-
Jul 9th, 2008, 08:22 AM
#1
[2005]Extract Images from a PDF file using iTextSharp
A long while ago when I posted the code to extract text from a PDF using iTextSharp, a VBF member asked me to write a function to extract images too. I was busy at the time and didn't dig too deep into it. And recently, while trying to find a way to extracted hyperlinks from a PDF (asked by a VBF member), I also figured out how to get the images. So I thought I would post the code here to share with everyone.
Note1: You'll need to add a reference of iTextSharp.dll to your project. It can be downloaded by Googling for "itextsharp download" if you don't already have it.
Note2: This code were written targetting .Net 2.0 framework. It will still work on .Net 1.x if you replace every occurances of "List(Of Image)" in the code with an ArrayList.
vb.net Code:
Public Shared Function ExtractImages(ByVal sourcePdf As String) As List(Of Image)
Dim imgList As New List(Of Image)
Dim raf As iTextSharp.text.pdf.RandomAccessFileOrArray = Nothing
Dim reader As iTextSharp.text.pdf.PdfReader = Nothing
Dim pdfObj As iTextSharp.text.pdf.PdfObject = Nothing
Dim pdfStrem As iTextSharp.text.pdf.PdfStream = Nothing
Try
raf = New iTextSharp.text.pdf.RandomAccessFileOrArray(sourcePdf)
reader = New iTextSharp.text.pdf.PdfReader(raf, Nothing)
For i As Integer = 0 To reader.XrefSize - 1
pdfObj = reader.GetPdfObject(i)
If Not IsNothing(pdfObj) AndAlso pdfObj.IsStream() Then
pdfStrem = DirectCast(pdfObj, iTextSharp.text.pdf.PdfStream)
Dim subtype As iTextSharp.text.pdf.PdfObject = pdfStrem.Get(iTextSharp.text.pdf.PdfName.SUBTYPE)
If Not IsNothing(subtype) AndAlso subtype.ToString = iTextSharp.text.pdf.PdfName.IMAGE.ToString Then
Dim bytes() As Byte = iTextSharp.text.pdf.PdfReader.GetStreamBytesRaw(CType(pdfStrem, iTextSharp.text.pdf.PRStream))
If Not IsNothing(bytes) Then
Try
Using memStream As New System.IO.MemoryStream(bytes)
memStream.Position = 0
Dim img As Image = Image.FromStream(memStream)
imgList.Add(img)
End Using
Catch ex As Exception
'Most likely the image is in an unsupported format
'Do nothing
'You can add your own code to handle this exception if you want to
End Try
End If
End If
End If
Next
reader.Close()
Catch ex As Exception
MessageBox.Show(ex.Message)
End Try
Return imgList
End Function
Let us have faith that right makes might, and in that faith, let us, to the end, dare to do our duty as we understand it.
- Abraham Lincoln -
-
Feb 24th, 2009, 09:37 AM
#2
New Member
Re: [2005]Extract Images from a PDF file using iTextSharp
Thank you for your example. I found that the XrefSize index does not have to match the order the images are found in the PDF. I need to know the order the images appear in the PDF. Do you have any suggestions on how I could get that order and extract the images in that order.
-
Feb 24th, 2009, 03:39 PM
#3
New Member
Re: [2005]Extract Images from a PDF file using iTextSharp
I found a way to do pull images per page. This example only gets the first image from each page. Sorry, this is in C# but you can convert it here for free: http://www.developerfusion.com/tools.../csharp-to-vb/
Hope this helps someone.
Code:
using iTextSharp.text;
using iTextSharp.text.pdf;
#region ExtractImagesFromPDF
public static void ExtractImagesFromPDF(string sourcePdf, string outputPath)
{
// NOTE: This will only get the first image it finds per page.
PdfReader pdf = new PdfReader(sourcePdf);
RandomAccessFileOrArray raf = new iTextSharp.text.pdf.RandomAccessFileOrArray(sourcePdf);
try
{
for (int pageNumber = 1; pageNumber <= pdf.NumberOfPages; pageNumber++)
{
PdfDictionary pg = pdf.GetPageN(pageNumber);
PdfDictionary res =
(PdfDictionary)PdfReader.GetPdfObject(pg.Get(PdfName.RESOURCES));
PdfDictionary xobj =
(PdfDictionary)PdfReader.GetPdfObject(res.Get(PdfName.XOBJECT));
if (xobj != null)
{
foreach (PdfName name in xobj.Keys)
{
PdfObject obj = xobj.Get(name);
if (obj.IsIndirect())
{
PdfDictionary tg = (PdfDictionary)PdfReader.GetPdfObject(obj);
PdfName type =
(PdfName)PdfReader.GetPdfObject(tg.Get(PdfName.SUBTYPE));
if (PdfName.IMAGE.Equals(type))
{
int XrefIndex = Convert.ToInt32(((PRIndirectReference)obj).Number.ToString(System.Globalization.CultureInfo.InvariantCulture));
PdfObject pdfObj = pdf.GetPdfObject(XrefIndex);
PdfStream pdfStrem = (PdfStream)pdfObj;
byte[] bytes = PdfReader.GetStreamBytesRaw((PRStream)pdfStrem);
if ((bytes != null))
{
using (System.IO.MemoryStream memStream = new System.IO.MemoryStream(bytes))
{
memStream.Position = 0;
System.Drawing.Image img = System.Drawing.Image.FromStream(memStream);
// must save the file while stream is open.
if (!Directory.Exists(outputPath))
Directory.CreateDirectory(outputPath);
string path = Path.Combine(outputPath, String.Format(@"{0}.jpg", pageNumber));
System.Drawing.Imaging.EncoderParameters parms = new System.Drawing.Imaging.EncoderParameters(1);
parms.Param[0] = new System.Drawing.Imaging.EncoderParameter(System.Drawing.Imaging.Encoder.Compression, 0);
// GetImageEncoder is found below this method
System.Drawing.Imaging.ImageCodecInfo jpegEncoder = GetImageEncoder("JPEG");
img.Save(path, jpegEncoder, parms);
break;
}
}
}
}
}
}
}
}
catch
{
throw;
}
finally
{
pdf.Close();
}
}
#endregion
#region GetImageEncoder
public static System.Drawing.Imaging.ImageCodecInfo GetImageEncoder(string imageType)
{
imageType = imageType.ToUpperInvariant();
foreach (ImageCodecInfo info in ImageCodecInfo.GetImageEncoders())
{
if (info.FormatDescription == imageType)
{
return info;
}
}
return null;
}
#endregion
-
Feb 24th, 2009, 04:55 PM
#4
Re: [2005]Extract Images from a PDF file using iTextSharp
iTextSharp doesn't seem to provide a way to link an image back to which page it was extracted from (at least I couldn't find a way to work it out). Sorry, can't help you further...
Let us have faith that right makes might, and in that faith, let us, to the end, dare to do our duty as we understand it.
- Abraham Lincoln -
-
Mar 15th, 2009, 01:48 AM
#5
Fanatic Member
Re: [2005]Extract Images from a PDF file using iTextSharp
Visual Studio.net 2010
If this post is useful, rate it
-
Mar 17th, 2009, 02:46 PM
#6
New Member
Re: [2005]Extract Images from a PDF file using iTextSharp
I am getting the exception "Parameter is not valid" on this line:
Code:
System.Drawing.Image img = System.Drawing.Image.FromStream(memStream);
Does anyone know why?
-
Mar 17th, 2009, 04:24 PM
#7
Re: [2005]Extract Images from a PDF file using iTextSharp
Originally Posted by prankster624
I am getting the exception "Parameter is not valid" on this line:
Code:
System.Drawing.Image img = System.Drawing.Image.FromStream(memStream);
Does anyone know why?
That's why you have to put that line in a try/catch block.... In my original post, I did explain that when this happens, the image is probably in an unsupported format, and there's really nothing much you can do except to ignore that image and go on...
Let us have faith that right makes might, and in that faith, let us, to the end, dare to do our duty as we understand it.
- Abraham Lincoln -
-
Mar 17th, 2009, 04:54 PM
#8
New Member
Re: [2005]Extract Images from a PDF file using iTextSharp
Thanks for the quick reply. Do you know of a way to convert each page of a PDF document to an image? I'm trying to convert a multipage PDF to a multipage TIFF. I don't necessarily want to pull the images out of a PDF page, I want to treat the whole page as an image (without having to buy a third party library). Thanks in advance!
-
Jul 14th, 2010, 08:06 AM
#9
Fanatic Member
Re: [2005]Extract Images from a PDF file using iTextSharp
Hi,
is it possible to get the position(x,y,height,width) of image in the pdf?
Visual Studio.net 2010
If this post is useful, rate it
-
Oct 28th, 2010, 12:27 PM
#10
New Member
Re: [2005]Extract Images from a PDF file using iTextSharp
I need the same. Please help to extract text with itext
-
Oct 28th, 2010, 12:29 PM
#11
New Member
Re: [2005]Extract Images from a PDF file using iTextSharp
sorry my last post, i need help to extract text from a required position...
-
May 3rd, 2011, 07:33 AM
#12
New Member
Re: [2005]Extract Images from a PDF file using iTextSharp
Can you share the code to extract all the hyperlinks from a pdf file along with its text.
-
May 3rd, 2011, 09:26 AM
#13
Re: [2005]Extract Images from a PDF file using iTextSharp
Originally Posted by gayatri
Can you share the code to extract all the hyperlinks from a pdf file along with its text.
Check out this thread
http://www.vbforums.com/showthread.php?t=490456
You need to download the PdfManipulation2 class and use the ExtractURLs method.
Let us have faith that right makes might, and in that faith, let us, to the end, dare to do our duty as we understand it.
- Abraham Lincoln -
-
Dec 16th, 2011, 04:16 AM
#14
New Member
Re: [2005]Extract Images from a PDF file using iTextSharp
Originally Posted by stanav
=========
Hello Stanav,
Do we have code or sample logic that can remove the watermark text & watermark image from a pdf file ?
thanks,
RkTech
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|