[2005]Extract Images from a PDF file using iTextSharp
A long while ago when I posted the code to extract text from a PDF using iTextSharp, a VBF member asked me to write a function to extract images too. I was busy at the time and didn't dig too deep into it. And recently, while trying to find a way to extracted hyperlinks from a PDF (asked by a VBF member), I also figured out how to get the images. So I thought I would post the code here to share with everyone.
Note1: You'll need to add a reference of iTextSharp.dll to your project. It can be downloaded by Googling for "itextsharp download" if you don't already have it.
Note2: This code were written targetting .Net 2.0 framework. It will still work on .Net 1.x if you replace every occurances of "List(Of Image)" in the code with an ArrayList.
vb.net Code:
Public Shared Function ExtractImages(ByVal sourcePdf As String) As List(Of Image)
Dim imgList As New List(Of Image)
Dim raf As iTextSharp.text.pdf.RandomAccessFileOrArray = Nothing
Dim reader As iTextSharp.text.pdf.PdfReader = Nothing
Dim pdfObj As iTextSharp.text.pdf.PdfObject = Nothing
Dim pdfStrem As iTextSharp.text.pdf.PdfStream = Nothing
Try
raf = New iTextSharp.text.pdf.RandomAccessFileOrArray(sourcePdf)
reader = New iTextSharp.text.pdf.PdfReader(raf, Nothing)
For i As Integer = 0 To reader.XrefSize - 1
pdfObj = reader.GetPdfObject(i)
If Not IsNothing(pdfObj) AndAlso pdfObj.IsStream() Then
pdfStrem = DirectCast(pdfObj, iTextSharp.text.pdf.PdfStream)
Dim subtype As iTextSharp.text.pdf.PdfObject = pdfStrem.Get(iTextSharp.text.pdf.PdfName.SUBTYPE)
If Not IsNothing(subtype) AndAlso subtype.ToString = iTextSharp.text.pdf.PdfName.IMAGE.ToString Then
Dim bytes() As Byte = iTextSharp.text.pdf.PdfReader.GetStreamBytesRaw(CType(pdfStrem, iTextSharp.text.pdf.PRStream))
If Not IsNothing(bytes) Then
Try
Using memStream As New System.IO.MemoryStream(bytes)
memStream.Position = 0
Dim img As Image = Image.FromStream(memStream)
imgList.Add(img)
End Using
Catch ex As Exception
'Most likely the image is in an unsupported format
'Do nothing
'You can add your own code to handle this exception if you want to
End Try
End If
End If
End If
Next
reader.Close()
Catch ex As Exception
MessageBox.Show(ex.Message)
End Try
Return imgList
End Function
Re: [2005]Extract Images from a PDF file using iTextSharp
Thank you for your example. I found that the XrefSize index does not have to match the order the images are found in the PDF. I need to know the order the images appear in the PDF. Do you have any suggestions on how I could get that order and extract the images in that order.
Re: [2005]Extract Images from a PDF file using iTextSharp
I found a way to do pull images per page. This example only gets the first image from each page. Sorry, this is in C# but you can convert it here for free: http://www.developerfusion.com/tools.../csharp-to-vb/
Hope this helps someone.
Code:
using iTextSharp.text;
using iTextSharp.text.pdf;
#region ExtractImagesFromPDF
public static void ExtractImagesFromPDF(string sourcePdf, string outputPath)
{
// NOTE: This will only get the first image it finds per page.
PdfReader pdf = new PdfReader(sourcePdf);
RandomAccessFileOrArray raf = new iTextSharp.text.pdf.RandomAccessFileOrArray(sourcePdf);
try
{
for (int pageNumber = 1; pageNumber <= pdf.NumberOfPages; pageNumber++)
{
PdfDictionary pg = pdf.GetPageN(pageNumber);
PdfDictionary res =
(PdfDictionary)PdfReader.GetPdfObject(pg.Get(PdfName.RESOURCES));
PdfDictionary xobj =
(PdfDictionary)PdfReader.GetPdfObject(res.Get(PdfName.XOBJECT));
if (xobj != null)
{
foreach (PdfName name in xobj.Keys)
{
PdfObject obj = xobj.Get(name);
if (obj.IsIndirect())
{
PdfDictionary tg = (PdfDictionary)PdfReader.GetPdfObject(obj);
PdfName type =
(PdfName)PdfReader.GetPdfObject(tg.Get(PdfName.SUBTYPE));
if (PdfName.IMAGE.Equals(type))
{
int XrefIndex = Convert.ToInt32(((PRIndirectReference)obj).Number.ToString(System.Globalization.CultureInfo.InvariantCulture));
PdfObject pdfObj = pdf.GetPdfObject(XrefIndex);
PdfStream pdfStrem = (PdfStream)pdfObj;
byte[] bytes = PdfReader.GetStreamBytesRaw((PRStream)pdfStrem);
if ((bytes != null))
{
using (System.IO.MemoryStream memStream = new System.IO.MemoryStream(bytes))
{
memStream.Position = 0;
System.Drawing.Image img = System.Drawing.Image.FromStream(memStream);
// must save the file while stream is open.
if (!Directory.Exists(outputPath))
Directory.CreateDirectory(outputPath);
string path = Path.Combine(outputPath, String.Format(@"{0}.jpg", pageNumber));
System.Drawing.Imaging.EncoderParameters parms = new System.Drawing.Imaging.EncoderParameters(1);
parms.Param[0] = new System.Drawing.Imaging.EncoderParameter(System.Drawing.Imaging.Encoder.Compression, 0);
// GetImageEncoder is found below this method
System.Drawing.Imaging.ImageCodecInfo jpegEncoder = GetImageEncoder("JPEG");
img.Save(path, jpegEncoder, parms);
break;
}
}
}
}
}
}
}
}
catch
{
throw;
}
finally
{
pdf.Close();
}
}
#endregion
#region GetImageEncoder
public static System.Drawing.Imaging.ImageCodecInfo GetImageEncoder(string imageType)
{
imageType = imageType.ToUpperInvariant();
foreach (ImageCodecInfo info in ImageCodecInfo.GetImageEncoders())
{
if (info.FormatDescription == imageType)
{
return info;
}
}
return null;
}
#endregion
Re: [2005]Extract Images from a PDF file using iTextSharp
iTextSharp doesn't seem to provide a way to link an image back to which page it was extracted from (at least I couldn't find a way to work it out). Sorry, can't help you further...
Re: [2005]Extract Images from a PDF file using iTextSharp
Re: [2005]Extract Images from a PDF file using iTextSharp
I am getting the exception "Parameter is not valid" on this line:
Code:
System.Drawing.Image img = System.Drawing.Image.FromStream(memStream);
Does anyone know why?
Re: [2005]Extract Images from a PDF file using iTextSharp
Quote:
Originally Posted by
prankster624
I am getting the exception "Parameter is not valid" on this line:
Code:
System.Drawing.Image img = System.Drawing.Image.FromStream(memStream);
Does anyone know why?
That's why you have to put that line in a try/catch block.... In my original post, I did explain that when this happens, the image is probably in an unsupported format, and there's really nothing much you can do except to ignore that image and go on...
Re: [2005]Extract Images from a PDF file using iTextSharp
Thanks for the quick reply. Do you know of a way to convert each page of a PDF document to an image? I'm trying to convert a multipage PDF to a multipage TIFF. I don't necessarily want to pull the images out of a PDF page, I want to treat the whole page as an image (without having to buy a third party library). Thanks in advance!
Re: [2005]Extract Images from a PDF file using iTextSharp
Hi,
is it possible to get the position(x,y,height,width) of image in the pdf?
Re: [2005]Extract Images from a PDF file using iTextSharp
I need the same. Please help to extract text with itext
Re: [2005]Extract Images from a PDF file using iTextSharp
sorry my last post, i need help to extract text from a required position...
Re: [2005]Extract Images from a PDF file using iTextSharp
Can you share the code to extract all the hyperlinks from a pdf file along with its text.
Re: [2005]Extract Images from a PDF file using iTextSharp
Quote:
Originally Posted by
gayatri
Can you share the code to extract all the hyperlinks from a pdf file along with its text.
Check out this thread
http://www.vbforums.com/showthread.php?t=490456
You need to download the PdfManipulation2 class and use the ExtractURLs method.
Re: [2005]Extract Images from a PDF file using iTextSharp
Quote:
Originally Posted by
stanav
=========
Hello Stanav,
Do we have code or sample logic that can remove the watermark text & watermark image from a pdf file ?
thanks,
RkTech