[2005]Extract Images from a PDF file using iTextSharp

A long while ago when I posted the code to extract text from a PDF using iTextSharp, a VBF member asked me to write a function to extract images too. I was busy at the time and didn't dig too deep into it. And recently, while trying to find a way to extracted hyperlinks from a PDF (asked by a VBF member), I also figured out how to get the images. So I thought I would post the code here to share with everyone.

Note1: You'll need to add a reference of iTextSharp.dll to your project. It can be downloaded by Googling for "itextsharp download" if you don't already have it.

Note2: This code were written targetting .Net 2.0 framework. It will still work on .Net 1.x if you replace every occurances of "List(Of Image)" in the code with an ArrayList.

vb.net Code:

Public Shared Function ExtractImages(ByVal sourcePdf As String) As List(Of Image)
        Dim imgList As New List(Of Image)
 
        Dim raf As iTextSharp.text.pdf.RandomAccessFileOrArray = Nothing
        Dim reader As iTextSharp.text.pdf.PdfReader = Nothing
        Dim pdfObj As iTextSharp.text.pdf.PdfObject = Nothing
        Dim pdfStrem As iTextSharp.text.pdf.PdfStream = Nothing
        
        Try
            raf = New iTextSharp.text.pdf.RandomAccessFileOrArray(sourcePdf)
            reader = New iTextSharp.text.pdf.PdfReader(raf, Nothing)
 
            For i As Integer = 0 To reader.XrefSize - 1
                pdfObj = reader.GetPdfObject(i)
                If Not IsNothing(pdfObj) AndAlso pdfObj.IsStream() Then
                    pdfStrem = DirectCast(pdfObj, iTextSharp.text.pdf.PdfStream)
                    Dim subtype As iTextSharp.text.pdf.PdfObject = pdfStrem.Get(iTextSharp.text.pdf.PdfName.SUBTYPE)
                    If Not IsNothing(subtype) AndAlso subtype.ToString = iTextSharp.text.pdf.PdfName.IMAGE.ToString Then
                        Dim bytes() As Byte = iTextSharp.text.pdf.PdfReader.GetStreamBytesRaw(CType(pdfStrem, iTextSharp.text.pdf.PRStream))
                        If Not IsNothing(bytes) Then
                            Try
                                Using memStream As New System.IO.MemoryStream(bytes)
                                    memStream.Position = 0
                                    Dim img As Image = Image.FromStream(memStream)
                                    imgList.Add(img)
                                End Using
                            Catch ex As Exception
                                'Most likely the image is in an unsupported format
                                'Do nothing
                                'You can add your own code to handle this exception if you want to
                            End Try
                        End If
                    End If
                End If
            Next
            reader.Close()
        Catch ex As Exception
            MessageBox.Show(ex.Message)
        End Try
        Return imgList
    End Function

Re: [2005]Extract Images from a PDF file using iTextSharp

Thank you for your example. I found that the XrefSize index does not have to match the order the images are found in the PDF. I need to know the order the images appear in the PDF. Do you have any suggestions on how I could get that order and extract the images in that order.

Re: [2005]Extract Images from a PDF file using iTextSharp

I found a way to do pull images per page. This example only gets the first image from each page. Sorry, this is in C# but you can convert it here for free: http://www.developerfusion.com/tools.../csharp-to-vb/

Hope this helps someone.

Code:

using iTextSharp.text; using iTextSharp.text.pdf; #region ExtractImagesFromPDF public static void ExtractImagesFromPDF(string sourcePdf, string outputPath) { // NOTE: This will only get the first image it finds per page. PdfReader pdf = new PdfReader(sourcePdf); RandomAccessFileOrArray raf = new iTextSharp.text.pdf.RandomAccessFileOrArray(sourcePdf); try { for (int pageNumber = 1; pageNumber <= pdf.NumberOfPages; pageNumber++) { PdfDictionary pg = pdf.GetPageN(pageNumber); PdfDictionary res = (PdfDictionary)PdfReader.GetPdfObject(pg.Get(PdfName.RESOURCES)); PdfDictionary xobj = (PdfDictionary)PdfReader.GetPdfObject(res.Get(PdfName.XOBJECT)); if (xobj != null) { foreach (PdfName name in xobj.Keys) { PdfObject obj = xobj.Get(name); if (obj.IsIndirect()) { PdfDictionary tg = (PdfDictionary)PdfReader.GetPdfObject(obj); PdfName type = (PdfName)PdfReader.GetPdfObject(tg.Get(PdfName.SUBTYPE)); if (PdfName.IMAGE.Equals(type)) { int XrefIndex = Convert.ToInt32(((PRIndirectReference)obj).Number.ToString(System.Globalization.CultureInfo.InvariantCulture)); PdfObject pdfObj = pdf.GetPdfObject(XrefIndex); PdfStream pdfStrem = (PdfStream)pdfObj; byte[] bytes = PdfReader.GetStreamBytesRaw((PRStream)pdfStrem); if ((bytes != null)) { using (System.IO.MemoryStream memStream = new System.IO.MemoryStream(bytes)) { memStream.Position = 0; System.Drawing.Image img = System.Drawing.Image.FromStream(memStream); // must save the file while stream is open. if (!Directory.Exists(outputPath)) Directory.CreateDirectory(outputPath); string path = Path.Combine(outputPath, String.Format(@"{0}.jpg", pageNumber)); System.Drawing.Imaging.EncoderParameters parms = new System.Drawing.Imaging.EncoderParameters(1); parms.Param[0] = new System.Drawing.Imaging.EncoderParameter(System.Drawing.Imaging.Encoder.Compression, 0); // GetImageEncoder is found below this method System.Drawing.Imaging.ImageCodecInfo jpegEncoder = GetImageEncoder("JPEG"); img.Save(path, jpegEncoder, parms); break; } } } } } } } } catch { throw; } finally { pdf.Close(); } } #endregion #region GetImageEncoder public static System.Drawing.Imaging.ImageCodecInfo GetImageEncoder(string imageType) { imageType = imageType.ToUpperInvariant(); foreach (ImageCodecInfo info in ImageCodecInfo.GetImageEncoders()) { if (info.FormatDescription == imageType) { return info; } } return null; } #endregion

Re: [2005]Extract Images from a PDF file using iTextSharp

iTextSharp doesn't seem to provide a way to link an image back to which page it was extracted from (at least I couldn't find a way to work it out). Sorry, can't help you further...

Re: [2005]Extract Images from a PDF file using iTextSharp

excellent one stanav...

Re: [2005]Extract Images from a PDF file using iTextSharp

I am getting the exception "Parameter is not valid" on this line:

Code:

System.Drawing.Image img = System.Drawing.Image.FromStream(memStream);

Does anyone know why?

Re: [2005]Extract Images from a PDF file using iTextSharp

Quote:

Originally Posted by prankster624

I am getting the exception "Parameter is not valid" on this line:

Code:

System.Drawing.Image img = System.Drawing.Image.FromStream(memStream);

Does anyone know why?

That's why you have to put that line in a try/catch block.... In my original post, I did explain that when this happens, the image is probably in an unsupported format, and there's really nothing much you can do except to ignore that image and go on...

Re: [2005]Extract Images from a PDF file using iTextSharp

Thanks for the quick reply. Do you know of a way to convert each page of a PDF document to an image? I'm trying to convert a multipage PDF to a multipage TIFF. I don't necessarily want to pull the images out of a PDF page, I want to treat the whole page as an image (without having to buy a third party library). Thanks in advance!

Re: [2005]Extract Images from a PDF file using iTextSharp

Hi,
is it possible to get the position(x,y,height,width) of image in the pdf?

Re: [2005]Extract Images from a PDF file using iTextSharp

I need the same. Please help to extract text with itext

Re: [2005]Extract Images from a PDF file using iTextSharp

sorry my last post, i need help to extract text from a required position...

Re: [2005]Extract Images from a PDF file using iTextSharp

Can you share the code to extract all the hyperlinks from a pdf file along with its text.

Re: [2005]Extract Images from a PDF file using iTextSharp

Quote:

Originally Posted by gayatri

Can you share the code to extract all the hyperlinks from a pdf file along with its text.

Check out this thread
http://www.vbforums.com/showthread.php?t=490456

You need to download the PdfManipulation2 class and use the ExtractURLs method.

Re: [2005]Extract Images from a PDF file using iTextSharp

Quote:

Originally Posted by stanav

Check out this thread
http://www.vbforums.com/showthread.php?t=490456

You need to download the PdfManipulation2 class and use the ExtractURLs method.

=========
Hello Stanav,

Do we have code or sample logic that can remove the watermark text & watermark image from a pdf file ?

thanks,
RkTech