Results 1 to 14 of 14

Thread: [2005]Extract Images from a PDF file using iTextSharp

  1. #1

    Thread Starter
    PowerPoster stanav's Avatar
    Join Date
    Jul 2006
    Location
    Providence, RI - USA
    Posts
    9,289

    [2005]Extract Images from a PDF file using iTextSharp

    A long while ago when I posted the code to extract text from a PDF using iTextSharp, a VBF member asked me to write a function to extract images too. I was busy at the time and didn't dig too deep into it. And recently, while trying to find a way to extracted hyperlinks from a PDF (asked by a VBF member), I also figured out how to get the images. So I thought I would post the code here to share with everyone.

    Note1: You'll need to add a reference of iTextSharp.dll to your project. It can be downloaded by Googling for "itextsharp download" if you don't already have it.

    Note2: This code were written targetting .Net 2.0 framework. It will still work on .Net 1.x if you replace every occurances of "List(Of Image)" in the code with an ArrayList.

    vb.net Code:
    1. Public Shared Function ExtractImages(ByVal sourcePdf As String) As List(Of Image)
    2.         Dim imgList As New List(Of Image)
    3.  
    4.         Dim raf As iTextSharp.text.pdf.RandomAccessFileOrArray = Nothing
    5.         Dim reader As iTextSharp.text.pdf.PdfReader = Nothing
    6.         Dim pdfObj As iTextSharp.text.pdf.PdfObject = Nothing
    7.         Dim pdfStrem As iTextSharp.text.pdf.PdfStream = Nothing
    8.        
    9.         Try
    10.             raf = New iTextSharp.text.pdf.RandomAccessFileOrArray(sourcePdf)
    11.             reader = New iTextSharp.text.pdf.PdfReader(raf, Nothing)
    12.  
    13.             For i As Integer = 0 To reader.XrefSize - 1
    14.                 pdfObj = reader.GetPdfObject(i)
    15.                 If Not IsNothing(pdfObj) AndAlso pdfObj.IsStream() Then
    16.                     pdfStrem = DirectCast(pdfObj, iTextSharp.text.pdf.PdfStream)
    17.                     Dim subtype As iTextSharp.text.pdf.PdfObject = pdfStrem.Get(iTextSharp.text.pdf.PdfName.SUBTYPE)
    18.                     If Not IsNothing(subtype) AndAlso subtype.ToString = iTextSharp.text.pdf.PdfName.IMAGE.ToString Then
    19.                         Dim bytes() As Byte = iTextSharp.text.pdf.PdfReader.GetStreamBytesRaw(CType(pdfStrem, iTextSharp.text.pdf.PRStream))
    20.                         If Not IsNothing(bytes) Then
    21.                             Try
    22.                                 Using memStream As New System.IO.MemoryStream(bytes)
    23.                                     memStream.Position = 0
    24.                                     Dim img As Image = Image.FromStream(memStream)
    25.                                     imgList.Add(img)
    26.                                 End Using
    27.                             Catch ex As Exception
    28.                                 'Most likely the image is in an unsupported format
    29.                                 'Do nothing
    30.                                 'You can add your own code to handle this exception if you want to
    31.                             End Try
    32.                         End If
    33.                     End If
    34.                 End If
    35.             Next
    36.             reader.Close()
    37.         Catch ex As Exception
    38.             MessageBox.Show(ex.Message)
    39.         End Try
    40.         Return imgList
    41.     End Function
    Let us have faith that right makes might, and in that faith, let us, to the end, dare to do our duty as we understand it.
    - Abraham Lincoln -

  2. #2
    New Member
    Join Date
    Feb 2009
    Posts
    2

    Re: [2005]Extract Images from a PDF file using iTextSharp

    Thank you for your example. I found that the XrefSize index does not have to match the order the images are found in the PDF. I need to know the order the images appear in the PDF. Do you have any suggestions on how I could get that order and extract the images in that order.

  3. #3
    New Member
    Join Date
    Feb 2009
    Posts
    2

    Re: [2005]Extract Images from a PDF file using iTextSharp

    I found a way to do pull images per page. This example only gets the first image from each page. Sorry, this is in C# but you can convert it here for free: http://www.developerfusion.com/tools.../csharp-to-vb/

    Hope this helps someone.

    Code:
    using iTextSharp.text;
    using iTextSharp.text.pdf;
    
    #region ExtractImagesFromPDF
            public static void ExtractImagesFromPDF(string sourcePdf, string outputPath)
            {
                // NOTE:  This will only get the first image it finds per page.
                PdfReader pdf = new PdfReader(sourcePdf);
                RandomAccessFileOrArray raf = new iTextSharp.text.pdf.RandomAccessFileOrArray(sourcePdf);
    
                try
                {
                    for (int pageNumber = 1; pageNumber <= pdf.NumberOfPages; pageNumber++)
                    {
                        PdfDictionary pg = pdf.GetPageN(pageNumber);
                        PdfDictionary res =
                          (PdfDictionary)PdfReader.GetPdfObject(pg.Get(PdfName.RESOURCES));
                        PdfDictionary xobj =
                          (PdfDictionary)PdfReader.GetPdfObject(res.Get(PdfName.XOBJECT));
                        if (xobj != null)
                        {
                            foreach (PdfName name in xobj.Keys)
                            {
                                PdfObject obj = xobj.Get(name);
                                if (obj.IsIndirect())
                                {
                                    PdfDictionary tg = (PdfDictionary)PdfReader.GetPdfObject(obj);
                                    PdfName type =
                                      (PdfName)PdfReader.GetPdfObject(tg.Get(PdfName.SUBTYPE));
                                    if (PdfName.IMAGE.Equals(type))
                                    {
    
                                        int XrefIndex = Convert.ToInt32(((PRIndirectReference)obj).Number.ToString(System.Globalization.CultureInfo.InvariantCulture));
                                        PdfObject pdfObj = pdf.GetPdfObject(XrefIndex);
                                        PdfStream pdfStrem = (PdfStream)pdfObj;
                                        byte[] bytes = PdfReader.GetStreamBytesRaw((PRStream)pdfStrem);
                                        if ((bytes != null))
                                        {
                                            using (System.IO.MemoryStream memStream = new System.IO.MemoryStream(bytes))
                                            {
                                                memStream.Position = 0;
                                                System.Drawing.Image img = System.Drawing.Image.FromStream(memStream);
                                                // must save the file while stream is open.
                                                if (!Directory.Exists(outputPath))
                                                    Directory.CreateDirectory(outputPath);
    
                                                string path = Path.Combine(outputPath, String.Format(@"{0}.jpg", pageNumber));
                                                System.Drawing.Imaging.EncoderParameters parms = new System.Drawing.Imaging.EncoderParameters(1);
                                                parms.Param[0] = new System.Drawing.Imaging.EncoderParameter(System.Drawing.Imaging.Encoder.Compression, 0);
    // GetImageEncoder is found below this method
                                                System.Drawing.Imaging.ImageCodecInfo jpegEncoder = GetImageEncoder("JPEG");
                                                img.Save(path, jpegEncoder, parms);
                                                break;
    
                                            }
                                        }
                                    }
                                }
                            }
                        }
                    }
                }
    
                catch
                {
                    throw;
                }
                finally
                {
                    pdf.Close();
                }
    
    
            }
            #endregion
    
           #region GetImageEncoder
            public static System.Drawing.Imaging.ImageCodecInfo GetImageEncoder(string imageType)
            {
                imageType = imageType.ToUpperInvariant();
    
    
    
                foreach (ImageCodecInfo info in ImageCodecInfo.GetImageEncoders())
                {
                    if (info.FormatDescription == imageType)
                    {
                        return info;
                    }
                }
    
                return null;
            }
            #endregion

  4. #4

    Thread Starter
    PowerPoster stanav's Avatar
    Join Date
    Jul 2006
    Location
    Providence, RI - USA
    Posts
    9,289

    Re: [2005]Extract Images from a PDF file using iTextSharp

    iTextSharp doesn't seem to provide a way to link an image back to which page it was extracted from (at least I couldn't find a way to work it out). Sorry, can't help you further...
    Let us have faith that right makes might, and in that faith, let us, to the end, dare to do our duty as we understand it.
    - Abraham Lincoln -

  5. #5
    Fanatic Member vijy's Avatar
    Join Date
    May 2007
    Location
    India
    Posts
    548

    Re: [2005]Extract Images from a PDF file using iTextSharp

    excellent one stanav...
    Visual Studio.net 2010
    If this post is useful, rate it


  6. #6
    New Member
    Join Date
    May 2006
    Posts
    14

    Re: [2005]Extract Images from a PDF file using iTextSharp

    I am getting the exception "Parameter is not valid" on this line:

    Code:
    System.Drawing.Image img = System.Drawing.Image.FromStream(memStream);
    Does anyone know why?

  7. #7

    Thread Starter
    PowerPoster stanav's Avatar
    Join Date
    Jul 2006
    Location
    Providence, RI - USA
    Posts
    9,289

    Re: [2005]Extract Images from a PDF file using iTextSharp

    Quote Originally Posted by prankster624 View Post
    I am getting the exception "Parameter is not valid" on this line:

    Code:
    System.Drawing.Image img = System.Drawing.Image.FromStream(memStream);
    Does anyone know why?
    That's why you have to put that line in a try/catch block.... In my original post, I did explain that when this happens, the image is probably in an unsupported format, and there's really nothing much you can do except to ignore that image and go on...
    Let us have faith that right makes might, and in that faith, let us, to the end, dare to do our duty as we understand it.
    - Abraham Lincoln -

  8. #8
    New Member
    Join Date
    May 2006
    Posts
    14

    Re: [2005]Extract Images from a PDF file using iTextSharp

    Thanks for the quick reply. Do you know of a way to convert each page of a PDF document to an image? I'm trying to convert a multipage PDF to a multipage TIFF. I don't necessarily want to pull the images out of a PDF page, I want to treat the whole page as an image (without having to buy a third party library). Thanks in advance!

  9. #9
    Fanatic Member vijy's Avatar
    Join Date
    May 2007
    Location
    India
    Posts
    548

    Re: [2005]Extract Images from a PDF file using iTextSharp

    Hi,
    is it possible to get the position(x,y,height,width) of image in the pdf?
    Visual Studio.net 2010
    If this post is useful, rate it


  10. #10
    New Member
    Join Date
    Apr 2010
    Posts
    2

    Re: [2005]Extract Images from a PDF file using iTextSharp

    I need the same. Please help to extract text with itext

  11. #11
    New Member
    Join Date
    Apr 2010
    Posts
    2

    Re: [2005]Extract Images from a PDF file using iTextSharp

    sorry my last post, i need help to extract text from a required position...

  12. #12
    New Member
    Join Date
    May 2011
    Posts
    1

    Re: [2005]Extract Images from a PDF file using iTextSharp

    Can you share the code to extract all the hyperlinks from a pdf file along with its text.

  13. #13

    Thread Starter
    PowerPoster stanav's Avatar
    Join Date
    Jul 2006
    Location
    Providence, RI - USA
    Posts
    9,289

    Re: [2005]Extract Images from a PDF file using iTextSharp

    Quote Originally Posted by gayatri View Post
    Can you share the code to extract all the hyperlinks from a pdf file along with its text.
    Check out this thread
    http://www.vbforums.com/showthread.php?t=490456

    You need to download the PdfManipulation2 class and use the ExtractURLs method.
    Let us have faith that right makes might, and in that faith, let us, to the end, dare to do our duty as we understand it.
    - Abraham Lincoln -

  14. #14
    New Member
    Join Date
    Aug 2011
    Posts
    10

    Re: [2005]Extract Images from a PDF file using iTextSharp

    Quote Originally Posted by stanav View Post
    Check out this thread
    http://www.vbforums.com/showthread.php?t=490456

    You need to download the PdfManipulation2 class and use the ExtractURLs method.
    =========
    Hello Stanav,

    Do we have code or sample logic that can remove the watermark text & watermark image from a pdf file ?

    thanks,
    RkTech

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width