Results 1 to 24 of 24

Thread: Extracting Images from A .PDF file...Revisit

  1. #1

    Thread Starter
    PowerPoster SamOscarBrown's Avatar
    Join Date
    Aug 2012
    Location
    NC, USA
    Posts
    9,145

    Extracting Images from A .PDF file...Revisit

    I had marked this thread, http://www.vbforums.com/showthread.p...ure-Extraction, as RESOLVED, simply because I did the manual process of capturing each image in the PDF using MS's Snipping Tool. As I will be getting NEW PDFs on a regular basis, I would like to automate the process using VB6. I would also not prefer to first convert the PDF to MS Word. I would like to save each image in the PDF as a .JPG file and store them in a directory.

    On that thread, LeandroA suggested a section of code which I could not execute successfully---arrStream never populated and could not track down why.

    I prefer NOT to download any 3rd Party apps, but could if necessary.

    Advice?

    Sam

  2. #2
    PowerPoster
    Join Date
    Feb 2006
    Posts
    24,482

    Re: Extracting Images from A .PDF file...Revisit

    Do you want to sniff and scrape for images within the PDF? If not and you want full pages, text and all, then why not the more obvious TIFF instead of JPEG? TIFF is multipage like a PDF.

  3. #3

    Thread Starter
    PowerPoster SamOscarBrown's Avatar
    Join Date
    Aug 2012
    Location
    NC, USA
    Posts
    9,145

    Re: Extracting Images from A .PDF file...Revisit

    S & S...just images. I get a publication from a company which produces the .PDF. It will have approximately 125 images (all the same size)...I would like to capture those images with a VB6 routine, so I can put them into a directory. The images WILL have text below each (identifying the image), but I really don't need it (although it could come in handy to scrape that information as well). In addition to the section of the PDF which includes those images, there will be other pages which I definitely do not want to capture...so, what it really is, is a document containing a certain number of pages with series of images, 12 per page (less than that on the last page of the images), 3 across, 4 down, followed by other pages with text I do not need/want (I already have THAT information in my database, and can easily correlate the images with what I already have). Make sense? COULD I use 3rd party apps...I would most assuredly expect I could, but would rather simply use VB6. I have seen VB.NET routines which supposedly do this, but not being too well versed in .Net, would rather not use it...besides, this function would be an add-in to my already fairly robust (in MY opinion) program.

    Attached is a SAMPLE pdf with 12 images and text below each (this is NOT what I am programming!)...one could use it for a test bed, however.

    Sam
    Last edited by SamOscarBrown; Aug 10th, 2019 at 01:28 PM. Reason: Added pdf

  4. #4
    PowerPoster
    Join Date
    Feb 2006
    Posts
    24,482

    Re: Extracting Images from A .PDF file...Revisit

    You'll need some sort of library or command-line utility, e.g. pdfimages.

  5. #5
    PowerPoster ChrisE's Avatar
    Join Date
    Jun 2017
    Location
    Frankfurt
    Posts
    3,046

    Re: Extracting Images from A .PDF file...Revisit

    Hi sam,

    to extract with .Net you can use iTextSharp, just tried it here an Image after
    extraction

    Name:  samFord.jpg
Views: 2149
Size:  27.5 KB


    and here a Link for Tool to use with vb6 https://bytescout.com/products/devel...6-and-VBScript

    never tried it with vb6, so can't say if it's good, do you want the .Net-Code ?
    to hunt a species to extinction is not logical !
    since 2010 the number of Tigers are rising again in 2016 - 3900 were counted. with Baby Callas it's 3901, my wife and I had 2-3 months the privilege of raising a Baby Tiger.

  6. #6
    PowerPoster
    Join Date
    Dec 2004
    Posts
    25,618

    Re: Extracting Images from A .PDF file...Revisit

    you can download xpdf command line tools, free to use, from http://www.xpdfreader.com/download.html, then shell the pdfimages
    like
    Code:
    Shell """c:\temp\extract\pdfimages.exe"" -j ""c:\temp\extract\fords.pdf"" ""c:\temp\extract\fords"""
    worked with your sample file, change paths etc to suit

    if you are working with multiple pdf files i would suggest shellandwait

    this is the same as suggested by dilettante
    i do my best to test code works before i post it, but sometimes am unable to do so for some reason, and usually say so if this is the case.
    Note code snippets posted are just that and do not include error handling that is required in real world applications, but avoid On Error Resume Next

    dim all variables as required as often i have done so elsewhere in my code but only posted the relevant part

    come back and mark your original post as resolved if your problem is fixed
    pete

  7. #7

    Thread Starter
    PowerPoster SamOscarBrown's Avatar
    Join Date
    Aug 2012
    Location
    NC, USA
    Posts
    9,145

    Re: Extracting Images from A .PDF file...Revisit

    @ westconn1....super! Works as you suggested. Of course, I did not originally want to use 3rd party, but this was so simple, no real problem.

    BUT, still leaving this open for other suggestions while I implement this command line call in my program. I printed the 'help file' to better understand its usage.

    Sam

  8. #8
    PowerPoster
    Join Date
    Dec 2004
    Posts
    25,618

    Re: Extracting Images from A .PDF file...Revisit

    On that thread, LeandroA suggested a section of code
    i did, with some changes, get the posted code to execute successfully without error, so that could also be used instead
    i do my best to test code works before i post it, but sometimes am unable to do so for some reason, and usually say so if this is the case.
    Note code snippets posted are just that and do not include error handling that is required in real world applications, but avoid On Error Resume Next

    dim all variables as required as often i have done so elsewhere in my code but only posted the relevant part

    come back and mark your original post as resolved if your problem is fixed
    pete

  9. #9
    PowerPoster
    Join Date
    Jun 2013
    Posts
    7,219

    Re: Extracting Images from A .PDF file...Revisit

    Quote Originally Posted by westconn1 View Post
    ...you can download xpdf command line tools, free to use...
    One needs to be careful here, regarding "free to use" - since it is licensed under GPL
    (an thus not commercially distributable alongside your App).

    For Sams use-case (personal private use, via an occasionally "shelled" commandline),
    there wont be any issues - the GPL allows that - but to be able to ship it with(in) a commercial App-package,
    one has to contact Glyph & Cog for an appropriate (commercial) license: http://www.glyphandcog.com/

    The only PDF-lib I know of, which comes under a license that allows commercial usage, would be libPDFium
    (developed by Foxit for the most part - then incorporated by special agreement between Google and Foxit into the Google-Chrome-Browser,
    later opened also as standalone-library - and relicensed by Google under the quite generous BSD-license).

    VB-friendly (__stdcall exporting) binaries are available in several online-repos,
    as for example the builds I'm using for years now (from a build-service, Pieter van Ginkel is providing here, on GitHub):
    https://github.com/pvginkel/PdfiumBu.../master/Builds

    With that single-file-lib (zipped about 2MB), one can "ship and integrate" ones own PDF-Viewer without licensing-issues
    (Image- or Plain-Text extraction then only a matter of "a dozen lines of VB-code").

    HTH

    Olaf

  10. #10

    Thread Starter
    PowerPoster SamOscarBrown's Avatar
    Join Date
    Aug 2012
    Location
    NC, USA
    Posts
    9,145

    Re: Extracting Images from A .PDF file...Revisit

    Quote Originally Posted by westconn1 View Post
    i did, with some changes, get the posted code to execute successfully without error, so that could also be used instead
    Would you mind posting here (or there?)...would like to see what you did so it worked correctly.

    Sammi

  11. #11
    PowerPoster
    Join Date
    Dec 2004
    Posts
    25,618

    Re: Extracting Images from A .PDF file...Revisit

    i was going to post it above, but i will have to clean it up a lot as i made a big mess getting it to work
    basically the issue was line separators and lack of, the original code had provision for different line separators, but did not work for your sample file
    i do not know if there is a specific spec for building pdf files regarding line separators, but maybe platform specific or some other criteria
    so that being the case, what i have now may not work with your real pdf files, so i will try to anticipate all variations that may be possible
    i will revisit it tonight

    i know i have had issues before with generated pdf files not being the same when reading the pdf as text, i spent some time, trying to figure why one pdf from a standard template was different in layout to all the others, with no resolution, even after generating the file multiple times, it always had the same differences

    i will be looking at this same code to see it it will help with my application, which i currently shell to an outside library
    i do my best to test code works before i post it, but sometimes am unable to do so for some reason, and usually say so if this is the case.
    Note code snippets posted are just that and do not include error handling that is required in real world applications, but avoid On Error Resume Next

    dim all variables as required as often i have done so elsewhere in my code but only posted the relevant part

    come back and mark your original post as resolved if your problem is fixed
    pete

  12. #12
    Frenzied Member
    Join Date
    Dec 2008
    Location
    Melbourne Australia
    Posts
    1,487

    Re: Extracting Images from A .PDF file...Revisit

    I have been on the search for PDF to JPG software for some time.
    Just came across a program that is not too expensive (previous programs I have checked out, are in the thousands, if you wish command line interface)
    This one is $55 - reaconverter
    If you can insert page breaks in your PDF (so you have one img per page), you just move a copy of the PDF into a watched folder, and it will create many JPGs (_page_01 _page_02 etc) in your pre-selected folder, and it can delete the PDF when it is done.
    That would mean that your img may not fill the JPG, can you live with that ?
    Rob
    PS the Standard version does not have command line, but I don't need that, as I leave reaconverter running all the time, and it is watching the PDF folder. (reaconverter uses a config file)

  13. #13
    PowerPoster
    Join Date
    Dec 2004
    Posts
    25,618

    Re: Extracting Images from A .PDF file...Revisit

    here is my version of the code posted by LeandroA

    Code:
    Private Sub ExtractImgPDF(ByVal PathPDF As String, ByVal DestPath As String)
        Dim i As Long, lRet As Long, j As Long
        Dim FF As Integer
        Dim sBuff As String
        Dim ArrStream() As String
        Dim sStream As String
        Dim lCount As Long
        Dim imgstrt As String
        lCount = 1
        
        FF = FreeFile
        
        Open PathPDF For Binary As #FF
            sBuff = Space(LOF(FF))
            Get #FF, , sBuff
        Close #FF
        
        If Right$(DestPath, 1) <> "\" Then DestPath = DestPath & "\"
            imgstrt = "ÿØÿ"
            ArrStream = Split(sBuff, imgstrt)
            For i = 1 To UBound(ArrStream)
                lRet = InStr(ArrStream(i), "endstream")
                If lRet Then
                    sStream = imgstrt & Left$(ArrStream(i), lRet - 1)
                 
                          FF = FreeFile
                        Open DestPath & "Image " & lCount & ".jpg" For Binary As #FF
                            Put #FF, , sStream
                        Close #FF
                        
                        lCount = lCount + 1
                    Else
                        Debug.Print Left$(sStream, 6)
                End If
            Next
    End Sub
    tested, working correctly with sample file posted above
    i do my best to test code works before i post it, but sometimes am unable to do so for some reason, and usually say so if this is the case.
    Note code snippets posted are just that and do not include error handling that is required in real world applications, but avoid On Error Resume Next

    dim all variables as required as often i have done so elsewhere in my code but only posted the relevant part

    come back and mark your original post as resolved if your problem is fixed
    pete

  14. #14
    Fanatic Member
    Join Date
    Jan 2015
    Posts
    596

    Re: Extracting Images from A .PDF file...Revisit

    Olaf, do you have a sample of use in vB6 for the library libPDFium ?
    It could be interesting.
    Thanks

    For the moment, I use MODI to display PDF (I extract everything in JPG to show PDF)

    Quote Originally Posted by Schmidt View Post
    One needs to be careful here, regarding "free to use" - since it is licensed under GPL
    (an thus not commercially distributable alongside your App).

    For Sams use-case (personal private use, via an occasionally "shelled" commandline),
    there wont be any issues - the GPL allows that - but to be able to ship it with(in) a commercial App-package,
    one has to contact Glyph & Cog for an appropriate (commercial) license: http://www.glyphandcog.com/

    The only PDF-lib I know of, which comes under a license that allows commercial usage, would be libPDFium
    (developed by Foxit for the most part - then incorporated by special agreement between Google and Foxit into the Google-Chrome-Browser,
    later opened also as standalone-library - and relicensed by Google under the quite generous BSD-license).

    VB-friendly (__stdcall exporting) binaries are available in several online-repos,
    as for example the builds I'm using for years now (from a build-service, Pieter van Ginkel is providing here, on GitHub):
    https://github.com/pvginkel/PdfiumBu.../master/Builds

    With that single-file-lib (zipped about 2MB), one can "ship and integrate" ones own PDF-Viewer without licensing-issues
    (Image- or Plain-Text extraction then only a matter of "a dozen lines of VB-code").

    HTH

    Olaf

  15. #15

    Thread Starter
    PowerPoster SamOscarBrown's Avatar
    Join Date
    Aug 2012
    Location
    NC, USA
    Posts
    9,145

    Re: Extracting Images from A .PDF file...Revisit

    @WestConn1 (as in 'Connecticut'???)---thanks for 'fixing' that code...works as stated (On my practice file (Fords.pdf)). But, the pdf I receive annually obviously has the pictures in it in a different format (I can't attach those pictures as they are of people---it's a "Directory" for the church I attend.) When i run this code on that document, all of the images come out as negatives...so, I guess I'll just go back to annually using the Snipping Tool and capture each one for my program. But thanks (and same to all) for posting ideas and code...

    Sam

  16. #16

    Thread Starter
    PowerPoster SamOscarBrown's Avatar
    Join Date
    Aug 2012
    Location
    NC, USA
    Posts
    9,145

    Re: Extracting Images from A .PDF file...Revisit

    @Bobbles....would like each image to fill the whole jpg.

  17. #17
    PowerPoster
    Join Date
    Dec 2004
    Posts
    25,618

    Re: Extracting Images from A .PDF file...Revisit

    as in 'Connecticut'???
    in no way related

    I can't attach those pictures as they are of people
    if yo want to send it to me privately, pm for email address

    did shelling to xpd extract work correctly on the real file?
    i do my best to test code works before i post it, but sometimes am unable to do so for some reason, and usually say so if this is the case.
    Note code snippets posted are just that and do not include error handling that is required in real world applications, but avoid On Error Resume Next

    dim all variables as required as often i have done so elsewhere in my code but only posted the relevant part

    come back and mark your original post as resolved if your problem is fixed
    pete

  18. #18

    Thread Starter
    PowerPoster SamOscarBrown's Avatar
    Join Date
    Aug 2012
    Location
    NC, USA
    Posts
    9,145

    Re: Extracting Images from A .PDF file...Revisit

    Using xpdf, when I shelled it, they all came out as .PPM files...which I know nada about! I guess I could figure out how to convert them to jpgs with another external program...but didn't yet.

    stand by for PM with modified 'new' document.

  19. #19
    Frenzied Member
    Join Date
    Dec 2008
    Location
    Melbourne Australia
    Posts
    1,487

    Re: Extracting Images from A .PDF file...Revisit

    Quote Originally Posted by SamOscarBrown View Post
    @Bobbles....would like each image to fill the whole jpg.
    Sam,
    It might be worth perusing their site.
    Not only do you use a config file, but you can also automate image manipulations by using an Action file (during the same operation that the config file is doing).
    Regards,
    Rob

  20. #20

    Thread Starter
    PowerPoster SamOscarBrown's Avatar
    Join Date
    Aug 2012
    Location
    NC, USA
    Posts
    9,145

    Re: Extracting Images from A .PDF file...Revisit

    I have several ways of doing what I want to get done... extracting all images to .ppm through a third-party app (xpdf). and then using another 3rd party app to convert from ppm to jpg. I was unaware of the several formats pictures are stored in in pdf files. Anyway, my 'sample' Ford.pdf is what I created with jpg images on my computer; but what I get from a company every year obviously created their pdfs with different format pictures. Using my Ford.pdf and the code from that other thread, and modified by westconn1 works well, but against the 'real' pdf, all images are in a negative format (a negative of a picture, if you know what I mean). I really can't attach that pdf as it contains pii (personally identifiable information). Even attempting to extract just one page of pictures also would include pii, so I am unable to do that as well (even in a PM I won't). SO, will probably end up closing this thread shortly, as I cannot provide a sample for anyone to demonstrate how to properly extract jpg images....I'll just go the longer route....but, again, thanks to all of you who offered some great advice.

    Sam

  21. #21
    PowerPoster
    Join Date
    Dec 2004
    Posts
    25,618

    Re: Extracting Images from A .PDF file...Revisit

    extracting all images to .ppm through a third-party app (xpdf)
    the -j option should save them directly as .jpg
    i do my best to test code works before i post it, but sometimes am unable to do so for some reason, and usually say so if this is the case.
    Note code snippets posted are just that and do not include error handling that is required in real world applications, but avoid On Error Resume Next

    dim all variables as required as often i have done so elsewhere in my code but only posted the relevant part

    come back and mark your original post as resolved if your problem is fixed
    pete

  22. #22
    PowerPoster
    Join Date
    Feb 2006
    Posts
    24,482

    Re: Extracting Images from A .PDF file...Revisit

    The prospects look pretty grim to me unless you happen to get lucky with the particular PDFs you have.

    I found some with images where even Adobe Reader could not successfully copy images to the clipboard. Some images ended up mirrored with the origin shifted well away from (0, 0) as clipboard bitmaps. Others ended up even more wacky as malformed EMFs. It suggests to me that Acrobat's left hand doesn't know what its right hand is doing: it can render to screen or printer but can't copy to clipboard?!?

    PDF is a "dead" format, i.e. only for display and no longer data in any meaningful sense. It may as well be printed sheets of paper. It's like formatting a numeric data type to dead text: you may be able to parse it back to the numeric value with full fidelity or you might not but there are no hints as to range, precision, or much else.

    Trying to use it as a data exchange format is pretty desperate. You may be stuck with rendering it to the screen and then scraping the images manually.

  23. #23

    Thread Starter
    PowerPoster SamOscarBrown's Avatar
    Join Date
    Aug 2012
    Location
    NC, USA
    Posts
    9,145

    Re: Extracting Images from A .PDF file...Revisit

    @westconn1---I am using that option, comes out ppm.

    @dilettante---yup, just about settled in on that...as it is only 125 images, an hour at the most is all it really takes. I've spend WAY more than that just trying alternatives. An hour (or so) a year is not that much of an investment.

  24. #24
    PowerPoster
    Join Date
    Jun 2013
    Posts
    7,219

    Re: Extracting Images from A .PDF file...Revisit

    Ok, I've just uploaded the requested Demo into the CodeBank, which shows how to use lib-pdfium (sorry for the delay...):
    http://www.vbforums.com/showthread.p...-ImageExports)



    HTH

    Olaf

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width