-
Aug 10th, 2019, 08:36 AM
#1
Extracting Images from A .PDF file...Revisit
I had marked this thread, http://www.vbforums.com/showthread.p...ure-Extraction, as RESOLVED, simply because I did the manual process of capturing each image in the PDF using MS's Snipping Tool. As I will be getting NEW PDFs on a regular basis, I would like to automate the process using VB6. I would also not prefer to first convert the PDF to MS Word. I would like to save each image in the PDF as a .JPG file and store them in a directory.
On that thread, LeandroA suggested a section of code which I could not execute successfully---arrStream never populated and could not track down why.
I prefer NOT to download any 3rd Party apps, but could if necessary.
Advice?
Sam
-
Aug 10th, 2019, 12:39 PM
#2
Re: Extracting Images from A .PDF file...Revisit
Do you want to sniff and scrape for images within the PDF? If not and you want full pages, text and all, then why not the more obvious TIFF instead of JPEG? TIFF is multipage like a PDF.
-
Aug 10th, 2019, 01:16 PM
#3
Re: Extracting Images from A .PDF file...Revisit
S & S...just images. I get a publication from a company which produces the .PDF. It will have approximately 125 images (all the same size)...I would like to capture those images with a VB6 routine, so I can put them into a directory. The images WILL have text below each (identifying the image), but I really don't need it (although it could come in handy to scrape that information as well). In addition to the section of the PDF which includes those images, there will be other pages which I definitely do not want to capture...so, what it really is, is a document containing a certain number of pages with series of images, 12 per page (less than that on the last page of the images), 3 across, 4 down, followed by other pages with text I do not need/want (I already have THAT information in my database, and can easily correlate the images with what I already have). Make sense? COULD I use 3rd party apps...I would most assuredly expect I could, but would rather simply use VB6. I have seen VB.NET routines which supposedly do this, but not being too well versed in .Net, would rather not use it...besides, this function would be an add-in to my already fairly robust (in MY opinion) program.
Attached is a SAMPLE pdf with 12 images and text below each (this is NOT what I am programming!)...one could use it for a test bed, however.
Sam
Last edited by SamOscarBrown; Aug 10th, 2019 at 01:28 PM.
Reason: Added pdf
-
Aug 10th, 2019, 04:24 PM
#4
Re: Extracting Images from A .PDF file...Revisit
You'll need some sort of library or command-line utility, e.g. pdfimages.
-
Aug 11th, 2019, 04:29 AM
#5
Re: Extracting Images from A .PDF file...Revisit
Hi sam,
to extract with .Net you can use iTextSharp, just tried it here an Image after
extraction
and here a Link for Tool to use with vb6 https://bytescout.com/products/devel...6-and-VBScript
never tried it with vb6, so can't say if it's good, do you want the .Net-Code ?
to hunt a species to extinction is not logical !
since 2010 the number of Tigers are rising again in 2016 - 3900 were counted. with Baby Callas it's 3901, my wife and I had 2-3 months the privilege of raising a Baby Tiger.
-
Aug 11th, 2019, 06:08 AM
#6
Re: Extracting Images from A .PDF file...Revisit
you can download xpdf command line tools, free to use, from http://www.xpdfreader.com/download.html, then shell the pdfimages
like
Code:
Shell """c:\temp\extract\pdfimages.exe"" -j ""c:\temp\extract\fords.pdf"" ""c:\temp\extract\fords"""
worked with your sample file, change paths etc to suit
if you are working with multiple pdf files i would suggest shellandwait
this is the same as suggested by dilettante
i do my best to test code works before i post it, but sometimes am unable to do so for some reason, and usually say so if this is the case.
Note code snippets posted are just that and do not include error handling that is required in real world applications, but avoid On Error Resume Next
dim all variables as required as often i have done so elsewhere in my code but only posted the relevant part
come back and mark your original post as resolved if your problem is fixed
pete
-
Aug 11th, 2019, 06:55 AM
#7
Re: Extracting Images from A .PDF file...Revisit
@ westconn1....super! Works as you suggested. Of course, I did not originally want to use 3rd party, but this was so simple, no real problem.
BUT, still leaving this open for other suggestions while I implement this command line call in my program. I printed the 'help file' to better understand its usage.
Sam
-
Aug 11th, 2019, 07:48 AM
#8
Re: Extracting Images from A .PDF file...Revisit
On that thread, LeandroA suggested a section of code
i did, with some changes, get the posted code to execute successfully without error, so that could also be used instead
i do my best to test code works before i post it, but sometimes am unable to do so for some reason, and usually say so if this is the case.
Note code snippets posted are just that and do not include error handling that is required in real world applications, but avoid On Error Resume Next
dim all variables as required as often i have done so elsewhere in my code but only posted the relevant part
come back and mark your original post as resolved if your problem is fixed
pete
-
Aug 11th, 2019, 11:16 AM
#9
Re: Extracting Images from A .PDF file...Revisit
Originally Posted by westconn1
...you can download xpdf command line tools, free to use...
One needs to be careful here, regarding "free to use" - since it is licensed under GPL
(an thus not commercially distributable alongside your App).
For Sams use-case (personal private use, via an occasionally "shelled" commandline),
there wont be any issues - the GPL allows that - but to be able to ship it with(in) a commercial App-package,
one has to contact Glyph & Cog for an appropriate (commercial) license: http://www.glyphandcog.com/
The only PDF-lib I know of, which comes under a license that allows commercial usage, would be libPDFium
(developed by Foxit for the most part - then incorporated by special agreement between Google and Foxit into the Google-Chrome-Browser,
later opened also as standalone-library - and relicensed by Google under the quite generous BSD-license).
VB-friendly (__stdcall exporting) binaries are available in several online-repos,
as for example the builds I'm using for years now (from a build-service, Pieter van Ginkel is providing here, on GitHub):
https://github.com/pvginkel/PdfiumBu.../master/Builds
With that single-file-lib (zipped about 2MB), one can "ship and integrate" ones own PDF-Viewer without licensing-issues
(Image- or Plain-Text extraction then only a matter of "a dozen lines of VB-code").
HTH
Olaf
-
Aug 11th, 2019, 11:59 AM
#10
Re: Extracting Images from A .PDF file...Revisit
Originally Posted by westconn1
i did, with some changes, get the posted code to execute successfully without error, so that could also be used instead
Would you mind posting here (or there?)...would like to see what you did so it worked correctly.
Sammi
-
Aug 11th, 2019, 04:26 PM
#11
Re: Extracting Images from A .PDF file...Revisit
i was going to post it above, but i will have to clean it up a lot as i made a big mess getting it to work
basically the issue was line separators and lack of, the original code had provision for different line separators, but did not work for your sample file
i do not know if there is a specific spec for building pdf files regarding line separators, but maybe platform specific or some other criteria
so that being the case, what i have now may not work with your real pdf files, so i will try to anticipate all variations that may be possible
i will revisit it tonight
i know i have had issues before with generated pdf files not being the same when reading the pdf as text, i spent some time, trying to figure why one pdf from a standard template was different in layout to all the others, with no resolution, even after generating the file multiple times, it always had the same differences
i will be looking at this same code to see it it will help with my application, which i currently shell to an outside library
i do my best to test code works before i post it, but sometimes am unable to do so for some reason, and usually say so if this is the case.
Note code snippets posted are just that and do not include error handling that is required in real world applications, but avoid On Error Resume Next
dim all variables as required as often i have done so elsewhere in my code but only posted the relevant part
come back and mark your original post as resolved if your problem is fixed
pete
-
Aug 12th, 2019, 07:00 PM
#12
Re: Extracting Images from A .PDF file...Revisit
I have been on the search for PDF to JPG software for some time.
Just came across a program that is not too expensive (previous programs I have checked out, are in the thousands, if you wish command line interface)
This one is $55 - reaconverter
If you can insert page breaks in your PDF (so you have one img per page), you just move a copy of the PDF into a watched folder, and it will create many JPGs (_page_01 _page_02 etc) in your pre-selected folder, and it can delete the PDF when it is done.
That would mean that your img may not fill the JPG, can you live with that ?
Rob
PS the Standard version does not have command line, but I don't need that, as I leave reaconverter running all the time, and it is watching the PDF folder. (reaconverter uses a config file)
-
Aug 13th, 2019, 04:26 PM
#13
Re: Extracting Images from A .PDF file...Revisit
here is my version of the code posted by LeandroA
Code:
Private Sub ExtractImgPDF(ByVal PathPDF As String, ByVal DestPath As String)
Dim i As Long, lRet As Long, j As Long
Dim FF As Integer
Dim sBuff As String
Dim ArrStream() As String
Dim sStream As String
Dim lCount As Long
Dim imgstrt As String
lCount = 1
FF = FreeFile
Open PathPDF For Binary As #FF
sBuff = Space(LOF(FF))
Get #FF, , sBuff
Close #FF
If Right$(DestPath, 1) <> "\" Then DestPath = DestPath & "\"
imgstrt = "ÿØÿ"
ArrStream = Split(sBuff, imgstrt)
For i = 1 To UBound(ArrStream)
lRet = InStr(ArrStream(i), "endstream")
If lRet Then
sStream = imgstrt & Left$(ArrStream(i), lRet - 1)
FF = FreeFile
Open DestPath & "Image " & lCount & ".jpg" For Binary As #FF
Put #FF, , sStream
Close #FF
lCount = lCount + 1
Else
Debug.Print Left$(sStream, 6)
End If
Next
End Sub
tested, working correctly with sample file posted above
i do my best to test code works before i post it, but sometimes am unable to do so for some reason, and usually say so if this is the case.
Note code snippets posted are just that and do not include error handling that is required in real world applications, but avoid On Error Resume Next
dim all variables as required as often i have done so elsewhere in my code but only posted the relevant part
come back and mark your original post as resolved if your problem is fixed
pete
-
Aug 14th, 2019, 01:14 AM
#14
Re: Extracting Images from A .PDF file...Revisit
Olaf, do you have a sample of use in vB6 for the library libPDFium ?
It could be interesting.
Thanks
For the moment, I use MODI to display PDF (I extract everything in JPG to show PDF)
Originally Posted by Schmidt
One needs to be careful here, regarding "free to use" - since it is licensed under GPL
(an thus not commercially distributable alongside your App).
For Sams use-case (personal private use, via an occasionally "shelled" commandline),
there wont be any issues - the GPL allows that - but to be able to ship it with(in) a commercial App-package,
one has to contact Glyph & Cog for an appropriate (commercial) license: http://www.glyphandcog.com/
The only PDF-lib I know of, which comes under a license that allows commercial usage, would be libPDFium
(developed by Foxit for the most part - then incorporated by special agreement between Google and Foxit into the Google-Chrome-Browser,
later opened also as standalone-library - and relicensed by Google under the quite generous BSD-license).
VB-friendly (__stdcall exporting) binaries are available in several online-repos,
as for example the builds I'm using for years now (from a build-service, Pieter van Ginkel is providing here, on GitHub):
https://github.com/pvginkel/PdfiumBu.../master/Builds
With that single-file-lib (zipped about 2MB), one can "ship and integrate" ones own PDF-Viewer without licensing-issues
(Image- or Plain-Text extraction then only a matter of "a dozen lines of VB-code").
HTH
Olaf
-
Aug 16th, 2019, 06:16 AM
#15
Re: Extracting Images from A .PDF file...Revisit
@WestConn1 (as in 'Connecticut'???)---thanks for 'fixing' that code...works as stated (On my practice file (Fords.pdf)). But, the pdf I receive annually obviously has the pictures in it in a different format (I can't attach those pictures as they are of people---it's a "Directory" for the church I attend.) When i run this code on that document, all of the images come out as negatives...so, I guess I'll just go back to annually using the Snipping Tool and capture each one for my program. But thanks (and same to all) for posting ideas and code...
Sam
-
Aug 16th, 2019, 06:17 AM
#16
Re: Extracting Images from A .PDF file...Revisit
@Bobbles....would like each image to fill the whole jpg.
-
Aug 16th, 2019, 05:32 PM
#17
Re: Extracting Images from A .PDF file...Revisit
in no way related
I can't attach those pictures as they are of people
if yo want to send it to me privately, pm for email address
did shelling to xpd extract work correctly on the real file?
i do my best to test code works before i post it, but sometimes am unable to do so for some reason, and usually say so if this is the case.
Note code snippets posted are just that and do not include error handling that is required in real world applications, but avoid On Error Resume Next
dim all variables as required as often i have done so elsewhere in my code but only posted the relevant part
come back and mark your original post as resolved if your problem is fixed
pete
-
Aug 16th, 2019, 05:56 PM
#18
Re: Extracting Images from A .PDF file...Revisit
Using xpdf, when I shelled it, they all came out as .PPM files...which I know nada about! I guess I could figure out how to convert them to jpgs with another external program...but didn't yet.
stand by for PM with modified 'new' document.
-
Aug 16th, 2019, 07:37 PM
#19
Re: Extracting Images from A .PDF file...Revisit
Originally Posted by SamOscarBrown
@Bobbles....would like each image to fill the whole jpg.
Sam,
It might be worth perusing their site.
Not only do you use a config file, but you can also automate image manipulations by using an Action file (during the same operation that the config file is doing).
Regards,
Rob
-
Aug 17th, 2019, 04:50 PM
#20
Re: Extracting Images from A .PDF file...Revisit
I have several ways of doing what I want to get done... extracting all images to .ppm through a third-party app (xpdf). and then using another 3rd party app to convert from ppm to jpg. I was unaware of the several formats pictures are stored in in pdf files. Anyway, my 'sample' Ford.pdf is what I created with jpg images on my computer; but what I get from a company every year obviously created their pdfs with different format pictures. Using my Ford.pdf and the code from that other thread, and modified by westconn1 works well, but against the 'real' pdf, all images are in a negative format (a negative of a picture, if you know what I mean). I really can't attach that pdf as it contains pii (personally identifiable information). Even attempting to extract just one page of pictures also would include pii, so I am unable to do that as well (even in a PM I won't). SO, will probably end up closing this thread shortly, as I cannot provide a sample for anyone to demonstrate how to properly extract jpg images....I'll just go the longer route....but, again, thanks to all of you who offered some great advice.
Sam
-
Aug 17th, 2019, 07:50 PM
#21
Re: Extracting Images from A .PDF file...Revisit
extracting all images to .ppm through a third-party app (xpdf)
the -j option should save them directly as .jpg
i do my best to test code works before i post it, but sometimes am unable to do so for some reason, and usually say so if this is the case.
Note code snippets posted are just that and do not include error handling that is required in real world applications, but avoid On Error Resume Next
dim all variables as required as often i have done so elsewhere in my code but only posted the relevant part
come back and mark your original post as resolved if your problem is fixed
pete
-
Aug 17th, 2019, 11:53 PM
#22
Re: Extracting Images from A .PDF file...Revisit
The prospects look pretty grim to me unless you happen to get lucky with the particular PDFs you have.
I found some with images where even Adobe Reader could not successfully copy images to the clipboard. Some images ended up mirrored with the origin shifted well away from (0, 0) as clipboard bitmaps. Others ended up even more wacky as malformed EMFs. It suggests to me that Acrobat's left hand doesn't know what its right hand is doing: it can render to screen or printer but can't copy to clipboard?!?
PDF is a "dead" format, i.e. only for display and no longer data in any meaningful sense. It may as well be printed sheets of paper. It's like formatting a numeric data type to dead text: you may be able to parse it back to the numeric value with full fidelity or you might not but there are no hints as to range, precision, or much else.
Trying to use it as a data exchange format is pretty desperate. You may be stuck with rendering it to the screen and then scraping the images manually.
-
Aug 19th, 2019, 05:37 AM
#23
Re: Extracting Images from A .PDF file...Revisit
@westconn1---I am using that option, comes out ppm.
@dilettante---yup, just about settled in on that...as it is only 125 images, an hour at the most is all it really takes. I've spend WAY more than that just trying alternatives. An hour (or so) a year is not that much of an investment.
-
Aug 24th, 2019, 09:45 AM
#24
Re: Extracting Images from A .PDF file...Revisit
Ok, I've just uploaded the requested Demo into the CodeBank, which shows how to use lib-pdfium (sorry for the delay...):
http://www.vbforums.com/showthread.p...-ImageExports)
HTH
Olaf
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|