Recommendation for "printing" PDF to remove all metadata?
I find myself in need to remove all meta data from PDFs before passing them on, for example when I receive a PDF from person A (for example a co-worker) to person B (a customer).
I really don't like how much meta data is in PDFs and how difficult it is to remove them, so I developed the habit of printing each PDF and then scanning it to get rid of all meta data.
Can anybody recommend a simpler way to do that than printing them out physically?
Loosing the ability to extract text from it is not a problem, I only need the visual representation of the PDF contents.
Also, I don't want to use any 3rd party tools anymore except RC6. In fact Olafs work is one of the few that I trust without seeing the source code.
Thank you for any recommendation.
Re: Recommendation for "printing" PDF to remove all metadata?
You could try deleting the metadata using the IPropertyStore interface. IPropertyStore::SetValue with VT_EMPTY deletes a value from the property store.
Re: Recommendation for "printing" PDF to remove all metadata?
Quote:
Originally Posted by
-Franky-
You could try deleting the metadata using the IPropertyStore interface. IPropertyStore::SetValue with VT_EMPTY deletes a value from the property store.
Wouldn't that require the installation of a 3rd party property handler shell extension? AFAIK there's nothing built in for PDF, at least through Win10. See if you can do it in Explorer. If you can do it in Explorer, then you have a decent chance of being able to do it through IPropertyStore (or guaranteed you can if you use twinBASIC; not all property handlers install 32bit versions for VB6 these days; if there's only a 64bit one that won't load in an out of process server, you'd need a 64bit exe).
Using only RC6+native VB is going to be a big problem for any kind of PDF handling beyond basic 'Print to PDF'. If you can't trust Microsoft Office or Google's open source pdfium, it's going to mean manually writing a pdf parser yourself in all likelihood.
Re: Recommendation for "printing" PDF to remove all metadata?
I am under the impression that (please correct me if I am wrong), PDF is going to stay a standard.
Lawyers, tax counsellors, etc. all use PDF.
In fact I was hoping that ultimatively somebody / the community would create a PDF library in TB or in VB6 so that we don't have to rely on third party tools anymore.
But obviously this is not the case yet.
Re: Recommendation for "printing" PDF to remove all metadata?
Google's pdfium is open source and easily used by VB6, VBA, and twinBASIC; so is xpdf. It's a massive undertaking so I don't know why anyone wouldn't use well established open source tools or why anyone would devote 6+ months of full time work for free to duplicate those efforts winding up with a codebase so complex 99% VB6 users couldn't modify it anyway .. tB supports static linking so you could probably eliminate the .dll if you wanted there... but at least it's a flat DLL so there's not even ActiveX registration hell.
Re: Recommendation for "printing" PDF to remove all metadata?
Could you lend me a hand by telling me if there is a library or a wrapper around a library available (that I could use from within VB6) that you would do what I need, even if it reduced the PDF contents to pure images?
I took a look at Olaf Schmidt's PDF image extractor, but I think I misunderstood its purpose. It does extract images, but it does not convert the contents to images. This is what I would actually need.
Re: Recommendation for "printing" PDF to remove all metadata?
The OrdoPdfReader control displays pages as images with pdfium.
My gPdfMerge works with pdfium text functions to perform searches.
Stripping metadata without just going through image rendering or text extraction seems to be such pain only Adobe's commercial closed source crap can do it. So I don't know who'd even be able to make such a tool in any language. If you install Adobe's stuff they have property handlers you could go through the Windows shell to remove easily.
Re: Recommendation for "printing" PDF to remove all metadata?
Thank you.
I took a look at Olaf's solution again.
I was so stupid (aka under time pressure) that I didn't realize it has everything I need.
Re: Recommendation for "printing" PDF to remove all metadata?
The irony of not wanting 3rd party dependencies only to use *both* closed source RC6 comprised of *several* dlls PLUS pdfium instead of a complete open source solution of just pdfium with the pdf page to image code in the ordo pdf project.... :eek2:
Re: Recommendation for "printing" PDF to remove all metadata?
Starting with Win10, you can use WinRT to render the pages of a PDF into images. These images can then be combined into a PDF using a PDF printer, for example.
Re: Recommendation for "printing" PDF to remove all metadata?
I didn't express myself clearly. I meant to say that I want to be able to trust the end result.
Using Olaf's project I can convert the PDF pages to images and then put these together as PDF pages again. That is what I needed.
I did realize that Windows has the "Print to PDF" option, but as I didn't have a clue what it does under the hood and whether it then still contains meta data, I didn't want to use it. In fact, the print to PDF thing gave me varying results in aspects of editable fields finally showing up when printed to PDF or NOT showing up anymore or BEING able to be edited and NOT being able to be edited anymore. As I needed images only, I ditched this approach.
Re: Recommendation for "printing" PDF to remove all metadata?
Quote:
Originally Posted by
tmighty2
I didn't express myself clearly. I meant to say that I want to be able to trust the end result.
Using Olaf's project I can convert the PDF pages to images and then put these together as PDF pages again. That is what I needed.
I did realize that Windows has the "Print to PDF" option, but as I didn't have a clue what it does under the hood and whether it then still contains meta data, I didn't want to use it. In fact, the print to PDF thing gave me varying results in aspects of editable fields finally showing up when printed to PDF or NOT showing up anymore or BEING able to be edited and NOT being able to be edited anymore. As I needed images only, I ditched this approach.
Your original post said you didn't want 3rd party tools, but RC6 is just wrapping pdfium, so you're still using it, just adding a bunch of other closed source 3rd party DLLs instead of just calling FPDF_RenderPage in the pdfium dll yourself.