Results 1 to 17 of 17

Thread: Extract Text from PDF

  1. #1

    Thread Starter
    PowerPoster Zvoni's Avatar
    Join Date
    Sep 2012
    Location
    To the moon and then left
    Posts
    4,415

    Extract Text from PDF

    Hi Folks,

    yes, i've seen the Threads here reg. this topic, and i've seen gazillions on the Net

    What i could gather: it boils down to using a (3rd-Party) Tool to do extraction, and then parse the resulting text-File (or other format)

    Background: i've taken on a project to automate the following process:

    Mail-Client in our company: Outlook 365 - 64-Bit
    1) Client sends automated Email with a machine-readable PDF in the Attachment
    2) The ladies in Dispatch-Department download (Save As to a dedicated Folder on our File-Server), then open every single PDF, and enter the neccessary data in our ERP
    As you can imagine, Step 2 is a Time-Killer.

    I was thinking along these lines:
    1) E-Mail arrives (known sender-address)
    2) VBA-Macro saves the attachment to the Folder on the File-Server
    3) Run through the PDF-Files and convert them to my specified format (currently i'm inclined to use HTML as a Target)
    4) Parse the resulting File
    5) If i find all needed fields (Client Part-No, Quantity, Target-Location), take that and fire a SQL-Query against our Database to retrieve neccessary Data from DB
    6) Write that Data from DB to somewhere else (not decided yet) - Not important right now
    7) clean up Folder with html-files
    8) Move PDF's to another Folder called "processed" (or whatever)
    9) Delete EMail

    My Question:

    Anyone here got any recommendations (or experience doing it?)?
    Any suggestions to my Workflow?

    NOTE: I can't use Tools, which need a setup/installation (restricted Company-Computers)

    NOTE 2: Will be done in VBA (Not VB6!)

    right now i'm testing xpdfreader (because of standalone CLI-Tools)
    https://www.xpdfreader.com/download.html
    Last edited by Zvoni; Tomorrow at 31:69 PM.
    ----------------------------------------------------------------------------------------

    One System to rule them all, One Code to find them,
    One IDE to bring them all, and to the Framework bind them,
    in the Land of Redmond, where the Windows lie
    ---------------------------------------------------------------------------------
    People call me crazy because i'm jumping out of perfectly fine airplanes.
    ---------------------------------------------------------------------------------
    Code is like a joke: If you have to explain it, it's bad

  2. #2
    PowerPoster
    Join Date
    Aug 2010
    Location
    Canada
    Posts
    2,412

    Re: Extract Text from PDF

    There are 2 steps I would consider changing.

    At step #5, I would show a window to a human with the parsed data side-by-side with the PDF for verification. There are some weird PDFs out there that can trip up text extraction, and you might only get a partial value or backwards value, or otherwise incorrect value. The downside is that your process is no longer fully automated, but the upside is that it is now much more difficult for an incorrect order to be processed.

    At step #9, I would not delete the email, but instead move it to a folder ("Auto-Processed Orders" or something like that). You might need to go back to the original email & attachment to audit it. Sure you've got the PDF saved in the processed network folder, but sometimes stuff happens and it's good to have the full original email.

  3. #3
    Fanatic Member
    Join Date
    Jan 2015
    Posts
    596

    Re: Extract Text from PDF

    This is nearly exactly what one of the product I wrote does. It is a sold application in .NET and integrated in Outlook.
    But it does also many more things, but very specific to our other applications, but could be modified
    If you need more info contact me by MP

  4. #4

    Thread Starter
    PowerPoster Zvoni's Avatar
    Join Date
    Sep 2012
    Location
    To the moon and then left
    Posts
    4,415

    Re: Extract Text from PDF

    Quote Originally Posted by jpbro View Post
    There are 2 steps I would consider changing.

    At step #5, I would show a window to a human with the parsed data side-by-side with the PDF for verification. There are some weird PDFs out there that can trip up text extraction, and you might only get a partial value or backwards value, or otherwise incorrect value. The downside is that your process is no longer fully automated, but the upside is that it is now much more difficult for an incorrect order to be processed.

    At step #9, I would not delete the email, but instead move it to a folder ("Auto-Processed Orders" or something like that). You might need to go back to the original email & attachment to audit it. Sure you've got the PDF saved in the processed network folder, but sometimes stuff happens and it's good to have the full original email.

    jp,
    reg. Step 5 i was considering that, too, and yes, it would defeat the "automation" in a way.
    Maybe as an Option to give the user to check, at least in the beginning.
    Sound idea!
    Thx

    Reg. Step 9
    Currently, the user applies a digital "stamp" to the PDF, and saves it to that dedicated folder (Now, THAT ONE IS A TIME-KILLER!)

    The thing is: Those Mails/PDF's are not Client-Orders in the classical sense: They are Kanban-CallOffs.

    For those not familiar with "Kanban":
    Basically, you have a blanket/general contract with a client
    and a Kanban-Calloff is an automated order for both sides, because such a Calloff might occur in different variations, depending on the "physical" System:
    It might be an empty bin coming back from the client, which's barcode gets scanned, and off you go
    It might be an empty bin being thrown into our RFID-Box, which scans the RFID-tag and sends it to us, automatically creating a "refill"
    It might be sent via EDI, the ERP-Systems communicating directly with each other

    And this one particular client is not capable of any of those scenarios above, but sends those automated eMails.
    So, as in my Scenario above, i don't need to keep the "original" E-Mail for Audit-Purposes, because "sent by mail" is just the transmisson-method.
    It could happen tomorrow our client buys a new ERP-System, and switches to EDI from one day to the next
    Last edited by Zvoni; Tomorrow at 31:69 PM.
    ----------------------------------------------------------------------------------------

    One System to rule them all, One Code to find them,
    One IDE to bring them all, and to the Framework bind them,
    in the Land of Redmond, where the Windows lie
    ---------------------------------------------------------------------------------
    People call me crazy because i'm jumping out of perfectly fine airplanes.
    ---------------------------------------------------------------------------------
    Code is like a joke: If you have to explain it, it's bad

  5. #5
    PowerPoster
    Join Date
    Dec 2004
    Posts
    25,618

    Re: Extract Text from PDF

    Anyone here got any recommendations (or experience doing it?)?
    i use similar to read pdf invoices to read the customer from the pdf look it up in a database to get the email address, send the pdf as an attachment to the customer

    not what you are wanting to do but using many of the same steps, i shell pdftotext and also pdftk for someother task (splitting multiplepage pdf files to individual pages), both are 3rd party applications and need to copied on, but not installed
    works correctly on about 98% of the pdf invoices i have not been able to determine how to resolve the very few that fail from some difference in the pdf output
    i do my best to test code works before i post it, but sometimes am unable to do so for some reason, and usually say so if this is the case.
    Note code snippets posted are just that and do not include error handling that is required in real world applications, but avoid On Error Resume Next

    dim all variables as required as often i have done so elsewhere in my code but only posted the relevant part

    come back and mark your original post as resolved if your problem is fixed
    pete

  6. #6
    Addicted Member jg.sa's Avatar
    Join Date
    Nov 2017
    Location
    South Australia ( SA )
    Posts
    198

    Re: Extract Text from PDF

    G'Day Zvoni

    I have built a number of these agents triggered via 'on new mail arrival', all before SAP got big days

    Quote Originally Posted by Zvoni View Post
    1) Client sends automated Email with a machine-readable PDF in the Attachment

    NOTE: I can't use Tools, which need a setup/installation (restricted Company-Computers)
    Always on a SMTP server and distribution after processing was via email.


    This is a concern

    5) If i find all needed fields

    Failing to 'Pass Go' is a fail, not a hope you will get out of jail quickly !!!

  7. #7
    The Idiot
    Join Date
    Dec 2014
    Posts
    2,721

    Re: Extract Text from PDF

    I have not worked with PDF, only used 3rd party apps to extract.
    but if I needed to extract text I would:

    - learn the protocol/header of PDF
    - is there a way to parse it content following the protocol/header
    - that data, how to convert it into strings.

    if I unable to do that I would:

    - how to create a "screenshot" of each page of the PDF, into a bitmap
    - have a bitmap to text converter.

    I did all this work when I learned about flash, how to extract data and even inject data.
    its all about learning the header/protocol of the file.
    but not always theres info about that and you will need to find other ways.

  8. #8

    Thread Starter
    PowerPoster Zvoni's Avatar
    Join Date
    Sep 2012
    Location
    To the moon and then left
    Posts
    4,415

    Re: Extract Text from PDF

    Quote Originally Posted by jg.sa View Post
    G'Day Zvoni

    I have built a number of these agents triggered via 'on new mail arrival', all before SAP got big days



    Always on a SMTP server and distribution after processing was via email.


    This is a concern

    5) If i find all needed fields

    Failing to 'Pass Go' is a fail, not a hope you will get out of jail quickly !!!

    Yeah, i'm aware that i have to sanity-check the output.
    Last edited by Zvoni; Tomorrow at 31:69 PM.
    ----------------------------------------------------------------------------------------

    One System to rule them all, One Code to find them,
    One IDE to bring them all, and to the Framework bind them,
    in the Land of Redmond, where the Windows lie
    ---------------------------------------------------------------------------------
    People call me crazy because i'm jumping out of perfectly fine airplanes.
    ---------------------------------------------------------------------------------
    Code is like a joke: If you have to explain it, it's bad

  9. #9

    Thread Starter
    PowerPoster Zvoni's Avatar
    Join Date
    Sep 2012
    Location
    To the moon and then left
    Posts
    4,415

    Re: Extract Text from PDF

    Quote Originally Posted by baka View Post
    I have not worked with PDF, only used 3rd party apps to extract.
    but if I needed to extract text I would:

    - learn the protocol/header of PDF
    - is there a way to parse it content following the protocol/header
    - that data, how to convert it into strings.

    if I unable to do that I would:

    - how to create a "screenshot" of each page of the PDF, into a bitmap
    - have a bitmap to text converter.

    I did all this work when I learned about flash, how to extract data and even inject data.
    its all about learning the header/protocol of the file.
    but not always theres info about that and you will need to find other ways.

    baka,
    i've found the documentation for the PDF-Format, but i think i will learn speaking Mandarin fluently before i understand the PDF-Format.
    Holy Smoke, but PDF is IMO a convoluted mess.
    i created a small test-PDF (just 2 Lines in a Word-Document exported as PDF), and opened it in Notepad++.
    ....a bucket full of worms is a pretty sight compared to that.....
    Last edited by Zvoni; Tomorrow at 31:69 PM.
    ----------------------------------------------------------------------------------------

    One System to rule them all, One Code to find them,
    One IDE to bring them all, and to the Framework bind them,
    in the Land of Redmond, where the Windows lie
    ---------------------------------------------------------------------------------
    People call me crazy because i'm jumping out of perfectly fine airplanes.
    ---------------------------------------------------------------------------------
    Code is like a joke: If you have to explain it, it's bad

  10. #10
    Fanatic Member
    Join Date
    Jan 2015
    Posts
    596

    Re: Extract Text from PDF

    My solution for PDF analysing and retrieving all data within uses a mix of MODI, Poppler, GS, Nicomsoft

    And I can reach very well my goals.

    I just managed arround 80 invoices with full OCR recognition in arround 10 minutes

  11. #11

  12. #12

    Thread Starter
    PowerPoster Zvoni's Avatar
    Join Date
    Sep 2012
    Location
    To the moon and then left
    Posts
    4,415

    Re: Extract Text from PDF

    Quote Originally Posted by Eduardo- View Post
    Here there is a PDF parser, IDK how well it works.
    Thx Ed.
    Will take a look and report back
    Last edited by Zvoni; Tomorrow at 31:69 PM.
    ----------------------------------------------------------------------------------------

    One System to rule them all, One Code to find them,
    One IDE to bring them all, and to the Framework bind them,
    in the Land of Redmond, where the Windows lie
    ---------------------------------------------------------------------------------
    People call me crazy because i'm jumping out of perfectly fine airplanes.
    ---------------------------------------------------------------------------------
    Code is like a joke: If you have to explain it, it's bad

  13. #13
    PowerPoster
    Join Date
    Feb 2017
    Posts
    4,995

    Re: Extract Text from PDF

    Quote Originally Posted by Zvoni View Post
    Thx Ed.
    Will take a look and report back

  14. #14

    Thread Starter
    PowerPoster Zvoni's Avatar
    Join Date
    Sep 2012
    Location
    To the moon and then left
    Posts
    4,415

    Re: Extract Text from PDF

    Quote Originally Posted by Eduardo- View Post

    Reporting back.
    Not exactly what i'm looking for. Yes, i can get stuff like MetaData, but not the actual content.
    At least it's a starting point.

    Thx anyway
    Last edited by Zvoni; Tomorrow at 31:69 PM.
    ----------------------------------------------------------------------------------------

    One System to rule them all, One Code to find them,
    One IDE to bring them all, and to the Framework bind them,
    in the Land of Redmond, where the Windows lie
    ---------------------------------------------------------------------------------
    People call me crazy because i'm jumping out of perfectly fine airplanes.
    ---------------------------------------------------------------------------------
    Code is like a joke: If you have to explain it, it's bad

  15. #15
    PowerPoster
    Join Date
    Feb 2017
    Posts
    4,995

    Re: Extract Text from PDF

    Quote Originally Posted by Zvoni View Post
    Reporting back.
    Not exactly what i'm looking for. Yes, i can get stuff like MetaData, but not the actual content.
    At least it's a starting point.

    Thx anyway
    Ah, OK, thanks for the information.

  16. #16
    PowerPoster
    Join Date
    Jun 2013
    Posts
    7,219

    Re: Extract Text from PDF

    Quote Originally Posted by Zvoni View Post
    Not exactly what i'm looking for.
    What happened to your plan, to use the commandline-tool (PdfToText) from the link in your first posting?
    (Have looked at the Arguments you can pass along there, and they are quite impressive with regards to "influencing the format of the Text-Output").

    Olaf

  17. #17

    Thread Starter
    PowerPoster Zvoni's Avatar
    Join Date
    Sep 2012
    Location
    To the moon and then left
    Posts
    4,415

    Re: Extract Text from PDF

    Quote Originally Posted by Schmidt View Post
    What happened to your plan, to use the commandline-tool (PdfToText) from the link in your first posting?
    (Have looked at the Arguments you can pass along there, and they are quite impressive with regards to "influencing the format of the Text-Output").

    Olaf

    Olaf,
    still testing, though i'll probably switch to pdftotext (instead of pdftoHTML), since i want to grab plain text.
    I'm just running tests with it (incl. the different output-layouts), firing it against multiple PDF's from that client to check if the Information i want to grab is always in the same position.
    If it is, then it will be enough for me (some idiot from the Customer changing the PDF-Layout not withstanding).
    Currently, i'm doing the tests as well as writing down some kind of workflow resp. ideas (do i save the output somewhere in a local db, sanity-checks, how to inform user, if something fails, that kind of things)

    EDIT:
    Oh, and it's not a short-term project (as in: Must be done tomorrow. Anyone familiar with it? *g*)
    projected time-line Q1 + Q2 2022, so around summer 2022 is expected finish
    Last edited by Zvoni; Tomorrow at 31:69 PM.
    ----------------------------------------------------------------------------------------

    One System to rule them all, One Code to find them,
    One IDE to bring them all, and to the Framework bind them,
    in the Land of Redmond, where the Windows lie
    ---------------------------------------------------------------------------------
    People call me crazy because i'm jumping out of perfectly fine airplanes.
    ---------------------------------------------------------------------------------
    Code is like a joke: If you have to explain it, it's bad

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width