Results 1 to 18 of 18

Thread: Problem parsing a pdf

  1. #1

    Thread Starter
    Addicted Member
    Join Date
    Jan 2009
    Posts
    231

    Problem parsing a pdf

    I want to parse a pdf into text file so I can use vb to read some data into tables.
    I can do it with adobe but the text file is messy.

    I found this site:

    https://products.aspose.app/pdf/parser

    That seems to work better.
    The text file is more readable. It looks like I can identify bookmarks in it for vb to find the data (words-numbers).

    But the problem with this is the lines.
    In notepad it says line1, line2, line 3 ... as it goes, but when I try to input those lines into vb they
    don't come out well. So line 1 in vb is line 1 + line 2 + line 3 and similar trouble further down.

    Is there something I can do about it ?

  2. #2
    PowerPoster
    Join Date
    Feb 2006
    Posts
    24,482

    Re: Problem parsing a pdf

    Wild guess, but the file may be suffering from "Linux Disease" using LF line delimiters instead of Earth Standard CRLF or Old Apple CR.

    VB's intrinsic Line Input statement will cope with either CRLF or CR for Old Apple compatibility.

    The Script Runtime TextStream object handles CRLF and (I think) LF.

    The ADODB.Stream object allows you to specify any of the 3 delimiter string options via the LineSeparator Property.

  3. #3

    Thread Starter
    Addicted Member
    Join Date
    Jan 2009
    Posts
    231

    Re: Problem parsing a pdf

    If I open the text file in vb as binary it appears that where should be carriage return ( Chr$(13) + Chr$(10) ) it's just Chr$(10).
    So notepad separates the lines but vb no.

  4. #4

    Thread Starter
    Addicted Member
    Join Date
    Jan 2009
    Posts
    231

    Re: Problem parsing a pdf

    Quote Originally Posted by dilettante View Post
    Wild guess, but the file may be suffering from "Linux Disease" using LF line delimiters instead of Earth Standard CRLF or Old Apple CR.

    VB's intrinsic Line Input statement will cope with either CRLF or CR for Old Apple compatibility.

    The Script Runtime TextStream object handles CRLF and (I think) LF.

    The ADODB.Stream object allows you to specify any of the 3 delimiter string options via the LineSeparator Property.
    I can make it read characters one by one.
    Call the stream #1 then write w$ =input(1,1), add w$ to word if not = chr$(10) and so on.
    Will that make it intolerably slow ? I don't know at this moment.
    Line input does n't cope as I say. Any other way ?

  5. #5

    Thread Starter
    Addicted Member
    Join Date
    Jan 2009
    Posts
    231

    Re: Problem parsing a pdf

    That was simple to resolve really.
    You just use the replace function, LF to CRLF.
    But I 'm unhappy. The supposedly good online parser is worse than adobe in the end - many mistakes, makes things random.

  6. #6
    PowerPoster
    Join Date
    Feb 2006
    Posts
    24,482

    Re: Problem parsing a pdf

    I suspect that's a risk inherent in trying to scrape data from PDFs. For one thing it is meant as a "dead" format: hard, published, read-only data. For another, that gives them the freedom to place text blobs into the file pretty much at random if the author chooses to.

    For reliability you might have to render PDF pages as images and then OCR them. Lots of work and that introduces another source of error.

    Are you sure you can't get access to the legitimate source data?

  7. #7
    Fanatic Member
    Join Date
    Jan 2015
    Posts
    596

    Re: Problem parsing a pdf

    I wrote my own PDF & recogn OCR by mixing Document Imaging, Ghostscript & Poppler
    The OCR is good at more than 90%.
    Very well implemented.

    You could do the same, of course depending on type of PDF to recognize, language etc...

  8. #8

    Thread Starter
    Addicted Member
    Join Date
    Jan 2009
    Posts
    231

    Re: Problem parsing a pdf

    Quote Originally Posted by dilettante View Post
    I suspect that's a risk inherent in trying to scrape data from PDFs. For one thing it is meant as a "dead" format: hard, published, read-only data. For another, that gives them the freedom to place text blobs into the file pretty much at random if the author chooses to.

    For reliability you might have to render PDF pages as images and then OCR them. Lots of work and that introduces another source of error.

    Are you sure you can't get access to the legitimate source data?
    The data may exist in simple text file somewhere else but I not all of them I think.
    Also the language is Greek and most online pdf services won't understand Greek - though the one I mentioned did.
    Adobe loses some because I can't get it to spot them correctly - the positions are randomized somewhat.
    So you mean if the guy who writes the pdf places his text in some order c-b-a rather then a-b-c it will affect things ?
    If I OCR them will it work better ?

  9. #9
    PowerPoster
    Join Date
    Feb 2012
    Location
    West Virginia
    Posts
    14,205

    Re: Problem parsing a pdf

    Quote Originally Posted by johnywalker View Post
    I can make it read characters one by one.
    Call the stream #1 then write w$ =input(1,1), add w$ to word if not = chr$(10) and so on.
    Will that make it intolerably slow ? I don't know at this moment.
    Line input does n't cope as I say. Any other way ?
    Read a large chunk or if the file is not to huge read the entire file at once into a var then use Split() on the linefeed to generate an array. Each element will contain one line from the file. You can then loop through the array for processing or writing to an output file using the Print # statement.

  10. #10

    Thread Starter
    Addicted Member
    Join Date
    Jan 2009
    Posts
    231

    Re: Problem parsing a pdf

    Quote Originally Posted by DataMiser View Post
    Read a large chunk or if the file is not to huge read the entire file at once into a var then use Split() on the linefeed to generate an array. Each element will contain one line from the file. You can then loop through the array for processing or writing to an output file using the Print # statement.
    That was a simple problem of the text file format.
    But the parsing was not good.

  11. #11
    PowerPoster
    Join Date
    Jun 2013
    Posts
    7,253

    Re: Problem parsing a pdf

    Quote Originally Posted by johnywalker View Post
    I want to parse a pdf into text file...
    You might want to do that via pdfium: https://www.vbforums.com/showthread....-ImageExports)

    HTH

    Olaf

  12. #12
    Frenzied Member
    Join Date
    Feb 2003
    Posts
    1,807

    Re: Problem parsing a pdf

    I can make it read characters one by one.
    Call the stream #1 then write w$ =input(1,1), add w$ to word if not = chr$(10) and so on.
    Will that make it intolerably slow ? I don't know at this moment.
    Line input does n't cope as I say. Any other way ?
    Unless the file is very large I would suggest you simply loaded it into a String or Byte array and then parse that. Reading characters one by one from a file is much slower than just reading the file all at once.

  13. #13
    PowerPoster
    Join Date
    Feb 2006
    Posts
    24,482

    Re: Problem parsing a pdf

    That issue is moot if the text is too scrambled to be useful.

  14. #14
    Frenzied Member
    Join Date
    Feb 2003
    Posts
    1,807

    Re: Problem parsing a pdf

    @johnywalker:
    I just did some checking, apparently PDF's format is public, did you search for any documentation?

    @dilettante:
    Okay, I still think it's worth mentioning regarding good programming practices in general.

  15. #15
    PowerPoster
    Join Date
    Feb 2006
    Posts
    24,482

    Re: Problem parsing a pdf

    "Slurp and split" has never been good practice, but the amount of data is likely so small it isn't an issue and it seems unlikely this will ever be server-side code or a scheduled task. Of course that makes alternatives make even more sense.

  16. #16
    Frenzied Member
    Join Date
    Jun 2015
    Posts
    1,068

    Re: Problem parsing a pdf

    you can poke through the internal structure of the pdf document and the raw stream data with a tool such as this (open source vb6)

    http://sandsprite.com/blogs/index.php?uid=7&pid=57

    streams can have different encodings and compressions applied to them, so raw parsing of the pdf document itself is not recommended unless a very simple generator was used to produce them that ignored these features.

    depending on how the documents were made the raw stream data can be a pure mess such as those with tables or with text formatting.

    there are some command line apps which can extract pages for you like pdfbox which includes

    extractText.exe -startPage 1 -endPage 99 [file.pdf] [out_path]

    Its generally a messy deal with formatting though. iTextSharp (.NET) also has messy extractions as text.

    The render/OCR might be the best path if you dont have access to the original data in any other way.

  17. #17

    Thread Starter
    Addicted Member
    Join Date
    Jan 2009
    Posts
    231

    Re: Problem parsing a pdf


    Browser guard reports as trojan.

  18. #18
    Frenzied Member
    Join Date
    Jun 2015
    Posts
    1,068

    Re: Problem parsing a pdf

    literally cant keep 60+ av programs happy. It does contain tools for analyzing shellcode and detecting pdf exploits (primary purpose of creation)

    just run it in a VM or compile from source.

    File: PDFStreamDumper_Setup.exe
    Size: 3797442
    MD5: 3AC32A72F85C543A25A5152C671A701D
    Scan Date: 2021-06-11 19:12:21
    Detections: 1/68

    https://www.virustotal.com/gui/file/...7599/detection
    Last edited by dz32; Jun 11th, 2021 at 05:50 PM.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width