Problem parsing a pdf

**johnywalker** · Jun 9th, 2021, 04:30 PM

I want to parse a pdf into text file so I can use vb to read some data into tables.
I can do it with adobe but the text file is messy.

I found this site:

https://products.aspose.app/pdf/parser

That seems to work better.
The text file is more readable. It looks like I can identify bookmarks in it for vb to find the data (words-numbers).

But the problem with this is the lines.
In notepad it says line1, line2, line 3 ... as it goes, but when I try to input those lines into vb they
don't come out well. So line 1 in vb is line 1 + line 2 + line 3 and similar trouble further down.

Is there something I can do about it ?

**dilettante** · Jun 9th, 2021, 04:54 PM

Wild guess, but the file may be suffering from "Linux Disease" using LF line delimiters instead of Earth Standard CRLF or Old Apple CR.

VB's intrinsic Line Input statement will cope with either CRLF or CR for Old Apple compatibility.

The Script Runtime TextStream object handles CRLF and (I think) LF.

The ADODB.Stream object allows you to specify any of the 3 delimiter string options via the LineSeparator Property.

**johnywalker** · Jun 9th, 2021, 04:55 PM

If I open the text file in vb as binary it appears that where should be carriage return ( Chr$(13) + Chr$(10) ) it's just Chr$(10).
So notepad separates the lines but vb no.

**johnywalker** · Jun 9th, 2021, 05:08 PM

Originally Posted by dilettante

Wild guess, but the file may be suffering from "Linux Disease" using LF line delimiters instead of Earth Standard CRLF or Old Apple CR.

VB's intrinsic Line Input statement will cope with either CRLF or CR for Old Apple compatibility.

The Script Runtime TextStream object handles CRLF and (I think) LF.

The ADODB.Stream object allows you to specify any of the 3 delimiter string options via the LineSeparator Property.

I can make it read characters one by one.
Call the stream #1 then write w$ =input(1,1), add w$ to word if not = chr$(10) and so on.
Will that make it intolerably slow ? I don't know at this moment.
Line input does n't cope as I say. Any other way ?

**johnywalker** · Jun 9th, 2021, 08:16 PM

That was simple to resolve really.
You just use the replace function, LF to CRLF.
But I 'm unhappy. The supposedly good online parser is worse than adobe in the end - many mistakes, makes things random.

**dilettante** · Jun 9th, 2021, 08:45 PM

I suspect that's a risk inherent in trying to scrape data from PDFs. For one thing it is meant as a "dead" format: hard, published, read-only data. For another, that gives them the freedom to place text blobs into the file pretty much at random if the author chooses to.

For reliability you might have to render PDF pages as images and then OCR them. Lots of work and that introduces another source of error.

Are you sure you can't get access to the legitimate source data?

**Thierry69** · Jun 9th, 2021, 10:29 PM

I wrote my own PDF & recogn OCR by mixing Document Imaging, Ghostscript & Poppler
The OCR is good at more than 90%.
Very well implemented.

You could do the same, of course depending on type of PDF to recognize, language etc...

**johnywalker** · Jun 9th, 2021, 10:34 PM

Originally Posted by dilettante

I suspect that's a risk inherent in trying to scrape data from PDFs. For one thing it is meant as a "dead" format: hard, published, read-only data. For another, that gives them the freedom to place text blobs into the file pretty much at random if the author chooses to.

For reliability you might have to render PDF pages as images and then OCR them. Lots of work and that introduces another source of error.

Are you sure you can't get access to the legitimate source data?

The data may exist in simple text file somewhere else but I not all of them I think.
Also the language is Greek and most online pdf services won't understand Greek - though the one I mentioned did.
Adobe loses some because I can't get it to spot them correctly - the positions are randomized somewhat.
So you mean if the guy who writes the pdf places his text in some order c-b-a rather then a-b-c it will affect things ?
If I OCR them will it work better ?

**DataMiser** · Jun 9th, 2021, 10:38 PM

Originally Posted by johnywalker

I can make it read characters one by one.
Call the stream #1 then write w$ =input(1,1), add w$ to word if not = chr$(10) and so on.
Will that make it intolerably slow ? I don't know at this moment.
Line input does n't cope as I say. Any other way ?

Read a large chunk or if the file is not to huge read the entire file at once into a var then use Split() on the linefeed to generate an array. Each element will contain one line from the file. You can then loop through the array for processing or writing to an output file using the Print # statement.

**johnywalker** · Jun 9th, 2021, 10:49 PM

Originally Posted by DataMiser

Read a large chunk or if the file is not to huge read the entire file at once into a var then use Split() on the linefeed to generate an array. Each element will contain one line from the file. You can then loop through the array for processing or writing to an output file using the Print # statement.

That was a simple problem of the text file format.
But the parsing was not good.

**Schmidt** · Jun 10th, 2021, 02:21 AM

Originally Posted by johnywalker

I want to parse a pdf into text file...

You might want to do that via pdfium: https://www.vbforums.com/showthread....-ImageExports)

HTH

Olaf

**Peter Swinkels** · Jun 10th, 2021, 02:34 AM

I can make it read characters one by one.
Call the stream #1 then write w$ =input(1,1), add w$ to word if not = chr$(10) and so on.
Will that make it intolerably slow ? I don't know at this moment.
Line input does n't cope as I say. Any other way ?

Unless the file is very large I would suggest you simply loaded it into a String or Byte array and then parse that. Reading characters one by one from a file is much slower than just reading the file all at once.

**dilettante** · Jun 10th, 2021, 02:39 AM

That issue is moot if the text is too scrambled to be useful.

**Peter Swinkels** · Jun 10th, 2021, 02:55 AM

@johnywalker:
I just did some checking, apparently PDF's format is public, did you search for any documentation?

@dilettante:
Okay, I still think it's worth mentioning regarding good programming practices in general.

**dilettante** · Jun 10th, 2021, 03:12 AM

"Slurp and split" has never been good practice, but the amount of data is likely so small it isn't an issue and it seems unlikely this will ever be server-side code or a scheduled task. Of course that makes alternatives make even more sense.

**dz32** · Jun 11th, 2021, 04:36 AM

you can poke through the internal structure of the pdf document and the raw stream data with a tool such as this (open source vb6)

http://sandsprite.com/blogs/index.php?uid=7&pid=57

streams can have different encodings and compressions applied to them, so raw parsing of the pdf document itself is not recommended unless a very simple generator was used to produce them that ignored these features.

depending on how the documents were made the raw stream data can be a pure mess such as those with tables or with text formatting.

there are some command line apps which can extract pages for you like pdfbox which includes

extractText.exe -startPage 1 -endPage 99 [file.pdf] [out_path]

Its generally a messy deal with formatting though. iTextSharp (.NET) also has messy extractions as text.

The render/OCR might be the best path if you dont have access to the original data in any other way.

**johnywalker** · Jun 11th, 2021, 05:04 PM

Originally Posted by dz32

http://sandsprite.com/blogs/index.php?uid=7&pid=57

Browser guard reports as trojan.

**dz32** · Jun 11th, 2021, 05:45 PM

literally cant keep 60+ av programs happy. It does contain tools for analyzing shellcode and detecting pdf exploits (primary purpose of creation)

just run it in a VM or compile from source.

File: PDFStreamDumper_Setup.exe
Size: 3797442
MD5: 3AC32A72F85C543A25A5152C671A701D
Scan Date: 2021-06-11 19:12:21
Detections: 1/68

https://www.virustotal.com/gui/file/...7599/detection

Thread: Problem parsing a pdf

Thread Tools

Display

Problem parsing a pdf

Re: Problem parsing a pdf

Re: Problem parsing a pdf

Re: Problem parsing a pdf

Re: Problem parsing a pdf

Re: Problem parsing a pdf

Re: Problem parsing a pdf

Re: Problem parsing a pdf

Re: Problem parsing a pdf

Re: Problem parsing a pdf

Re: Problem parsing a pdf

Re: Problem parsing a pdf

Re: Problem parsing a pdf

Re: Problem parsing a pdf

Re: Problem parsing a pdf

Re: Problem parsing a pdf

Re: Problem parsing a pdf

Re: Problem parsing a pdf

Posting Permissions