-
Jun 9th, 2021, 04:30 PM
#1
Thread Starter
Addicted Member
Problem parsing a pdf
I want to parse a pdf into text file so I can use vb to read some data into tables.
I can do it with adobe but the text file is messy.
I found this site:
https://products.aspose.app/pdf/parser
That seems to work better.
The text file is more readable. It looks like I can identify bookmarks in it for vb to find the data (words-numbers).
But the problem with this is the lines.
In notepad it says line1, line2, line 3 ... as it goes, but when I try to input those lines into vb they
don't come out well. So line 1 in vb is line 1 + line 2 + line 3 and similar trouble further down.
Is there something I can do about it ?
-
Jun 9th, 2021, 04:54 PM
#2
Re: Problem parsing a pdf
Wild guess, but the file may be suffering from "Linux Disease" using LF line delimiters instead of Earth Standard CRLF or Old Apple CR.
VB's intrinsic Line Input statement will cope with either CRLF or CR for Old Apple compatibility.
The Script Runtime TextStream object handles CRLF and (I think) LF.
The ADODB.Stream object allows you to specify any of the 3 delimiter string options via the LineSeparator Property.
-
Jun 9th, 2021, 04:55 PM
#3
Thread Starter
Addicted Member
Re: Problem parsing a pdf
If I open the text file in vb as binary it appears that where should be carriage return ( Chr$(13) + Chr$(10) ) it's just Chr$(10).
So notepad separates the lines but vb no.
-
Jun 9th, 2021, 05:08 PM
#4
Thread Starter
Addicted Member
Re: Problem parsing a pdf
Originally Posted by dilettante
Wild guess, but the file may be suffering from "Linux Disease" using LF line delimiters instead of Earth Standard CRLF or Old Apple CR.
VB's intrinsic Line Input statement will cope with either CRLF or CR for Old Apple compatibility.
The Script Runtime TextStream object handles CRLF and (I think) LF.
The ADODB.Stream object allows you to specify any of the 3 delimiter string options via the LineSeparator Property.
I can make it read characters one by one.
Call the stream #1 then write w$ =input(1,1), add w$ to word if not = chr$(10) and so on.
Will that make it intolerably slow ? I don't know at this moment.
Line input does n't cope as I say. Any other way ?
-
Jun 9th, 2021, 08:16 PM
#5
Thread Starter
Addicted Member
Re: Problem parsing a pdf
That was simple to resolve really.
You just use the replace function, LF to CRLF.
But I 'm unhappy. The supposedly good online parser is worse than adobe in the end - many mistakes, makes things random.
-
Jun 9th, 2021, 08:45 PM
#6
Re: Problem parsing a pdf
I suspect that's a risk inherent in trying to scrape data from PDFs. For one thing it is meant as a "dead" format: hard, published, read-only data. For another, that gives them the freedom to place text blobs into the file pretty much at random if the author chooses to.
For reliability you might have to render PDF pages as images and then OCR them. Lots of work and that introduces another source of error.
Are you sure you can't get access to the legitimate source data?
-
Jun 9th, 2021, 10:29 PM
#7
Re: Problem parsing a pdf
I wrote my own PDF & recogn OCR by mixing Document Imaging, Ghostscript & Poppler
The OCR is good at more than 90%.
Very well implemented.
You could do the same, of course depending on type of PDF to recognize, language etc...
-
Jun 9th, 2021, 10:34 PM
#8
Thread Starter
Addicted Member
Re: Problem parsing a pdf
Originally Posted by dilettante
I suspect that's a risk inherent in trying to scrape data from PDFs. For one thing it is meant as a "dead" format: hard, published, read-only data. For another, that gives them the freedom to place text blobs into the file pretty much at random if the author chooses to.
For reliability you might have to render PDF pages as images and then OCR them. Lots of work and that introduces another source of error.
Are you sure you can't get access to the legitimate source data?
The data may exist in simple text file somewhere else but I not all of them I think.
Also the language is Greek and most online pdf services won't understand Greek - though the one I mentioned did.
Adobe loses some because I can't get it to spot them correctly - the positions are randomized somewhat.
So you mean if the guy who writes the pdf places his text in some order c-b-a rather then a-b-c it will affect things ?
If I OCR them will it work better ?
-
Jun 9th, 2021, 10:38 PM
#9
Re: Problem parsing a pdf
Originally Posted by johnywalker
I can make it read characters one by one.
Call the stream #1 then write w$ =input(1,1), add w$ to word if not = chr$(10) and so on.
Will that make it intolerably slow ? I don't know at this moment.
Line input does n't cope as I say. Any other way ?
Read a large chunk or if the file is not to huge read the entire file at once into a var then use Split() on the linefeed to generate an array. Each element will contain one line from the file. You can then loop through the array for processing or writing to an output file using the Print # statement.
-
Jun 9th, 2021, 10:49 PM
#10
Thread Starter
Addicted Member
Re: Problem parsing a pdf
Originally Posted by DataMiser
Read a large chunk or if the file is not to huge read the entire file at once into a var then use Split() on the linefeed to generate an array. Each element will contain one line from the file. You can then loop through the array for processing or writing to an output file using the Print # statement.
That was a simple problem of the text file format.
But the parsing was not good.
-
Jun 10th, 2021, 02:21 AM
#11
Re: Problem parsing a pdf
Originally Posted by johnywalker
I want to parse a pdf into text file...
You might want to do that via pdfium: https://www.vbforums.com/showthread....-ImageExports)
HTH
Olaf
-
Jun 10th, 2021, 02:34 AM
#12
Re: Problem parsing a pdf
I can make it read characters one by one.
Call the stream #1 then write w$ =input(1,1), add w$ to word if not = chr$(10) and so on.
Will that make it intolerably slow ? I don't know at this moment.
Line input does n't cope as I say. Any other way ?
Unless the file is very large I would suggest you simply loaded it into a String or Byte array and then parse that. Reading characters one by one from a file is much slower than just reading the file all at once.
-
Jun 10th, 2021, 02:39 AM
#13
Re: Problem parsing a pdf
That issue is moot if the text is too scrambled to be useful.
-
Jun 10th, 2021, 02:55 AM
#14
Re: Problem parsing a pdf
@johnywalker:
I just did some checking, apparently PDF's format is public, did you search for any documentation?
@dilettante:
Okay, I still think it's worth mentioning regarding good programming practices in general.
-
Jun 10th, 2021, 03:12 AM
#15
Re: Problem parsing a pdf
"Slurp and split" has never been good practice, but the amount of data is likely so small it isn't an issue and it seems unlikely this will ever be server-side code or a scheduled task. Of course that makes alternatives make even more sense.
-
Jun 11th, 2021, 04:36 AM
#16
Re: Problem parsing a pdf
you can poke through the internal structure of the pdf document and the raw stream data with a tool such as this (open source vb6)
http://sandsprite.com/blogs/index.php?uid=7&pid=57
streams can have different encodings and compressions applied to them, so raw parsing of the pdf document itself is not recommended unless a very simple generator was used to produce them that ignored these features.
depending on how the documents were made the raw stream data can be a pure mess such as those with tables or with text formatting.
there are some command line apps which can extract pages for you like pdfbox which includes
extractText.exe -startPage 1 -endPage 99 [file.pdf] [out_path]
Its generally a messy deal with formatting though. iTextSharp (.NET) also has messy extractions as text.
The render/OCR might be the best path if you dont have access to the original data in any other way.
-
Jun 11th, 2021, 05:04 PM
#17
Thread Starter
Addicted Member
Re: Problem parsing a pdf
Originally Posted by dz32
Browser guard reports as trojan.
-
Jun 11th, 2021, 05:45 PM
#18
Re: Problem parsing a pdf
literally cant keep 60+ av programs happy. It does contain tools for analyzing shellcode and detecting pdf exploits (primary purpose of creation)
just run it in a VM or compile from source.
File: PDFStreamDumper_Setup.exe
Size: 3797442
MD5: 3AC32A72F85C543A25A5152C671A701D
Scan Date: 2021-06-11 19:12:21
Detections: 1/68
https://www.virustotal.com/gui/file/...7599/detection
Last edited by dz32; Jun 11th, 2021 at 05:50 PM.
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|