-
Mar 16th, 2015, 08:03 PM
#1
Thread Starter
Hyperactive Member
PDF to Text parser
im working on a task of automating data entry by parsing PDF files to text using a script i found online
http://www.codeproject.com/Articles/...m-PDF-in-C-NET
i tried few other ones, they all work and generate text files, however they all seem to be in a messy text format and not in the showing order,
which can by no mean be programmatically read and processed
any other suggestions.
Thanks in advance
-
Mar 16th, 2015, 08:35 PM
#2
Re: PDF to Text parser
PDF is a complex format and converting it to text is definitely going to result in a messy output. Is there certain information, fields etc. you're looking to change? It would be easier to target certain structures.
There is an SDK for this purpose http://www.pdfonline.com/easypdf/sdk/sample_code.htm, perhaps that would be a better option.
- If my post helped you, please Rate it
- If your problem is solved please also mark the thread resolved
I use VS2015 (unless otherwise stated).
_________________________________________________________________________________
B.Sc(Hons), AUS.P, C.Eng, MIET, MIEEE, MBCS / MCSE+Sec, MCSA+Sec, MCP, A+, Net+, Sec+, MCIWD, CIWP, CIWA
I wrote my very first program in 1979, using machine code on a mechanical Olivetti teletype connected to an 8-bit, 78 instruction, 1MHz, Motorola 6800 multi-user system with 2k of memory. Using Windows, I dont think my situation has improved.
-
Mar 16th, 2015, 09:07 PM
#3
Thread Starter
Hyperactive Member
Re: PDF to Text parser
Unfortunately no, different clients will have different templates. if i can get it in order i can then have pre-configured for each client
-
Mar 17th, 2015, 05:18 AM
#4
Re: PDF to Text parser
Not sure I understand your requirement. Is it that you have a set of PDF files that you want to extract information from and then use that for some other purpose?, or is it that you want to open the PDF files and put additional information into them in certain places?
I tried the reference that you mentioned, which outputs a plain text file in text order (at least for the file I tried, which was my phone manual).
- If my post helped you, please Rate it
- If your problem is solved please also mark the thread resolved
I use VS2015 (unless otherwise stated).
_________________________________________________________________________________
B.Sc(Hons), AUS.P, C.Eng, MIET, MIEEE, MBCS / MCSE+Sec, MCSA+Sec, MCP, A+, Net+, Sec+, MCIWD, CIWP, CIWA
I wrote my very first program in 1979, using machine code on a mechanical Olivetti teletype connected to an 8-bit, 78 instruction, 1MHz, Motorola 6800 multi-user system with 2k of memory. Using Windows, I dont think my situation has improved.
-
Mar 17th, 2015, 06:09 AM
#5
Hyperactive Member
Re: PDF to Text parser
all PDF files use the same structure for text i cant remember the exact structure but it is something like
0 0 0 RG - for the colour of the text
0 0 TD - for position of the text using chart like co ordinates
/F1 sample text Tj - for the font and text within the document
i can check this for you later if you wish to get the exact example
-
Mar 17th, 2015, 08:14 AM
#6
Re: PDF to Text parser
This is already done for you in iTextSharp library, its what I use. I modified it slightly to run through and extract text on the same line within a certain criteria (i.e top of lines starting x pixels above or below are classified as the same line).
-
Mar 17th, 2015, 05:58 PM
#7
Hyperactive Member
Re: PDF to Text parser
below is the correct version
BT
/F1 16 Tf - font and size
25 795 Td - position
(your string here) Tj - text value
ET
Tags for this Thread
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|