Results 1 to 7 of 7

Thread: PDF to Text parser

  1. #1

    Thread Starter
    Hyperactive Member
    Join Date
    Dec 2009
    Location
    sydney
    Posts
    265

    PDF to Text parser

    im working on a task of automating data entry by parsing PDF files to text using a script i found online
    http://www.codeproject.com/Articles/...m-PDF-in-C-NET

    i tried few other ones, they all work and generate text files, however they all seem to be in a messy text format and not in the showing order,
    which can by no mean be programmatically read and processed

    any other suggestions.

    Thanks in advance

  2. #2
    Frenzied Member Bulldog's Avatar
    Join Date
    Jun 2005
    Location
    South UK
    Posts
    1,950

    Re: PDF to Text parser

    PDF is a complex format and converting it to text is definitely going to result in a messy output. Is there certain information, fields etc. you're looking to change? It would be easier to target certain structures.

    There is an SDK for this purpose http://www.pdfonline.com/easypdf/sdk/sample_code.htm, perhaps that would be a better option.


    • If my post helped you, please Rate it
    • If your problem is solved please also mark the thread resolved

    I use VS2015 (unless otherwise stated).
    _________________________________________________________________________________
    B.Sc(Hons), AUS.P, C.Eng, MIET, MIEEE, MBCS / MCSE+Sec, MCSA+Sec, MCP, A+, Net+, Sec+, MCIWD, CIWP, CIWA
    I wrote my very first program in 1979, using machine code on a mechanical Olivetti teletype connected to an 8-bit, 78 instruction, 1MHz, Motorola 6800 multi-user system with 2k of memory. Using Windows, I dont think my situation has improved.

  3. #3

    Thread Starter
    Hyperactive Member
    Join Date
    Dec 2009
    Location
    sydney
    Posts
    265

    Re: PDF to Text parser

    Unfortunately no, different clients will have different templates. if i can get it in order i can then have pre-configured for each client

  4. #4
    Frenzied Member Bulldog's Avatar
    Join Date
    Jun 2005
    Location
    South UK
    Posts
    1,950

    Re: PDF to Text parser

    Not sure I understand your requirement. Is it that you have a set of PDF files that you want to extract information from and then use that for some other purpose?, or is it that you want to open the PDF files and put additional information into them in certain places?

    I tried the reference that you mentioned, which outputs a plain text file in text order (at least for the file I tried, which was my phone manual).


    • If my post helped you, please Rate it
    • If your problem is solved please also mark the thread resolved

    I use VS2015 (unless otherwise stated).
    _________________________________________________________________________________
    B.Sc(Hons), AUS.P, C.Eng, MIET, MIEEE, MBCS / MCSE+Sec, MCSA+Sec, MCP, A+, Net+, Sec+, MCIWD, CIWP, CIWA
    I wrote my very first program in 1979, using machine code on a mechanical Olivetti teletype connected to an 8-bit, 78 instruction, 1MHz, Motorola 6800 multi-user system with 2k of memory. Using Windows, I dont think my situation has improved.

  5. #5
    Hyperactive Member
    Join Date
    Sep 2014
    Posts
    404

    Re: PDF to Text parser

    all PDF files use the same structure for text i cant remember the exact structure but it is something like

    0 0 0 RG - for the colour of the text
    0 0 TD - for position of the text using chart like co ordinates
    /F1 sample text Tj - for the font and text within the document

    i can check this for you later if you wish to get the exact example

  6. #6
    PowerPoster
    Join Date
    Mar 2002
    Location
    UK
    Posts
    4,780

    Re: PDF to Text parser

    This is already done for you in iTextSharp library, its what I use. I modified it slightly to run through and extract text on the same line within a certain criteria (i.e top of lines starting x pixels above or below are classified as the same line).

  7. #7
    Hyperactive Member
    Join Date
    Sep 2014
    Posts
    404

    Re: PDF to Text parser

    below is the correct version

    BT
    /F1 16 Tf - font and size
    25 795 Td - position
    (your string here) Tj - text value
    ET

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width