Results 1 to 19 of 19

Thread: [RESOLVED] Looking for an API

  1. #1

    Thread Starter
    I'm about to be a PowerPoster! mendhak's Avatar
    Join Date
    Feb 2002
    Location
    Ulaan Baator GooGoo: Frog
    Posts
    38,170

    Resolved [RESOLVED] Looking for an API

    I'm looking for an API that can:

    1. Extract text from a PDF file (obviously, OCR)
    2. Create images, as in a snapshot, of a page in a PDF file

    For #1, preferably to XML or XHTML format.

    Any suggestions?


    I am using C#, so a .NET API or even a COM API would do.
    Last edited by mendhak; May 17th, 2007 at 05:06 AM.

  2. #2
    Shared Member
    Join Date
    May 2005
    Location
    Kashmir, India
    Posts
    2,277

    Re: Looking for an API

    Quote Originally Posted by mendhak
    1. Extract text from a PDF file (obviously, OCR)
    No need for OCR. Take a look at http://www.codeproject.com/useritems/PDFToText.asp
    Use [code] source code here[/code] tags when you post source code.

    My Articles

  3. #3

    Thread Starter
    I'm about to be a PowerPoster! mendhak's Avatar
    Join Date
    Feb 2002
    Location
    Ulaan Baator GooGoo: Frog
    Posts
    38,170

    Re: Looking for an API

    I do know how to extract text from a PDF, but I still need it to extract images of the pages in the PDF file.

    Forget what I said about OCRs, my mistake. We're looking to purchase some class libraries for this.

    I can find libraries that convert to HTML/XML. I can find libraries that convert to images But I was hoping for one library to do both.

  4. #4
    PoorPoster iPrank's Avatar
    Join Date
    Oct 2005
    Location
    In a black hole
    Posts
    2,729

    Re: Looking for an API

    http://www.vbforums.com/showpost.php...00&postcount=2

    GSView/GhostScript supports command line switches. (and there are some batch files available). See if you can use it from C#.
    Usefull VBF Threads/Posts I Found . My flickr page .
    "I love being married. It's so great to find that one special person you want to annoy for the rest of your life." - Rita Rudner


  5. #5
    Just Married shakti5385's Avatar
    Join Date
    Mar 2006
    Location
    Udaipur,Rajasthan(INDIA)
    Posts
    3,747

    Re: Looking for an API

    I think that the same question I want to ask here that can we read text from the image or the fax page?

  6. #6

    Thread Starter
    I'm about to be a PowerPoster! mendhak's Avatar
    Join Date
    Feb 2002
    Location
    Ulaan Baator GooGoo: Frog
    Posts
    38,170

    Re: Looking for an API

    Sorry to be an idiot, but we do need OCR capabilities in the API. Reason: The PDFs are actually scanned images of magazine articles.

    So, to be very specific... is there an API that can do the OCR reading for me?

  7. #7
    PoorPoster iPrank's Avatar
    Join Date
    Oct 2005
    Location
    In a black hole
    Posts
    2,729

    Re: Looking for an API

    No idea about library.

    If I remember correctly,Abby PDF Transformer uses same OCR technology as Abby FineReader. That product may be able to do it.
    Usefull VBF Threads/Posts I Found . My flickr page .
    "I love being married. It's so great to find that one special person you want to annoy for the rest of your life." - Rita Rudner


  8. #8
    PowerPoster 2.0 Negative0's Avatar
    Join Date
    Jun 2000
    Location
    Southeastern MI
    Posts
    4,367

    Re: Looking for an API

    I have used the Abby Fine Reader to take TIFs and make text PDFs from them and it works very well.

    Just so you know OCR is a very CPU intense process. I have seen some customers that have multiple machines that do only OCR 24/7.

  9. #9
    Hyperactive Member
    Join Date
    Oct 2006
    Posts
    354

    Re: Looking for an API

    Quote Originally Posted by mendhak
    Sorry to be an idiot, but we do need OCR capabilities in the API. Reason: The PDFs are actually scanned images of magazine articles.

    So, to be very specific... is there an API that can do the OCR reading for me?
    I worked on a DoD contract that used an API that could do OCR on pages with images and text and threw them into pdf that was stored in an oracle db. It was a few years ago, and I don't remember what it was called. But yeah there is one out there.

  10. #10

    Thread Starter
    I'm about to be a PowerPoster! mendhak's Avatar
    Join Date
    Feb 2002
    Location
    Ulaan Baator GooGoo: Frog
    Posts
    38,170

    Re: Looking for an API

    Let me try explaining again. The PDF contains an image. The image has text on it. The OCR needs to be able to read the PDF's image for that text. Are there any APIs out there for this?


    superbovine, did that OCR work with PDFs too?

  11. #11
    PoorPoster iPrank's Avatar
    Join Date
    Oct 2005
    Location
    In a black hole
    Posts
    2,729

    Re: Looking for an API

    [
    Have you actually tried Abby PDF Transformer ?

    You can export the pages from your pdf document as tiff using GhostView, then recognise them by Abby Fine Reader.
    ]
    Last edited by iPrank; May 21st, 2007 at 05:04 AM.
    Usefull VBF Threads/Posts I Found . My flickr page .
    "I love being married. It's so great to find that one special person you want to annoy for the rest of your life." - Rita Rudner


  12. #12

    Thread Starter
    I'm about to be a PowerPoster! mendhak's Avatar
    Join Date
    Feb 2002
    Location
    Ulaan Baator GooGoo: Frog
    Posts
    38,170

    Re: Looking for an API

    I did, unfortunately it wouldn't be suitable enough, considering that there are about 4 million PDF files to process. :vomit:

  13. #13
    PoorPoster iPrank's Avatar
    Join Date
    Oct 2005
    Location
    In a black hole
    Posts
    2,729

    Re: Looking for an API

    I feel sorry for you. (oops wrong smily)
    Usefull VBF Threads/Posts I Found . My flickr page .
    "I love being married. It's so great to find that one special person you want to annoy for the rest of your life." - Rita Rudner


  14. #14
    PowerPoster 2.0 Negative0's Avatar
    Join Date
    Jun 2000
    Location
    Southeastern MI
    Posts
    4,367

    Re: Looking for an API

    From looking at the Abby SDK pages, it supports PDF for both input and output formats:

    http://www.abbyy.com/sdk/?param=60493#p5

    So the Abby Fine Reader Engine may work for you.

  15. #15
    Shared Member
    Join Date
    May 2005
    Location
    Kashmir, India
    Posts
    2,277

    Re: Looking for an API

    I have never done this kind of work before. But probably this article might be of some help
    http://www.codeproject.com/showcase/SearchablePDFs.asp
    Use [code] source code here[/code] tags when you post source code.

    My Articles

  16. #16
    Hyperactive Member
    Join Date
    Oct 2006
    Posts
    354

    Re: Looking for an API

    Quote Originally Posted by mendhak
    Let me try explaining again. The PDF contains an image. The image has text on it. The OCR needs to be able to read the PDF's image for that text. Are there any APIs out there for this?


    superbovine, did that OCR work with PDFs too?
    I know for sure, that wasn't really my section. I talked to another guy that was on base with me, and he doesn't remember the name of the API or if it work on pdf.

  17. #17

    Thread Starter
    I'm about to be a PowerPoster! mendhak's Avatar
    Join Date
    Feb 2002
    Location
    Ulaan Baator GooGoo: Frog
    Posts
    38,170

    Re: Looking for an API

    Bloody Brilliant, people.

    I <3 some of you, and <3 some of you even more!

  18. #18
    PoorPoster iPrank's Avatar
    Join Date
    Oct 2005
    Location
    In a black hole
    Posts
    2,729

    Re: Looking for an API

    So what have you done at last ?

    PS. What does "<3" mean ?
    Usefull VBF Threads/Posts I Found . My flickr page .
    "I love being married. It's so great to find that one special person you want to annoy for the rest of your life." - Rita Rudner


  19. #19

    Thread Starter
    I'm about to be a PowerPoster! mendhak's Avatar
    Join Date
    Feb 2002
    Location
    Ulaan Baator GooGoo: Frog
    Posts
    38,170

    Re: Looking for an API

    It's l33tpr0n speak for 'love'. (heart)

    I will be going for http://www.codeproject.com/showcase/SearchablePDFs.asp in conjunction with the Adobe SDK to create thumbnails.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width