[RESOLVED] Looking for an API

Printable View

May 17th, 2007, 05:03 AM
mendhak

[RESOLVED] Looking for an API

I'm looking for an API that can:

1. Extract text from a PDF file (obviously, OCR)
2. Create images, as in a snapshot, of a page in a PDF file

For #1, preferably to XML or XHTML format.

Any suggestions?

I am using C#, so a .NET API or even a COM API would do.
May 17th, 2007, 05:29 AM
Shuja Ali

Re: Looking for an API

Quote:

Originally Posted by mendhak

1. Extract text from a PDF file (obviously, OCR)

No need for OCR. Take a look at http://www.codeproject.com/useritems/PDFToText.asp
May 17th, 2007, 05:48 AM
mendhak

Re: Looking for an API

I do know how to extract text from a PDF, but I still need it to extract images of the pages in the PDF file.

Forget what I said about OCRs, my mistake. We're looking to purchase some class libraries for this.

I can find libraries that convert to HTML/XML. I can find libraries that convert to images But I was hoping for one library to do both.
May 17th, 2007, 06:39 AM
iPrank

Re: Looking for an API

http://www.vbforums.com/showpost.php...00&postcount=2

GSView/GhostScript supports command line switches. (and there are some batch files available). See if you can use it from C#.
May 17th, 2007, 07:21 AM
shakti5385

Re: Looking for an API

I think that the same question I want to ask here that can we read text from the image or the fax page?
May 17th, 2007, 08:32 AM
mendhak

Re: Looking for an API

Sorry to be an idiot, but we do need OCR capabilities in the API. Reason: The PDFs are actually scanned images of magazine articles.

So, to be very specific... is there an API that can do the OCR reading for me?
May 17th, 2007, 10:18 AM
iPrank

Re: Looking for an API

No idea about library.

If I remember correctly,Abby PDF Transformer uses same OCR technology as Abby FineReader. That product may be able to do it.
May 17th, 2007, 09:18 PM
Negative0

Re: Looking for an API

I have used the Abby Fine Reader to take TIFs and make text PDFs from them and it works very well.

Just so you know OCR is a very CPU intense process. I have seen some customers that have multiple machines that do only OCR 24/7.
May 20th, 2007, 10:31 PM
superbovine

Re: Looking for an API

Quote:

Originally Posted by mendhak

Sorry to be an idiot, but we do need OCR capabilities in the API. Reason: The PDFs are actually scanned images of magazine articles.

So, to be very specific... is there an API that can do the OCR reading for me?

I worked on a DoD contract that used an API that could do OCR on pages with images and text and threw them into pdf that was stored in an oracle db. It was a few years ago, and I don't remember what it was called. But yeah there is one out there.
May 21st, 2007, 04:29 AM
mendhak

Re: Looking for an API

Let me try explaining again. The PDF contains an image. The image has text on it. The OCR needs to be able to read the PDF's image for that text. Are there any APIs out there for this?

superbovine, did that OCR work with PDFs too?
May 21st, 2007, 05:00 AM
iPrank

Re: Looking for an API

[
Have you actually tried Abby PDF Transformer ?

You can export the pages from your pdf document as tiff using GhostView, then recognise them by Abby Fine Reader.
]
May 21st, 2007, 05:35 AM
mendhak

Re: Looking for an API

I did, unfortunately it wouldn't be suitable enough, considering that there are about 4 million PDF files to process. :vomit:
May 21st, 2007, 05:43 AM
iPrank

Re: Looking for an API

I feel sorry for you. :D (oops wrong smily)
May 21st, 2007, 09:39 AM
Negative0

Re: Looking for an API

From looking at the Abby SDK pages, it supports PDF for both input and output formats:

http://www.abbyy.com/sdk/?param=60493#p5

So the Abby Fine Reader Engine may work for you.
May 22nd, 2007, 02:22 AM
Shuja Ali

Re: Looking for an API

I have never done this kind of work before. But probably this article might be of some help
http://www.codeproject.com/showcase/SearchablePDFs.asp
May 23rd, 2007, 11:18 AM
superbovine

Re: Looking for an API

Quote:

Originally Posted by mendhak

Let me try explaining again. The PDF contains an image. The image has text on it. The OCR needs to be able to read the PDF's image for that text. Are there any APIs out there for this?

superbovine, did that OCR work with PDFs too?

I know for sure, that wasn't really my section. I talked to another guy that was on base with me, and he doesn't remember the name of the API or if it work on pdf.
May 23rd, 2007, 04:00 PM
mendhak

Re: Looking for an API

Bloody Brilliant, people.

I <3 some of you, and <3 some of you even more!
May 23rd, 2007, 04:29 PM
iPrank

Re: Looking for an API

So what have you done at last ?

PS. What does "<3" mean ?
May 24th, 2007, 04:23 PM
mendhak

Re: Looking for an API

It's l33tpr0n speak for 'love'. (heart)

I will be going for http://www.codeproject.com/showcase/SearchablePDFs.asp in conjunction with the Adobe SDK to create thumbnails.

All times are GMT -5. The time now is 01:08 PM.