PDA

Click to See Complete Forum and Search --> : [RESOLVED] Looking for an API


mendhak
May 17th, 2007, 05:03 AM
I'm looking for an API that can:

1. Extract text from a PDF file (obviously, OCR)
2. Create images, as in a snapshot, of a page in a PDF file

For #1, preferably to XML or XHTML format.

Any suggestions?


I am using C#, so a .NET API or even a COM API would do.

Shuja Ali
May 17th, 2007, 05:29 AM
1. Extract text from a PDF file (obviously, OCR)No need for OCR. Take a look at http://www.codeproject.com/useritems/PDFToText.asp

mendhak
May 17th, 2007, 05:48 AM
I do know how to extract text from a PDF, but I still need it to extract images of the pages in the PDF file.

Forget what I said about OCRs, my mistake. We're looking to purchase some class libraries for this.

I can find libraries that convert to HTML/XML. I can find libraries that convert to images But I was hoping for one library to do both.

iPrank
May 17th, 2007, 06:39 AM
http://www.vbforums.com/showpost.php?p=2880900&postcount=2

GSView/GhostScript supports command line switches. (and there are some batch files available). See if you can use it from C#.

shakti5385
May 17th, 2007, 07:21 AM
I think that the same question I want to ask here that can we read text from the image or the fax page?

mendhak
May 17th, 2007, 08:32 AM
Sorry to be an idiot, but we do need OCR capabilities in the API. Reason: The PDFs are actually scanned images of magazine articles.

So, to be very specific... is there an API that can do the OCR reading for me?

iPrank
May 17th, 2007, 10:18 AM
No idea about library.

If I remember correctly,Abby PDF Transformer (http://www.pdftransformer.com/) uses same OCR technology as Abby FineReader. That product may be able to do it.

Negative0
May 17th, 2007, 09:18 PM
I have used the Abby Fine Reader to take TIFs and make text PDFs from them and it works very well.

Just so you know OCR is a very CPU intense process. I have seen some customers that have multiple machines that do only OCR 24/7.

superbovine
May 20th, 2007, 10:31 PM
Sorry to be an idiot, but we do need OCR capabilities in the API. Reason: The PDFs are actually scanned images of magazine articles.

So, to be very specific... is there an API that can do the OCR reading for me?

I worked on a DoD contract that used an API that could do OCR on pages with images and text and threw them into pdf that was stored in an oracle db. It was a few years ago, and I don't remember what it was called. But yeah there is one out there.

mendhak
May 21st, 2007, 04:29 AM
Let me try explaining again. The PDF contains an image. The image has text on it. The OCR needs to be able to read the PDF's image for that text. Are there any APIs out there for this?


superbovine, did that OCR work with PDFs too?

iPrank
May 21st, 2007, 05:00 AM
[
Have you actually tried Abby PDF Transformer ?

You can export the pages from your pdf document as tiff using GhostView, then recognise them by Abby Fine Reader.
]

mendhak
May 21st, 2007, 05:35 AM
I did, unfortunately it wouldn't be suitable enough, considering that there are about 4 million PDF files to process. :vomit:

iPrank
May 21st, 2007, 05:43 AM
I feel sorry for you. :D (oops wrong smily)

Negative0
May 21st, 2007, 09:39 AM
From looking at the Abby SDK pages, it supports PDF for both input and output formats:

http://www.abbyy.com/sdk/?param=60493#p5

So the Abby Fine Reader Engine may work for you.

Shuja Ali
May 22nd, 2007, 02:22 AM
I have never done this kind of work before. But probably this article might be of some help
http://www.codeproject.com/showcase/SearchablePDFs.asp

superbovine
May 23rd, 2007, 11:18 AM
Let me try explaining again. The PDF contains an image. The image has text on it. The OCR needs to be able to read the PDF's image for that text. Are there any APIs out there for this?


superbovine, did that OCR work with PDFs too?

I know for sure, that wasn't really my section. I talked to another guy that was on base with me, and he doesn't remember the name of the API or if it work on pdf.

mendhak
May 23rd, 2007, 04:00 PM
Bloody Brilliant, people.

I <3 some of you, and <3 some of you even more!

iPrank
May 23rd, 2007, 04:29 PM
So what have you done at last ?

PS. What does "<3" mean ?

mendhak
May 24th, 2007, 04:23 PM
It's l33tpr0n speak for 'love'. (heart)

I will be going for http://www.codeproject.com/showcase/SearchablePDFs.asp in conjunction with the Adobe SDK to create thumbnails.