|
-
May 17th, 2007, 05:03 AM
#1
[RESOLVED] Looking for an API
I'm looking for an API that can:
1. Extract text from a PDF file (obviously, OCR)
2. Create images, as in a snapshot, of a page in a PDF file
For #1, preferably to XML or XHTML format.
Any suggestions?
I am using C#, so a .NET API or even a COM API would do.
Last edited by mendhak; May 17th, 2007 at 05:06 AM.
-
May 17th, 2007, 05:29 AM
#2
Re: Looking for an API
 Originally Posted by mendhak
1. Extract text from a PDF file (obviously, OCR)
No need for OCR. Take a look at http://www.codeproject.com/useritems/PDFToText.asp
Use [code] source code here[/code] tags when you post source code.
My Articles
-
May 17th, 2007, 05:48 AM
#3
Re: Looking for an API
I do know how to extract text from a PDF, but I still need it to extract images of the pages in the PDF file.
Forget what I said about OCRs, my mistake. We're looking to purchase some class libraries for this.
I can find libraries that convert to HTML/XML. I can find libraries that convert to images But I was hoping for one library to do both.
-
May 17th, 2007, 06:39 AM
#4
Re: Looking for an API
http://www.vbforums.com/showpost.php...00&postcount=2
GSView/GhostScript supports command line switches. (and there are some batch files available). See if you can use it from C#.
-
May 17th, 2007, 07:21 AM
#5
Re: Looking for an API
I think that the same question I want to ask here that can we read text from the image or the fax page?
-
May 17th, 2007, 08:32 AM
#6
Re: Looking for an API
Sorry to be an idiot, but we do need OCR capabilities in the API. Reason: The PDFs are actually scanned images of magazine articles.
So, to be very specific... is there an API that can do the OCR reading for me?
-
May 17th, 2007, 10:18 AM
#7
Re: Looking for an API
No idea about library.
If I remember correctly,Abby PDF Transformer uses same OCR technology as Abby FineReader. That product may be able to do it.
-
May 17th, 2007, 09:18 PM
#8
Re: Looking for an API
I have used the Abby Fine Reader to take TIFs and make text PDFs from them and it works very well.
Just so you know OCR is a very CPU intense process. I have seen some customers that have multiple machines that do only OCR 24/7.
-
May 20th, 2007, 10:31 PM
#9
Hyperactive Member
Re: Looking for an API
 Originally Posted by mendhak
Sorry to be an idiot, but we do need OCR capabilities in the API. Reason: The PDFs are actually scanned images of magazine articles.
So, to be very specific... is there an API that can do the OCR reading for me?
I worked on a DoD contract that used an API that could do OCR on pages with images and text and threw them into pdf that was stored in an oracle db. It was a few years ago, and I don't remember what it was called. But yeah there is one out there.
-
May 21st, 2007, 04:29 AM
#10
Re: Looking for an API
Let me try explaining again. The PDF contains an image. The image has text on it. The OCR needs to be able to read the PDF's image for that text. Are there any APIs out there for this?
superbovine, did that OCR work with PDFs too?
-
May 21st, 2007, 05:00 AM
#11
Re: Looking for an API
[
Have you actually tried Abby PDF Transformer ?
You can export the pages from your pdf document as tiff using GhostView, then recognise them by Abby Fine Reader.
]
Last edited by iPrank; May 21st, 2007 at 05:04 AM.
-
May 21st, 2007, 05:35 AM
#12
Re: Looking for an API
I did, unfortunately it wouldn't be suitable enough, considering that there are about 4 million PDF files to process. :vomit:
-
May 21st, 2007, 05:43 AM
#13
Re: Looking for an API
I feel sorry for you. (oops wrong smily)
-
May 21st, 2007, 09:39 AM
#14
Re: Looking for an API
From looking at the Abby SDK pages, it supports PDF for both input and output formats:
http://www.abbyy.com/sdk/?param=60493#p5
So the Abby Fine Reader Engine may work for you.
-
May 22nd, 2007, 02:22 AM
#15
Re: Looking for an API
I have never done this kind of work before. But probably this article might be of some help
http://www.codeproject.com/showcase/SearchablePDFs.asp
Use [code] source code here[/code] tags when you post source code.
My Articles
-
May 23rd, 2007, 11:18 AM
#16
Hyperactive Member
Re: Looking for an API
 Originally Posted by mendhak
Let me try explaining again. The PDF contains an image. The image has text on it. The OCR needs to be able to read the PDF's image for that text. Are there any APIs out there for this?
superbovine, did that OCR work with PDFs too?
I know for sure, that wasn't really my section. I talked to another guy that was on base with me, and he doesn't remember the name of the API or if it work on pdf.
-
May 23rd, 2007, 04:00 PM
#17
Re: Looking for an API
Bloody Brilliant, people.
I <3 some of you, and <3 some of you even more!
-
May 23rd, 2007, 04:29 PM
#18
Re: Looking for an API
So what have you done at last ?
PS. What does "<3" mean ?
-
May 24th, 2007, 04:23 PM
#19
Re: Looking for an API
It's l33tpr0n speak for 'love'. (heart)
I will be going for http://www.codeproject.com/showcase/SearchablePDFs.asp in conjunction with the Adobe SDK to create thumbnails.
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|