[VB6] pdftotext.dll - VB6-compatible DLL for extracting text from PDFs
After getting frustrated relying on Adobe Acrobat to extract text from PDFs, I started hunting around for an alternative solution.
The first release of pdftotext.dll for VB6 is on GitHub. Binary download on the Releases page.
Usage (current as of v4.03-2 release)
Code:
Private Declare Function getNumPages Lib "pdftotext.dll" (ByVal lpFileName As String, Optional ByVal lpLogCallbackFunc As Long, Optional ByVal lpOwnerPassword As String, Optional ByVal lpUserPassword As String) As Integer
Private Declare Function extractText Lib "pdftotext.dll" (ByVal lpFileName As String, ByVal lpTextOutput As Long, Optional ByVal iFirstPage As Integer, Optional ByVal iLastPage As Integer, Optional ByVal lpTextOutEnc As String, Optional ByVal lpLayout As String, Optional ByVal lpLogCallbackFunc As Long, Optional ByVal lpOwnerPassword As String, Optional ByVal lpUserPassword As String) As Integer
Private Declare Function extractTextSlice Lib "pdftotext.dll" (ByVal lpFileName As String, ByVal lpTextOutput As Long, ByVal iPage As Integer, ByVal iSliceX As Integer, ByVal iSliceY As Integer, ByVal iSliceW As Integer, ByVal iSliceH As Integer, Optional ByVal lpTextOutEnc As String, Optional ByVal lpLayout As String, Optional ByVal lpLogCallbackFunc As Long, Optional ByVal lpOwnerPassword As String, Optional ByVal lpUserPassword As String) As Integer
Private Declare Function getPageSize Lib "pdftotext.dll" (ByVal lpFileName As String, ByVal iPage as Integer, ByRef dWidth as Double, ByRef dHeight as Double, Optional ByVal lpLogCallbackFunc As Long, Optional ByVal lpOwnerPassword As String, Optional ByVal lpUserPassword As String) As Integer
Dim strOutput as String
Dim Width as Double, Height as Double
pages = getNumPages("filename.pdf", AddressOf LogCallback, "pass", "anotherpass")
ret = extractText("filename.pdf", VarPtr(strOutput), 1, 3, "UTF-8", "rawOrder", AddressOf LogCallback, "pass", "anotherpass")
ret = extractTextSlice("filename.pdf", VarPtr(strOutput), 1, 207, 100, 300, 200, "UTF-8", "table", AddressOf LogCallback, "pass", "anotherpass")
ret = getPageSize("filename.pdf", 1, Width, Height, AddressOf LogCallback, "pass", "anotherpass")
' Must be in a standard module (not Form or Class)
Public Sub LogCallback(ByVal str As String)
Debug.Print "Log: " & str
End Sub
Almost all arguments are optional. For example, the following works:
Code:
Dim strOutput as String
Dim Width as Double, Height as Double
pages = getNumPages("filename.pdf")
ret = extractText("filename.pdf", VarPtr(strOutput))
ret = extractTextSlice("filename.pdf", VarPtr(strOutput), 1, 207, 100, 300, 200) 'However, you probably want to use the "table" layout
ret = getPageSize("filename.pdf", 1, Width, Height)
Last edited by pdey; Dec 28th, 2024 at 04:54 PM.
Reason: Updated code snipped to reflect latest version
Re: [VB6] pdftotext.dll - VB6-compatible DLL for extracting text from PDFs
Very good work.
The problems with accents could be solved.
Code:
FORMACIÓN
- Cursos básicos de prevención de riesgos laborales. - Formación adaptada a tu puesto de trabajo. - TPC (tarjeta profesional para la Construcción, Metal,
Madera, Vidrio y Cerámica). - Amianto. - Manipulador de alimentos.
Re: [VB6] pdftotext.dll - VB6-compatible DLL for extracting text from PDFs
Originally Posted by yokesee
Very good work.
The problems with accents could be solved.
Code:
FORMACIÓN
- Cursos básicos de prevención de riesgos laborales. - Formación adaptada a tu puesto de trabajo. - TPC (tarjeta profesional para la Construcción, Metal,
Madera, Vidrio y Cerámica). - Amianto. - Manipulador de alimentos.
Regards
Can you please raise an issue in GitHub and attach an example document?
Looks like a UTF8 encoding issue.
Re: [VB6] pdftotext.dll - VB6-compatible DLL for extracting text from PDFs
Originally Posted by Dan_W
Hi. This is great. I tried it in TwinBasic 32-bit, and it worked flawlessly.
I have no experience compiling DLLs, using CMAKE, etc - is it possible/easy enough to compile the DLL in 64bit too? Say, for consumption in VBA?
Certainly convenient but for more than simple needs I'd just use the regular dll... Usually someone posts builds somewhere. That's what I did for pdfium; though for VBA I didn't see a recent build source with _stdcall so VBA 32bit is a problem if you need features added after 2018.
It's a few more lines but you can see how to get text with 64bit compatibility with pdfium in my gPdfMerge project.
If you use the DLLs from the original version they'd support VBA32 instead of just VB6 (via VBCDeclFix)/tB32/tB64/VBA64 like the ones in the latest version.
Re: [VB6] pdftotext.dll - VB6-compatible DLL for extracting text from PDFs
For those encoding issues (i.e. with accented characters) - please test the latest build.
Details are in the issue report.
Note that code changes are required.
It would be helpful if you could supply some "clean" examples of PDFs with the issue - i.e. not scanned & OCR'd documents.
Re: [VB6] pdftotext.dll - VB6-compatible DLL for extracting text from PDFs
Aloa,
It is better, but I had to have this line :
strOutput = Replace(strOutput, Chr$(0), vbNullString)
The extracted text : "N u m é r o d e c l i e n t : 2 1 5 3 9 7 8 3 4 0 N u m é r o d e f a c t u r e : 7 0 8 0 1 1 9 9 2 2 9 2 "
it should be : "Numéro de client: 2 153 978 340 Numéro de facture: 708 011 992 292"
Re: [VB6] pdftotext.dll - VB6-compatible DLL for extracting text from PDFs
This new version does not work for me.
ret return 0.
Code:
Option Explicit
Private Declare Function getNumPages Lib "pdftotext.dll" (ByVal lpFileName As String, Optional ByVal lpLogCallbackFunc As Long, Optional ByVal lpOwnerPassword As String, Optional ByVal lpUserPassword As String) As Integer
Private Declare Function extractText Lib "pdftotext.dll" (ByVal lpFileName As String, ByRef lpTextOutput As Long, Optional ByVal iFirstPage As Integer, Optional ByVal iLastPage As Integer, Optional ByVal lpTextOutEnc As String, Optional ByVal lpLayout As String, Optional ByVal lpLogCallbackFunc As Long, Optional ByVal lpOwnerPassword As String, Optional ByVal lpUserPassword As String) As Integer
Private Sub Command1_Click()
Dim strOutput As Long
Dim pages As Integer
Dim ret As Integer
pages = getNumPages("filename.pdf", AddressOf LogCallback, "pass", "anotherpass")
LabelNumberpages.Caption = pages
ret = extractText("filename.pdf", VarPtr(strOutput), 1, 3, "UTF-8", "rawOrder", AddressOf LogCallback, "pass", "anotherpass")
Msgbox strOutput
End Sub
Private Sub Command2_Click()
Dim strOutput As Long
Dim pages As Integer
Dim ret As Integer
pages = getNumPages("filename.pdf")
LabelNumberpages.Caption = pages
ret = extractText("filename.pdf", VarPtr(strOutput))
Msgbox strOutput
End Sub
Re: [VB6] pdftotext.dll - VB6-compatible DLL for extracting text from PDFs
Originally Posted by Thierry69
Aloa,
It is better, but I had to have this line :
strOutput = Replace(strOutput, Chr$(0), vbNullString)
The extracted text : "N u m é r o d e c l i e n t : 2 1 5 3 9 7 8 3 4 0 N u m é r o d e f a c t u r e : 7 0 8 0 1 1 9 9 2 2 9 2 "
it should be : "Numéro de client: 2 153 978 340 Numéro de facture: 708 011 992 292"
If you have a NULL for every second character, it is likely that you have not updated the function declaration, and are not passing the return variable using StrPtr.
If the function declaration still uses String instead of Long for lpTextOutput, then VB6 assumes the text coming back is ANSI, not Unicode, and re-encodes it as UTF16 - introducing the NULL characters.
Please see README.md for example code with correct usage.
Originally Posted by yokesee
This new version does not work for me.
ret return 0.
Your strOutput is defined as Long, not as a String - so obviously, no string is returned.
Please see README.md for example code with correct usage.
Re: [VB6] pdftotext.dll - VB6-compatible DLL for extracting text from PDFs
The examples don't work for me with the new version of the dll xpdf-dll/issues/1
Originally Posted by Thierry69
haven't seen it, sorry
I'll solve your problem with the DLL like this.
Code:
Private Declare Function SetCurrentDirectory Lib "kernel32" Alias "SetCurrentDirectoryA" (ByVal lpPathName As String) As Long
Private Sub Form_Load()
SetCurrentDirectory App.Path
End Sub
I'm sorry, I only just saw that you had responded.
Sure enough, that adjustment managed to get DLL to compile. I had my doubts when I saw the flurry of WARNING messages, but it did actually compile a 64bit DLL.
I fully expected it would crash, though, because nothing with 64bit VBA is ever straightforward or painless, but I rewrote the declarations, and it kinda worked. So the getNumPages function worked fine, but extractText didn't appear to do anything at all. It wasn't until I adjusted the declaration as per Cliv's message above that it came back with something. Anyway, after a little bit more work, here is the 64bit API declarations and code that worked for me:
Code:
Private Declare PtrSafe Function getNumPages Lib "pdftotext.dll" (ByVal lpFileName As String, Optional ByVal lpLogCallbackFunc As LongPtr, Optional ByVal lpOwnerPassword As String, Optional ByVal lpUserPassword As String) As Integer
Private Declare PtrSafe Function extractText Lib "pdftotext.dll" (ByVal lpFileName As String, ByVal lpTextOutput As LongPtr, Optional ByVal iFirstPage As Integer, Optional ByVal iLastPage As Integer, Optional ByVal lpTextOutEnc As String, Optional ByVal lpLayout As String, Optional ByVal lpLogCallbackFunc As LongPtr, Optional ByVal lpOwnerPassword As String, Optional ByVal lpUserPassword As String) As Integer
Sub Test()
Dim strOutput As String, Filename As String, Ret As Integer
Filename = "C:\VBA\TestPDF.pdf"
Ret = extractText(Fname, VarPtr(strOutput), 1, 3, "UTF-8", "rawOrder", AddressOf LogCallback)
Debug.Print StrConv(strOutput, vbUnicode)
End Sub
Sub Test2()
Dim strOutput As String, Filename As String, Pages As Long, Ret As Integer
Filename = "C:\VBA\TestPDF.pdf"
Pages = getNumPages(Filename)
Ret = extractText(Filename, VarPtr(strOutput), 1, Pages \ 2)
Debug.Print StrConv(strOutput, vbUnicode)
End Sub
Public Sub LogCallback(ByVal str As String)
Debug.Print "Log: " & str
End Sub
Re: [VB6] pdftotext.dll - VB6-compatible DLL for extracting text from PDFs
Originally Posted by pdey
I've tested the link on two different machines - works on both...
I am not logged in , would this make a difference. I am at work PC and this too gives the 404 error page for github
and the link provided is what i clicked on, but in image the link has changed
Last edited by k_zeon; Dec 16th, 2024 at 06:38 AM.
Re: [VB6] pdftotext.dll - VB6-compatible DLL for extracting text from PDFs
@pdey:
It'd sure be nice if we had this all in pure VB6 code. I took a look at your GitHub fork, and there are all kinds of libraries pulled together to do this. In addition, it seems to be a conglomeration of C, C++, & Python. It's entirely too much for me to maintain motivation to dig through and translate. But maybe there are core pieces that outline how to read the PDF structure and extract text that you could point out, since you're obviously familiar with it at this point.
Any software I post in these forums written by me is provided "AS IS" without warranty of any kind, expressed or implied, and permission is hereby granted, free of charge and without restriction, to any person obtaining a copy. To all, peace and happiness.
Re: [VB6] pdftotext.dll - VB6-compatible DLL for extracting text from PDFs
pdfium is just a single pure c++ dll, so that would be *easier* to port... but its still a massive undertaking, you'd be hard pressed to find someone willing to make that time commitment for free, among the dwindling active population of people who know both vb6 and c++ well enough to port.
A standard dll doesn't have the registration hell of activex so it's not that bad to have one. Could always look at static linking when tB is stable and complete enough for you holdouts.
Re: [VB6] pdftotext.dll - VB6-compatible DLL for extracting text from PDFs
Originally Posted by cliv
For extractTextSlice a new function would be very helpful getPageSize(iPage, dWidth, dHeight)
Yes, I've been looking for a pdf text extractor that can read the margins (formatting) so the extracted text could be easily rebuilt in MSoffice or OpenOffice. This would be an excellent addition!
Re: [VB6] pdftotext.dll - VB6-compatible DLL for extracting text from PDFs
Originally Posted by Elroy
It'd sure be nice if we had this all in pure VB6 code.
I'm not sure I see the point. Having this as a single, easily redistributable DLL with no registration required seems convenient enough. Regardless, my initial motivation was replacing Adobe Acrobat with something with a substantially lower installation/support complexity. According to @Dan_W, this seems to even work in 64-bit VBA which no actual code changes.
Originally Posted by Elroy
But maybe there are core pieces that outline how to read the PDF structure and extract text that you could point out, since you're obviously familiar with it at this point.
Porting this to pure VB6 would be a non-trivial project.
Text in PDF documents is non-linear. Xpdf does not simply "extract" the text, but rather renders the text, then returns the text rendering so that the text is in the correct order, as would be seen on screen. The bulk of what you would want to port is in probably in xpdf/: PdfDoc.cc, Catalog.cc (describing the document structure), Page.cc and TextOutputDev.cc (which renders the text).
Originally Posted by taishan
Yes, I've been looking for a pdf text extractor that can read the margins (formatting) so the extracted text could be easily rebuilt in MSoffice or OpenOffice. This would be an excellent addition!
A link to a new test version is in Post #23 which includes this.
Re: [VB6] pdftotext.dll - VB6-compatible DLL for extracting text from PDFs
@pdey Thanks for the dll. Here's a complete project with everything needed to test it.
I'm pretty sure the xpdf declaration returns should be As Long instead of As Integer in VB6.
I am out of time working with this, so please modify this project to show the full capabilities of the dll.
I'd especially like to know how to retrieve the margins in Px or inches!
Cheers
Last edited by taishan; Dec 19th, 2024 at 01:58 PM.
Thank you for the update. Out of curiosity, do you plan on adding any further functionality?
Following your instructions from last time, I have compiled the updated 64bit DLL, and can confirm that it worked perfectly like last time.
Having the output correctly encoded is a huge help - thank you. As before, on the off-chance it might be useful to someone else using VBA, I set out the 64bit and 32bit API declarations that worked for me.
Code:
#If VBA7 Then
Private Declare PtrSafe Function extractText Lib "pdftotext.dll" (ByVal lpFileName As String, ByVal lpTextOutput As LongPtr, Optional ByVal iFirstPage As Integer, Optional ByVal iLastPage As Integer, Optional ByVal lpTextOutEnc As String, Optional ByVal lpLayout As String, Optional ByVal lpLogCallbackFunc As LongPtr, Optional ByVal lpOwnerPassword As String, Optional ByVal lpUserPassword As String) As Integer
Private Declare PtrSafe Function extractTextSlice Lib "pdftotext.dll" (ByVal lpFileName As String, ByVal lpTextOutput As LongPtr, ByVal iPage As Integer, ByVal iSliceX As Integer, ByVal iSliceY As Integer, ByVal iSliceW As Integer, ByVal iSliceH As Integer, Optional ByVal lpTextOutEnc As String, Optional ByVal lpLayout As String, Optional ByVal lpLogCallbackFunc As LongPtr, Optional ByVal lpOwnerPassword As String, Optional ByVal lpUserPassword As String) As Integer
Private Declare PtrSafe Function getNumPages Lib "pdftotext.dll" (ByVal lpFileName As String, Optional ByVal lpLogCallbackFunc As LongPtr, Optional ByVal lpOwnerPassword As String, Optional ByVal lpUserPassword As String) As Integer
Private Declare PtrSafe Function getPageSize Lib "pdftotext.dll" (ByVal lpFileName As String, ByVal iPage As Integer, ByRef dWidth As Double, ByRef dHeight As Double, Optional ByVal lpLogCallbackFunc As LongPtr, Optional ByVal lpOwnerPassword As String, Optional ByVal lpUserPassword As String) As Integer
#Else
Private Declare Function getNumPages Lib "pdftotext.dll" (ByVal lpFileName As String, Optional ByVal lpLogCallbackFunc As Long, Optional ByVal lpOwnerPassword As String, Optional ByVal lpUserPassword As String) As Integer
Private Declare Function extractText Lib "pdftotext.dll" (ByVal lpFileName As String, ByVal lpTextOutput As Long, Optional ByVal iFirstPage As Integer, Optional ByVal iLastPage As Integer, Optional ByVal lpTextOutEnc As String, Optional ByVal lpLayout As String, Optional ByVal lpLogCallbackFunc As Long, Optional ByVal lpOwnerPassword As String, Optional ByVal lpUserPassword As String) As Integer
Private Declare Function extractTextSlice Lib "pdftotext.dll" (ByVal lpFileName As String, ByVal lpTextOutput As Long, ByVal iPage As Integer, ByVal iSliceX As Integer, ByVal iSliceY As Integer, ByVal iSliceW As Integer, ByVal iSliceH As Integer, Optional ByVal lpTextOutEnc As String, Optional ByVal lpLayout As String, Optional ByVal lpLogCallbackFunc As Long, Optional ByVal lpOwnerPassword As String, Optional ByVal lpUserPassword As String) As Integer
Private Declare Function getPageSize Lib "pdftotext.dll" (ByVal lpFileName As String, ByVal iPage as Integer, ByRef dWidth as Double, ByRef dHeight as Double, Optional ByVal lpLogCallbackFunc As Long, Optional ByVal lpOwnerPassword As String, Optional ByVal lpUserPassword As String) As Integer
#End If
This originally started as a way to replace a finicky Adobe Acrobat dependency, so I don't have any active plans to add or develop significant functionality. Perhaps a function to extract a single page as a PNG.
I will however try to get automated builds working on GitHub for both the 32-bit and 64-bit versions, and unify your fork into one codebase.
Thank you for the update. Out of curiosity, do you plan on adding any further functionality?
Following your instructions from last time, I have compiled the updated 64bit DLL, and can confirm that it worked perfectly like last time.
Having the output correctly encoded is a huge help - thank you. As before, on the off-chance it might be useful to someone else using VBA, I set out the 64bit and 32bit API declarations that worked for me.
Code:
#If VBA7 Then
Private Declare PtrSafe Function extractText Lib "pdftotext.dll" (ByVal lpFileName As String, ByVal lpTextOutput As LongPtr, Optional ByVal iFirstPage As Integer, Optional ByVal iLastPage As Integer, Optional ByVal lpTextOutEnc As String, Optional ByVal lpLayout As String, Optional ByVal lpLogCallbackFunc As LongPtr, Optional ByVal lpOwnerPassword As String, Optional ByVal lpUserPassword As String) As Integer
Private Declare PtrSafe Function extractTextSlice Lib "pdftotext.dll" (ByVal lpFileName As String, ByVal lpTextOutput As LongPtr, ByVal iPage As Integer, ByVal iSliceX As Integer, ByVal iSliceY As Integer, ByVal iSliceW As Integer, ByVal iSliceH As Integer, Optional ByVal lpTextOutEnc As String, Optional ByVal lpLayout As String, Optional ByVal lpLogCallbackFunc As LongPtr, Optional ByVal lpOwnerPassword As String, Optional ByVal lpUserPassword As String) As Integer
Private Declare PtrSafe Function getNumPages Lib "pdftotext.dll" (ByVal lpFileName As String, Optional ByVal lpLogCallbackFunc As LongPtr, Optional ByVal lpOwnerPassword As String, Optional ByVal lpUserPassword As String) As Integer
Private Declare PtrSafe Function getPageSize Lib "pdftotext.dll" (ByVal lpFileName As String, ByVal iPage As Integer, ByRef dWidth As Double, ByRef dHeight As Double, Optional ByVal lpLogCallbackFunc As LongPtr, Optional ByVal lpOwnerPassword As String, Optional ByVal lpUserPassword As String) As Integer
#Else
Private Declare Function getNumPages Lib "pdftotext.dll" (ByVal lpFileName As String, Optional ByVal lpLogCallbackFunc As Long, Optional ByVal lpOwnerPassword As String, Optional ByVal lpUserPassword As String) As Integer
Private Declare Function extractText Lib "pdftotext.dll" (ByVal lpFileName As String, ByVal lpTextOutput As Long, Optional ByVal iFirstPage As Integer, Optional ByVal iLastPage As Integer, Optional ByVal lpTextOutEnc As String, Optional ByVal lpLayout As String, Optional ByVal lpLogCallbackFunc As Long, Optional ByVal lpOwnerPassword As String, Optional ByVal lpUserPassword As String) As Integer
Private Declare Function extractTextSlice Lib "pdftotext.dll" (ByVal lpFileName As String, ByVal lpTextOutput As Long, ByVal iPage As Integer, ByVal iSliceX As Integer, ByVal iSliceY As Integer, ByVal iSliceW As Integer, ByVal iSliceH As Integer, Optional ByVal lpTextOutEnc As String, Optional ByVal lpLayout As String, Optional ByVal lpLogCallbackFunc As Long, Optional ByVal lpOwnerPassword As String, Optional ByVal lpUserPassword As String) As Integer
Private Declare Function getPageSize Lib "pdftotext.dll" (ByVal lpFileName As String, ByVal iPage as Integer, ByRef dWidth as Double, ByRef dHeight as Double, Optional ByVal lpLogCallbackFunc As Long, Optional ByVal lpOwnerPassword As String, Optional ByVal lpUserPassword As String) As Integer
#End If
Ah yeah I guess 32bit VBA kind of got left out in the cold with the new pdfium builds, so good to have this. No VBCdeclFix for 32bit VBA. But just FYI if anyone was wondering... you can use the "cdecl" 64bit pdfium.dll for VBA7x64 because the cdecl or stdcall calling convention is ignored and all will use the standard x64 calling convention. So just delete the 'CDecl' from those declares.
Re: [VB6] pdftotext.dll - VB6-compatible DLL for extracting text from PDFs
Happy to see that you created a fork of Glyph & Cog's XPDF! I did this many many years ago but didn't keep up with the many changes. I may have to replace my DLL with your newer version.
This originally started as a way to replace a finicky Adobe Acrobat dependency, so I don't have any active plans to add or develop significant functionality. Perhaps a function to extract a single page as a PNG.
Personally, I use the XPDF Toolset either to extract text or to generate PNG images of the PDF pages, so that would be a welcome addition. Also, I think makes sense given the inclusion of the extractTextSlice and getPageSize functions - being able to visualise and draw on the page would help map the coordinates of the slice.
Originally Posted by pdey
I will however try to get automated builds working on GitHub for both the 32-bit and 64-bit versions, and unify your fork into one codebase.