Page 1 of 2 12 LastLast
Results 1 to 40 of 42

Thread: [VB6] pdftotext.dll - VB6-compatible DLL for extracting text from PDFs

  1. #1

    Thread Starter
    New Member
    Join Date
    Dec 2024
    Posts
    13

    Post [VB6] pdftotext.dll - VB6-compatible DLL for extracting text from PDFs

    After getting frustrated relying on Adobe Acrobat to extract text from PDFs, I started hunting around for an alternative solution.

    The first release of pdftotext.dll for VB6 is on GitHub. Binary download on the Releases page.

    Usage (current as of v4.03-2 release)
    Code:
    Private Declare Function getNumPages Lib "pdftotext.dll" (ByVal lpFileName As String, Optional ByVal lpLogCallbackFunc As Long, Optional ByVal lpOwnerPassword As String, Optional ByVal lpUserPassword As String) As Integer
    Private Declare Function extractText Lib "pdftotext.dll" (ByVal lpFileName As String, ByVal lpTextOutput As Long, Optional ByVal iFirstPage As Integer, Optional ByVal iLastPage As Integer, Optional ByVal lpTextOutEnc As String, Optional ByVal lpLayout As String, Optional ByVal lpLogCallbackFunc As Long, Optional ByVal lpOwnerPassword As String, Optional ByVal lpUserPassword As String) As Integer
    Private Declare Function extractTextSlice Lib "pdftotext.dll" (ByVal lpFileName As String, ByVal lpTextOutput As Long, ByVal iPage As Integer, ByVal iSliceX As Integer, ByVal iSliceY As Integer, ByVal iSliceW As Integer, ByVal iSliceH As Integer, Optional ByVal lpTextOutEnc As String, Optional ByVal lpLayout As String, Optional ByVal lpLogCallbackFunc As Long, Optional ByVal lpOwnerPassword As String, Optional ByVal lpUserPassword As String) As Integer
    Private Declare Function getPageSize Lib "pdftotext.dll" (ByVal lpFileName As String, ByVal iPage as Integer, ByRef dWidth as Double, ByRef dHeight as Double, Optional ByVal lpLogCallbackFunc As Long, Optional ByVal lpOwnerPassword As String, Optional ByVal lpUserPassword As String) As Integer
    
    Dim strOutput as String
    Dim Width as Double, Height as Double
    
    pages = getNumPages("filename.pdf", AddressOf LogCallback, "pass", "anotherpass")
    ret = extractText("filename.pdf", VarPtr(strOutput), 1, 3, "UTF-8", "rawOrder", AddressOf LogCallback, "pass", "anotherpass")
    ret = extractTextSlice("filename.pdf", VarPtr(strOutput), 1, 207, 100, 300, 200, "UTF-8", "table", AddressOf LogCallback, "pass", "anotherpass")
    ret = getPageSize("filename.pdf", 1, Width, Height, AddressOf LogCallback, "pass", "anotherpass")
    
    ' Must be in a standard module (not Form or Class)
    Public Sub LogCallback(ByVal str As String)
    	Debug.Print "Log: " & str
    End Sub
    Almost all arguments are optional. For example, the following works:
    Code:
    Dim strOutput as String
    Dim Width as Double, Height as Double
    
    pages = getNumPages("filename.pdf")
    ret = extractText("filename.pdf", VarPtr(strOutput))
    ret = extractTextSlice("filename.pdf", VarPtr(strOutput), 1, 207, 100, 300, 200) 'However, you probably want to use the "table" layout
    ret = getPageSize("filename.pdf", 1, Width, Height)
    Last edited by pdey; Dec 28th, 2024 at 04:54 PM. Reason: Updated code snipped to reflect latest version

  2. #2
    Fanatic Member
    Join Date
    Jun 2016
    Location
    España
    Posts
    563

    Re: [VB6] pdftotext.dll - VB6-compatible DLL for extracting text from PDFs

    Very good work.
    The problems with accents could be solved.

    Code:
    FORMACIÓN
    - Cursos básicos de prevención de riesgos laborales. - Formación adaptada a tu puesto de trabajo. - TPC (tarjeta profesional para la Construcción, Metal,
    Madera, Vidrio y Cerámica). - Amianto. - Manipulador de alimentos.
    Regards

  3. #3
    Fanatic Member
    Join Date
    Jan 2015
    Posts
    625

    Re: [VB6] pdftotext.dll - VB6-compatible DLL for extracting text from PDFs

    Nice DLL, and fast.
    But indeed, accent chars are not recognized

  4. #4

    Thread Starter
    New Member
    Join Date
    Dec 2024
    Posts
    13

    Re: [VB6] pdftotext.dll - VB6-compatible DLL for extracting text from PDFs

    Quote Originally Posted by yokesee View Post
    Very good work.
    The problems with accents could be solved.

    Code:
    FORMACIÓN
    - Cursos básicos de prevención de riesgos laborales. - Formación adaptada a tu puesto de trabajo. - TPC (tarjeta profesional para la Construcción, Metal,
    Madera, Vidrio y Cerámica). - Amianto. - Manipulador de alimentos.
    Regards
    Can you please raise an issue in GitHub and attach an example document?
    Looks like a UTF8 encoding issue.

  5. #5
    Lively Member
    Join Date
    Feb 2006
    Posts
    116

    Re: [VB6] pdftotext.dll - VB6-compatible DLL for extracting text from PDFs

    Very fast! Thank you.
    Can you add a new function extractTextFromRect(iPage, x0, y0, x1, y1) ? Very helpful for table layout.
    Last edited by cliv; Dec 5th, 2024 at 07:55 AM.

  6. #6
    Lively Member
    Join Date
    May 2021
    Posts
    118

    Re: [VB6] pdftotext.dll - VB6-compatible DLL for extracting text from PDFs

    Hi. This is great. I tried it in TwinBasic 32-bit, and it worked flawlessly.

    I have no experience compiling DLLs, using CMAKE, etc - is it possible/easy enough to compile the DLL in 64bit too? Say, for consumption in VBA?

  7. #7
    PowerPoster
    Join Date
    Jul 2010
    Location
    NYC
    Posts
    6,636

    Re: [VB6] pdftotext.dll - VB6-compatible DLL for extracting text from PDFs

    Quote Originally Posted by Dan_W View Post
    Hi. This is great. I tried it in TwinBasic 32-bit, and it worked flawlessly.

    I have no experience compiling DLLs, using CMAKE, etc - is it possible/easy enough to compile the DLL in 64bit too? Say, for consumption in VBA?
    Certainly convenient but for more than simple needs I'd just use the regular dll... Usually someone posts builds somewhere. That's what I did for pdfium; though for VBA I didn't see a recent build source with _stdcall so VBA 32bit is a problem if you need features added after 2018.
    It's a few more lines but you can see how to get text with 64bit compatibility with pdfium in my gPdfMerge project.

    If you use the DLLs from the original version they'd support VBA32 instead of just VB6 (via VBCDeclFix)/tB32/tB64/VBA64 like the ones in the latest version.

  8. #8

    Thread Starter
    New Member
    Join Date
    Dec 2024
    Posts
    13

    Re: [VB6] pdftotext.dll - VB6-compatible DLL for extracting text from PDFs

    Quote Originally Posted by Dan_W View Post
    is it possible/easy enough to compile the DLL in 64bit too? Say, for consumption in VBA?
    This is probably just a one-line change:
    https://github.com/peterdey/pdftotex...eLists.txt#L13

    Change WIN32 to x64.
    Last edited by pdey; Dec 28th, 2024 at 04:30 PM.

  9. #9

    Thread Starter
    New Member
    Join Date
    Dec 2024
    Posts
    13

    Re: [VB6] pdftotext.dll - VB6-compatible DLL for extracting text from PDFs

    For those encoding issues (i.e. with accented characters) - please test the latest build.
    Details are in the issue report.
    Note that code changes are required.

    It would be helpful if you could supply some "clean" examples of PDFs with the issue - i.e. not scanned & OCR'd documents.
    Last edited by pdey; Dec 28th, 2024 at 04:31 PM.

  10. #10
    Fanatic Member
    Join Date
    Jan 2015
    Posts
    625

    Re: [VB6] pdftotext.dll - VB6-compatible DLL for extracting text from PDFs

    Aloa,
    It is better, but I had to have this line :
    strOutput = Replace(strOutput, Chr$(0), vbNullString)
    The extracted text : "N u m é r o d e c l i e n t : 2 1 5 3 9 7 8 3 4 0 N u m é r o d e f a c t u r e : 7 0 8 0 1 1 9 9 2 2 9 2 "
    it should be : "Numéro de client: 2 153 978 340 Numéro de facture: 708 011 992 292"

  11. #11
    Fanatic Member
    Join Date
    Jun 2016
    Location
    España
    Posts
    563

    Re: [VB6] pdftotext.dll - VB6-compatible DLL for extracting text from PDFs

    This new version does not work for me.
    ret return 0.

    Code:
    Option Explicit
    Private Declare Function getNumPages Lib "pdftotext.dll" (ByVal lpFileName As String, Optional ByVal lpLogCallbackFunc As Long, Optional ByVal lpOwnerPassword As String, Optional ByVal lpUserPassword As String) As Integer
    Private Declare Function extractText Lib "pdftotext.dll" (ByVal lpFileName As String, ByRef lpTextOutput As Long, Optional ByVal iFirstPage As Integer, Optional ByVal iLastPage As Integer, Optional ByVal lpTextOutEnc As String, Optional ByVal lpLayout As String, Optional ByVal lpLogCallbackFunc As Long, Optional ByVal lpOwnerPassword As String, Optional ByVal lpUserPassword As String) As Integer
    
    Private Sub Command1_Click()
        Dim strOutput As Long
        Dim pages As Integer
        Dim ret As Integer
        pages = getNumPages("filename.pdf", AddressOf LogCallback, "pass", "anotherpass")
        LabelNumberpages.Caption = pages
        ret = extractText("filename.pdf", VarPtr(strOutput), 1, 3, "UTF-8", "rawOrder", AddressOf LogCallback, "pass", "anotherpass")
        Msgbox strOutput
    End Sub
    
    Private Sub Command2_Click()
        Dim strOutput As Long
        Dim pages As Integer
        Dim ret As Integer
        pages = getNumPages("filename.pdf")
        LabelNumberpages.Caption = pages
        ret = extractText("filename.pdf", VarPtr(strOutput))
        Msgbox strOutput
    End Sub

  12. #12
    Fanatic Member
    Join Date
    Nov 2011
    Posts
    591

    Re: [VB6] pdftotext.dll - VB6-compatible DLL for extracting text from PDFs

    so i have it working with latest dll with basic function

    Code:
    Dim strOutput as String
    pages = getNumPages("filename.pdf")
    ret = extractText("filename.pdf", strOutput)
    but cannot use

    Code:
    ret = extractText("filename.pdf", strOutput, 1, 3, "UTF-8", "rawOrder", AddressOf LogCallback, "pass", "anotherpass")
    error pops up with , invalid use of AddressOf call back

    also have to put the full path to where the dll is other wise it says cannot find pdftotext.dll

  13. #13
    PowerPoster
    Join Date
    Jul 2010
    Location
    NYC
    Posts
    6,636

    Re: [VB6] pdftotext.dll - VB6-compatible DLL for extracting text from PDFs

    is LogCallback in a standard module? Can't be in a form or class or UC unless you use tB or asm thunks.

  14. #14
    Fanatic Member
    Join Date
    Nov 2011
    Posts
    591

    Re: [VB6] pdftotext.dll - VB6-compatible DLL for extracting text from PDFs

    Quote Originally Posted by fafalone View Post
    is LogCallback in a standard module? Can't be in a form or class or UC unless you use tB or asm thunks.
    tks fafalone. yes i had in form. moved to module and works fine.

    also re the declaration. Why does
    Code:
     Private Declare Function getNumPages Lib "pdftotext.dll"
    not work and i have to specify the full path

    forget it, just compiled and left as
    Code:
     Private Declare Function getNumPages Lib "pdftotext.dll"
    and works fine.

    I am at work so when i run as admin it must be having issues finding the dll as running in different account

  15. #15

    Thread Starter
    New Member
    Join Date
    Dec 2024
    Posts
    13

    Re: [VB6] pdftotext.dll - VB6-compatible DLL for extracting text from PDFs

    Quote Originally Posted by Thierry69 View Post
    Aloa,
    It is better, but I had to have this line :
    strOutput = Replace(strOutput, Chr$(0), vbNullString)
    The extracted text : "N u m é r o d e c l i e n t : 2 1 5 3 9 7 8 3 4 0 N u m é r o d e f a c t u r e : 7 0 8 0 1 1 9 9 2 2 9 2 "
    it should be : "Numéro de client: 2 153 978 340 Numéro de facture: 708 011 992 292"
    If you have a NULL for every second character, it is likely that you have not updated the function declaration, and are not passing the return variable using StrPtr.

    If the function declaration still uses String instead of Long for lpTextOutput, then VB6 assumes the text coming back is ANSI, not Unicode, and re-encodes it as UTF16 - introducing the NULL characters.

    Please see README.md for example code with correct usage.

    Quote Originally Posted by yokesee View Post
    This new version does not work for me.
    ret return 0.
    Your strOutput is defined as Long, not as a String - so obviously, no string is returned.

    Please see README.md for example code with correct usage.
    Last edited by pdey; Dec 10th, 2024 at 07:05 AM.

  16. #16
    Fanatic Member
    Join Date
    Jan 2015
    Posts
    625

    Re: [VB6] pdftotext.dll - VB6-compatible DLL for extracting text from PDFs

    haven't seen it, sorry

  17. #17
    Fanatic Member
    Join Date
    Jun 2016
    Location
    España
    Posts
    563

    Re: [VB6] pdftotext.dll - VB6-compatible DLL for extracting text from PDFs

    The examples don't work for me with the new version of the dll
    xpdf-dll/issues/1

    Quote Originally Posted by Thierry69 View Post
    haven't seen it, sorry
    I'll solve your problem with the DLL like this.
    Code:
    Private Declare Function SetCurrentDirectory Lib "kernel32" Alias "SetCurrentDirectoryA" (ByVal lpPathName As String) As Long
    
    Private Sub Form_Load()
        SetCurrentDirectory App.Path
    End Sub

  18. #18

    Thread Starter
    New Member
    Join Date
    Dec 2024
    Posts
    13

    Re: [VB6] pdftotext.dll - VB6-compatible DLL for extracting text from PDFs

    Quote Originally Posted by yokesee View Post
    I'll solve your problem with the DLL like this.
    As with all DLLs, the DLL file is located using Windows' Dynamic-link library search order.

  19. #19

    Thread Starter
    New Member
    Join Date
    Dec 2024
    Posts
    13

    Re: [VB6] pdftotext.dll - VB6-compatible DLL for extracting text from PDFs

    Quote Originally Posted by cliv View Post
    Can you add a new function extractTextFromRect(iPage, x0, y0, x1, y1) ? Very helpful for table layout.
    New branch with my first attempt at this:
    https://github.com/peterdey/pdftotex...tractTextSlice

    Binary DLL is available here:
    https://github.com/peterdey/pdftotex...cts/2308295086

    I've found the "table" layout seems to be the most effective/consistent here.
    Last edited by pdey; Dec 28th, 2024 at 04:37 PM.

  20. #20
    Lively Member
    Join Date
    Feb 2006
    Posts
    116

    Re: [VB6] pdftotext.dll - VB6-compatible DLL for extracting text from PDFs

    If in the new version of the dll extractText not work
    Problem is in declaration of extractText:
    Code:
    replace 
    ByRef lpTextOutput As Long with ByVal lpTextOutput As Long
    For extractTextSlice a new function would be very helpful getPageSize(iPage, dWidth, dHeight)
    Last edited by cliv; Dec 13th, 2024 at 02:37 AM.

  21. #21
    Fanatic Member
    Join Date
    Jun 2016
    Location
    España
    Posts
    563

    Re: [VB6] pdftotext.dll - VB6-compatible DLL for extracting text from PDFs

    Quote Originally Posted by cliv View Post
    For me Problem is in declaration of extractText:
    Code:
    replace 
    ByRef lpTextOutput As Long with ByVal lpTextOutput As Long
    For extractTextSlice a new function would be very helpful getPageSize(iPage, dWidth, dHeight)

    Exactly, that's the mistake, now everything works for me.
    Thank you very much

  22. #22
    Lively Member
    Join Date
    May 2021
    Posts
    118

    Re: [VB6] pdftotext.dll - VB6-compatible DLL for extracting text from PDFs

    Quote Originally Posted by pdey View Post
    This is probably just a one-line change:
    https://github.com/peterdey/xpdf-dll...eLists.txt#L13

    Change WIN32 to x64.
    I'm sorry, I only just saw that you had responded.

    Sure enough, that adjustment managed to get DLL to compile. I had my doubts when I saw the flurry of WARNING messages, but it did actually compile a 64bit DLL.

    I fully expected it would crash, though, because nothing with 64bit VBA is ever straightforward or painless, but I rewrote the declarations, and it kinda worked. So the getNumPages function worked fine, but extractText didn't appear to do anything at all. It wasn't until I adjusted the declaration as per Cliv's message above that it came back with something. Anyway, after a little bit more work, here is the 64bit API declarations and code that worked for me:

    Code:
    Private Declare PtrSafe Function getNumPages Lib "pdftotext.dll" (ByVal lpFileName As String, Optional ByVal lpLogCallbackFunc As LongPtr, Optional ByVal lpOwnerPassword As String, Optional ByVal lpUserPassword As String) As Integer
    Private Declare PtrSafe Function extractText Lib "pdftotext.dll" (ByVal lpFileName As String, ByVal lpTextOutput As LongPtr, Optional ByVal iFirstPage As Integer, Optional ByVal iLastPage As Integer, Optional ByVal lpTextOutEnc As String, Optional ByVal lpLayout As String, Optional ByVal lpLogCallbackFunc As LongPtr, Optional ByVal lpOwnerPassword As String, Optional ByVal lpUserPassword As String) As Integer
    
    Sub Test()
      
      Dim strOutput As String, Filename As String, Ret As Integer
      Filename = "C:\VBA\TestPDF.pdf"
      Ret = extractText(Fname, VarPtr(strOutput), 1, 3, "UTF-8", "rawOrder", AddressOf LogCallback)
      Debug.Print StrConv(strOutput, vbUnicode)
      
    End Sub
    
    Sub Test2()
    
      Dim strOutput As String, Filename As String, Pages As Long, Ret As Integer
      Filename = "C:\VBA\TestPDF.pdf"
      Pages = getNumPages(Filename)
      Ret = extractText(Filename, VarPtr(strOutput), 1, Pages \ 2)
      Debug.Print StrConv(strOutput, vbUnicode)
      
    End Sub
    
    Public Sub LogCallback(ByVal str As String)
      
      Debug.Print "Log: " & str
    
    End Sub
    Thank you again!

  23. #23

    Thread Starter
    New Member
    Join Date
    Dec 2024
    Posts
    13

    Re: [VB6] pdftotext.dll - VB6-compatible DLL for extracting text from PDFs

    Quote Originally Posted by cliv View Post
    Problem is in declaration of extractText:
    Code:
    replace 
    ByRef lpTextOutput As Long with ByVal lpTextOutput As Long
    You're absolutely right - I missed this update in the documentation.
    Updated on GitHub. Thanks!

    Quote Originally Posted by cliv View Post
    For extractTextSlice a new function would be very helpful getPageSize(iPage, dWidth, dHeight)
    New version on the extractTextSlice branch.
    Binary DLL here: https://github.com/peterdey/pdftotex...cts/2321553571
    Please see README.md (on the branch) for usage documentation.
    Last edited by pdey; Dec 28th, 2024 at 04:37 PM.

  24. #24
    Fanatic Member
    Join Date
    Jun 2016
    Location
    España
    Posts
    563

    Re: [VB6] pdftotext.dll - VB6-compatible DLL for extracting text from PDFs

    good job pdey

  25. #25
    Fanatic Member
    Join Date
    Nov 2011
    Posts
    591

    Re: [VB6] pdftotext.dll - VB6-compatible DLL for extracting text from PDFs

    Quote Originally Posted by pdey View Post
    You're absolutely right - I missed this update in the documentation.
    Updated on GitHub. Thanks!



    New version on the extractTextSlice branch.
    Binary DLL here: https://github.com/peterdey/xpdf-dll...cts/2321553571
    Please see README.md (on the branch) for usage documentation.
    the binary link is not working, i get a 404 page error

  26. #26

    Thread Starter
    New Member
    Join Date
    Dec 2024
    Posts
    13

    Re: [VB6] pdftotext.dll - VB6-compatible DLL for extracting text from PDFs

    Quote Originally Posted by k_zeon View Post
    the binary link is not working, i get a 404 page error
    I've tested the link on two different machines - works on both...

  27. #27
    Fanatic Member
    Join Date
    Nov 2011
    Posts
    591

    Re: [VB6] pdftotext.dll - VB6-compatible DLL for extracting text from PDFs

    Quote Originally Posted by pdey View Post
    I've tested the link on two different machines - works on both...
    I am not logged in , would this make a difference. I am at work PC and this too gives the 404 error page for github
    and the link provided is what i clicked on, but in image the link has changed
    Name:  404.jpg
Views: 526
Size:  25.6 KB
    Last edited by k_zeon; Dec 16th, 2024 at 06:38 AM.

  28. #28

    Thread Starter
    New Member
    Join Date
    Dec 2024
    Posts
    13

    Re: [VB6] pdftotext.dll - VB6-compatible DLL for extracting text from PDFs

    Quote Originally Posted by k_zeon View Post
    I am not logged in , would this make a difference.
    Good point. Looks like GitHub requires you to be logged in to download automated build artefacts.

  29. #29
    Fanatic Member
    Join Date
    Nov 2011
    Posts
    591

    Re: [VB6] pdftotext.dll - VB6-compatible DLL for extracting text from PDFs

    Quote Originally Posted by pdey View Post
    Good point. Looks like GitHub requires you to be logged in to download automated build artefacts.
    ok, tks. will do this one at home. tks

  30. #30
    PowerPoster Elroy's Avatar
    Join Date
    Jun 2014
    Location
    Near Nashville TN
    Posts
    10,665

    Re: [VB6] pdftotext.dll - VB6-compatible DLL for extracting text from PDFs

    @pdey:

    It'd sure be nice if we had this all in pure VB6 code. I took a look at your GitHub fork, and there are all kinds of libraries pulled together to do this. In addition, it seems to be a conglomeration of C, C++, & Python. It's entirely too much for me to maintain motivation to dig through and translate. But maybe there are core pieces that outline how to read the PDF structure and extract text that you could point out, since you're obviously familiar with it at this point.
    Any software I post in these forums written by me is provided "AS IS" without warranty of any kind, expressed or implied, and permission is hereby granted, free of charge and without restriction, to any person obtaining a copy. To all, peace and happiness.

  31. #31
    PowerPoster
    Join Date
    Jul 2010
    Location
    NYC
    Posts
    6,636

    Re: [VB6] pdftotext.dll - VB6-compatible DLL for extracting text from PDFs

    pdfium is just a single pure c++ dll, so that would be *easier* to port... but its still a massive undertaking, you'd be hard pressed to find someone willing to make that time commitment for free, among the dwindling active population of people who know both vb6 and c++ well enough to port.

    A standard dll doesn't have the registration hell of activex so it's not that bad to have one. Could always look at static linking when tB is stable and complete enough for you holdouts.

  32. #32
    Addicted Member
    Join Date
    Feb 2022
    Posts
    214

    Re: [VB6] pdftotext.dll - VB6-compatible DLL for extracting text from PDFs

    Quote Originally Posted by cliv View Post

    For extractTextSlice a new function would be very helpful getPageSize(iPage, dWidth, dHeight)
    Yes, I've been looking for a pdf text extractor that can read the margins (formatting) so the extracted text could be easily rebuilt in MSoffice or OpenOffice. This would be an excellent addition!

  33. #33

    Thread Starter
    New Member
    Join Date
    Dec 2024
    Posts
    13

    Re: [VB6] pdftotext.dll - VB6-compatible DLL for extracting text from PDFs

    Quote Originally Posted by Elroy View Post
    It'd sure be nice if we had this all in pure VB6 code.
    I'm not sure I see the point. Having this as a single, easily redistributable DLL with no registration required seems convenient enough. Regardless, my initial motivation was replacing Adobe Acrobat with something with a substantially lower installation/support complexity. According to @Dan_W, this seems to even work in 64-bit VBA which no actual code changes.

    Quote Originally Posted by Elroy View Post
    But maybe there are core pieces that outline how to read the PDF structure and extract text that you could point out, since you're obviously familiar with it at this point.
    Porting this to pure VB6 would be a non-trivial project.

    Text in PDF documents is non-linear. Xpdf does not simply "extract" the text, but rather renders the text, then returns the text rendering so that the text is in the correct order, as would be seen on screen. The bulk of what you would want to port is in probably in xpdf/: PdfDoc.cc, Catalog.cc (describing the document structure), Page.cc and TextOutputDev.cc (which renders the text).

    Quote Originally Posted by taishan View Post
    Yes, I've been looking for a pdf text extractor that can read the margins (formatting) so the extracted text could be easily rebuilt in MSoffice or OpenOffice. This would be an excellent addition!
    A link to a new test version is in Post #23 which includes this.

  34. #34
    Addicted Member
    Join Date
    Feb 2022
    Posts
    214

    Re: [VB6] pdftotext.dll - VB6-compatible DLL for extracting text from PDFs

    @pdey Thanks for the dll. Here's a complete project with everything needed to test it.
    I'm pretty sure the xpdf declaration returns should be As Long instead of As Integer in VB6.
    I am out of time working with this, so please modify this project to show the full capabilities of the dll.
    I'd especially like to know how to retrieve the margins in Px or inches!
    Cheers
    Attached Files Attached Files
    Last edited by taishan; Dec 19th, 2024 at 01:58 PM.

  35. #35

    Thread Starter
    New Member
    Join Date
    Dec 2024
    Posts
    13

    Version 4.03-2 released

    Version 4.03-2 released

    Note the repository URL has changed from "xpdf-dll" to "pdftotext-dll"

    Breaking changes - Refer to the updated example code in README.md
    • Function definition has changed (lpTextOutput is now Long) to allow raw Unicode text to be returned
    • lpTextOutput must now be passed using VarPtr


    New features
    • extractTextSlice function to extract the text from a specified rectangle
    • getPageSize to return the dimensions of the page


    Bug fixes
    • Output now returned as a correctly-encoded UTF-16 BSTR (see "Breaking changes")

  36. #36
    Lively Member
    Join Date
    May 2021
    Posts
    118

    Re: Version 4.03-2 released

    Thank you for the update. Out of curiosity, do you plan on adding any further functionality?

    Following your instructions from last time, I have compiled the updated 64bit DLL, and can confirm that it worked perfectly like last time.

    Having the output correctly encoded is a huge help - thank you. As before, on the off-chance it might be useful to someone else using VBA, I set out the 64bit and 32bit API declarations that worked for me.

    Code:
      #If VBA7 Then
        Private Declare PtrSafe Function extractText Lib "pdftotext.dll" (ByVal lpFileName As String, ByVal lpTextOutput As LongPtr, Optional ByVal iFirstPage As Integer, Optional ByVal iLastPage As Integer, Optional ByVal lpTextOutEnc As String, Optional ByVal lpLayout As String, Optional ByVal lpLogCallbackFunc As LongPtr, Optional ByVal lpOwnerPassword As String, Optional ByVal lpUserPassword As String) As Integer
        Private Declare PtrSafe Function extractTextSlice Lib "pdftotext.dll" (ByVal lpFileName As String, ByVal lpTextOutput As LongPtr, ByVal iPage As Integer, ByVal iSliceX As Integer, ByVal iSliceY As Integer, ByVal iSliceW As Integer, ByVal iSliceH As Integer, Optional ByVal lpTextOutEnc As String, Optional ByVal lpLayout As String, Optional ByVal lpLogCallbackFunc As LongPtr, Optional ByVal lpOwnerPassword As String, Optional ByVal lpUserPassword As String) As Integer
        Private Declare PtrSafe Function getNumPages Lib "pdftotext.dll" (ByVal lpFileName As String, Optional ByVal lpLogCallbackFunc As LongPtr, Optional ByVal lpOwnerPassword As String, Optional ByVal lpUserPassword As String) As Integer
        Private Declare PtrSafe Function getPageSize Lib "pdftotext.dll" (ByVal lpFileName As String, ByVal iPage As Integer, ByRef dWidth As Double, ByRef dHeight As Double, Optional ByVal lpLogCallbackFunc As LongPtr, Optional ByVal lpOwnerPassword As String, Optional ByVal lpUserPassword As String) As Integer
      #Else
        Private Declare Function getNumPages Lib "pdftotext.dll" (ByVal lpFileName As String, Optional ByVal lpLogCallbackFunc As Long, Optional ByVal lpOwnerPassword As String, Optional ByVal lpUserPassword As String) As Integer
        Private Declare Function extractText Lib "pdftotext.dll" (ByVal lpFileName As String, ByVal lpTextOutput As Long, Optional ByVal iFirstPage As Integer, Optional ByVal iLastPage As Integer, Optional ByVal lpTextOutEnc As String, Optional ByVal lpLayout As String, Optional ByVal lpLogCallbackFunc As Long, Optional ByVal lpOwnerPassword As String, Optional ByVal lpUserPassword As String) As Integer
        Private Declare Function extractTextSlice Lib "pdftotext.dll" (ByVal lpFileName As String, ByVal lpTextOutput As Long, ByVal iPage As Integer, ByVal iSliceX As Integer, ByVal iSliceY As Integer, ByVal iSliceW As Integer, ByVal iSliceH As Integer, Optional ByVal lpTextOutEnc As String, Optional ByVal lpLayout As String, Optional ByVal lpLogCallbackFunc As Long, Optional ByVal lpOwnerPassword As String, Optional ByVal lpUserPassword As String) As Integer
        Private Declare Function getPageSize Lib "pdftotext.dll" (ByVal lpFileName As String, ByVal iPage as Integer, ByRef dWidth as Double, ByRef dHeight as Double, Optional ByVal lpLogCallbackFunc As Long, Optional ByVal lpOwnerPassword As String, Optional ByVal lpUserPassword As String) As Integer
      #End If

  37. #37

    Thread Starter
    New Member
    Join Date
    Dec 2024
    Posts
    13

    Re: Version 4.03-2 released

    This originally started as a way to replace a finicky Adobe Acrobat dependency, so I don't have any active plans to add or develop significant functionality. Perhaps a function to extract a single page as a PNG.

    I will however try to get automated builds working on GitHub for both the 32-bit and 64-bit versions, and unify your fork into one codebase.

  38. #38
    PowerPoster
    Join Date
    Jul 2010
    Location
    NYC
    Posts
    6,636

    Re: Version 4.03-2 released

    Quote Originally Posted by Dan_W View Post
    Thank you for the update. Out of curiosity, do you plan on adding any further functionality?

    Following your instructions from last time, I have compiled the updated 64bit DLL, and can confirm that it worked perfectly like last time.

    Having the output correctly encoded is a huge help - thank you. As before, on the off-chance it might be useful to someone else using VBA, I set out the 64bit and 32bit API declarations that worked for me.

    Code:
      #If VBA7 Then
        Private Declare PtrSafe Function extractText Lib "pdftotext.dll" (ByVal lpFileName As String, ByVal lpTextOutput As LongPtr, Optional ByVal iFirstPage As Integer, Optional ByVal iLastPage As Integer, Optional ByVal lpTextOutEnc As String, Optional ByVal lpLayout As String, Optional ByVal lpLogCallbackFunc As LongPtr, Optional ByVal lpOwnerPassword As String, Optional ByVal lpUserPassword As String) As Integer
        Private Declare PtrSafe Function extractTextSlice Lib "pdftotext.dll" (ByVal lpFileName As String, ByVal lpTextOutput As LongPtr, ByVal iPage As Integer, ByVal iSliceX As Integer, ByVal iSliceY As Integer, ByVal iSliceW As Integer, ByVal iSliceH As Integer, Optional ByVal lpTextOutEnc As String, Optional ByVal lpLayout As String, Optional ByVal lpLogCallbackFunc As LongPtr, Optional ByVal lpOwnerPassword As String, Optional ByVal lpUserPassword As String) As Integer
        Private Declare PtrSafe Function getNumPages Lib "pdftotext.dll" (ByVal lpFileName As String, Optional ByVal lpLogCallbackFunc As LongPtr, Optional ByVal lpOwnerPassword As String, Optional ByVal lpUserPassword As String) As Integer
        Private Declare PtrSafe Function getPageSize Lib "pdftotext.dll" (ByVal lpFileName As String, ByVal iPage As Integer, ByRef dWidth As Double, ByRef dHeight As Double, Optional ByVal lpLogCallbackFunc As LongPtr, Optional ByVal lpOwnerPassword As String, Optional ByVal lpUserPassword As String) As Integer
      #Else
        Private Declare Function getNumPages Lib "pdftotext.dll" (ByVal lpFileName As String, Optional ByVal lpLogCallbackFunc As Long, Optional ByVal lpOwnerPassword As String, Optional ByVal lpUserPassword As String) As Integer
        Private Declare Function extractText Lib "pdftotext.dll" (ByVal lpFileName As String, ByVal lpTextOutput As Long, Optional ByVal iFirstPage As Integer, Optional ByVal iLastPage As Integer, Optional ByVal lpTextOutEnc As String, Optional ByVal lpLayout As String, Optional ByVal lpLogCallbackFunc As Long, Optional ByVal lpOwnerPassword As String, Optional ByVal lpUserPassword As String) As Integer
        Private Declare Function extractTextSlice Lib "pdftotext.dll" (ByVal lpFileName As String, ByVal lpTextOutput As Long, ByVal iPage As Integer, ByVal iSliceX As Integer, ByVal iSliceY As Integer, ByVal iSliceW As Integer, ByVal iSliceH As Integer, Optional ByVal lpTextOutEnc As String, Optional ByVal lpLayout As String, Optional ByVal lpLogCallbackFunc As Long, Optional ByVal lpOwnerPassword As String, Optional ByVal lpUserPassword As String) As Integer
        Private Declare Function getPageSize Lib "pdftotext.dll" (ByVal lpFileName As String, ByVal iPage as Integer, ByRef dWidth as Double, ByRef dHeight as Double, Optional ByVal lpLogCallbackFunc As Long, Optional ByVal lpOwnerPassword As String, Optional ByVal lpUserPassword As String) As Integer
      #End If
    Ah yeah I guess 32bit VBA kind of got left out in the cold with the new pdfium builds, so good to have this. No VBCdeclFix for 32bit VBA. But just FYI if anyone was wondering... you can use the "cdecl" 64bit pdfium.dll for VBA7x64 because the cdecl or stdcall calling convention is ignored and all will use the standard x64 calling convention. So just delete the 'CDecl' from those declares.

  39. #39
    Fanatic Member
    Join Date
    Aug 2011
    Location
    Palm Coast, FL
    Posts
    653

    Re: [VB6] pdftotext.dll - VB6-compatible DLL for extracting text from PDFs

    Happy to see that you created a fork of Glyph & Cog's XPDF! I did this many many years ago but didn't keep up with the many changes. I may have to replace my DLL with your newer version.

  40. #40
    Lively Member
    Join Date
    May 2021
    Posts
    118

    Re: Version 4.03-2 released

    Quote Originally Posted by pdey View Post
    This originally started as a way to replace a finicky Adobe Acrobat dependency, so I don't have any active plans to add or develop significant functionality. Perhaps a function to extract a single page as a PNG.
    Personally, I use the XPDF Toolset either to extract text or to generate PNG images of the PDF pages, so that would be a welcome addition. Also, I think makes sense given the inclusion of the extractTextSlice and getPageSize functions - being able to visualise and draw on the page would help map the coordinates of the slice.

    Quote Originally Posted by pdey View Post
    I will however try to get automated builds working on GitHub for both the 32-bit and 64-bit versions, and unify your fork into one codebase.
    That would be great, thank you.

Page 1 of 2 12 LastLast

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width