Results 1 to 14 of 14

Thread: VB6 Search Box to search for text inside multiple word documents

Threaded View

  1. #8
    PowerPoster Elroy's Avatar
    Join Date
    Jun 2014
    Location
    Near Nashville TN
    Posts
    10,913

    Re: VB6 Search Box to search for text inside multiple word documents

    Yes, you've already gotten some pretty good advice, and it does make a great deal of difference if we're talking about a .DOC or a .DOCX file.

    Also, you say that you ...

    Quote Originally Posted by irfi View Post
    don't want to open the files while they are being searched.
    I assume that you mean you don't want to use Word automation from VB6. You must be clear about those things. If you don't wish to open the files in any way, then I can't imagine how you could search them.


    It's been a while since I've messed this but I have in the past. For the older .DOC files, I believe they put all of their text in the files as ASCII/ANSI (although the entire file is not ASCII/ANSI). Therefore, they can be searched directly, opening them as binary. Here's a binary file search routine I often use with Excel files (with the older .XLS files also placing text in the files as ASCII/ANSI). Also, this code has the advantage of being able to handle a file of any size, because it doesn't read it all into memory.

    Code:
    Option Explicit
    
    Public Function BinaryFileSearch(sFileSpec As String, sSearchString As String, Optional bCaseSensitive As Boolean = True, _
                                     Optional lStartPosition As Long = 1, Optional lFoundPosition As Long, _
                                     Optional lFileHandleToUse As Long = 0) As Boolean
        ' Returns true if sSearchString is found, else false.
        ' sSearchString can be no longer than 128.
        ' This will work even if Word or Excel has the file open.
        ' The lFoundPosition is a return argument.
        '    It returns the latest position before lStartPosition (if there isn't one after lStartPosition) or
        '    it returns the earliest position after lStartPosition.
        Dim iFle As Long
        Dim FileData As String
        Dim FilePointer As Long
        Dim FileLength As Long
        Dim sFind As String
        Dim iPos As Long
        '
        If Len(sSearchString) > 128 Then
            Error 1234
            Exit Function
        End If
        '
        If lFileHandleToUse = 0 Then
            If Not bFileExists(sFileSpec) Then Exit Function
            iFle = FreeFile
            On Error Resume Next
            Open sFileSpec For Binary As iFle
            If Err <> 0 Then Close iFle: Exit Function
            On Error GoTo 0
            If Len(iFle) = 0 Then Close iFle: Exit Function
        Else
            iFle = lFileHandleToUse ' The file MUST be opened BINARY for this to work.
        End If
        '
        If bCaseSensitive Then
            sFind = sSearchString
        Else
            sFind = LCase$(sSearchString)
        End If
        FileData = Space(1024)
        FileLength = LOF(iFle)
        FilePointer = lStartPosition
        Do
            If FilePointer > FileLength Then Exit Do
            Get iFle, FilePointer, FileData
            If Not bCaseSensitive Then FileData = LCase$(FileData)
            iPos = InStr(FileData, sFind)
            If iPos <> 0 Then
                lFoundPosition = FilePointer + iPos - 1
                If lFoundPosition >= lStartPosition Then
                    BinaryFileSearch = True
                    Exit Do
                End If
            End If
            FilePointer = ((FilePointer + 1024) - Len(sFind)) + 1
        Loop
        If lFileHandleToUse = 0 Then Close iFle
    End Function
    
    Public Function bFileExists(fle As String) As Boolean
        On Error GoTo FileExistsError
        ' If no error then something existed.
        bFileExists = (GetAttr(fle) And vbDirectory) = 0
        Exit Function
    FileExistsError:
        bFileExists = False
        Exit Function
    End Function

    Now, westconn1 already mentioned this, but the newer .DOCX files are actually a zipped collection of files. If you use something like 7zip, it's easy to see this. Just right-click the .DOCX file, say 7Zip / Open Archive, and they'll open right up. Once open, you'll want to go into the "word" folder and you'll see a document named "document.xml". That's an ANSI file that you can easily search. Therefore, with these newer .DOCX files, you've got the extra step of unzipping them and extracting this "document.xml" file into some temporary location.

    Regarding the temporary location, I'd tend to use the API and get the Windows temp folder for this. Here's code to do that:

    Code:
    Option Explicit
    '
    Private Declare Function GetTempPath Lib "kernel32" Alias "GetTempPathA" (ByVal nBufferLength As Long, ByVal lpBuffer As String) As Long
    '
    
    Public Function WindowsTempPath()
        Dim s As String
        Const MAX_PATH = 1024 ' Maximum long filename path length
        '
        s = Space$(MAX_PATH)
        GetTempPath MAX_PATH, s
        s = Left$(s, InStr(s, vbNullChar) - 1)
        If Right$(s, 1) <> "\" Then s = s & "\"
        WindowsTempPath = s
    End Function
    Now, regarding unzipping a file, this isn't something I regularly do. I know that there are several threads around here that discuss this. And there's probably several threads in the VB6 codebank. However, here's a piece of code I've got in my VB6 junk-drawer. I can't say it's well tested, but here it is. Also, this requires a DLL reference to the "Microsoft Shell Controls And Automation" (shell32.dll). I'll leave you on your own to work that out, but it should be pretty easy. If you're on a 32-bit OS, that DLL should be in your Windows/System32 folder. If you're on a 64-bit OS, it should be in Windows/SysWOW64.

    Code:
    '
    ' Set reference to "Microsoft Shell Controls and Automation" to use this procedure!!!!!!!!!!
    '
    ' Be aware when using the shell automation interface to unzip files as it
    ' leaves copies of the zip files in the temp directory (defined by %TEMP%).
    ' Folders named "Temporary Directory X for demo.zip" are generated where X
    ' is a sequential number from 1 - 99.  When it reaches 99 you will then get
    ' a error dialog saying "The file exists" and it will not continue.
    ' I've no idea why Windows doesn't clean up after itself when unzipping files,
    ' but it is most annoying...
    '
    ' Ziping a file using the Windows Shell API creates another thread where the zipping is executed.
    ' This means that it is possible that this app would end before the zipping thread
    ' starts to execute which would cause the zip to never occur and you will end up with just
    ' an empty zip file. So wait a second and give the zipping thread time to get started.
    '
    ' If we're sure the program will continue then we can skip the pause.
    ' However, we must also be careful to not use unzipped files before they are fully unzipped.
    '
    Option Explicit
    '
    Public Enum ZipFlagsEnum
        ZipFlagDefault = 0              ' Default. No options specified.
        ZipFlagNoProgBar = 4            ' Do not display a progress dialog box.
        ZipFlagRenameTarget = 8         ' Rename the target file if a file exists at the target location with the same name.
        ZipFlagYesToAll = 16            ' Click "Yes to All" in any dialog box displayed.
        ZipFlagPreserveUndo = 64        ' Preserve undo information, if possible.
        ZipFlagOnlyWildcard = 128       ' Perform the operation only if a wildcard file name (*.*) is specified.
        ZipFlagProgNoName = 256         ' Display a progress dialog box but do not show the file names.
        ZipFlagNoConfirm = 512          ' Do not confirm the creation of a new directory if the operation requires one to be created.
        ZipFlagNoInterface = 1024       ' Do not display a user interface if an error occurs.
        ZipFlagNoSecurity = 2048        ' Do not copy security attributes.
        ZipFlagNoRecursion = 4096       ' Disable recursion through sub-folders.
        ZipFlagNoConnectedFiles = 8192  ' Do not copy connected files as a group. Only copy the specified files.
    End Enum
    '
    
    Private Sub CreateEmptyZipFile(sZipFileSpec As String)
        Dim EmptyZip() As Byte
        Dim i As Long
        Dim iFle As Long
        '
        ' Create zip header.
        ReDim EmptyZip(1 To 22)
        EmptyZip(1) = 80
        EmptyZip(2) = 75
        EmptyZip(3) = 5
        EmptyZip(4) = 6
        For i = 5 To UBound(EmptyZip)
            EmptyZip(i) = 0
        Next
        ' Create empty zip file with header.
        iFle = FreeFile
        Open sZipFileSpec For Binary Access Write As #iFle
        For i = LBound(EmptyZip) To UBound(EmptyZip)
            Put #iFle, , EmptyZip(i)
        Next
        Close #iFle
    End Sub
    
    Public Sub AddFileToZipFile(sSourceFileSpec As String, sZipFileSpec As String, _
                                Optional ZipFlags As ZipFlagsEnum = ZipFlagNoProgBar + ZipFlagYesToAll + ZipFlagNoConfirm + ZipFlagNoInterface)
        ' The zip file must already exist.
        Dim ShellClass  As Shell32.Shell
        Dim FileDest    As Shell32.Folder
        '
        Set ShellClass = New Shell32.Shell
        Set FileDest = ShellClass.NameSpace(sZipFileSpec)
        FileDest.CopyHere sSourceFileSpec, ZipFlags
    End Sub
    
    Public Sub AddFolderToZipFile(sSourceFolder As String, sZipFileSpec As String, _
                                  Optional ZipFlags As ZipFlagsEnum = ZipFlagNoProgBar + ZipFlagYesToAll + ZipFlagNoConfirm + ZipFlagNoInterface)
        ' The zip file must already exist.
        ' All sub-folders are added unless the ZipFlagNoRecursion flag is used.
        Dim ShellClass  As Shell32.Shell
        Dim FileSource  As Shell32.Folder
        Dim FileDest    As Shell32.Folder
        Dim FolderItems As Shell32.FolderItems
        '
        Set ShellClass = New Shell32.Shell
        Set FileSource = ShellClass.NameSpace(sSourceFolder)
        Set FileDest = ShellClass.NameSpace(sZipFileSpec)
        Set FolderItems = FileSource.Items
        FileDest.CopyHere FolderItems, ZipFlags
    End Sub
    
    Public Sub ExtractZipFileToFolder(sZipFileSpec As String, sDestFolder As String, _
                                      Optional ZipFlags As ZipFlagsEnum = ZipFlagNoProgBar + ZipFlagYesToAll + ZipFlagNoConfirm + ZipFlagNoInterface)
        ' The zip file must already exist.
        Dim ShellClass  As Shell32.Shell
        Dim FileSource  As Shell32.Folder
        Dim FileDest    As Shell32.Folder
        Dim FolderItems As Shell32.FolderItems
        '
        Set ShellClass = New Shell32.Shell
        Set FileSource = ShellClass.NameSpace(sZipFileSpec)
        Set FileDest = ShellClass.NameSpace(sDestFolder)
        Set FolderItems = FileSource.Items
        FileDest.CopyHere FolderItems, ZipFlags
    End Sub
    Now, you haven't specified how you're finding these Word documents. For simplicity, I'll assume that you know where they are. In that case, it's just some judicious use of the Dir$ function. If you don't know where they are, or if you wish to traverse all sub-folders, that's a more nettlesome problem. There's been some excellent code posted in the codebank lately regarding this. I've got my own, but I think the code in the codebank is better (or, at least, faster) than mine.

    Good Luck,
    Elroy

    EDIT: Regarding PDF files, I'm less familiar with the internal format of those. However, that BinarySearch procedure I gave you may work. It's definitely worth a try. However, they may use a UTF-16 or UTF-8 storage method, which presents a whole other set of problems. Or, they may embed formatting in with the text, which also causes problems for doing searches, especially if/when parts of a word or sentence are formatted differently from other parts. Word used to do that many moons ago, but they quit doing it LONG ago. I'm not as positive about the XML documents, but, as westconn1 stated, it really shouldn't be much work to parse out (and throw away) all the XML codes, leaving just the text.

    EDIT2: I've given you many (if not all) of the pieces, but it's up to you to fit them together.

    EDIT3: It dawns on me that maybe you want to tap into the Windows search routines from VB6 using some kind of API calls. That's something I've never done, and would be interested to see how to do it. If that's what you're trying to do, I look forward to the replies of others.
    Last edited by Elroy; Aug 16th, 2016 at 06:34 AM.
    Any software I post in these forums written by me is provided "AS IS" without warranty of any kind, expressed or implied, and permission is hereby granted, free of charge and without restriction, to any person obtaining a copy. To all, peace and happiness.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width