This demonstrates a faster way to search any file and pull a full line from that file given a search string, than the string buffer way. Uses an API method to open a file and store it in a buffer then instr'gs though the buffer to find a match, also compares this to the ordinary string buffer method, API wins by a factor of about 3.5 for a 1 meg text file which is not bad.
Could also open the file as binary and use instr, didn't bother with it though..
I wrote a competing API code that does things slightly differently. Instead of using StrConv for all the data, it just reads it into a string variable directly. I also removed the complex InStr + InStrRev code and just put in InStr vs. InStrB with a simple string search for "abc".
With a 11.8 MB file memory usage with my function was around 12 - 13 MB. Jmacp's original method jumped at around 36 - 37 MB memory usage.
On the speed side differences are greater: my code is some 5 - 6 times faster. The difference becomes greater as the file size grows.
Just a reminder of the byte versions of string functions: InStrB, LeftB$, MidB$, RightB$, LenB, ChrB$, AscB. That would be "true" binary file handling using string functions, as no textual conversion takes place. Also if you're reading UTF-8 data it is much more straightforward to pass this kind of a string to Windows string conversion function and get out a string that is ready-to-use in VB6.
Edit!
If you also need an ANSI Split, kinda like SplitB, see my QuickSplitB sub.
I should have said that my code obviously wasn't polished up. I was just throwing in some idea's, the whole instr, instrrev part was just to get to the end point quickly, but the API ReadFile, CreateFile was the real substance, i am sure your version is better, well done!
Hello Merri,
I integrated your modified function into my application and it's by far the fastest file search I've used and makes the application perfectly usable now. It took just 4 minutes to scoot through a total of 6GB of CSV data. My question to you is: is it possible to use your function in conjunction with a regular expression pattern instead of a fixed search string, and still maintain its performance?
That would depend on the regular expression mechanism: if you can pass a pointer to a buffer in memory and if you can do continuous calls without causing regular expression pattern to be analyzed each time separately, then you could achieve pretty good speeds.
In comparison if you'd need to pass "normal" strings and needed a conversion, that would cause a massive amount of extra work, a bit like how Jmacp's original code is when compared to what I did.
It is all about keeping data unmodified as much as possible.
The leak is probably introduced in your way of using the code. For example, the PutMem4 part of the code places the created string into a string variable. If you do this in a loop and never use vbNullString to the string variable you never free the strings from memory and thus you keep on hogging more memory.
The leak is probably introduced in your way of using the code. For example, the PutMem4 part of the code places the created string into a string variable. If you do this in a loop and never use vbNullString to the string variable you never free the strings from memory and thus you keep on hogging more memory.
I believe that is accurate. Is there an api to free up the memory? When I set the string to = "", then it clears it out but has to be within the loop.
You can't use "" because that allocates an empty string. You must use vbNullString. In this case the use of vbNullString is faster than using an API call.
Note that you can also just create the buffer once and keep filling it again and again, you don't need to create the buffer over and over again. Clearing up the buffer with vbNullString would be good practice (that was left out from that example... and it should have more comments).
I did what you recommended and updated the "" to vbNullString. That was a good recommendation.
I did what you said and only create the buffer once by moving the stringAlloc api out of the loop. Some how, that created gibberish for my data. Was I supposed to move the PutMem4 out of the loop too?
Yes, the string allocation and PutMem4 should always go together and be as close to each other as possible. So if you move one you must move the other. Also, regarding the end of file when buffer will be larger than the remaining file, you must decrease the buffer size. Easiest way and probably the fastest is to use LeftB$.
I've a problem, because if i search a txt file for IB, and the txt file contains GOTLIB, then it would find it. Is there some way where i only get a positive hit if the file contailns IB?
To keep it relatively fast you need to code additional conditions once you have found a match. Basically you check the character before and after the match to see whether it is or isn't something you want to be there. If the characters are what you don't want then you search again.
Alternatively, if there is always a specific character before and after the string to be found, such as line change, then you can simply include them in your search.
There is not any specific way to identify the words, and i do therefore have a mdb file with over 10000 words, that the program should go through. I take the program 30 min. to end, and im therefore i search of some code, that can reduce that amount of time
Open up a thread in the classic VB and post some of the code you use. People are probably able to tell you about the issues that are in your existing code, in best case it is just a few small things that need to be changed to improve speed to bearable levels. Also, try to tell what is the information you want to have, ie. do you just have to know that the word is in the file or is there something more.
Merri, thanks for code. I don't really understand it but I have cribbed your project and put one line:
Label2.Caption = InStrB(API_Merri, Find)
, into a loop, with the Find string being read from a file. I have output the time it takes to process each batch of 1000 searches. This shows that the search gets slower and slower.
1-1000, 2 secs
1000-2000 3 secs
2000 - 3000 4 secs etc etc.
The length of the Find string does not change. I have found that if I remove the InStrB search or if the Find string is a constant, the speed does not deteriorate. The speed improves with shorter Find strings. Whether the Find string is found or not makes no difference. Also the memory useage does not increase.
Any ideas would be greatly appreciated.
Here's my code, thanks in advance:
Code:
FileNo = FreeFile
Open TESTFILE For Input As #FileNo
StartTime = Now()
i = 0
Do While Not EOF(FileNo)
Line Input #FileNo, Find
Find = StrConv(Find, vbFromUnicode)
Label2.Caption = InStrB(API_Merri, Find)
i = i + 1
If Int(i / 1000) = i / 1000 Then
Debug.Print i & " - " & Format(Now() - StartTime, "hh:mm:ss")
StartTime = Now()
End If
Loop
Close #FileNo
If the result to find is further down in the file you search from, then it will take longer to find. In this case, if it is likely your later search keywords are down to the end of the searched file in general, then finding does take longer.
Note that this also may mean there is a spot for further optimization. If the keywords are always found in the order from file you don't necessarily need to search from the beginning of the file, instead simply continue from the last position. Or, if it possible to sort the keywords into such order that they're found in order from the file.
On the other news, as things keep getting faster you don't want to update Label2.Caption on each loop iteration, because interacting with controls is slow. It may seem small, but in reality a lot of happens each time you change something in a control (drawing to screen, string storage etc).
Merri, i like you code, and want to use it in some project. But i wan multiple keyword search. can i do this without looping, as i have so many keyword i want to search for at once, you can also let me know the performance issue i should be expecting
Last edited by coolcurrent4u; Feb 4th, 2011 at 07:24 AM.
Programming is all about good logic. Spend more time here
It would require more complex code than that, can't use InStr because it always looks for a single given keyword. You'd be forced to multiple loops through it all.
To make it more efficient and to truly loop through just once you'd need to 1) sort the keywords 2) do string matching manually against the keyword list 3) as the keyword list is sorted, it will be quite fast to know whether you've found what you're looking for, you don't need to check againt all the strings, just go on until you have either a perfect match or only a partial match and the next keyword can't match. Finally 4) applying some string finding algorithm such as binary search should make things quite fast and those require the keywords to be sorted. You'll have only a couple of lookups from the keywords list instead of going through all the keywords. That is the power of sorting & a good search algorithm.
I'm getting a 'Run-time error 9 subscript out of range' as indicated in the below code:
Code:
Public Function ApiReadFile(ByVal strFilename As String, ByVal strStringToFind As String) As String
Dim hFile As Long, bContent() As Byte
Dim FileLenght As Long, Result As Long
hFile = CreateFile(strFilename, GENERIC_READ, FILE_SHARE_READ Or FILE_SHARE_WRITE, ByVal 0&, OPEN_EXISTING, 0, 0)
FileLenght = GetFileSize(hFile, 0)
SetFilePointer hFile, 0, 0, FILE_BEGIN
ReDim bContent(1 To FileLenght) As Byte '<<--- Error 9
ReadFile hFile, bContent(1), UBound(bContent), Result, ByVal 0&
If Result <> UBound(bContent) Then MsgBox "Error reading file ..."
CloseHandle hFile
ApiReadFile = StrConv(bContent, vbUnicode)
Label1.Caption = InStr(ApiReadFile, strStringToFind) 'Mid(ApiReadFile, InStrRev(ApiReadFile, vbNewLine, InStr(1, ApiReadFile, strStringToFind)), (InStr(InStr(1, ApiReadFile, strStringToFind), ApiReadFile, vbNewLine)) - InStrRev(ApiReadFile, vbNewLine, InStr(1, ApiReadFile, strStringToFind)))
ReDim bContent(0) As Byte
End Function
What is the size of the tested file? If it is 0 bytes then the code fails as it does not check if the length is valid.
In the other hand I can't recall whether the code created a file or not, if it does create a file then make sure the project is located in a folder that you have write access to (Vista & 7 aren't as "nice" as XP is on file permissions).
I noticed if I create the file "C:\Test.txt" and leave it empty then attempt to use your code I receive a message box saying:
File too big: 0.000 gigabytes. Shouldn't that be file too small?
Originally Posted by Aaron02
I manually created the file:
Code:
Const TESTFILE = "C:\Test.txt"
It contains 4 bytes.
Same error.
The app didn't create it.
Did you write something in the file then save it?
when you quote a post could you please do it via the "Reply With Quote" button or if it multiple post click the "''+" button then "Reply With Quote" button.
If this thread is finished with please mark it "Resolved" by selecting "Mark thread resolved" from the "Thread tools" drop-down menu. https://get.cryptobrowser.site/30/4111672
Aaron02: now that I had time to download the sample I noticed that the referred code is jmacp's original code and you get "subscript out of range" error if the file is not there. My code has a check for valid handle and it tells it could not find the file, so clicking the second button first should tell you this.
Nightwalker83: the bug is there but considering the nature of the sample it shouldn't matter that much: it is quite a small change to fix the problem and a rewrite is required for use in other purposes.
How do you get it to continue to find the last find entry (currently only find the first instance)
ie to shows the last GeccountId for "30180AV" which should be 0001769011GjRH9BkX58roI7eAmCurl6G6q7C5yYzJ6lwS21oJ
the only way i can think of is loading it as a text file as an string array. then reversein the sting array with another loop
then searching
But there must be an easier way
Thanks
ilarge txt file which has lay out similar to this plus other stiff in
2012-09-25 15:01:03,421 INFO - 25 September 2012 15:01:03.421 +01:00 : [0001768998r8Vfb4B6ZajavFqNMWZwSbm6QsfJuLwVoHcRIWx0] POST - GetAccountId(0001768998r8Vfb4B6ZajavFqNMWZwSbm6QsfJuLwVoHcRIWx0) returning "19678JU"
2012-09-25 15:02:47,093 INFO - 25 September 2012 15:02:47.093 +01:00 : [0001768998r8Vfb4B6ZajavFqNMWZwSbm6QsfJuLwVoHcRIWx0] POST - GetAccountId(0001768998r8Vfb4B6ZajavFqNMWZwSbm6QsfJuLwVoHcRIWx0) returning "19678JU"
2012-09-25 15:02:53,468 INFO - 25 September 2012 15:02:53.468 +01:00 : [0001768998r8Vfb4B6ZajavFqNMWZwSbm6QsfJuLwVoHcRIWx0] POST - GetAccountId(0001768998r8Vfb4B6ZajavFqNMWZwSbm6QsfJuLwVoHcRIWx0) returning "19678JU"
2012-09-25 15:03:00,250 INFO - 25 September 2012 15:03:00.250 +01:00 : [0001768998r8Vfb4B6ZajavFqNMWZwSbm6QsfJuLwVoHcRIWx0] POST - GetAccountId(0001768998r8Vfb4B6ZajavFqNMWZwSbm6QsfJuLwVoHcRIWx0) returning "19678JU"
2012-09-25 15:03:27,656 INFO - 25 September 2012 15:03:27.656 +01:00 : [0001768998r8Vfb4B6ZajavFqNMWZwSbm6QsfJuLwVoHcRIWx0] POST - GetAccountId(0001768998r8Vfb4B6ZajavFqNMWZwSbm6QsfJuLwVoHcRIWx0) returning "19678JU"
2012-09-25 15:05:17,265 INFO - 25 September 2012 15:05:17.265 +01:00 : [0001769003uH4h8ZWtQEur2kyZbBeaRA10W0FC8L1Da6ifaAB8] POST - GetAccountId(0001769003uH4h8ZWtQEur2kyZbBeaRA10W0FC8L1Da6ifaAB8) returning "30180AV"
2012-09-25 15:12:50,734 INFO - 25 September 2012 15:12:50.734 +01:00 : [0001769003uH4h8ZWtQEur2kyZbBeaRA10W0FC8L1Da6ifaAB8] POST - GetAccountId(0001769003uH4h8ZWtQEur2kyZbBeaRA10W0FC8L1Da6ifaAB8) returning "30180AV"
2012-09-25 15:31:06,703 INFO - 25 September 2012 15:31:06.703 +01:00 : [0001769011GjRH9BkX58roI7eAmCurl6G6q7C5yYzJ6lwS21oJ] POST - GetAccountId(0001769011GjRH9BkX58roI7eAmCurl6G6q7C5yYzJ6lwS21oJ) returning "30180AV"
2012-09-25 15:31:57,250 INFO - 25 September 2012 15:31:57.250 +01:00 : [0001769013Lpp8tvJXEadblzvvdyLy1CMWHQcV7fduWVjHGz4n] POST - GetAccountId(0001769013Lpp8tvJXEadblzvvdyLy1CMWHQcV7fduWVjHGz4n) returning "30180AV"
2012-09-25 15:34:35,593 INFO - 25 September 2012 15:34:35.593 +01:00 : [0001769009A3cuBbG20shtUgPv8y5uBBgTbfCc8k19BedVN1dL] POST - GetAccountId(0001769009A3cuBbG20shtUgPv8y5uBBgTbfCc8k19BedVN1dL) returning "91896YY"
2012-09-25 15:34:41,828 INFO - 25 September 2012 15:34:41.828 +01:00 : [0001769009A3cuBbG20shtUgPv8y5uBBgTbfCc8k19BedVN1dL] POST - GetAccountId(0001769009A3cuBbG20shtUgPv8y5uBBgTbfCc8k19BedVN1dL) returning "91896YY"