|
-
Feb 7th, 2010, 03:32 PM
#1
Thread Starter
New Member
Please help - Seach 100000 txt files very fast
I'm new to this forum, so please correct me, if i make some mistakes (been programing vb for a least five years)
I have a problem with txt files. I need to search over 100000 txt files for 2 excact words, and it have to go very fast, because one of the words can be 45000 different words (taken from a mdb file) and the other word can be 200000 different words (also taken from a mdb file).
If a txt file consists of those two words then the program would move the txt file to a different folder.
I have been looking at http://www.vbforums.com/showthread.php?p=2316263 but these do also give me a hit, if they do contain part of the word, that programe i searching. Ex. if one of the words from the mdb file is: IB, then the programe would also find GOTLIB, because IB is part of that word. The programe should only respond if the word i IB, so that is where I'm stuck, and I cant figure what to do, so please help
-
Feb 8th, 2010, 02:45 AM
#2
Re: Please help - Seach 100000 txt files very fast
put space before and after search string, then only get whole words,
but will fail to find first or last word (add spaces before and after string to search),
but also will fail to find strings followed by any punctuation
" stop " will not match " stop. "
it may be possibly faster, to check for search strings without and with common punctuation, rather than trying to remove the punctuation from the string to search
i do my best to test code works before i post it, but sometimes am unable to do so for some reason, and usually say so if this is the case.
Note code snippets posted are just that and do not include error handling that is required in real world applications, but avoid On Error Resume Next
dim all variables as required as often i have done so elsewhere in my code but only posted the relevant part
come back and mark your original post as resolved if your problem is fixed
pete
-
Feb 8th, 2010, 05:01 AM
#3
Thread Starter
New Member
Re: Please help - Seach 100000 txt files very fast
thx westconn1 for your reply.
I just tried this, and it works fine. The big problem is, that it takes my program 1 hr. to execute, so I was wondering if you have an idea on how to improve the time.
-
Feb 8th, 2010, 05:41 AM
#4
Re: Please help - Seach 100000 txt files very fast
well you never need to search for other options, past the first found result
how long did it take before?
if you post the code you are using now, maybe someone will pick up on something that will improve the speed
i do my best to test code works before i post it, but sometimes am unable to do so for some reason, and usually say so if this is the case.
Note code snippets posted are just that and do not include error handling that is required in real world applications, but avoid On Error Resume Next
dim all variables as required as often i have done so elsewhere in my code but only posted the relevant part
come back and mark your original post as resolved if your problem is fixed
pete
-
Feb 8th, 2010, 06:22 AM
#5
PowerPoster
Re: Please help - Seach 100000 txt files very fast
I can improve your speed exponentially.
The problem is you're loading each text file every time you do the searches. Loading the file and performing the search each time is why it is so slow.
Set up two two-dimensional arrays and perform the search only once, checking each file for each word only once. If a word is found, index it in the array as a hit, then when all that is done just check the two arrays to see if both words are found. It'll greatly improve the speed at which you perform the checks.
If the words have to be consecutive in the text, you can then check successful hits to see if they're consecutive in the text file.
Well, everyone else has been doing it :-)
Loading a file into memory QUICKLY - Using SendKeys - HyperLabel - A highly customisable label replacement - Using resource files/DLLs with VB - Adding GZip to your projects
Expect more to come in future
If I have helped you, RATE ME! :-)
I love helping noobs with their VB problems (probably because, as an amateur programmer, I am only slightly better at VB than them :-)) but if you SERIOUSLY want to get help for free from a community such as VBForums, you have to first have a grounding (basic knowledge) in VB6, otherwise you're way too much work to help...You've got to give a little if you want to get help from us, in other words!
And we DON'T do your homework. If your tutor doesn't teach you enough to help you make the project without his or her help, FIND A BETTER TUTOR or try reading books on programming! We are happy to help with minor things regarding the project, but you have to understand the rest of it if you want our help to be useful.
-
Feb 8th, 2010, 06:25 AM
#6
Thread Starter
New Member
Re: Please help - Seach 100000 txt files very fast
thx for your reply
I am using the following code to search a word in a txt file. The word comes from two seperate databases
There is two functions, but they i a like, so i'm only posting one of them. There are som public function that i written under option explicit, so if the variables i not part of the sub and functions don't comment on them, because the sub works and is slow.
command 1 is just searching one file, and i do loop them over, so that that i can search over 100000 files.
Note
I have adjusted the names of dbs and txt files
Code:
private sub command1_click()
Dim Gnosis As Database
Dim GnosisRS As DAO.Recordset
Dim i As Long
Dim i1 As Integer
Dim j As Integer
Dim a1 As Integer
Dim b1 As String
Dim b2 As String
Dim b3 As String
Set Gnosis = OpenDatabase("c:DB1.mdb")
Set GnosisRS = Gnosis.OpenRecordset("Table1", dbOpenTable)
GnosisRS.MoveFirst
For i = 0 To GnosisRS.RecordCount - 1 Step 1
Hn = GnosisRS.Fields(2).Value
API1_V "c:\3076.txt", Hn
GnosisRS.MoveNext
Next
GnosisRS.Close
Gnosis.Close
Set Gnosis = Nothing
Set Gnosis = OpenDatabase("c:\DB2.mdb")
Set GnosisRS = Gnosis.OpenRecordset("Table2", dbOpenTable)
GnosisRS.MoveFirst
For i = 0 To GnosisRS.RecordCount - 1 Step 1
Sb = GnosisRS.Fields(1).Value
API2_V "c:\3076.txt", Skib
GnosisRS.MoveNext
Next
GnosisRS.Close
Gnosis.Close
Set Gnosis = Nothing
Text1.Text = SH
Text2.Text = SS
end sub
Public Function API1_V(ByRef Filename As String, ByVal Find As String) As String
Dim lngFile As Long
Dim lngPtr As Long
Dim lngRead As Long
Dim curSize As Currency64
Dim lngSize As Long64
Dim a1 As Double
Dim i As Double
Find = StrConv(Find, vbFromUnicode)
lngFile = CreateFileW(StrPtr(Filename), GENERIC_READ, FILE_SHARE_READ Or FILE_SHARE_WRITE, ByVal 0&, OPEN_EXISTING, 0, 0)
If lngFile <> INVALID_HANDLE_VALUE Then
lngSize.Low = GetFileSize(lngFile, lngSize.High)
If lngSize.High = 0 And lngSize.Low > 0 Then
If SetFilePointer(lngFile, 0, 0, FILE_BEGIN) <> INVALID_SET_FILE_POINTER Then
lngPtr = SysAllocStringByteLen(0, lngSize.Low)
If lngPtr Then
PutMem4 ByVal VarPtr(API1_V), ByVal lngPtr
If ReadFile(lngFile, ByVal lngPtr, lngSize.Low, lngRead, ByVal 0&) <> 0 Then
' binary search
a1 = InStrB(API1_V, Find)
If a1 > 0 Then
SH = SH + 1
Print Hn
End If
End If
End If
Else
Else
LSet curSize = lngSize
End If
CloseHandle lngFile
End If
End Function
Last edited by si_the_geek; Feb 8th, 2010 at 08:35 AM.
Reason: added code tags
-
Feb 8th, 2010, 06:27 AM
#7
Thread Starter
New Member
Re: Please help - Seach 100000 txt files very fast
Hi smUX,
How can I load all the files in one big txt file, and if i get a hit split the files again to see in what txt files it is in a program
-
Feb 8th, 2010, 06:33 AM
#8
PowerPoster
Re: Please help - Seach 100000 txt files very fast
That would also be doable, but I'm not sure if it would make a great deal of difference to the speed...it might make some though.
You would have to load each file and append it to the string then store the length of the string in an array. +1 to that value would be the start of the next string so it would be a simple matter to work out which file the found word is in and you wouldn't need to split to find the result.
Well, everyone else has been doing it :-)
Loading a file into memory QUICKLY - Using SendKeys - HyperLabel - A highly customisable label replacement - Using resource files/DLLs with VB - Adding GZip to your projects
Expect more to come in future
If I have helped you, RATE ME! :-)
I love helping noobs with their VB problems (probably because, as an amateur programmer, I am only slightly better at VB than them :-)) but if you SERIOUSLY want to get help for free from a community such as VBForums, you have to first have a grounding (basic knowledge) in VB6, otherwise you're way too much work to help...You've got to give a little if you want to get help from us, in other words!
And we DON'T do your homework. If your tutor doesn't teach you enough to help you make the project without his or her help, FIND A BETTER TUTOR or try reading books on programming! We are happy to help with minor things regarding the project, but you have to understand the rest of it if you want our help to be useful.
-
Feb 8th, 2010, 06:46 AM
#9
Re: Please help - Seach 100000 txt files very fast
maybe it would be quicker if you open the file first, then loop through all the words in the database rather than opening the same file for every word in the database
also make sure to test if this code will work if the user is not administrator
i do my best to test code works before i post it, but sometimes am unable to do so for some reason, and usually say so if this is the case.
Note code snippets posted are just that and do not include error handling that is required in real world applications, but avoid On Error Resume Next
dim all variables as required as often i have done so elsewhere in my code but only posted the relevant part
come back and mark your original post as resolved if your problem is fixed
pete
-
Feb 8th, 2010, 06:58 AM
#10
Thread Starter
New Member
Re: Please help - Seach 100000 txt files very fast
Thx. for your replies.
Good point by not opening the file each time.
I will try to save all the files in one big file, and just open it once, and see what the results will be. Thx.
-
Feb 8th, 2010, 03:14 PM
#11
Re: Please help - Seach 100000 txt files very fast
opening and combining all the files may be slower than searching the files separately, but i believe that opening each file in a separate procedure for every word in the data base would be an absolute killer
all you can do is speed test each option
i do my best to test code works before i post it, but sometimes am unable to do so for some reason, and usually say so if this is the case.
Note code snippets posted are just that and do not include error handling that is required in real world applications, but avoid On Error Resume Next
dim all variables as required as often i have done so elsewhere in my code but only posted the relevant part
come back and mark your original post as resolved if your problem is fixed
pete
-
Feb 8th, 2010, 05:06 PM
#12
PowerPoster
Re: Please help - Seach 100000 txt files very fast
I'd write some test code for myself if I had 100k text files and a list of words to check against them, but it's too much work otherwise :-)
Well, everyone else has been doing it :-)
Loading a file into memory QUICKLY - Using SendKeys - HyperLabel - A highly customisable label replacement - Using resource files/DLLs with VB - Adding GZip to your projects
Expect more to come in future
If I have helped you, RATE ME! :-)
I love helping noobs with their VB problems (probably because, as an amateur programmer, I am only slightly better at VB than them :-)) but if you SERIOUSLY want to get help for free from a community such as VBForums, you have to first have a grounding (basic knowledge) in VB6, otherwise you're way too much work to help...You've got to give a little if you want to get help from us, in other words!
And we DON'T do your homework. If your tutor doesn't teach you enough to help you make the project without his or her help, FIND A BETTER TUTOR or try reading books on programming! We are happy to help with minor things regarding the project, but you have to understand the rest of it if you want our help to be useful.
-
Feb 8th, 2010, 07:24 PM
#13
Re: Please help - Seach 100000 txt files very fast
One file would be enough. You can use the high resolution timer to see the performance of one file, do changes and compare. There is no reason to process everything each time. Also, it is good to locate the parts of code that run slowly.
For seeing the time it takes for code to run see this post.
The first thing that would be important to achieve is the correct behavior of the code. I guess we do need a sample text file so we can see where we stand with it. Later on, as we are dealing with words there is one other important thing: sorting before searching allows to optimize finding a great deal, and even later optimization would be binary tree search (or whatever it was, I haven't done it in practise myself, but it allows finding a matching string in just a very few steps in a million strings). These together would probably nail it down to a few seconds at worst. But before we can go there we need to have the correct results first.
-
Feb 8th, 2010, 09:38 PM
#14
Re: Please help - Seach 100000 txt files very fast
 Originally Posted by Merri
Later on, as we are dealing with words there is one other important thing: sorting before searching allows to optimize finding a great deal, and even later optimization would be binary tree search (or whatever it was, I haven't done it in practise myself, but it allows finding a matching string in just a very few steps in a million strings). These together would probably nail it down to a few seconds at worst. But before we can go there we need to have the correct results first.
At that point I would just create a giant word table like how vBulletion does for searching posts. I'm guessing it's setup with 1 record for each unique word in each file, then you could just search that word table using indexed seeks. Lightning fast for a good database. (Access is not a good database but it could conceivably muddle through well enough with DAO seeks. SQL Server or FoxPro would be super fast.)
The question is, how volatile are the 100,000 files? Do they change a lot? If they are a static library, creating the word lookup would absolutely be the best bet.
-
Feb 9th, 2010, 04:37 AM
#15
PowerPoster
Re: Please help - Seach 100000 txt files very fast
 Originally Posted by Ellis Dee
At that point I would just create a giant word table like how vBulletion does for searching posts.
Which is basically what I suggested in post #5, but word lookup only using the words in the two lists rather than the words in all the text files like vBulletin would do.
Well, everyone else has been doing it :-)
Loading a file into memory QUICKLY - Using SendKeys - HyperLabel - A highly customisable label replacement - Using resource files/DLLs with VB - Adding GZip to your projects
Expect more to come in future
If I have helped you, RATE ME! :-)
I love helping noobs with their VB problems (probably because, as an amateur programmer, I am only slightly better at VB than them :-)) but if you SERIOUSLY want to get help for free from a community such as VBForums, you have to first have a grounding (basic knowledge) in VB6, otherwise you're way too much work to help...You've got to give a little if you want to get help from us, in other words!
And we DON'T do your homework. If your tutor doesn't teach you enough to help you make the project without his or her help, FIND A BETTER TUTOR or try reading books on programming! We are happy to help with minor things regarding the project, but you have to understand the rest of it if you want our help to be useful.
-
Feb 9th, 2010, 02:41 PM
#16
Re: Please help - Seach 100000 txt files very fast
 Originally Posted by smUX
Which is basically what I suggested in post #5, but word lookup only using the words in the two lists rather than the words in all the text files like vBulletin would do.
Post 5 talks of arrays, which implies running the check each time the program runs. A database is a different animal in that it persists over time.
-
Feb 9th, 2010, 02:56 PM
#17
PowerPoster
Re: Please help - Seach 100000 txt files very fast
 Originally Posted by Ellis Dee
Post 5 talks of arrays, which implies running the check each time the program runs. A database is a different animal in that it persists over time.
"Set up two two-dimensional arrays and perform the search only once, checking each file for each word only once. If a word is found, index it in the array as a hit, then when all that is done just check the two arrays to see if both words are found. It'll greatly improve the speed at which you perform the checks."
Sure about that?
Also an array being different to a database is also total bull...a database and an array both hold data, one in memory (faster) and one in memory or hard drive (hard drive being slower)...other than that, if the array is done right, there's no discernible difference between an array and a database with regards to this apart from the fact that databases can be SQLed and such to get data out quicker...but it's a bit like using a sledgehammer to crack open a walnut
Well, everyone else has been doing it :-)
Loading a file into memory QUICKLY - Using SendKeys - HyperLabel - A highly customisable label replacement - Using resource files/DLLs with VB - Adding GZip to your projects
Expect more to come in future
If I have helped you, RATE ME! :-)
I love helping noobs with their VB problems (probably because, as an amateur programmer, I am only slightly better at VB than them :-)) but if you SERIOUSLY want to get help for free from a community such as VBForums, you have to first have a grounding (basic knowledge) in VB6, otherwise you're way too much work to help...You've got to give a little if you want to get help from us, in other words!
And we DON'T do your homework. If your tutor doesn't teach you enough to help you make the project without his or her help, FIND A BETTER TUTOR or try reading books on programming! We are happy to help with minor things regarding the project, but you have to understand the rest of it if you want our help to be useful.
-
Feb 9th, 2010, 03:50 PM
#18
Re: Please help - Seach 100000 txt files very fast
 Originally Posted by smUX
Sure about that?
Yes, very much so.
Also an array being different to a database is also total bull...a database and an array both hold data, one in memory (faster) and one in memory or hard drive (hard drive being slower)
There is nothing "bull...." about it. It is exactly as I described, and also exactly as you yourself described right here. Databases persist over time, while arrays have to be recreated every time the program runs. Let's say it ends up taking an hour and a half to create the array or database. Your array solution means that the user has to sit through an hour and a half prep time every morning while the array gets built. The database solution, on the other hand, only ever has to do it once. Surely you understand that is not a "bull...." difference, right?
...other than that, if the array is done right, there's no discernible difference between an array and a database with regards to this apart from the fact that databases can be SQLed and such to get data out quicker...but it's a bit like using a sledgehammer to crack open a walnut
Uh, no, it's absolutely not like using a sledgehammer to crack a walnut. There is no conceivable way that using SQL is in any way, shape or form overcoding for a problem like this. On the contrary, this is exactly the kind of thing SQL is made for.
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|