-
I'm starting the process of creating a CDROM distribution and will be writing the systems in VB.
One of the functions of the CD should be free text searching on the 10,000 (or so) documents stored on it(html or text).
I'm pretty comfortable with our Website implementation of this as I´ll be using ColdFusion with Verity as the indexing software. Works a treat. Trust me.
Anyone have any fine ideas on products that´ll allow me to implement the same speedy search results / indexing on the CDROM using a VB program?
I've seen a few products on the market but at prices of $6000+ (with extra licensing costs too! Marvellous...)
As this CD is likely to be for fairly limited distribution it's simply not worth the cost. Any one seen any MUCH cheaper products that'll do the job?
-
Why dont you just write a search algorithm yourself ?
First go though the drive and record all filenames + paths in an array.
Then go through the array, read the entire file into a buffer :
Code:
Open file(i) For Binary Access Read As #1
Get #1,,Buffer
If (InStr(Buffer, SomeWord) <> 0) Then
'Word was found in that file ...
End If
Close #1
- jamie
-
Well, another thing you could do, is list all files + put into an array. Then, lind of like below :
Code:
Dim var_array() As String
Open file(i) For Binary Access Read As #1
Get #1,,Buffer
var_array() = Split(Buffer, " ")
Close #1
The above piece of code would then put every word (a word being any string separated by a space), into an array. Then you could iterate through the array using a loop, and add the word to a dictionary.
The dictionary object is quite cool :)
It can add about 4500 words to a dictionary in less than 2 seconds (well on this P-III 650 anyway).
You would do it something like :
Code:
Private d As New Dictionary
Private Function Add(x As String, y As String)
d.Add x, y
End Function
Private Function SomeFunction()
For i = 0 to UBound(var_array)
If (d.Exists(var_array(i)) = False) Then
Add var_array(i), filename_it_was_found_in
End If
Next i
End Function
You could store the word list then on the cdrom somewhere (encrypted I'd say), and then load at runtime. In one of my apps, I load 4300 entries into a dictionary at runtime. Takes less than a second.
- jamie
-
Hey, smart!
Yep, I like that one!
I was worried about access times more than anything, but judging by your example timings, maybe I shouldn´t be!
...Checked out dictionary object & you're right! It's a good one...
I guess maybe I could add another dimension for the number of times the word appears in the file (increment on each find) and I'd have the basis of a "weighted" search as well.
(Maybe that rules out the dictionary object, as it needs 3 dimensions - but I'll keep thinking!)
Thanks very much for the help on this!!
Cheers, Shaun.
-
Well,
you could do something like :
Code:
Add word(i), file1;file2;file3
Seperate the files that the word appears in with a semi-colon, and then use the split function later :
Code:
Dim var_array() As String
var_array() = Split(d.Item(word(i)).Key, ";")
for i = 0 to ubound(var_array)
Debug.Print "Word : " & word(i) & " appears in : " & var_array(i)
next i
The above usage of the dictionary object is probably wrong. Its been a while since I've coded with it, but its something along those lines.
In relation to a multidimensional dynamic array, I'd avoid it. They just eat memory.
- jamie
-
Yep! and Yep! again!
Even better, you're right!!
Thanks again! :)
Shaun.
-
I dunno if you know this, but with dynamic multidimensional arrays you can only change the last dimension, and it will change it for everything.
For Example :
There is 2-dimensional array called var_array()
ie. var_array(x, y)
Lets say the array has been dimensioned so that
var_array(99, 99) is the upper bound of the entire array.
So at the moment, thats 99*99 = 9801 array elemts.
Each array element will take up a minimum of 1 byte of memory (byte data type uses lowest amount of memory).
Then you do :
ReDim Preserve var_array(99, 128).
The total number of elements is now : 128*99 = 12672
One might think that index 99 of the first dimension now has 128, and the others still have 99, but it will apply the new dimension to all elements.
Then if you were to use more than 2 dimensions, you're just wasting memory big-time.
So I use them sparingly.
- jamie
-
1/2 way there...
...OK,OK...proof that this forum is cool...
Largely, thanks to the advice given, I'm halfway there!
:D
Created an database indexed on all of the words, all parsed nice and neat, along with the files they appear and the number of times...Been testing the results and they seem accurate.
It's pretty slow to do the indexing (there's quite a few parsing algortihms needed) but as long as retrieval is quick I don't care! :)
Now to the retrieval program...
Thanks again & I'll be avoiding the arrays too!
-
Well you only have to do the parsing once.
So I'd spend as much time on it as possible, and have every word etc. indexed.
Then in future you just look up the index.
But you know that bit already :)
- jamie