@Olaf
In my machine is 12sec for binary search with auto sorting, 20 if I use the old one that compares only equal length words, and 90sec if I use binary search and quick sort for every entry.
As I said before there are routines to do much better, and yours is very good, but the task here for me is not to stay to that solutions, which are a black box, but explore ways that I can understand how it works, and share to others. You share the ways to do something, that is good, but you didn't give the deep knowledge.
I thing that I have to use a hash table. I never make one and I will try.
Merry Christmas
George
Last edited by georgekar; Dec 21st, 2014 at 08:09 PM.
@DataMiser,
perhaps you can find that this
? strcomp("Aaronites","Aaron's")
and this
? strcomp("Aaronites","Aaron's",vbtextCompare)
give totally different output...
@DataMiser,
perhaps you can find that this
? strcomp("Aaronites","Aaron's")
and this
? strcomp("Aaronites","Aaron's",vbtextCompare)
give totally different output...
George, maybe you noticed, that the "make it case-insensitive"-Flag of the
cSortedDictionary is a different Enum-Type - and named TextCompare instead of vbTextCompare.
So it is not surprising, that you will get results which are a bit differently sorted.
If you want to play around with all kinds of different Sort- and Compare-Schemes, the cSortedDictionary
supports that as well, over its SetStringCompareFlags method, which is quite versatile.
E.g. if you add a new small function to the CodeBase as:
Code:
Private Function ReSortDictionaryKeys(WD As cSortedDictionary, Flags As CmpFlags, Optional ByVal LCID As LangIDs) As cSortedDictionary
Dim i As Long
Set ReSortDictionaryKeys = New_c.SortedDictionary(, False)
ReSortDictionaryKeys.SetStringCompareFlags Flags, LCID
For i = 0 To WD.Count - 1
ReSortDictionaryKeys.Add WD.KeyByIndex(i), WD.ItemByIndex(i)
Next i
End Function
You could then apply it after the counting went through - by adding an additional line into the
List-Click-Event:
The blue line above will cause the performance to dropy by ~10% (now about 0.37-0.38 seconds instead of the ~0.34sec without it).
But those who want to sort the results by whatever LCID (or SubSorting-Schemes) of their specific countries, can do so now.
Intellisense for playing around with that is supported - and I can easily choose e.g.:
cSortedDictionary.SetStringCompareFlags cmpIgnoreCase, German_Germany, SORT_GERMAN_PHONE_BOOK
(in case I want the special treatment for german umlauts, according to german phonebook-rules).
Olaf
Last edited by Schmidt; Dec 21st, 2014 at 09:55 PM.
As I said before there are routines to do much better,
There are some approaches which *could* do better, but not by a very high margin -
and neither your word-splitter-routine is currently up to the task - nor your current
Binary-Sorting-Whilst-adding approach (which is BTW also used within the cSortedDictionary).
Originally Posted by georgekar
...the task here for me is not to stay to that solutions, which are a black box, ...
That's kind of a hypocrisy, because - when you apply that consequently, then you should refrain
also from using the CompareStringW-call in your code (which sits in the "blackbox"-kernel32.dll).
Please re-implement all its functionality in your own procedures first, to understand better.
Originally Posted by georgekar
You share the ways to do something, that is good, but you didn't give the deep knowledge.
That's something one could really take offense with - George, the amount of code I shared
with the community over the last two decades is immense. And if you have specific questions
about something, you only need to ask (but perhaps better in another, dedicated thread).
And it's again not really decent behaviour from you, that you not even *considered* looking at
(or working with) the Word-Splitting routine I've posted, instead you stubbornly insist on using
your own one further, although it is slower by a magnitude.
For fast String-routines in VB, you have to "avoid using Strings" - and as "oxymoronic" as that
sounds, it is nevertheless true - please look at the fastest performing StringRoutines at: http://www.xbeat.net/vbspeed/
See, I could even point you to some older postings of mine, where I explained how to implement
something like a Sorting Dictionary (with code) - but given your ignorance of the code I *did* post
here already - I'm not sure, whether that would make any sense.
@Schmidt
Your example is very fast but from using safearry isn't too much gain of speed.
To prove that please insert this line :
WD.SetStringCompareFlags cmpIgnoreCase, English_United_States, SORT_DEFAULT
After these lines
WordCountTotal = 0 'reset the WordTotal-Counter
WB(-2) = 0 'set the initial StrLenB to zero
Set WD = New_c.SortedDictionary(TextCompare)
..............
WD.SetStringCompareFlags cmpIgnoreCase, English_United_States, SORT_DEFAULT
in the original your example. As you see now need 3 times more to finish.
Can you describe what exactly do in the TextCompare...option in a SortedDictionary?
Last edited by georgekar; Dec 22nd, 2014 at 08:30 PM.
Trying to wade thru all these posts. Let me see if I got this right?
1) Both lists are sorted
2) You want to count how many times a word is repeated in List1
3) Then you want to add that word, with number of repeats, to List2, deleting the previous List2 entry (if one exists?)
If I got the above correct, I see one flaw in your original routine & correct me if I'm wrong:
Code:
If LCase(List2.List(i)) = LCase(List2.List(totListItms)) Then...
Since each item in List2 has an occurrence count appended to the word, are you only removing the entry if they have the same occurrence count as before? In other words, lets say List2 already has an entry: "The - 5 Occurrences". And after processing List1, you have "The - 3 Occurrences". What is suppose to happen. Is the previous entry to be removed, or only removed if the previous and new entry match exactly. Equal sign in your code is a problem if trying to match just on the word in List2
Either way, I see this being done looping thru List1 just once. And using APIs, not looping thru List2 at all.
Edited: Here's an example using the API to locate a match just on the word, not the entire entry in List2. It could be what you are looking for a close enough to modify. The API usage prevents us from having to loop thru List2 every iteration until we found a 'match' or exhausted the list
Code:
Private Declare Function SendMessage Lib "user32.dll" Alias "SendMessageA" (ByVal hwnd As Long, ByVal wMsg As Long, ByVal wParam As Long, ByRef lParam As Any) As Long
Private Const LB_FINDSTRING As Long = &H18F
Private Sub Command1_Click()
Dim l1 As Long, l2 As Long, nrRepeat As Long
Dim lLast As Long, sEntry as String
nrRepeat = 1
For l1 = 1 To List1.ListCount - 1
If StrComp(List1.List(l1), List1.List(lLast), vbTextCompare) = 0 Then ' match
nrRepeat = nrRepeat + 1
Else
sEntry = List1.List(lLast) & " -"
l2 = SendMessage(List2.hwnd, LB_FINDSTRING, -1, ByVal sEntry)
sEntry = sEntry & nrRepeat & " Occurrences"
If l2 = -1 Then ' no match found on the word, add it
List2.AddItem sEntry
Else ' match found, replace it
List2.List(l2) = sEntry
End If
nrRepeat = 1
lLast = l1
End If
Next
sEntry = List1.List(lLast) & " - "
l2 = SendMessage(List2.hwnd, LB_FINDSTRING, -1, ByVal sEntry)
sEntry = sEntry & nrRepeat & " Occurrences"
If l2 = -1 Then ' no match found on the word, add it
List2.AddItem sEntry
Else ' match found, replace it
List2.List(l2) = sEntry
End If
End Sub
I didn't add a check to see if List1 is empty. You should probably do that
If an exact match is what you need, then replace that API constant with this one
Code:
Private Const LB_FINDSTRINGEXACT As Long = &H1A2
Last edited by LaVolpe; Dec 22nd, 2014 at 08:56 PM.
Insomnia is just a byproduct of, "It can't be done"
I Have 6 examples. We have to put the Bible13.txt.
Because some examples cannot work with books, I make all examples to count all books. I put an example with a HashTable class (from Francesco Balena, changed slight by me). I put my binary search routine, and my first slow routine...with a sort routine. So I have three more. One using MS Dictionary without using the Olaf's safearray. One using Olaf's RC5 cSortedDictionary but with proper sorting by using
WD.SetStringCompareFlags cmpIgnoreCase, English_United_States, SORT_DEFAULT
(It seems that TextCompare ignore case but has not the right sorting order unless we set these flags)
And finaly we can click on the list of books (any item gives all books and introduction text).
I do all of this to find out what factor add speed. In my computer in first version the binary search way of counting took 12000 ms. The last version took 1400ms (my slow routine start from 20000ms and end to 7500ms). These two routines and Olafs with RC5 give sorted result. Olafs without CompareFlags settings took 550ms...with these settings 1550. He propose to do make a sort after the routine run, and with that optimization took 760ms (200ms more, because the sort routine is before we check timer). But this is not the same, as for the other routines that do the proper comparison.
HashTable has no sorted output. Dictionary output started with a click in listbox, are 200ms more quick from my implementation -started from a button Dictionary- (is like Olaf's without the safearray). So safe array add speed.
From all that experiments...I have to notice that Olaf has do very good work with cSortedDictionary but must clarify some things about the options for that. I look help and only the syntax described, not what we can do, no specifications. (and these are the same for all of classes in RC5...so not only is a black box, but also has no embedded info or examples for each class, method, property. So if I need a life to understand...how this RC5 can be useful?
Trying to wade thru all these posts. Let me see if I got this right?
1) Both lists are sorted
2) You want to count how many times a word is repeated in List1
3) Then you want to add that word, with number of repeats, to List2, deleting the previous List2 entry (if one exists?)
1- it is Sam!!!!!!!!!!!!! (Actually , Samuel Oscar Brown (Shorted to Sam Oscar Brown) is simply a moniker I created LONG LONG time ago...."Sam Brown" has a unique/funny history behind it (used to have it posted on my Website (MacIsIrish.com) but have since discontinued the site). I added Oscar, yes, to make it SOB. I hope I don't appear that way on this site....my life has changed considerably since the event occurred wherein I started using this alias.
2-Yes, 1, 2 and 3 is precisely how I had approached the problem, and yesterday (very busy day), used an array instead of list1---speed about the same.
I have not yet had time to experiment any further...hopefully today....with the many suggestions above.
Here is what it looks like (and works fine (it seems)) except for the speed, when I was using two lists...
I'll respond to more of your post in a few..but the 'answers' appear correct with my orginal code. (Again, I have not had the opportunity to play around with others' suggestions yet.) From the image, you can see I used chapter 1 of Genesis in the NIV version---I clicked on the right-hand side GO button. (The left GO button would do the entire BOOK of, in this case, Gensis, and takes around 5 minutes or so). (Just an image to show how I originally approached this).
LV---using YOUR code, it only took me 0.593 seconds to separate and count the words in Genesis. (13082 words in the NIV).
NOW, to experiment with others' suggestions, and THEN, the hard part, try to understand the logic and code ya'll presented.
Don't think you'll get much faster using a listbox, a tad, if uber-optimizing. Since the items are already sorted, looping thru once is as fast as you can hope for. Probably more time taken, adding the words to the listbox
Any massive improvement will be thru the use of other methods than the sorted listbox.
Insomnia is just a byproduct of, "It can't be done"
Trying to wade thru all these posts. Let me see if I got this right?
1) Both lists are sorted
2) You want to count how many times a word is repeated in List1
3) Then you want to add that word, with number of repeats, to List2, deleting the previous List2 entry (if one exists?)
If I got the above correct, I see one flaw in your original routine & correct me if I'm wrong:
Code:
If LCase(List2.List(i)) = LCase(List2.List(totListItms)) Then...
Since each item in List2 has an occurrence count appended to the word, are you only removing the entry if they have the same occurrence count as before? In other words, lets say List2 already has an entry: "The - 5 Occurrences". And after processing List1, you have "The - 3 Occurrences". What is suppose to happen. Is the previous entry to be removed, or only removed if the previous and new entry match exactly. Equal sign in your code is a problem if trying to match just on the word in List2
Either way, I see this being done looping thru List1 just once. And using APIs, not looping thru List2 at all.
Edited: Here's an example using the API to locate a match just on the word, not the entire entry in List2. It could be what you are looking for a close enough to modify. The API usage prevents us from having to loop thru List2 every iteration until we found a 'match' or exhausted the list
Code:
Private Declare Function SendMessage Lib "user32.dll" Alias "SendMessageA" (ByVal hwnd As Long, ByVal wMsg As Long, ByVal wParam As Long, ByRef lParam As Any) As Long
Private Const LB_FINDSTRING As Long = &H18F
Private Sub Command1_Click()
Dim l1 As Long, l2 As Long, nrRepeat As Long
Dim lLast As Long, sEntry as String
nrRepeat = 1
For l1 = 1 To List1.ListCount - 1
If StrComp(List1.List(l1), List1.List(lLast), vbTextCompare) = 0 Then ' match
nrRepeat = nrRepeat + 1
Else
sEntry = List1.List(lLast) & " -"
l2 = SendMessage(List2.hwnd, LB_FINDSTRING, -1, ByVal sEntry)
sEntry = sEntry & nrRepeat & " Occurrences"
If l2 = -1 Then ' no match found on the word, add it
List2.AddItem sEntry
Else ' match found, replace it
List2.List(l2) = sEntry
End If
nrRepeat = 1
lLast = l1
End If
Next
sEntry = List1.List(lLast) & " - "
l2 = SendMessage(List2.hwnd, LB_FINDSTRING, -1, ByVal sEntry)
sEntry = sEntry & nrRepeat & " Occurrences"
If l2 = -1 Then ' no match found on the word, add it
List2.AddItem sEntry
Else ' match found, replace it
List2.List(l2) = sEntry
End If
End Sub
I didn't add a check to see if List1 is empty. You should probably do that
If an exact match is what you need, then replace that API constant with this one
Code:
Private Const LB_FINDSTRINGEXACT As Long = &H1A2
It takes 2 seconds (lb_findstring) or 3 seconds (lb_findstringexact) to complete (modified a bit Lavolpe code).
I know this can probably look better but I am in a hurry this morning
Code:
Private Sub PopList4()
Dim l1 As Long
Dim l2 As Long
Dim sTinme As Long
Dim nrRepeat As Long
Dim sEntry As String
Dim S() As String
sTime = Timer
ReDim S(List1.ListCount - 1)
For i = 0 To List1.ListCount - 1
S(i) = List1.List(i)
Next
List1.Visible = False
List2.Visible = False
nrRepeat = 0
Do While List1.ListCount > 0
DoEvents
For l1 = List1.ListCount - 1 To 0 Step -1
If List1.List(l1) = " " Or List1.List(l1) = "" Then
If l1 < List1.ListCount Then List1.RemoveItem l1
Exit For
End If
sEntry = LCase(List1.List(l1))
Do Until l2 = -1
DoEvents
l2 = SendMessage(List1.hwnd, LB_FINDSTRINGEXACT, -1, ByVal sEntry)
If l2 = -1 Then ' no match found on the word, add it
If Not sEntry = "" Then List2.AddItem sEntry & " - " & nrRepeat & " Occurance"
nrRepeat = 0
l2 = 0
Exit Do
Else ' match found, replace it
List1.RemoveItem l2
nrRepeat = nrRepeat + 1
End If
Loop
Next
Loop
List1.Visible = True
List2.Visible = True
For i = 0 To UBound(S)
List1.AddItem S(i)
Next
Debug.Print Timer - sTime
End Sub
list1 contains the bible13.txt...... split by space (Split(text1.text, " "))
list2 is sorted = true (not necessary)
1) When Sam said 1/2 second, I'm sure he wasn't counting the time taken to get the words in the list
2) DoEvents is probably slowing your routine down
3) Do While is a tad slower. It has to re-evaluate List1.ListCount each iteration
Insomnia is just a byproduct of, "It can't be done"
Lavolpe, I added DoEvents (i always add it for testing) because if the do-loop crashes i lose all the code... it can be removed, there was no errors or inifinite looping in my example.
I did this on the fly, but yes it can be done a lot better, the look and the speed.
just wanted to add the looping to find/add every word in one go
Edit:
You do save .1 to .2 seconds... 100ms to 200ms speed gain by removing the doevents
Correct, I was not counting the pop of list1...however, that was very rapid as well (2.63 secs)....(for a total of 3.223 seconds for both lists to be popped)
Now, when I don't use a list1 at all, but populate an array instead (was 'instantaneous' in populating that array instead of2.63 seconds to pop the list1)
and use the array in lieu of list1 to pop list2, the total time do pop the array (0) and list2 was 5.787 seconds....very interesting...I thought that perhaps using an array in lieu of list1 would be faster populating list2....but no, a total of ~2.5 seconds MORE.
So, back to using two lists, I am MORE THAN PLEASED with the total of 3.222 seconds when MY times originally were in the MINUTES.
Sam, the 5.787 seconds might have been an anomaly? Did you try it more than once without unloading the form? Maybe some initialization was going between VB and List2, something that happens just once?
Insomnia is just a byproduct of, "It can't be done"
LV, DM: Here's the code (LV's---only thing changed was I substituted wordArray() for list1) (word array, in Genesis, has 13081 'words')
No redimming, simply looping through the array once.
Times are pretty much consistent between 5.7 and 6.6 seconds, (first time, second click, etc) as compared to using list1, ~ 3.2 seconds.
Code:
StartTiming
List2.Clear
Dim l1 As Long, l2 As Long, nrRepeat As Long
Dim lLast As Long, sEntry As String
nrRepeat = 1
For l1 = 1 To UBound(wordArray()) - 1
If StrComp(wordArray(l1), wordArray(lLast), vbTextCompare) = 0 Then ' match
nrRepeat = nrRepeat + 1
Else
sEntry = wordArray(lLast) & " -"
l2 = SendMessage(List2.hwnd, LB_FINDSTRING, -1, ByVal sEntry)
sEntry = sEntry & nrRepeat & " Occurrences"
If l2 = -1 Then ' no match found on the word, add it
List2.AddItem sEntry
Else ' match found, replace it
List2.List(l2) = sEntry
End If
nrRepeat = 1
lLast = l1
End If
Next
sEntry = wordArray(lLast) & " - "
l2 = SendMessage(List2.hwnd, LB_FINDSTRING, -1, ByVal sEntry)
sEntry = sEntry & nrRepeat & " Occurrences"
If l2 = -1 Then ' no match found on the word, add it
List2.AddItem sEntry
Else ' match found, replace it
List2.List(l2) = sEntry
End If
MsgBox GetTime
That's a head-scratcher. You'd figure that compare List1.List(l1) to List1.List(lLast) would be slower since calls to the listbox would be needed to extract the strings for comparison, unless VB is optimizing something under the covers. What also may be a reason is that by using an array, I'd imagine twice the bytes are being compared: wordArray() is a unicode string where List1.List(x) is an ANSI string? A head-scratcher
Insomnia is just a byproduct of, "It can't be done"
Describe better...The first item, in position 0, has a variable used as offset to start, named ILast. So the next will be 1, next 2, so there is a line to advance the offset to next one lLast=L1.
What is in the StartTiming and GetTime function i would use
Code:
Dim sTime as Long
sTime = Timer
'execute code here
Debug.Print Timer - sTime
These are those functions....I was going to use timer but have done it this way before....
Again, tho, the results are about the same ~6 seconds (WITH ARRAY VICE LISTBOX).
Code:
Private Declare Function GetTickCount Lib "kernel32" () As Long
Private m_StartTime As Long
Private Sub StartTiming()
m_StartTime = GetTickCount()
End Sub
Private Function GetTime() As Single
Dim StopTime As Long
StopTime = GetTickCount()
GetTime = (StopTime - m_StartTime) / 1000
End Function
I'm going to go ahead and mark as Resolves (as it works fast now), but will keep experimenting with those other options provided. Yes, curious how the same code using a listbox was faster than the array. Oh well.
THANKS TO ALL for educating this old man....still learnin'
Actually, PROBABLY, most likely, is that I did not have the array sorted first, like I did list1. THAT is probably the reason for the extra time....I may go in and sort the array first, and then test the speed. OPINIONS?
I think a sorted list 1 give a sorted list2. If you have list 2 auto sorted then a non sorted input maybe is a lost of speed, but not much. So use not sorted list 2 and sorted list 1.
Yup, as suspected....If I sort my array first (as I had list1 sorted first), the time to count the words and populate list2 was reduced from ~ 5 seconds to various instances between 0.007 and 0.99 secs...big difference.
So, I suppose, if I eliminate BOTH listboxes, and use two arrays, it will be much faster to populate that second array as to populate list2.
If I try that, I still need to show on the screen, but what I may do is take that second array and populate a msflexgrid with two columns, the first being the count, the second being the word, and sort it by count (highest first). That would make the most sense to me if I wanted to see the most common words. I then will probably have a search routine to find the grid's row for any given word. Sounds like a fun little challenge.
Last edited by SamOscarBrown; Dec 23rd, 2014 at 01:57 PM.
No, if you use an array as the second list you slow the process...because you have to implement the search routine...So use ms dictionary...to use a fast search...
Well, you may be right....what I did was kept the second listbox, split the text values of that listbox into word and number of occurrences, then populated a table in my database, then retrieved the information back into a grid with an sql which brought back the most occurrences first, etc. WHOLE process (including reading the data in the first time, until the grid was fully populated) took only 1.15 seconds.
Of course now my code is all commented out in many, many places, so now have to go back and clean it up...and, once again, fully understand how the populating of the second listbox actually occurred so much more rapidly than my first attempt.
Yup, as suspected....If I sort my array first (as I had list1 sorted first), the time to count the words and populate list2 was reduced from ~ 5 seconds to various instances between 0.007 and 0.99 secs...big difference.
That could explain the speed hit. Not only that, the routines are dependent on a sorted source list. Without it being sorted, you are not only getting false occurrence counts, but are updating List2 way too often
Insomnia is just a byproduct of, "It can't be done"
I do some practice...with examples..to optimize them. I see that making big arrays from the start is worst scenario from making them at steps. So here is the 7 example program...I conform to the proposal for later sorting on Olaf's super speed vbRichClient, so now has the full speed with later appropriate sorting.
The 7th example is the myDoc class (for documents) that has "on board" the binary search routine. So if we have a document (also mydoc is combined with the text viewer - an editor I have in M2000 - look code), with a setting to a reference the text viewer just use the data from mydoc). I make some functions such as to count in ms the number of words (very fast), and we can get all the words in an array or in a listbox, or we can count the unique words and if we want we can use later the saved words to fill a listbox or an array.
Binary search routine count the words in the same time that sort the words. Now has comparable speed (the stand alone routine need 1.3 seconds to find from 792367 words 13231 unique words and the number of appearance. Ms Dictionary with a stage for sorting need 1.118, Olaf's routine 0.574 sec (very fast, including second sort), and 1.156 if we apply sort and include the time to final (this is for the Ms Dictionary -clicking in the left listbox, with books). (we must hide all other windows to get a good measure...). Binary search 1.330, Hashtable 1.515 including the sort routine. The slow routine 10.500 (including the sorting routine). Finally the 7th routine (has the same binary routine but get the data from Document class and run in the space of the class, 1.480ms. So except the slow routine all routines are between 0.574ms and 1.515. The binary search routine in the Document class is more versatile, because works for unicode (and sort for that). We can pass the Licd, so we can sort for any country (although I have no manage to fit the sub regions..is something too much...for now).
That's all...
(Now I place the new Document class myDoc.class to M2000, so I make new commands, to count words and read the number of appearance for any word.)