Hi all.
This is a code I wrote several years ago, and it works ok, but it's really slow with large blocks of text. I'm wondering if anybody has a better idea.
It actually runs in a word macro, but close enough to VB6.
The idea is I have a block of text that is English and Chinese all mixed together, and my task is to get all the clusters of Chinese characters sorted into an array.
My approach is to replace everything that is not ASC(63) (question mark, and Chinese characters in Word macros) with an asterisk.
Then I replace all asterisk clusters with single asterisks, then I split the text over the asterisks to get my Chinese character cluster array.
Like I said, works ok, but on my slow office computer, larger blocks of text take minutes, not seconds, to break down.
My code:
For t = 1 To Len(strWholeDoc)
If Asc(Mid(strWholeDoc, t, 1)) <> 63 Then strWholeDoc = Replace(strWholeDoc, Mid(strWholeDoc, t, 1), "*")
Next t
Do Until InStr(strWholeDoc, "**") = 0
strWholeDoc = Replace(strWholeDoc, "**", "*")
Loop
sTerm = Split(strWholeDoc, "*")
I was thinking maybe Regular Expressions could do something instantaneous, but I never get Regular Expressions right without a hundred attempts.
Regular expressions are not fast at all. They are just somewhat faster than the scripting languages they were designed to be used with.
VBA isn't VB6 and can't be compiled to native code, but even VBA p-code interpretation is faster than most scripting languages and there are function one can use to avoid doing everything in pure p-code anyway.
Just to begin with, stop using Variant-valued functions so much. Avoid the ones that imply Unicode-to-ANSI conversion too. That might gain you back a little perfomance.
Where does the question mark come into the picture? Is that just the "failed" character from a failing Unicode to ANSI conversion? How do you get the text in the first place, i.e. has the data already been mangled through conversions or does the String have clean Unicode data in it?
Or are you saying mangling has already occurred and all of your "won't map into ANSI" characters have been turned into "?" already?
I guess where I'm going is that without sample data I'm not sure we can help much.
Yes, I also noticed the use of variant-valued functions. In other words, use Mid$ instead of Mid.
Also, your ...
Code:
If Asc(Mid(strWholeDoc, t, 1)) <> 63 Then strWholeDoc = Replace(strWholeDoc, Mid(strWholeDoc, t, 1), "*")
... line of code seems to be doing LOTS of extra work. Aren't you re-replacing all of your asterisk characters over and over and OVER?
How about...
Code:
If Asc(Mid(strWholeDoc, t, 1)) <> 63 And Asc(Mid(strWholeDoc, t, 1)) <> Asc("*") Then strWholeDoc = Replace(strWholeDoc, Mid(strWholeDoc, t, 1), "*")
Also, replacing Asc("*") with 42 will help, and also creating a temporary variable for Asc(Mid(strWholeDoc, t, 1)) will help ever so slightly.
Any software I post in these forums written by me is provided "AS IS" without warranty of any kind, expressed or implied, and permission is hereby granted, free of charge and without restriction, to any person obtaining a copy. To all, peace and happiness.
Here is an example in VB6. That might not do you any good in itself if you don't have VB6, but this code should be portable to VBA:
Code:
Private Function NonAnsiSubstrings(ByRef Text As String) As Collection
Dim Collection As Collection 'Items hold: Array(start, length, text)
Dim TextBytes() As Byte
Dim I As Long
Dim StartPos As Long
Dim J As Long
Dim EndPos As Long
Dim FoundEnd As Boolean
Set Collection = New Collection
If Len(Text) > 0 Then
TextBytes = Text
For I = 1 To UBound(TextBytes) Step 2
If TextBytes(I) > 0 Then
StartPos = (I - 1) \ 2 + 1 'Convert to character position in a String.
For J = I + 2 To UBound(TextBytes) Step 2
If TextBytes(J) = 0 Then
EndPos = (J - 3) \ 2 + 1 'Back up 1 char, convert to char position.
Collection.Add Array(StartPos, _
EndPos - StartPos + 1, _
Mid$(Text, StartPos, EndPos - StartPos + 1))
FoundEnd = True
Exit For
End If
Next
I = J 'Adjust I past the substring.
If FoundEnd Then
FoundEnd = False 'Prepare for next substring.
Else
'We ran off the end while scanning.
Collection.Add Array(StartPos, _
Len(Text) - StartPos + 1, _
Mid$(Text, StartPos))
End If
End If
Next
End If
Set NonAnsiSubstrings = Collection
End Function
If you really want an array you could either: (a.) use one and manage its growth via Redim Preserve statements, or else (b.) you could copy the resulting Collection into an array after calling this Function.
Seems to run pretty quickly though, even with the 196KB UTF-16LE sample text file in the attachment.
I was thinking maybe Regular Expressions could do something instantaneous, but I never get Regular Expressions right without a hundred attempts.
See if the following RegEx solution is quick enough for you:
Code:
With CreateObject("VBScript.RegExp")
.Global = True
.Pattern = "[\x00-\xFF]+"
sTerm = Split(.Replace(strWholeDoc, vbNullChar), vbNullChar)
End With
On Local Error Resume Next: If Not Empty Is Nothing Then Do While Null: ReDim i(True To False) As Currency: Loop: Else Debug.Assert CCur(CLng(CInt(CBool(False Imp True Xor False Eqv True)))): Stop: On Local Error GoTo 0
Public Function ChangeString(sWholeDoc As String) As String
' #VBIDEUtils#************************************************************
' * Author : Thierry69
' * Module Name : Form1
' * Module Filename :
' * Procedure Name : ChangeString
' * Purpose :
' * Parameters :
' * sWholeDoc As String
' * Purpose :
' **********************************************************************
' * Comments :
' *
' *
' * Example :
' *
' * See Also :
' *
' * History :
' *
' *
' **********************************************************************
Dim sFinal As String
Dim sChar As String
Dim nLen As Long
Dim nI As Long
Dim bFound As Boolean
sFinal = vbNullString
nLen = Len(sWholeDoc)
bFound = True
For nI = 1 To nLen
sChar = Mid$(sWholeDoc, nI, 1)
If Asc(sChar) <> 63 Then
If bFound Then
sFinal = sFinal & "*"
bFound = False
End If
Else
sFinal = sFinal & sChar
bFound = True
End If
Next
ChangeString = sFinal
End Function
And the test :
Code:
Dim dTime As Double
Dim sTest As String
Dim nI As Long
sTest = "azsireo??????sqdf????qsdqsd???"
For nI = 1 To 13
sTest = sTest & sTest
Next
dTime = Now
Call ChangeString(sTest)
Debug.Print dTime - Now
I tested with your code and mine, it tooks 2 sec for my code, and after 1 min, I stopped to test yours
Then, of course :
sTerm = Split(ChangeString(sTest), "*")
Last edited by Thierry69; Jul 9th, 2016 at 01:45 AM.
The idea is I have a block of text that is English and Chinese all mixed together, and my task is to get all the clusters of Chinese characters sorted into an array.
Sorry, I must have misinterpreted this.
I took it as "extract the substrings made up of Chinese characters." I assumed that characters outside the ANSI subset range (i.e. any where the upper byte is non-zero) is a good enough working definition of "Chinese characters" for this purpose.
Elroy I never realized I was replacing the asterisk with itself. I wrote that code over five years ago, and I guess if it occurred to me at the time, I must have thought it wouldn't make much difference speedwise. But my current office computer is super slow, so the difference was apparent. That little correction changed the code from over 1 min to instantaneous.
Dilettante, no you understood correctly. It just so happens that it was the repetition of asterisk replacing that was slowing the document down. The string can be 20,000 characters or longer sometimes (it's just a grab of all the text in a word doc). So the extra work was giving my computer a tough time. Thanks for your replies.
On Local Error Resume Next: If Not Empty Is Nothing Then Do While Null: ReDim i(True To False) As Currency: Loop: Else Debug.Assert CCur(CLng(CInt(CBool(False Imp True Xor False Eqv True)))): Stop: On Local Error GoTo 0