short code improvement challenge

**wengang** · Jul 8th, 2016, 12:44 PM

Hi all.
This is a code I wrote several years ago, and it works ok, but it's really slow with large blocks of text. I'm wondering if anybody has a better idea.

It actually runs in a word macro, but close enough to VB6.
The idea is I have a block of text that is English and Chinese all mixed together, and my task is to get all the clusters of Chinese characters sorted into an array.
My approach is to replace everything that is not ASC(63) (question mark, and Chinese characters in Word macros) with an asterisk.
Then I replace all asterisk clusters with single asterisks, then I split the text over the asterisks to get my Chinese character cluster array.
Like I said, works ok, but on my slow office computer, larger blocks of text take minutes, not seconds, to break down.
My code:

For t = 1 To Len(strWholeDoc)
If Asc(Mid(strWholeDoc, t, 1)) <> 63 Then strWholeDoc = Replace(strWholeDoc, Mid(strWholeDoc, t, 1), "*")
Next t
Do Until InStr(strWholeDoc, "**") = 0
strWholeDoc = Replace(strWholeDoc, "**", "*")
Loop
sTerm = Split(strWholeDoc, "*")

I was thinking maybe Regular Expressions could do something instantaneous, but I never get Regular Expressions right without a hundred attempts.

Anybody?
Thanks.

**dilettante** · Jul 8th, 2016, 12:54 PM

Regular expressions are not fast at all. They are just somewhat faster than the scripting languages they were designed to be used with.

VBA isn't VB6 and can't be compiled to native code, but even VBA p-code interpretation is faster than most scripting languages and there are function one can use to avoid doing everything in pure p-code anyway.

Just to begin with, stop using Variant-valued functions so much. Avoid the ones that imply Unicode-to-ANSI conversion too. That might gain you back a little perfomance.

Where does the question mark come into the picture? Is that just the "failed" character from a failing Unicode to ANSI conversion? How do you get the text in the first place, i.e. has the data already been mangled through conversions or does the String have clean Unicode data in it?

Or are you saying mangling has already occurred and all of your "won't map into ANSI" characters have been turned into "?" already?

I guess where I'm going is that without sample data I'm not sure we can help much.

**Elroy** · Jul 8th, 2016, 01:33 PM

Yes, I also noticed the use of variant-valued functions. In other words, use Mid$ instead of Mid.

Also, your ...

Code:

If Asc(Mid(strWholeDoc, t, 1)) <> 63 Then strWholeDoc = Replace(strWholeDoc, Mid(strWholeDoc, t, 1), "*")

... line of code seems to be doing LOTS of extra work. Aren't you re-replacing all of your asterisk characters over and over and OVER?

How about...

Code:

If Asc(Mid(strWholeDoc, t, 1)) <> 63 And Asc(Mid(strWholeDoc, t, 1)) <> Asc("*") Then strWholeDoc = Replace(strWholeDoc, Mid(strWholeDoc, t, 1), "*")

Also, replacing Asc("*") with 42 will help, and also creating a temporary variable for Asc(Mid(strWholeDoc, t, 1)) will help ever so slightly.

**dilettante** · Jul 8th, 2016, 02:07 PM

Here is an example in VB6. That might not do you any good in itself if you don't have VB6, but this code should be portable to VBA:

Code:

Private Function NonAnsiSubstrings(ByRef Text As String) As Collection
    Dim Collection As Collection 'Items hold: Array(start, length, text)
    Dim TextBytes() As Byte
    Dim I As Long
    Dim StartPos As Long
    Dim J As Long
    Dim EndPos As Long
    Dim FoundEnd As Boolean

    Set Collection = New Collection
    If Len(Text) > 0 Then
        TextBytes = Text
        For I = 1 To UBound(TextBytes) Step 2
            If TextBytes(I) > 0 Then
                StartPos = (I - 1) \ 2 + 1 'Convert to character position in a String.
                For J = I + 2 To UBound(TextBytes) Step 2
                    If TextBytes(J) = 0 Then
                        EndPos = (J - 3) \ 2 + 1 'Back up 1 char, convert to char position.
                        Collection.Add Array(StartPos, _
                                             EndPos - StartPos + 1, _
                                             Mid$(Text, StartPos, EndPos - StartPos + 1))
                        FoundEnd = True
                        Exit For
                    End If
                Next
                I = J 'Adjust I past the substring.
                If FoundEnd Then
                    FoundEnd = False 'Prepare for next substring.
                Else
                    'We ran off the end while scanning.
                    Collection.Add Array(StartPos, _
                                         Len(Text) - StartPos + 1, _
                                         Mid$(Text, StartPos))
                End If
            End If
        Next
    End If
    Set NonAnsiSubstrings = Collection
End Function

If you really want an array you could either: (a.) use one and manage its growth via Redim Preserve statements, or else (b.) you could copy the resulting Collection into an array after calling this Function.

Seems to run pretty quickly though, even with the 196KB UTF-16LE sample text file in the attachment.

Name: sshot.png
Views: 163
Size: 25.3 KB

**Bonnie West** · Jul 8th, 2016, 03:18 PM

Originally Posted by wengang

I was thinking maybe Regular Expressions could do something instantaneous, but I never get Regular Expressions right without a hundred attempts.

See if the following RegEx solution is quick enough for you:

Code:

With CreateObject("VBScript.RegExp")
    .Global = True
    .Pattern = "[\x00-\xFF]+"

    sTerm = Split(.Replace(strWholeDoc, vbNullChar), vbNullChar)
End With

**Thierry69** · Jul 9th, 2016, 01:42 AM

Here is my idea.

Code:

Public Function ChangeString(sWholeDoc As String) As String
   ' #VBIDEUtils#************************************************************
   ' * Author           : Thierry69
   ' * Module Name      : Form1
   ' * Module Filename  :
   ' * Procedure Name   : ChangeString
   ' * Purpose          :
   ' * Parameters       :
   ' *                    sWholeDoc As String
   ' * Purpose          :
   ' **********************************************************************
   ' * Comments         :
   ' *
   ' *
   ' * Example          :
   ' *
   ' * See Also         :
   ' *
   ' * History          :
   ' *
   ' *
   ' **********************************************************************

   Dim sFinal     As String
   Dim sChar      As String
   Dim nLen       As Long
   Dim nI         As Long
   Dim bFound     As Boolean
   
   sFinal = vbNullString
   
   nLen = Len(sWholeDoc)
   bFound = True
   
   For nI = 1 To nLen
      sChar = Mid$(sWholeDoc, nI, 1)
      If Asc(sChar) <> 63 Then
         If bFound Then
            sFinal = sFinal & "*"
            bFound = False
         End If
      Else
         sFinal = sFinal & sChar
         bFound = True
      End If
   Next

   ChangeString = sFinal

End Function

And the test :

Code:

   Dim dTime            As Double
   Dim sTest            As String
   
   Dim nI               As Long

   sTest = "azsireo??????sqdf????qsdqsd???"
   For nI = 1 To 13
      sTest = sTest & sTest
   Next
   
   
   dTime = Now
   Call ChangeString(sTest)
   Debug.Print dTime - Now

I tested with your code and mine, it tooks 2 sec for my code, and after 1 min, I stopped to test yours

Then, of course :
sTerm = Split(ChangeString(sTest), "*")

**dilettante** · Jul 9th, 2016, 10:07 AM

Originally Posted by wengang

The idea is I have a block of text that is English and Chinese all mixed together, and my task is to get all the clusters of Chinese characters sorted into an array.

Sorry, I must have misinterpreted this.

I took it as "extract the substrings made up of Chinese characters." I assumed that characters outside the ANSI subset range (i.e. any where the upper byte is non-zero) is a good enough working definition of "Chinese characters" for this purpose.

**wengang** · Jul 11th, 2016, 08:12 AM

Elroy I never realized I was replacing the asterisk with itself. I wrote that code over five years ago, and I guess if it occurred to me at the time, I must have thought it wouldn't make much difference speedwise. But my current office computer is super slow, so the difference was apparent. That little correction changed the code from over 1 min to instantaneous.

**wengang** · Jul 11th, 2016, 08:14 AM

Dilettante, no you understood correctly. It just so happens that it was the repetition of asterisk replacing that was slowing the document down. The string can be 20,000 characters or longer sometimes (it's just a grab of all the text in a word doc). So the extra work was giving my computer a tough time. Thanks for your replies.

**Bonnie West** · Jul 11th, 2016, 06:46 PM

Contrary to popular belief, it turns out the VBScript Regular Expressions Object isn't always slow:

**Thierry69** · Jul 11th, 2016, 11:56 PM

Interesting Bonnie.

Personally, I never look to RegEx, I prefer easily maintainable code, but according your picture, I'll take a look, when I'll need such

Thread: short code improvement challenge

Thread Tools

Display

short code improvement challenge

Re: short code improvement challenge

Re: short code improvement challenge

Re: short code improvement challenge

Re: short code improvement challenge

Re: short code improvement challenge

Re: short code improvement challenge

Re: short code improvement challenge

Re: short code improvement challenge

Re: short code improvement challenge

Re: short code improvement challenge

Posting Permissions