Results 1 to 11 of 11

Thread: short code improvement challenge

  1. #1

    Thread Starter
    Frenzied Member wengang's Avatar
    Join Date
    Mar 2000
    Location
    Beijing, China
    Posts
    1,602

    short code improvement challenge

    Hi all.
    This is a code I wrote several years ago, and it works ok, but it's really slow with large blocks of text. I'm wondering if anybody has a better idea.

    It actually runs in a word macro, but close enough to VB6.
    The idea is I have a block of text that is English and Chinese all mixed together, and my task is to get all the clusters of Chinese characters sorted into an array.
    My approach is to replace everything that is not ASC(63) (question mark, and Chinese characters in Word macros) with an asterisk.
    Then I replace all asterisk clusters with single asterisks, then I split the text over the asterisks to get my Chinese character cluster array.
    Like I said, works ok, but on my slow office computer, larger blocks of text take minutes, not seconds, to break down.
    My code:

    For t = 1 To Len(strWholeDoc)
    If Asc(Mid(strWholeDoc, t, 1)) <> 63 Then strWholeDoc = Replace(strWholeDoc, Mid(strWholeDoc, t, 1), "*")
    Next t
    Do Until InStr(strWholeDoc, "**") = 0
    strWholeDoc = Replace(strWholeDoc, "**", "*")
    Loop
    sTerm = Split(strWholeDoc, "*")


    I was thinking maybe Regular Expressions could do something instantaneous, but I never get Regular Expressions right without a hundred attempts.

    Anybody?
    Thanks.
    Wen Gang, Programmer
    VB6, QB, HTML, ASP, VBScript, Visual C++, Java

  2. #2
    PowerPoster
    Join Date
    Feb 2006
    Posts
    24,482

    Re: short code improvement challenge

    Regular expressions are not fast at all. They are just somewhat faster than the scripting languages they were designed to be used with.

    VBA isn't VB6 and can't be compiled to native code, but even VBA p-code interpretation is faster than most scripting languages and there are function one can use to avoid doing everything in pure p-code anyway.

    Just to begin with, stop using Variant-valued functions so much. Avoid the ones that imply Unicode-to-ANSI conversion too. That might gain you back a little perfomance.


    Where does the question mark come into the picture? Is that just the "failed" character from a failing Unicode to ANSI conversion? How do you get the text in the first place, i.e. has the data already been mangled through conversions or does the String have clean Unicode data in it?

    Or are you saying mangling has already occurred and all of your "won't map into ANSI" characters have been turned into "?" already?

    I guess where I'm going is that without sample data I'm not sure we can help much.

  3. #3
    PowerPoster Elroy's Avatar
    Join Date
    Jun 2014
    Location
    Near Nashville TN
    Posts
    9,940

    Re: short code improvement challenge

    Yes, I also noticed the use of variant-valued functions. In other words, use Mid$ instead of Mid.

    Also, your ...
    Code:
    If Asc(Mid(strWholeDoc, t, 1)) <> 63 Then strWholeDoc = Replace(strWholeDoc, Mid(strWholeDoc, t, 1), "*")
    ... line of code seems to be doing LOTS of extra work. Aren't you re-replacing all of your asterisk characters over and over and OVER?

    How about...
    Code:
    If Asc(Mid(strWholeDoc, t, 1)) <> 63 And Asc(Mid(strWholeDoc, t, 1)) <> Asc("*") Then strWholeDoc = Replace(strWholeDoc, Mid(strWholeDoc, t, 1), "*")
    Also, replacing Asc("*") with 42 will help, and also creating a temporary variable for Asc(Mid(strWholeDoc, t, 1)) will help ever so slightly.
    Any software I post in these forums written by me is provided "AS IS" without warranty of any kind, expressed or implied, and permission is hereby granted, free of charge and without restriction, to any person obtaining a copy. To all, peace and happiness.

  4. #4
    PowerPoster
    Join Date
    Feb 2006
    Posts
    24,482

    Re: short code improvement challenge

    Here is an example in VB6. That might not do you any good in itself if you don't have VB6, but this code should be portable to VBA:

    Code:
    Private Function NonAnsiSubstrings(ByRef Text As String) As Collection
        Dim Collection As Collection 'Items hold: Array(start, length, text)
        Dim TextBytes() As Byte
        Dim I As Long
        Dim StartPos As Long
        Dim J As Long
        Dim EndPos As Long
        Dim FoundEnd As Boolean
    
        Set Collection = New Collection
        If Len(Text) > 0 Then
            TextBytes = Text
            For I = 1 To UBound(TextBytes) Step 2
                If TextBytes(I) > 0 Then
                    StartPos = (I - 1) \ 2 + 1 'Convert to character position in a String.
                    For J = I + 2 To UBound(TextBytes) Step 2
                        If TextBytes(J) = 0 Then
                            EndPos = (J - 3) \ 2 + 1 'Back up 1 char, convert to char position.
                            Collection.Add Array(StartPos, _
                                                 EndPos - StartPos + 1, _
                                                 Mid$(Text, StartPos, EndPos - StartPos + 1))
                            FoundEnd = True
                            Exit For
                        End If
                    Next
                    I = J 'Adjust I past the substring.
                    If FoundEnd Then
                        FoundEnd = False 'Prepare for next substring.
                    Else
                        'We ran off the end while scanning.
                        Collection.Add Array(StartPos, _
                                             Len(Text) - StartPos + 1, _
                                             Mid$(Text, StartPos))
                    End If
                End If
            Next
        End If
        Set NonAnsiSubstrings = Collection
    End Function
    If you really want an array you could either: (a.) use one and manage its growth via Redim Preserve statements, or else (b.) you could copy the resulting Collection into an array after calling this Function.

    Seems to run pretty quickly though, even with the 196KB UTF-16LE sample text file in the attachment.

    Name:  sshot.png
Views: 163
Size:  25.3 KB
    Attached Files Attached Files

  5. #5
    Default Member Bonnie West's Avatar
    Join Date
    Jun 2012
    Location
    InIDE
    Posts
    4,060

    Re: short code improvement challenge

    Quote Originally Posted by wengang View Post
    I was thinking maybe Regular Expressions could do something instantaneous, but I never get Regular Expressions right without a hundred attempts.
    See if the following RegEx solution is quick enough for you:

    Code:
    With CreateObject("VBScript.RegExp")
        .Global = True
        .Pattern = "[\x00-\xFF]+"
    
        sTerm = Split(.Replace(strWholeDoc, vbNullChar), vbNullChar)
    End With
    On Local Error Resume Next: If Not Empty Is Nothing Then Do While Null: ReDim i(True To False) As Currency: Loop: Else Debug.Assert CCur(CLng(CInt(CBool(False Imp True Xor False Eqv True)))): Stop: On Local Error GoTo 0
    Declare Sub CrashVB Lib "msvbvm60" (Optional DontPassMe As Any)

  6. #6
    Fanatic Member
    Join Date
    Jan 2015
    Posts
    596

    Re: short code improvement challenge

    Here is my idea.

    Code:
    Public Function ChangeString(sWholeDoc As String) As String
       ' #VBIDEUtils#************************************************************
       ' * Author           : Thierry69
       ' * Module Name      : Form1
       ' * Module Filename  :
       ' * Procedure Name   : ChangeString
       ' * Purpose          :
       ' * Parameters       :
       ' *                    sWholeDoc As String
       ' * Purpose          :
       ' **********************************************************************
       ' * Comments         :
       ' *
       ' *
       ' * Example          :
       ' *
       ' * See Also         :
       ' *
       ' * History          :
       ' *
       ' *
       ' **********************************************************************
    
       Dim sFinal     As String
       Dim sChar      As String
       Dim nLen       As Long
       Dim nI         As Long
       Dim bFound     As Boolean
       
       sFinal = vbNullString
       
       nLen = Len(sWholeDoc)
       bFound = True
       
       For nI = 1 To nLen
          sChar = Mid$(sWholeDoc, nI, 1)
          If Asc(sChar) <> 63 Then
             If bFound Then
                sFinal = sFinal & "*"
                bFound = False
             End If
          Else
             sFinal = sFinal & sChar
             bFound = True
          End If
       Next
    
       ChangeString = sFinal
    
    End Function
    And the test :
    Code:
       Dim dTime            As Double
       Dim sTest            As String
       
       Dim nI               As Long
    
       sTest = "azsireo??????sqdf????qsdqsd???"
       For nI = 1 To 13
          sTest = sTest & sTest
       Next
       
       
       dTime = Now
       Call ChangeString(sTest)
       Debug.Print dTime - Now

    I tested with your code and mine, it tooks 2 sec for my code, and after 1 min, I stopped to test yours

    Then, of course :
    sTerm = Split(ChangeString(sTest), "*")
    Last edited by Thierry69; Jul 9th, 2016 at 01:45 AM.

  7. #7
    PowerPoster
    Join Date
    Feb 2006
    Posts
    24,482

    Re: short code improvement challenge

    Quote Originally Posted by wengang View Post
    The idea is I have a block of text that is English and Chinese all mixed together, and my task is to get all the clusters of Chinese characters sorted into an array.
    Sorry, I must have misinterpreted this.

    I took it as "extract the substrings made up of Chinese characters." I assumed that characters outside the ANSI subset range (i.e. any where the upper byte is non-zero) is a good enough working definition of "Chinese characters" for this purpose.

  8. #8

    Thread Starter
    Frenzied Member wengang's Avatar
    Join Date
    Mar 2000
    Location
    Beijing, China
    Posts
    1,602

    Re: short code improvement challenge

    Elroy I never realized I was replacing the asterisk with itself. I wrote that code over five years ago, and I guess if it occurred to me at the time, I must have thought it wouldn't make much difference speedwise. But my current office computer is super slow, so the difference was apparent. That little correction changed the code from over 1 min to instantaneous.
    Wen Gang, Programmer
    VB6, QB, HTML, ASP, VBScript, Visual C++, Java

  9. #9

    Thread Starter
    Frenzied Member wengang's Avatar
    Join Date
    Mar 2000
    Location
    Beijing, China
    Posts
    1,602

    Re: short code improvement challenge

    Dilettante, no you understood correctly. It just so happens that it was the repetition of asterisk replacing that was slowing the document down. The string can be 20,000 characters or longer sometimes (it's just a grab of all the text in a word doc). So the extra work was giving my computer a tough time. Thanks for your replies.
    Wen Gang, Programmer
    VB6, QB, HTML, ASP, VBScript, Visual C++, Java

  10. #10
    Default Member Bonnie West's Avatar
    Join Date
    Jun 2012
    Location
    InIDE
    Posts
    4,060

    Re: short code improvement challenge

    Contrary to popular belief, it turns out the VBScript Regular Expressions Object isn't always slow:





    On Local Error Resume Next: If Not Empty Is Nothing Then Do While Null: ReDim i(True To False) As Currency: Loop: Else Debug.Assert CCur(CLng(CInt(CBool(False Imp True Xor False Eqv True)))): Stop: On Local Error GoTo 0
    Declare Sub CrashVB Lib "msvbvm60" (Optional DontPassMe As Any)

  11. #11
    Fanatic Member
    Join Date
    Jan 2015
    Posts
    596

    Re: short code improvement challenge

    Interesting Bonnie.

    Personally, I never look to RegEx, I prefer easily maintainable code, but according your picture, I'll take a look, when I'll need such

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width