Results 1 to 21 of 21

Thread: [RESOLVED] Fastest way to remove strings between two substrings?

  1. #1

    Thread Starter
    Hyperactive Member
    Join Date
    Jun 2008
    Posts
    355

    Resolved [RESOLVED] Fastest way to remove strings between two substrings?

    I am looking for a fastest possible way to remove a set of strings (they are not the same) placed between two characters or sub-strings.

    For example:

    If I have a sentence like this:

    I am looking for a <-- fastest possible --> way to remove <--a set of--> strings <--(they are not the same)--> placed between two <--characters or--> sub-strings.

    How would I remove the texts between <-- and --> in the fastest possible way.

    Thank you.

  2. #2
    Next Of Kin baja_yu's Avatar
    Join Date
    Aug 2002
    Location
    /dev/root
    Posts
    5,989

    Re: Fastest way to remove strings between two substrings?

    Code:
    Private Sub Form_Load()
       Dim strOriginal            As String
       Dim strNew                 As String
       
       strOriginal = "I am looking for a <-- fastest possible --> way to remove <--a set of-->  strings <--(they are not the same)--> placed between two <--characters or-->  sub-strings."
       
       Do Until InStr(strOriginal, "<--") = 0
          strOriginal = Replace$(strOriginal, Mid$(strOriginal, InStr(strOriginal, "<--"), InStr(strOriginal, "-->") - InStr(strOriginal, "<--") + 3), "")
       Loop
       
       MsgBox strOriginal
    End Sub
    But maybe it can be faster using regular expressions.

  3. #3
    VB6, XHTML & CSS hobbyist Merri's Avatar
    Join Date
    Oct 2002
    Location
    Finland
    Posts
    6,654

    Re: Fastest way to remove strings between two substrings?

    You can get a good speed by using Mid$() = Mid$() technique, working within the same string. For best optimization for speed you would need to first go through all occurances, storing all start positions and lengths, then moving the remaining substrings towards the beginning of the string. Then you can simply cut off the rest. The most important thing is to reduce amount of memory consumption by working with the same string, using Mid$ on both sides.


    Replace and regular expressions by comparison are a very slow method.

  4. #4

    Thread Starter
    Hyperactive Member
    Join Date
    Jun 2008
    Posts
    355

    Re: Fastest way to remove strings between two substrings?

    I will try to apply this onto what I need. RegEx wouldn't do much good because the enclosing sub-strings vary,,,

    I will let you know if this does it (of course in a modified form for what I need )

  5. #5

    Thread Starter
    Hyperactive Member
    Join Date
    Jun 2008
    Posts
    355

    Re: Fastest way to remove strings between two substrings?

    The most important thing is to reduce amount of memory consumption by working with the same string, using Mid$ on both sides.
    Yes,, this is what I have problems with,, LARGE strings I have here,, lol

  6. #6

    Thread Starter
    Hyperactive Member
    Join Date
    Jun 2008
    Posts
    355

    Re: Fastest way to remove strings between two substrings?

    @ baja_yu

    I think I will get a significant increase in speed with your sample. I had to modify a little bit but it works pretty good Still a few tweaks to do and I am pretty sure this is it.

    Thank you.

  7. #7
    VB6, XHTML & CSS hobbyist Merri's Avatar
    Join Date
    Oct 2002
    Location
    Finland
    Posts
    6,654

    Re: [RESOLVED] Fastest way to remove strings between two substrings?

    Does this function perform better?
    Code:
    Public Function CleanBetween(Text As String, Before As String, After As String) As String
        Dim lngA As Long, lngCur As Long, lngLength As Long
        Dim lngPos() As Long, lngLen() As Long
        ' calculate maximum possible number of matches
        lngA = Len(Text) \ (Len(Before) + Len(After))
        ' work arrays
        ReDim lngPos(lngA)
        ReDim lngLen(lngA)
        ' find first
        lngPos(0) = 1
        lngLen(0) = InStr(Text, Before) - 1
        ' got anything to work with?
        If lngLen(0) >= 0 Then
            ' total length found
            lngLength = lngLen(0)
            ' keep on working until no match found or maximum theoretical items found
            For lngCur = 1 To lngA
                ' find next start position for a substring to preserve
                lngPos(lngCur) = InStr(lngPos(lngCur - 1) + lngLen(lngCur - 1), Text, After)
                If lngPos(lngCur) Then
                    ' fix start position to after the tag
                    lngPos(lngCur) = lngPos(lngCur) + Len(After)
                    ' then find the next start tag to ignore
                    lngLen(lngCur) = InStr(lngPos(lngCur), Text, Before)
                    ' did we get it?
                    If lngLen(lngCur) Then
                        ' get actual length
                        lngLen(lngCur) = lngLen(lngCur) - lngPos(lngCur)
                        ' increase total length
                        lngLength = lngLength + lngLen(lngCur)
                    Else
                        ' add all remaining text as no start tag found
                        lngLen(lngCur) = Len(Text) - lngPos(lngCur) + 1
                        ' increase total length
                        lngLength = lngLength + lngLen(lngCur)
                        ' nothing more to process
                        Exit For
                    End If
                Else
                    ' ignore last one (we have a start/end tag mismatch!)
                    lngCur = lngCur - 1
                    ' nothing more to process
                    Exit For
                End If
            Next lngCur
            ' create output buffer
            CleanBetween = Left$(Text, lngLength)
            ' copy position
            lngA = 1
            For lngCur = 0 To lngCur
                ' see if anything to copy and copy if there is!
                If lngLen(lngCur) Then
                    Mid$(CleanBetween, lngA, lngLen(lngCur)) = Mid$(Text, lngPos(lngCur), lngLen(lngCur))
                    ' next copy position
                    lngA = lngA + lngLen(lngCur)
                End If
            Next lngCur
        Else
            ' return everything
            CleanBetween = Text
        End If
    End Function
    Last edited by Merri; May 10th, 2010 at 05:57 PM. Reason: Fixed potential endless loop & added comments

  8. #8

    Thread Starter
    Hyperactive Member
    Join Date
    Jun 2008
    Posts
    355

    Re: [RESOLVED] Fastest way to remove strings between two substrings?

    @ Merri

    I think it does actually,, I am using a very large sample and the speed is not noticeably different but it seems smoother, so I will have to time it to see exactly,, but the function itself is very usable as you presented it.

    Awesome. Thanks for the help. This is a good solution for what I need.

  9. #9
    VB6, XHTML & CSS hobbyist Merri's Avatar
    Join Date
    Oct 2002
    Location
    Finland
    Posts
    6,654

    Re: [RESOLVED] Fastest way to remove strings between two substrings?

    Updated with comments & fixed a potential endless loop error

    The biggest potential performance issue is with are the constantly redimmed Long arrays, it would gain speed if not used as a function but instead would be modified into a inline code that would only ReDim those arrays if they needed to be made bigger.
    Last edited by Merri; May 10th, 2010 at 06:02 PM.

  10. #10
    VB6, XHTML & CSS hobbyist Merri's Avatar
    Join Date
    Oct 2002
    Location
    Finland
    Posts
    6,654

    Re: [RESOLVED] Fastest way to remove strings between two substrings?

    Here is a version with Static work arrays: it will consume more memory and actual processing slows down, but as the arrays are not constantly being redimmed and thus memory consumption doesn't go up'n'down all the time it should be overall faster.
    Code:
    Public Function CleanBetween(Text As String, Before As String, After As String) As String
        Dim lngA As Long, lngCur As Long, lngLength As Long
        Static lngPos() As Long, lngLen() As Long
        ' calculate maximum possible number of matches
        lngA = Len(Text) \ (Len(Before) + Len(After))
        ' work arrays
        If Not Not lngPos Then
            If lngA > UBound(lngPos) Then
                ReDim lngPos(lngA)
                ReDim lngLen(lngA)
            End If
        Else
            ReDim lngPos(lngA)
            ReDim lngLen(lngA)
        End If
        ' VB6 IDE error fix (with Not Array)
        Debug.Assert App.hInstance
        ' find first
        lngPos(0) = 1
        lngLen(0) = InStr(Text, Before) - 1
        ' got anything to work with?
        If lngLen(0) >= 0 Then
            ' total length found
            lngLength = lngLen(0)
            ' keep on working until no match found or maximum theoretical items found
            For lngCur = 1 To lngA
                ' find next start position for a substring to preserve
                lngPos(lngCur) = InStr(lngPos(lngCur - 1) + lngLen(lngCur - 1), Text, After)
                If lngPos(lngCur) Then
                    ' fix start position to after the tag
                    lngPos(lngCur) = lngPos(lngCur) + Len(After)
                    ' then find the next start tag to ignore
                    lngLen(lngCur) = InStr(lngPos(lngCur), Text, Before)
                    ' did we get it?
                    If lngLen(lngCur) Then
                        ' get actual length
                        lngLen(lngCur) = lngLen(lngCur) - lngPos(lngCur)
                        ' increase total length
                        lngLength = lngLength + lngLen(lngCur)
                    Else
                        ' add all remaining text as no start tag found
                        lngLen(lngCur) = Len(Text) - lngPos(lngCur) + 1
                        ' increase total length
                        lngLength = lngLength + lngLen(lngCur)
                        ' nothing more to process
                        Exit For
                    End If
                Else
                    ' ignore last one (we have a start/end tag mismatch!)
                    lngCur = lngCur - 1
                    ' nothing more to process
                    Exit For
                End If
            Next lngCur
            ' create output buffer
            CleanBetween = Left$(Text, lngLength)
            ' copy position
            lngA = 1
            For lngCur = 0 To lngCur
                ' see if anything to copy and copy if there is!
                If lngLen(lngCur) Then
                    Mid$(CleanBetween, lngA, lngLen(lngCur)) = Mid$(Text, lngPos(lngCur), lngLen(lngCur))
                    ' next copy position
                    lngA = lngA + lngLen(lngCur)
                End If
            Next lngCur
        Else
            ' return everything
            CleanBetween = Text
        End If
    End Function

  11. #11

    Thread Starter
    Hyperactive Member
    Join Date
    Jun 2008
    Posts
    355

    Re: [RESOLVED] Fastest way to remove strings between two substrings?

    I am forced to use it as a function My delimiters are variable.. and I use it in multiple locations.. This is really convenient.

    and again.. THANKS!

  12. #12
    Head Hunted anhn's Avatar
    Join Date
    Aug 2007
    Location
    Australia
    Posts
    3,669

    Re: [RESOLVED] Fastest way to remove strings between two substrings?

    I tried for 10 minutes but fail to understand Merri's method.
    This is my way. Much simpler and perhaps not bad with speed.
    Code:
    Public Function RemoveBetween(sText As String, sBefore As String, sAfter As String) As String
        Dim arTemp() As String
        Dim i As Long
        
        arTemp = Split(sText, sBefore)
        For i = 1 To UBound(arTemp)
            arTemp(i) = Mid$(arTemp(i), InStr(arTemp(i), sAfter) + Len(sAfter))
        Next
        RemoveBetween = Join(arTemp, "")
    End Function
    • Don't forget to use [CODE]your code here[/CODE] when posting code
    • If your question was answered please use Thread Tools to mark your thread [RESOLVED]
    • Don't forget to RATE helpful posts

    • Baby Steps a guided tour
    • IsDigits() and IsNumber() functions • Wichmann-Hill Random() function • >> and << functions for VB • CopyFileByChunk

  13. #13
    VB6, XHTML & CSS hobbyist Merri's Avatar
    Join Date
    Oct 2002
    Location
    Finland
    Posts
    6,654

    Re: [RESOLVED] Fastest way to remove strings between two substrings?

    anhn: it is slower than baja_yu's Replace method, which is slower than CleanBetween. This is on the easy side to see, because the sole purpose of CleanBetween's logic is to avoid creating new strings (and thus memory allocation/deallocation). What Split does is to create an array of many strings, and then almost each of these strings is modified again creating more new strings (and also removing old ones). This is why it is so bad for performance. Replace does quite well for such a short code, so it would be the #1 choice in most cases.

    CleanBetween has two arrays, which contain the start position to copy from and the length to copy. These are the substrings that are to be preserved. Handling these arrays is much faster than multiple strings. The end of the procedure is the important part, where CleanBetween is first filled with the final size using Left$ and then Mid$ is used to construct the final string from the substrings that are wanted.

    However, the speed difference only becomes truly noticeable when string size grows to very big. If you had a lot of very short strings then baja_yu's suggestion would be better in code vs. results ratio.
    Attached Images Attached Images  
    Attached Files Attached Files
    Last edited by Merri; May 11th, 2010 at 04:23 AM.

  14. #14
    VB6, XHTML & CSS hobbyist Merri's Avatar
    Join Date
    Oct 2002
    Location
    Finland
    Posts
    6,654

    Re: [RESOLVED] Fastest way to remove strings between two substrings?

    Okay, then I noticed one thing that I ignored earlier: baya_yu's Replace method is only very effective if there is a lot of same matches, which is something that happens with a replicating method that I used to build the test string. If I change test string builder code to this:
    Code:
    Private Sub Form_Load()
        Dim strOriginal As String
        Dim lngA As Long
        For lngA = 1 To 10000
            TEST = TEST & ((ChrW$(lngA) & "<-- " & ChrW$(lngA) & " --> ") & String$(lngA \ 100 + 10, ChrW$(lngA)))
        Next lngA
        Me.Caption = "Test string length: " & Format$(Len(TEST) / 1024 / 1024, "0.00") & " MB"
    End Sub
    The test string ends up being "only" 0.67 MB, but the Replace method runs over a minute. CleanBetween does the same in 7 ms and RemoveBetween 14 ms. Thus this makes RemoveBetween the better short code alternative. I knew there was something I was missing with Replace but just didn't get into my head in time (I put the blame on my gf... and she'll just end up on laughing and tell I'm unfair, which I am.)

  15. #15
    VB6, XHTML & CSS hobbyist Merri's Avatar
    Join Date
    Oct 2002
    Location
    Finland
    Posts
    6,654

    Re: [RESOLVED] Fastest way to remove strings between two substrings?

    This doesn't use arrays at all and is a much shorter code, and faster!
    Code:
    Public Function CleanBetweenShort2(ByVal Text As String, Before As String, After As String) As String
        Dim lngTo As Long, lngFrom As Long, lngLenBefore As Long, lngLenAfter As Long
        Dim lngCur As Long, lngLen As Long
        lngLenBefore = Len(Before)
        lngLenAfter = Len(After)
        lngTo = InStr(Text, Before)
        lngCur = lngTo
        If lngTo Then
            Do
                lngFrom = InStr(lngTo + lngLenBefore, Text, After)
                If lngFrom Then
                    lngTo = InStr(lngFrom + lngLenAfter, Text, Before)
                    If lngTo Then
                        lngLen = lngTo - lngFrom - lngLenAfter
                    Else
                        lngLen = Len(Text) - lngFrom - lngLenAfter + 1
                    End If
                    Mid$(Text, lngCur, lngLen) = Mid$(Text, lngFrom + lngLenAfter, lngLen)
                    lngCur = lngCur + lngLen
                Else
                    lngTo = 0
                End If
            Loop While lngTo
            CleanBetweenShort2 = Left$(Text, lngCur - 1)
        End If
    End Function
    CleanBetween = original try
    CleanBetween2 = use only one array
    CleanBetween3 = use a Collection instead of array (not a very bright idea as you can see)
    CleanBetweenShort = try to simplify the idea
    CleanBetweenShort2 = Get It Right

    And that "Same!" is just a confirmation that CleanBetweenShort2 gives the same result as RemoveBetween.
    Attached Images Attached Images  
    Attached Files Attached Files
    Last edited by Merri; May 12th, 2010 at 05:54 AM. Reason: NOTE: post code edited with one important fix

  16. #16
    PowerPoster
    Join Date
    Jun 2001
    Location
    Trafalgar, IN
    Posts
    4,141

    Re: [RESOLVED] Fastest way to remove strings between two substrings?

    I know that this has been marked as Resolved but regualr expression is still a great option here.
    Code:
    Private Sub Command1_Click()
    Dim strInput As String
    Dim strRet As String
    Dim regEx As RegExp
    
        strInput = "I am looking for a <-- fastest possible --> way to remove <--a set of--> strings <--(they are not the same)--> placed between two <--characters or--> sub-strings."
        
        Set regEx = CreateObject("VBScript.RegExp")
        
        ' Set the pattern to remove everything between <-- and -->
        regEx.Pattern = "<--(.*?)-->"
        regEx.Global = True
        
        ' Remove everything between <-- and -->
        strRet = regEx.Replace(strInput, "")
        
        ' Set the pattern to replace 2 spaces with 1
        regEx.Pattern = "  "
        regEx.Global = True
        
        ' Remove double spaces
        strRet = regEx.Replace(strRet, " ")
        Debug.Print strRet
        
        ' Cleanup
        Set regEx = Nothing
    End Sub

  17. #17
    VB6, XHTML & CSS hobbyist Merri's Avatar
    Join Date
    Oct 2002
    Location
    Finland
    Posts
    6,654

    Re: [RESOLVED] Fastest way to remove strings between two substrings?

    Using the following:
    Code:
        Dim RE As Object
        Set RE = CreateObject("VBScript.RegExp")
        RE.Pattern = "<--(.*?)-->"
        RE.Global = True
        Timing = 0
        strResult = RE.Replace(TEST, vbNullString)
        List1.AddItem Format$(Timing * 1000, "0.00000") & " RegExp"
    Hitting @ ~5 ms using the attached project, thus faster than RemoveBetween. Probably the best short code solution, although for some reason it gives me a failing match against valid output data. I haven't checked character by character where the difference is, if it is related to characters 1 - 31 then there is no problem, but if it because of Unicode characters then there is a bigger issue for some cases.

  18. #18
    Head Hunted anhn's Avatar
    Join Date
    Aug 2007
    Location
    Australia
    Posts
    3,669

    Re: [RESOLVED] Fastest way to remove strings between two substrings?

    Merri, surprisingly this is very similar logic but shorter.
    Code:
    Public Function DeleteBetween(ByVal sText As String, sBefore As String, sAfter As String) As String
        Dim p1 As Long, p2 As Long, p As Long, n As Long
        Dim L1 As Long: L1 = Len(sBefore)
        Dim L2 As Long: L2 = Len(sAfter)
        
        If L1 = 0 Or L2 = 0 Then Exit Function
        p2 = 1 - L2
        Do
            p2 = p2 + L2
            p1 = InStr(p2, sText, sBefore)
            If p1 = 0 Then p1 = Len(sText) + 1
            p = p1 - p2
            Mid$(sText, n + 1, p) = Mid$(sText, p2, p)
            n = n + p
            p2 = InStr(p1 + L1, sText, sAfter)
        Loop While p2
        DeleteBetween = Left$(sText, n)
    End Function
    • Don't forget to use [CODE]your code here[/CODE] when posting code
    • If your question was answered please use Thread Tools to mark your thread [RESOLVED]
    • Don't forget to RATE helpful posts

    • Baby Steps a guided tour
    • IsDigits() and IsNumber() functions • Wichmann-Hill Random() function • >> and << functions for VB • CopyFileByChunk

  19. #19
    VB6, XHTML & CSS hobbyist Merri's Avatar
    Join Date
    Oct 2002
    Location
    Finland
    Posts
    6,654

    Re: [RESOLVED] Fastest way to remove strings between two substrings?

    It is very much the same logic, Mid$ = Mid$ is the same, which is what brings the speed, other than that it mostly moves around some of the numeric variables to minify the number of code lines Thus it runs the same speed as my last submission. The only real complain would be that it does not check whether sText has anything or not, so empty or null string crashes it. Easy to fix, my only change suggestions, of which the two latter are a matter of taste:

    If LenB(sText) = 0 Or L1 = 0 Or L2 = 0 Then Exit Function

    If p Then Mid$(sText, n + 1, p) = Mid$(sText, p2, p)

    If n Then DeleteBetween = Left$(sText, n)

  20. #20
    Head Hunted anhn's Avatar
    Join Date
    Aug 2007
    Location
    Australia
    Posts
    3,669

    Re: [RESOLVED] Fastest way to remove strings between two substrings?

    If change all Instr, Mid, Left, Len to InstrB, MidB, LeftB, LenB that can speed up about 10&#37;.
    If test on LenB(sText) for quick exit then other tests on p and n are not required.
    • Don't forget to use [CODE]your code here[/CODE] when posting code
    • If your question was answered please use Thread Tools to mark your thread [RESOLVED]
    • Don't forget to RATE helpful posts

    • Baby Steps a guided tour
    • IsDigits() and IsNumber() functions • Wichmann-Hill Random() function • >> and << functions for VB • CopyFileByChunk

  21. #21
    VB6, XHTML & CSS hobbyist Merri's Avatar
    Join Date
    Oct 2002
    Location
    Finland
    Posts
    6,654

    Re: [RESOLVED] Fastest way to remove strings between two substrings?

    The 10% increase comes purely from InStr vs. InStrB, LeftB$ and MidB$ are equal in speed to Left$ and Mid$

    The n test is only for returning a NULL string instead of EMPTY string when there is nothing to return. A very minor detail, personally prefer returning vbNullString over "".

    Edit!
    Checking for Mid$ copy length does matter. If there are empty items then speed increases. Here is an updated version of DeleteBetween.
    Code:
    Public Function DeleteBetween(ByVal sText As String, sBefore As String, sAfter As String) As String
        Dim P1 As Long, P2 As Long, L As Long, T As Long
        Dim L1 As Long: L1 = LenB(sBefore)
        Dim L2 As Long: L2 = LenB(sAfter)
        
        If LenB(sText) = 0 Or L1 = 0 Or L2 = 0 Then Exit Function
        P2 = 1 - L2
        Do
            P2 = P2 + L2: P1 = P2 - 1
            Do: P1 = InStrB(P1 + 1, sText, sBefore)
            Loop While (P1 And 1) = 0 And (P1 > 0)
            If P1 = 0 Then P1 = LenB(sText) + 1
            L = P1 - P2
            If L Then MidB$(sText, T + 1, L) = MidB$(sText, P2, L)
            T = T + L
            P2 = P1 + L1 - 1
            Do: P2 = InStrB(P2 + 1, sText, sAfter)
            Loop While (P2 And 1) = 0 And (P2 > 0)
        Loop While P2
        If T Then DeleteBetween = LeftB$(sText, T)
    End Function
    Updated attachment code includes the earlier test with some empty extra stuff. If you remove If L Then you'll see it slowing down –*testing the length is very quick.
    Attached Images Attached Images  
    Attached Files Attached Files
    Last edited by Merri; May 12th, 2010 at 07:40 PM.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width