Results 1 to 2 of 2

Thread: String functions for VB6 to handle BSTR strings with surrogate-pairs

  1. #1

    Thread Starter
    PowerPoster Elroy's Avatar
    Join Date
    Jun 2014
    Location
    Near Nashville TN
    Posts
    9,853

    String functions for VB6 to handle BSTR strings with surrogate-pairs

    Ok, I just knocked this out today. I've now tested fairly well, including edge conditions, so I'm done unless someone finds a bug.

    Others are also certainly willing to test and report any problems.

    For changes/updates, see "Version" comments in the code.

    It'll be curious if anyone actually uses this thing. But it sure creates some discussion when this UCS-2 vs UTF-16 issue comes up.

    For a BAS module:
    Code:
    
    Option Explicit
    '
    '   Version 1.00    ' Posted on VBForums CodeBank.
    '   Version 1.01    ' Fixed LeftEx edge condition.
    '   Version 1.02    ' Fixed math problem in MidEx.
    '   Version 1.03    ' Fixed iStart default in InstrEx.
    '   Version 1.04    ' Fixed logic problem in InstrEx and InstrRevEx.
    '
    ' Some explanation:
    '
    '   The UTF-16 character set encoding is made up of the following:
    '       The UCS-2 characters, which are always 2-bytes.
    '       Surrogate-pair characters, which are always 4-bytes.
    '
    '   If a character is a surrogate-pair:
    '       The low-order-word  is always in the range of &HDC00 to &HDFFF.
    '       The high-order-word is always in the range of &HD800 to &HDBFF.
    '
    '   To avoid any possible confusion, if a character is not a surrogate-pair,
    '   it can't be anywhere in the range between &HD800 and &HDFFF.
    '   That's part of the UTF-16 specifications.
    '
    '   The built-in VB6 functions always assume the characters are UCS-2
    '   characters, i.e., 2-bytes long.  Therefore, we need a special set
    '   of functions to deal with strings that may contain surrogate-pairs.
    '
    '   Just as a note, the above does provide an opportunity for "garbage"
    '   to be in a string.  For instance, if a word in a string is in the
    '   range of >=&HD800 And <=&HDBFF, but the next word isn't in the
    '   range of >=&HDC00 And <=&HDFFF, this would be garbage.  In the same
    '   vane, if a word is in the range of >=&HDC00 And <=&HDFFF, but the
    '   prior word isn't in the range of >=&HD800 And <=&HDBFF, this would
    '   be garbage as well.  And the following don't check for such garbage,
    '   and may return inaccurate results if a string has such garbage.
    '
    '   Functions reworked:
    '       AscWEx
    '       ChrWEx
    '       InStrEx
    '       InStrRevEx
    '       LeftEx
    '       LenEx
    '       MidEx
    '       RightEx
    '
    '       Split and Join should work just fine as they are.
    '
    '   Extra "helper" functions (that can be used by anyone):
    '       IsUcs2Char
    '       IsLowSurrogate
    '       IsHighSurrogate
    '       IsSurrogatePair
    '       HasSurrogatePair
    '       SurrogatePairCount
    '
    Private Declare Function GetMem4 Lib "msvbvm60" (ByRef Source As Any, ByRef Dest As Any) As Long ' Always ignore the returned value, it's useless.
    Private Declare Function GetMem2 Lib "msvbvm60" (ByRef Source As Any, ByRef Dest As Any) As Long ' Always ignore the returned value, it's useless.
    '
    
    
    Public Function AscWEx(sChar As String) As Long
        ' Returns double-word, so surrogate-pairs can be represented.
        ' Examines only the first character of the string.
        '
        If IsSurrogatePair(sChar) Then  ' Words are swapped to accomodate Little Endian (LE).
            GetMem2 ByVal StrPtr(sChar) + 2&, AscWEx
            GetMem2 ByVal StrPtr(sChar), ByVal VarPtr(AscWEx) + 2&
        Else
            GetMem2 ByVal StrPtr(sChar), AscWEx
        End If
    End Function
    
    Public Function ChrWEx(ByVal iChar As Long) As String
        ' Ok, let's check for a bit of garbage.
        If iChar >= &HDC000000 And iChar <= &HDFFFFFFF Then Err.Raise 5&
        ' Now, let's just decide if we're dealing with a surrogate-pair or not.
        If iChar >= &HD8000000 And iChar <= &HDBFFFFFF Then
            ChrWEx = "  "
            GetMem2 iChar, ByVal StrPtr(ChrWEx) + 2&                ' Still must deal with LE going back in.
            GetMem2 ByVal VarPtr(iChar) + 2&, ByVal StrPtr(ChrWEx)
        Else
            ' If it's not a surrogate pair, we're going to ignore the high word.
            ChrWEx = ChrW$(CInt(iChar And &HFFFF))
        End If
    End Function
    
    Public Function InStrEx(sHay As String, sNeedle As String, Optional ByVal iStart As Long = 1&) As Long
        ' The optional iStart is at the end, as VB6 doesn't provide the kind of overloading the InStr() function does.
        ' All the InStrEx searches as done as vbBinaryCompare (the Instr() default) as that's the only want that really makes sense when searching for surrogate-pairs.
        ' iStart counts surrogate-pair characters only once, to respect them as characters.
        '
        ' The only real issue with this one (as opposed to Instr) is correctly handling iStart.
        '
        If iStart < 1& Then Err.Raise 5&        ' Same way Instr() handles it.
        Dim iPreCnt As Long
        iPreCnt = SurrogatePairCount(LeftEx(sHay, iStart - 1&))
        iStart = iStart + iPreCnt
        InStrEx = InStr(iStart, sHay, sNeedle, vbBinaryCompare) - iPreCnt
    End Function
    
    Public Function InStrRevEx(sHay As String, sNeedle As String, Optional ByVal iStart As Long = -1&) As Long
        ' All the InStrEx searches as done as vbBinaryCompare (the InstrRev() default) as that's the only want that really makes sense when searching for surrogate-pairs.
        ' iStart counts surrogate-pair characters only once, to respect them as characters.
        '
        Dim iPairCount As Long
        iPairCount = SurrogatePairCount(sHay)                       ' We'll need this a couple of times.
        Dim iLenEx As Long
        iLenEx = Len(sHay) - iPairCount                             ' We'll need this a couple of times.
        '
        If iStart = -1& Then iStart = iLenEx
        If iStart < 1& Then Err.Raise 5&                            ' Same way InStrRev() handles it.
        If iLenEx < iStart Then Exit Function                       ' Same way InStrRev() does it, even though it doesn't really make sense.
        '
        iStart = iStart + SurrogatePairCount(LeftEx(sHay, iStart))  ' Make it a no-surrogate-pair version, so we can use it in InStrRev().
        InStrRevEx = InStrRev(sHay, sNeedle, iStart)                ' But we haven't correctly handled surrogate-pairs, yet.
        If InStrRevEx > 1& Then                                     ' If it's one, we're good to go, either way.
            InStrRevEx = InStrRevEx - SurrogatePairCount(Left$(sHay, InStrRevEx - 1&))
        End If                                                      ' Above, adjust for surrogate pairs prior to our "find".
    End Function
    
    Public Function LeftEx(sStr As String, ByVal iLength As Long) As String
        ' We assume that iLength is characters, including surrogate-pairs (counted once each).
        '
        If iLength = 0& Then Exit Function                          ' Easy.
        If iLength < 0& Then Err.Raise 5&                           ' Same way Left$() handles it.
        '
        LeftEx = Left$(sStr, iLength + SurrogatePairCount(sStr))    ' Start by assuming they're all in the piece we want.
        Do                                                          ' Loop until we've trimmed to correct length.
            If LenEx(LeftEx) <= iLength Then Exit Function          ' Return when we've got the correct length (or asked for more than there are).
            '                                                       ' This test works even if there aren't surrogate-pairs in the tested piece.
            If IsLowSurrogate(Right$(LeftEx, 1&)) Then              ' Is the right-most word a high of a surrogate-pair?
                LeftEx = Left$(LeftEx, Len(LeftEx) - 2&)            ' Trim surrogate-pair.
            Else
                LeftEx = Left$(LeftEx, Len(LeftEx) - 1&)            ' Trim UCS-2 character.
            End If
        Loop
    End Function
    
    Public Function LenEx(sStr As String) As Long
        LenEx = Len(sStr) - SurrogatePairCount(sStr)
    End Function
    
    Public Function MidEx(sStr As String, ByVal iStart As Long, Optional ByVal iLength As Long) As String
        ' We assume that iStart and iLength is characters, including surrogate-pairs (counted once each).
        '
        If iStart < 1& Then Err.Raise 5&                            ' Same way Mid$() handles it.
        iStart = iStart - 1&                                        ' Make iStart 0 based, it's just easier.
        If iLength < 0& Then Err.Raise 5&                           ' Same way Mid$() handles it.
        If iLength = 0& Then iLength = &H7FFFFFFF                   ' Just makes it easy.  We want all that's remaining.
        Dim iLenEx As Long
        iLenEx = LenEx(sStr)
        If iLength > iLenEx - iStart Then                           ' Adjust length to be exactly what we want.
            iLength = iLenEx - iStart
        End If
        If iLength <= 0& Then Exit Function                         ' Return empty string, same way Mid$() does it.
        MidEx = LeftEx(RightEx(sStr, iLenEx - iStart), iLength)
    End Function
    
    Public Function RightEx(sStr As String, ByVal iLength As Long) As String
        ' We assume that iLength is characters, including surrogate-pairs (counted once each).
        '
        If iLength = 0& Then Exit Function                          ' Easy.
        If iLength < 0& Then Err.Raise 5&                           ' Same way Right$() handles it.
        '
        RightEx = Right$(sStr, iLength + SurrogatePairCount(sStr))  ' Start by assuming they're all in the piece we want.
        Do                                                          ' Loop until we've trimmed to correct length.
            If LenEx(RightEx) <= iLength Then Exit Function         ' Return when we've got the correct length (or asked for more than there are).
            '                                                       ' This test works even if there aren't surrogate-pairs in the tested piece.
            If IsSurrogatePair(RightEx) Then                        ' Is the left-most character a surrogate-pair?
                RightEx = Right$(RightEx, Len(RightEx) - 2&)        ' Trim surrogate-pair.
            Else
                RightEx = Right$(RightEx, Len(RightEx) - 1&)        ' Trim UCS-2 character.
            End If
        Loop
    End Function
    
    
    
    Public Function IsUcs2Char(sChar As String) As Boolean
        ' Only tests the first character of sChar.
        ' Just say "Not IsUcs2Char" to see if a word of a surrogate-pair.
        If Len(sChar) = 0& Then Exit Function
        Dim i As Integer:   i = AscW(sChar)
        IsUcs2Char = i < &HD800 Or i > &HDFFF
    End Function
    
    Public Function IsLowSurrogate(sChar As String) As Boolean
        ' Only tests the first character of sChar.
        If Len(sChar) = 0& Then Exit Function
        Dim i As Integer:   i = AscW(sChar)
        IsLowSurrogate = i >= &HDC00 And i <= &HDFFF
    End Function
    
    Public Function IsHighSurrogate(sChar As String) As Boolean
        ' Only tests the first character of sChar.
        If Len(sChar) = 0& Then Exit Function
        Dim i As Integer:   i = AscW(sChar)
        IsHighSurrogate = i >= &HD800 And i <= &HDBFF
    End Function
    
    Public Function IsSurrogatePair(sChar As String) As Boolean
        ' Looks precisely at the first FOUR bytes in sChar.
        If Len(sChar) = 0& Then Exit Function
        Static i(1&) As Integer
        GetMem4 ByVal StrPtr(sChar), i(0&)   ' This is safe because, if we're at the last character, we'll just get the null terminator.
        IsSurrogatePair = i(1&) >= &HDC00 And i(1&) <= &HDFFF And i(0&) >= &HD800 And i(0&) <= &HDBFF
    End Function
    
    Public Function HasSurrogatePair(sStr As String) As Boolean
        Dim bb() As Byte:   bb = sStr
        Dim i As Long
        For i = 1& To UBound(bb) Step 2&    ' We're looking only at the high-bytes.
            If bb(i) >= &HD8 And bb(i) <= &HDB Then
                If i + 2& <= UBound(bb) Then
                    HasSurrogatePair = bb(i + 2&) >= &HDC And bb(i + 2&) <= &HDF
                End If
                Exit Function       ' If we return False here, the string has garbage.
            End If
        Next
    End Function
    
    Public Function SurrogatePairCount(sStr As String) As Long
        Dim bb() As Byte:   bb = sStr
        Dim i As Long
        For i = 1& To UBound(bb) Step 2&    ' We're looking only at the high-bytes.
            If bb(i) >= &HD8 And bb(i) <= &HDB Then
                If i + 2& <= UBound(bb) Then
                    If bb(i + 2&) >= &HDC And bb(i + 2&) <= &HDF Then
                        SurrogatePairCount = SurrogatePairCount + 1&
                        i = i + 2&
                    End If
                End If
            End If
        Next
    End Function
    
    
    There's some testing code in the next post.

    Enjoy,
    Elroy
    Last edited by Elroy; May 19th, 2023 at 06:37 PM.
    Any software I post in these forums written by me is provided "AS IS" without warranty of any kind, expressed or implied, and permission is hereby granted, free of charge and without restriction, to any person obtaining a copy. To all, peace and happiness.

  2. #2

    Thread Starter
    PowerPoster Elroy's Avatar
    Join Date
    Jun 2014
    Location
    Near Nashville TN
    Posts
    9,853

    Re: String functions for VB6 to handle BSTR strings with surrogate-pairs

    Here's a bit of Form1 test code for the above BAS module.

    I might do more with this in the coming days. But I've tested quite a bit beyond the following.

    Code:
    
    Option Explicit
    
    Private Sub Form_Load()
        Dim sMail As String
        Dim sTest As String
        Dim sTst2 As String
    
        sMail = ChrW$(&HD83D) & ChrW$(&HDEEB)           ' Open-mailbox surrogate-pair.
        sTest = "asdf" & sMail & "qwer"
        sTst2 = sMail & sTest & sTest & sTest & sMail   ' Five surrogate-pairs.
    
        Debug.Print IsSurrogatePair(sMail)
        Debug.Print HasSurrogatePair(sTest)
        Debug.Print SurrogatePairCount(sTst2)
        Debug.Print SurrogatePairCount("asdf")
    
        Dim i As Long
        i = AscWEx(sMail)
    
        Debug.Print Hex$(i)
        Dim sTst3 As String
        sTst3 = ChrWEx(i)
        Debug.Print sTst3 = sMail
    
    End Sub
    
    
    Last edited by Elroy; May 19th, 2023 at 06:08 PM.
    Any software I post in these forums written by me is provided "AS IS" without warranty of any kind, expressed or implied, and permission is hereby granted, free of charge and without restriction, to any person obtaining a copy. To all, peace and happiness.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width