-
May 19th, 2023, 02:54 PM
#1
String functions for VB6 to handle BSTR strings with surrogate-pairs
Ok, I just knocked this out today. I've now tested fairly well, including edge conditions, so I'm done unless someone finds a bug.
Others are also certainly willing to test and report any problems.
For changes/updates, see "Version" comments in the code.
It'll be curious if anyone actually uses this thing. But it sure creates some discussion when this UCS-2 vs UTF-16 issue comes up.
For a BAS module:
Code:
Option Explicit
'
' Version 1.00 ' Posted on VBForums CodeBank.
' Version 1.01 ' Fixed LeftEx edge condition.
' Version 1.02 ' Fixed math problem in MidEx.
' Version 1.03 ' Fixed iStart default in InstrEx.
' Version 1.04 ' Fixed logic problem in InstrEx and InstrRevEx.
'
' Some explanation:
'
' The UTF-16 character set encoding is made up of the following:
' The UCS-2 characters, which are always 2-bytes.
' Surrogate-pair characters, which are always 4-bytes.
'
' If a character is a surrogate-pair:
' The low-order-word is always in the range of &HDC00 to &HDFFF.
' The high-order-word is always in the range of &HD800 to &HDBFF.
'
' To avoid any possible confusion, if a character is not a surrogate-pair,
' it can't be anywhere in the range between &HD800 and &HDFFF.
' That's part of the UTF-16 specifications.
'
' The built-in VB6 functions always assume the characters are UCS-2
' characters, i.e., 2-bytes long. Therefore, we need a special set
' of functions to deal with strings that may contain surrogate-pairs.
'
' Just as a note, the above does provide an opportunity for "garbage"
' to be in a string. For instance, if a word in a string is in the
' range of >=&HD800 And <=&HDBFF, but the next word isn't in the
' range of >=&HDC00 And <=&HDFFF, this would be garbage. In the same
' vane, if a word is in the range of >=&HDC00 And <=&HDFFF, but the
' prior word isn't in the range of >=&HD800 And <=&HDBFF, this would
' be garbage as well. And the following don't check for such garbage,
' and may return inaccurate results if a string has such garbage.
'
' Functions reworked:
' AscWEx
' ChrWEx
' InStrEx
' InStrRevEx
' LeftEx
' LenEx
' MidEx
' RightEx
'
' Split and Join should work just fine as they are.
'
' Extra "helper" functions (that can be used by anyone):
' IsUcs2Char
' IsLowSurrogate
' IsHighSurrogate
' IsSurrogatePair
' HasSurrogatePair
' SurrogatePairCount
'
Private Declare Function GetMem4 Lib "msvbvm60" (ByRef Source As Any, ByRef Dest As Any) As Long ' Always ignore the returned value, it's useless.
Private Declare Function GetMem2 Lib "msvbvm60" (ByRef Source As Any, ByRef Dest As Any) As Long ' Always ignore the returned value, it's useless.
'
Public Function AscWEx(sChar As String) As Long
' Returns double-word, so surrogate-pairs can be represented.
' Examines only the first character of the string.
'
If IsSurrogatePair(sChar) Then ' Words are swapped to accomodate Little Endian (LE).
GetMem2 ByVal StrPtr(sChar) + 2&, AscWEx
GetMem2 ByVal StrPtr(sChar), ByVal VarPtr(AscWEx) + 2&
Else
GetMem2 ByVal StrPtr(sChar), AscWEx
End If
End Function
Public Function ChrWEx(ByVal iChar As Long) As String
' Ok, let's check for a bit of garbage.
If iChar >= &HDC000000 And iChar <= &HDFFFFFFF Then Err.Raise 5&
' Now, let's just decide if we're dealing with a surrogate-pair or not.
If iChar >= &HD8000000 And iChar <= &HDBFFFFFF Then
ChrWEx = " "
GetMem2 iChar, ByVal StrPtr(ChrWEx) + 2& ' Still must deal with LE going back in.
GetMem2 ByVal VarPtr(iChar) + 2&, ByVal StrPtr(ChrWEx)
Else
' If it's not a surrogate pair, we're going to ignore the high word.
ChrWEx = ChrW$(CInt(iChar And &HFFFF))
End If
End Function
Public Function InStrEx(sHay As String, sNeedle As String, Optional ByVal iStart As Long = 1&) As Long
' The optional iStart is at the end, as VB6 doesn't provide the kind of overloading the InStr() function does.
' All the InStrEx searches as done as vbBinaryCompare (the Instr() default) as that's the only want that really makes sense when searching for surrogate-pairs.
' iStart counts surrogate-pair characters only once, to respect them as characters.
'
' The only real issue with this one (as opposed to Instr) is correctly handling iStart.
'
If iStart < 1& Then Err.Raise 5& ' Same way Instr() handles it.
Dim iPreCnt As Long
iPreCnt = SurrogatePairCount(LeftEx(sHay, iStart - 1&))
iStart = iStart + iPreCnt
InStrEx = InStr(iStart, sHay, sNeedle, vbBinaryCompare) - iPreCnt
End Function
Public Function InStrRevEx(sHay As String, sNeedle As String, Optional ByVal iStart As Long = -1&) As Long
' All the InStrEx searches as done as vbBinaryCompare (the InstrRev() default) as that's the only want that really makes sense when searching for surrogate-pairs.
' iStart counts surrogate-pair characters only once, to respect them as characters.
'
Dim iPairCount As Long
iPairCount = SurrogatePairCount(sHay) ' We'll need this a couple of times.
Dim iLenEx As Long
iLenEx = Len(sHay) - iPairCount ' We'll need this a couple of times.
'
If iStart = -1& Then iStart = iLenEx
If iStart < 1& Then Err.Raise 5& ' Same way InStrRev() handles it.
If iLenEx < iStart Then Exit Function ' Same way InStrRev() does it, even though it doesn't really make sense.
'
iStart = iStart + SurrogatePairCount(LeftEx(sHay, iStart)) ' Make it a no-surrogate-pair version, so we can use it in InStrRev().
InStrRevEx = InStrRev(sHay, sNeedle, iStart) ' But we haven't correctly handled surrogate-pairs, yet.
If InStrRevEx > 1& Then ' If it's one, we're good to go, either way.
InStrRevEx = InStrRevEx - SurrogatePairCount(Left$(sHay, InStrRevEx - 1&))
End If ' Above, adjust for surrogate pairs prior to our "find".
End Function
Public Function LeftEx(sStr As String, ByVal iLength As Long) As String
' We assume that iLength is characters, including surrogate-pairs (counted once each).
'
If iLength = 0& Then Exit Function ' Easy.
If iLength < 0& Then Err.Raise 5& ' Same way Left$() handles it.
'
LeftEx = Left$(sStr, iLength + SurrogatePairCount(sStr)) ' Start by assuming they're all in the piece we want.
Do ' Loop until we've trimmed to correct length.
If LenEx(LeftEx) <= iLength Then Exit Function ' Return when we've got the correct length (or asked for more than there are).
' ' This test works even if there aren't surrogate-pairs in the tested piece.
If IsLowSurrogate(Right$(LeftEx, 1&)) Then ' Is the right-most word a high of a surrogate-pair?
LeftEx = Left$(LeftEx, Len(LeftEx) - 2&) ' Trim surrogate-pair.
Else
LeftEx = Left$(LeftEx, Len(LeftEx) - 1&) ' Trim UCS-2 character.
End If
Loop
End Function
Public Function LenEx(sStr As String) As Long
LenEx = Len(sStr) - SurrogatePairCount(sStr)
End Function
Public Function MidEx(sStr As String, ByVal iStart As Long, Optional ByVal iLength As Long) As String
' We assume that iStart and iLength is characters, including surrogate-pairs (counted once each).
'
If iStart < 1& Then Err.Raise 5& ' Same way Mid$() handles it.
iStart = iStart - 1& ' Make iStart 0 based, it's just easier.
If iLength < 0& Then Err.Raise 5& ' Same way Mid$() handles it.
If iLength = 0& Then iLength = &H7FFFFFFF ' Just makes it easy. We want all that's remaining.
Dim iLenEx As Long
iLenEx = LenEx(sStr)
If iLength > iLenEx - iStart Then ' Adjust length to be exactly what we want.
iLength = iLenEx - iStart
End If
If iLength <= 0& Then Exit Function ' Return empty string, same way Mid$() does it.
MidEx = LeftEx(RightEx(sStr, iLenEx - iStart), iLength)
End Function
Public Function RightEx(sStr As String, ByVal iLength As Long) As String
' We assume that iLength is characters, including surrogate-pairs (counted once each).
'
If iLength = 0& Then Exit Function ' Easy.
If iLength < 0& Then Err.Raise 5& ' Same way Right$() handles it.
'
RightEx = Right$(sStr, iLength + SurrogatePairCount(sStr)) ' Start by assuming they're all in the piece we want.
Do ' Loop until we've trimmed to correct length.
If LenEx(RightEx) <= iLength Then Exit Function ' Return when we've got the correct length (or asked for more than there are).
' ' This test works even if there aren't surrogate-pairs in the tested piece.
If IsSurrogatePair(RightEx) Then ' Is the left-most character a surrogate-pair?
RightEx = Right$(RightEx, Len(RightEx) - 2&) ' Trim surrogate-pair.
Else
RightEx = Right$(RightEx, Len(RightEx) - 1&) ' Trim UCS-2 character.
End If
Loop
End Function
Public Function IsUcs2Char(sChar As String) As Boolean
' Only tests the first character of sChar.
' Just say "Not IsUcs2Char" to see if a word of a surrogate-pair.
If Len(sChar) = 0& Then Exit Function
Dim i As Integer: i = AscW(sChar)
IsUcs2Char = i < &HD800 Or i > &HDFFF
End Function
Public Function IsLowSurrogate(sChar As String) As Boolean
' Only tests the first character of sChar.
If Len(sChar) = 0& Then Exit Function
Dim i As Integer: i = AscW(sChar)
IsLowSurrogate = i >= &HDC00 And i <= &HDFFF
End Function
Public Function IsHighSurrogate(sChar As String) As Boolean
' Only tests the first character of sChar.
If Len(sChar) = 0& Then Exit Function
Dim i As Integer: i = AscW(sChar)
IsHighSurrogate = i >= &HD800 And i <= &HDBFF
End Function
Public Function IsSurrogatePair(sChar As String) As Boolean
' Looks precisely at the first FOUR bytes in sChar.
If Len(sChar) = 0& Then Exit Function
Static i(1&) As Integer
GetMem4 ByVal StrPtr(sChar), i(0&) ' This is safe because, if we're at the last character, we'll just get the null terminator.
IsSurrogatePair = i(1&) >= &HDC00 And i(1&) <= &HDFFF And i(0&) >= &HD800 And i(0&) <= &HDBFF
End Function
Public Function HasSurrogatePair(sStr As String) As Boolean
Dim bb() As Byte: bb = sStr
Dim i As Long
For i = 1& To UBound(bb) Step 2& ' We're looking only at the high-bytes.
If bb(i) >= &HD8 And bb(i) <= &HDB Then
If i + 2& <= UBound(bb) Then
HasSurrogatePair = bb(i + 2&) >= &HDC And bb(i + 2&) <= &HDF
End If
Exit Function ' If we return False here, the string has garbage.
End If
Next
End Function
Public Function SurrogatePairCount(sStr As String) As Long
Dim bb() As Byte: bb = sStr
Dim i As Long
For i = 1& To UBound(bb) Step 2& ' We're looking only at the high-bytes.
If bb(i) >= &HD8 And bb(i) <= &HDB Then
If i + 2& <= UBound(bb) Then
If bb(i + 2&) >= &HDC And bb(i + 2&) <= &HDF Then
SurrogatePairCount = SurrogatePairCount + 1&
i = i + 2&
End If
End If
End If
Next
End Function
There's some testing code in the next post.
Enjoy,
Elroy
Last edited by Elroy; May 19th, 2023 at 06:37 PM.
Any software I post in these forums written by me is provided "AS IS" without warranty of any kind, expressed or implied, and permission is hereby granted, free of charge and without restriction, to any person obtaining a copy. To all, peace and happiness.
-
May 19th, 2023, 02:55 PM
#2
Re: String functions for VB6 to handle BSTR strings with surrogate-pairs
Here's a bit of Form1 test code for the above BAS module.
I might do more with this in the coming days. But I've tested quite a bit beyond the following.
Code:
Option Explicit
Private Sub Form_Load()
Dim sMail As String
Dim sTest As String
Dim sTst2 As String
sMail = ChrW$(&HD83D) & ChrW$(&HDEEB) ' Open-mailbox surrogate-pair.
sTest = "asdf" & sMail & "qwer"
sTst2 = sMail & sTest & sTest & sTest & sMail ' Five surrogate-pairs.
Debug.Print IsSurrogatePair(sMail)
Debug.Print HasSurrogatePair(sTest)
Debug.Print SurrogatePairCount(sTst2)
Debug.Print SurrogatePairCount("asdf")
Dim i As Long
i = AscWEx(sMail)
Debug.Print Hex$(i)
Dim sTst3 As String
sTst3 = ChrWEx(i)
Debug.Print sTst3 = sMail
End Sub
Last edited by Elroy; May 19th, 2023 at 06:08 PM.
Any software I post in these forums written by me is provided "AS IS" without warranty of any kind, expressed or implied, and permission is hereby granted, free of charge and without restriction, to any person obtaining a copy. To all, peace and happiness.
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|