|
-
May 1st, 2026, 11:09 AM
#1
Thread Starter
Lively Member
AscU unicode equivalent of Asc
This proposal introduces the AscU utility, a variant of the Asc function designed to retrieve the Unicode code point of any character.
While the standard AscW function is often used for this purpose, it is frequently misunderstood and misused. In practice, AscW returns a signed integer (ranging from -32,768 to 32,767). Some developers attempt to use it for UTF-8 conversion, and tests are often misinterpreted ex:
Code:
Select Case AscW(Mid(Txt, i, 1)) 'AscW may returns a negative value
Case Is < 128:
Case Is < 2048:
...
AscU:
Code:
Function AscU(s As String, aPos) As Long
Dim h As Long, l As Long
h = AscW(Mid(s, aPos, 1)) And &HFFFF&
aPos = aPos + 1
If (h >= &HD800&) And (h <= &HDBFF&) Then
l = AscW(Mid(s, aPos, 1)) And &HFFFF&
aPos = aPos + 1
If (l >= &HDC00&) And (l <= &HDFFF&) Then
AscU = (h And &H3FF&) * 1024
AscU = (AscU Or (l And &H3FF&)) + &H10000
Exit Function
End If
End If
AscU = h
End Function
Example:
Code:
Private Sub test()
Dim i As Long, s As String
i = 1
s = TextBox1 ' Unicode office TextBox control
While i <= Len(s)
Debug.Print Hex(AscU(s, i))
Wend
End Sub
Hope this works.
-
May 4th, 2026, 08:31 AM
#2
Re: AscU unicode equivalent of Asc
Any software I post in these forums written by me is provided "AS IS" without warranty of any kind, expressed or implied, and permission is hereby granted, free of charge and without restriction, to any person obtaining a copy. To all, peace and happiness.
-
May 5th, 2026, 03:53 AM
#3
Thread Starter
Lively Member
Re: AscU unicode equivalent of Asc
Thanks for your contribution,
I think your code has some major limitations at least for AscWEx:
It doesn't handle individual characters, in the middle of a string for example.
Valid code points range from 0 to 0x10FFFF, but AscWEx returns the concatenated surrogate pairs instead.
Ex:(?) code point 0x20024
AscWEx returns: D840DC24
Last edited by anycoder; May 5th, 2026 at 04:00 AM.
-
May 7th, 2026, 07:09 AM
#4
Re: AscU unicode equivalent of Asc
anycoder, when you're dealing with a surrogate pair (four bytes), the entire four bytes represents a single character. That's the way surrogate pairs work.
All my functions are doing is expanding the handling of strings from the UCS-2 characterset, which VB6 was designed around, to the entire UTF-16 characterset, which includes the surrogate pairs.
VB6, with its intrinsic functions, assumes there's no such thing as surrogate pairs, which, when dealing with the complete UTF-16 characterset, isn't correct. My functions simply correct this oversight.
Now, if you wish to bring UTF-8 into the discussion (which you may be trying to do), that's a completely different discussion. And none of the VB6 intrinsic string functions ever make a UTF-8 assumption. To treat a VB6 BSTR string as UTF-8 would take a completely different set of functions. And, truth be told, isn't really worth it. It'd be far easier to convert the UTF-8 to UTF-16, get your work done, and then possibly convert back to UTF-8, if that's what you need.
Last edited by Elroy; May 7th, 2026 at 07:15 AM.
Any software I post in these forums written by me is provided "AS IS" without warranty of any kind, expressed or implied, and permission is hereby granted, free of charge and without restriction, to any person obtaining a copy. To all, peace and happiness.
-
May 15th, 2026, 02:04 AM
#5
Thread Starter
Lively Member
Re: AscU unicode equivalent of Asc
VB6, with its intrinsic functions, assumes there's no such thing as surrogate pairs, which, when dealing with the complete UTF-16 characterset, isn't correct. My functions simply correct this oversight.
Nowadays, surrogate pairs are appearing more frequently in text, and they aren't just limited to emojis.
Since the beginning, scripts that use them have caused issues not only for storage, but also due to linguistic characteristics that make standard string functions like Mid or InStr unusable.
-
May 19th, 2026, 02:50 PM
#6
Re: AscU unicode equivalent of Asc
 Originally Posted by anycoder
Nowadays, surrogate pairs are appearing more frequently in text, and they aren't just limited to emojis.
Since the beginning, scripts that use them have caused issues not only for storage, but also due to linguistic characteristics that make standard string functions like Mid or InStr unusable.
I agree. Also, regarding your statement about wanting a "code point". I assume you're talking about a Unicode (with no specific Unicode 'flavor' specified) code point. It might be useful to write some functions with names like:
CodePointFromUTF16(UTF16 As Long) As Long
UTF16FromCodePoint(CodePoint As Long) As Long
For anyone actually wanting the Unicode Code Point values, they could use those functions. They'd be quite easy to write, as all the non-surrogate-pairs are already the code point values. And the math for the surrogate-pairs is fairly straightforward.
Any software I post in these forums written by me is provided "AS IS" without warranty of any kind, expressed or implied, and permission is hereby granted, free of charge and without restriction, to any person obtaining a copy. To all, peace and happiness.
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|