Results 1 to 7 of 7

Thread: Almost found another UTF-8 codec

  1. #1

    Thread Starter
    PowerPoster
    Join Date
    Feb 2006
    Posts
    22,501

    Almost found another UTF-8 codec

    This might have been nice... if it worked.

    Instead this classic ASP helper object is worthless to us.

    Code:
    Option Explicit
    
    'Needs a reference to "oleprn 1.0 Type Library" and three multiline TextBox controls.
    
    Private Function ToHex(ByRef Raw As String) As String
        Const HexDigits As String = "0123456789ABCDEF"
        Dim HexText As String
        Dim I As Long
        Dim RawByte As Byte
        
        HexText = Space$((LenB(Raw)) * 3)
        For I = 1 To LenB(Raw)
            RawByte = AscB(MidB$(Raw, I, 1))
            Mid$(HexText, (I - 1) * 3 + 1) = Mid$(HexDigits, RawByte \ &H10& + 1, 1)
            Mid$(HexText, (I - 1) * 3 + 2) = Mid$(HexDigits, RawByte Mod &H10& + 1, 1)
        Next
        ToHex = Left$(HexText, Len(HexText) - 1)
    End Function
    
    Private Sub Form_Load()
        Const CP_UTF8 As Long = 65001
        Dim Unicode As String
        Dim UTF8 As String
    
        Unicode = "" & vbCrLf _
                & "abcdefghijklmnopqrstuvwxyz " _
                & "0123456789 " _
                & "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
        Text1.Text = Unicode
        With New OLEPRNLib.OleCvt
            UTF8 = .ToUtf8(Unicode)
            Text2.Text = ToHex(UTF8)
            Text3.Text = .ToUnicode(UTF8, CP_UTF8) 'Convert back.
        End With
    End Sub
    
    Private Sub Form_Resize()
        If WindowState <> vbMinimized Then
            With Text1
                .Move 0, 0, ScaleWidth, ScaleHeight / 3
                Text2.Move 0, .Height, ScaleWidth, .Height
                Text3.Move 0, Text2.Top + Text2.Height, ScaleWidth, ScaleHeight - 2 * .Height
            End With
        End If
    End Sub
    Hex dump of the "converted" data:

    Code:
    C2 00 B6 00 C2 00 B6 00 0D 00 0A 00 61 00 62 00 
    63 00 64 00 65 00 66 00 67 00 68 00 69 00 6A 00 
    6B 00 6C 00 6D 00 6E 00 6F 00 70 00 71 00 72 00 
    73 00 74 00 75 00 76 00 77 00 78 00 79 00 7A 00 
    20 00 30 00 31 00 32 00 33 00 34 00 35 00 36 00 
    37 00 38 00 39 00 20 00 41 00 42 00 43 00 44 00 
    45 00 46 00 47 00 48 00 49 00 4A 00 4B 00 4C 00 
    4D 00 4E 00 4F 00 50 00 51 00 52 00 53 00 54 00 
    55 00 56 00 57 00 58 00 59 00 5A 00
    This is "correct" except that it stuffs a NUL after each character. I guess the problem was that the IIS developers who made this thing didn't really understand BSTRs.

    How very odd. So close but yet so wrong.

    At least it is consistent though, converting the weird "stuffed UTF-8" back works using its ToUnicode() method.

  2. #2
    PowerPoster wqweto's Avatar
    Join Date
    May 2011
    Posts
    3,094

    Re: Almost found another UTF-8 codec

    OLEPRNLib.OleCvt exposed-- full source code leaked!

    Code:
    Option Explicit
    
    '--- for WideCharToMultiByte
    Private Const CP_UTF8                       As Long = 65001
    
    Private Declare Function WideCharToMultiByte Lib "kernel32" (ByVal CodePage As Long, ByVal dwFlags As Long, ByVal lpWideCharStr As Long, ByVal cchWideChar As Long, lpMultiByteStr As Any, ByVal cchMultiByte As Long, ByVal lpDefaultChar As Long, ByVal lpUsedDefaultChar As Long) As Long
    
    Public Function ToUtf8(ByVal sText As String) As String
        Dim lSize           As Long
        
        ToUtf8 = String$(4 * Len(sText), 0)
        lSize = WideCharToMultiByte(CP_UTF8, 0, StrPtr(sText), Len(sText), ByVal ToUtf8, Len(ToUtf8), 0, 0)
        ToUtf8 = Left$(ToUtf8, lSize)
    End Function
    Just kidding but this is *the* full source code of a class in an AxDLL I'm using in some VBScript quick hacks when fast UTF-8 conversion is needed. (Slow is already available using ADODB.Stream I guess.)

    cheers,
    </wqw>

  3. #3

    Thread Starter
    PowerPoster
    Join Date
    Feb 2006
    Posts
    22,501

    Re: Almost found another UTF-8 codec

    We all know about WideCharToMultiByte(), but many seem to struggle to use it properly.

    Your example is concise but wastes a lot of String space. Guessing "4 times" as enough buffer works, but for normal text that's almost 4 times as much as actually needed.

    The alternate is to spend more cycles, and call WideCharToMultiByte() once for the buffer needed, then call WideCharToMultiByte() a second time.

    So you pick your pain.

  4. #4
    PowerPoster wqweto's Avatar
    Join Date
    May 2011
    Posts
    3,094

    Re: Almost found another UTF-8 codec

    Quote Originally Posted by dilettante View Post
    We all know about WideCharToMultiByte(), but many seem to struggle to use it properly.

    Your example is concise but wastes a lot of String space. Guessing "4 times" as enough buffer works, but for normal text that's almost 4 times as much as actually needed.

    The alternate is to spend more cycles, and call WideCharToMultiByte() once for the buffer needed, then call WideCharToMultiByte() a second time.

    So you pick your pain.
    This is what a classical space-time tradeoff looks in pracitce not quite like the ones discussed in CS academia :-))

    cheers,
    </wqw>

  5. #5

    Thread Starter
    PowerPoster
    Join Date
    Feb 2006
    Posts
    22,501

    Re: Almost found another UTF-8 codec

    I'm beginning to suspect that the weird OLEPRNLib.OleCvt "UTF-8" had a lot to do with VBScript.

    There you have to jump through hoops to get a Byte array (returned by some object like ADODB.Stream, etc.), and early versions of VBScript had no MidB(), LeftB(), etc.

  6. #6
    Frenzied Member
    Join Date
    Dec 2012
    Posts
    1,070

    Re: Almost found another UTF-8 codec

    Why not just ignore every second byte?
    Code:
    Private Function ToHex(ByRef Raw As String) As String
        Const HexDigits As String = "0123456789ABCDEF"
        Dim HexText As String
        Dim I As Long
        Dim RawByte As Byte
        
        HexText = Space$((Len(Raw)) * 3)
        For I = 1 To LenB(Raw) Step 2
            RawByte = AscB(MidB$(Raw, I, 1))
            Mid$(HexText, (I - 1) / 2 * 3 + 1) = Mid$(HexDigits, RawByte \ &H10& + 1, 1)
            Mid$(HexText, (I - 1) / 2 * 3 + 2) = Mid$(HexDigits, RawByte Mod &H10& + 1, 1)
        Next
        ToHex = Left$(HexText, Len(HexText))
    End Function
    Code:
    abcdefghijklmnopqrstuvwxyz 0123456789 
    ABCDEFGHIJKLMNOPQRSTUVWXYZ
    
    C2 B6 C2 B6 0D 0A 61 62 63 64 65 66 67 68 69 6A 
    6B 6C 6D 6E 6F 70 71 72 73 74 75 76 77 78 79 7A 
    20 30 31 32 33 34 35 36 37 38 39 20 41 42 43 44 
    45 46 47 48 49 4A 4B 4C 4D 4E 4F 50 51 52 53 54 
    55 56 57 58 59 5A 
    
    
    abcdefghijklmnopqrstuvwxyz 0123456789 
    ABCDEFGHIJKLMNOPQRSTUVWXYZ
    J.A. Coutts

  7. #7

    Thread Starter
    PowerPoster
    Join Date
    Feb 2006
    Posts
    22,501

    Re: Almost found another UTF-8 codec

    BTW:

    Your ToUtf8() As String can work of course, after a fashion.

    However there is a strong convention that any arguments passed As String are assumed to be Unicode (UTF-16LE). MSXML2 objects, WinHttpRequest, and others come to mind.

    Better to hold UTF-8 text in Byte arrays as a general rule. Not required, but less error prone. Many properties and method arguments are Variant, and the logic may transcode any String passed while taking a passed Byte array literally.

    Of course if you are careful you can do as you please in your own programs.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width