dcsimg
Results 1 to 18 of 18

Thread: VB6 - The case for UTF-8

  1. #1

    Thread Starter
    Fanatic Member
    Join Date
    Dec 2012
    Posts
    798

    VB6 - The case for UTF-8

    Some people have been critical of the fact that my clsCNG.cls does not preserve Unicode. So with this post, I have attempted to correct that situation. I am by no means any kind of expert on Unicode, and until recently I have only cursed its existence. The Unicode standards are very loose (much like SMTP), but at least there is a fair amount of information out there if you are willing to dig for it.

    ClsCNG.cls is a general purpose class designed to perform encryption services on anything that is passed to it. With one small change to the StrToByte routine, it now detects double-wide characters and passes the entire string instead of the just the low order bytes. But with that flexibility comes a new "gotcha". In the image below, you will see the Russian Unicode string does not produce the correct Hash. That is because it is a mixed string, consisting of both ASCII and Russian Unicode. This is not uncommon in HTML code, and this particular string was intercepted from http://www.humancomp.org/unichtm/unichtm.htm using a packet sniffer. There are a couple of ways around that issue. One way is to remove the NULL characters associated with the ASCII characters. The other way is to encode the string using UTF-8. This is the preferred method and is demonstrated using the "Hash UTF-8" button. I should mention at this point that I am using the TextBox provided by the Microsoft Forms 2.0 Object Library to display the Unicode characters. The regular TextBox only accepts ASCII.

    The change to the StrToByte routine allowed the implementation of 2 new routines called "ByteToStrShort" and "HexToStrShort". These routines create a string without the intermediate NULL bytes and shorten the process time.

    Using UTF-8 introduces another "gotcha". The Unicode standard, and in particular UTF-8, only works with true ASCII characters less than 128 (&H80). If there is any chance that your application could pass ANSI characters above &H7F, you should provide a detection routine to avoid passing it to "clsCNG.cls". DO NOT use "StrConv", as it will cause problems, especially if you are using a non-Latin System Locale.

    That's the easy part. Recognizing an incoming byte string as Double-wide Unicode or UTF-8 is difficult to say the least. There is no standard methodology to deal with it. HTTP and XML will announce their intention to use UTF-8. For the Russian page below, the line:
    <meta http-equiv="content-type" content="text/html; charset=UTF-8">
    was provided. There is another methodology called BOM which is sometimes employed. It stands for Byte Order Mark, and is used to specify "Big Endian" or "Little Endian" order for encoded strings. Since UTF-8 uses bytes instead of words, Endian has little meaning, and it is often referred to as a "UTF-8 Signature" (EF BB BF). HTML5 requires an application to respect it, and it takes precedence over the notification. Unicode standards do not restrict or require it's use, but if you are building an application where you can control both ends, it would make sense to use it. In either case, your application should be prepared to recognize and remove it before display.

    Mozilla (and I assume other browsers as well) will use the information provided to determine the type of encoding used on incoming data, and if that fails or is not provided, it then uses a heuristic approach. So I set out to provide my own routine to detect UTF-8. My first reaction was to question the need to scan the data twice. If you are going to convert the string if UTF-8 is detected, why not just attempt to convert the string and respond to any errors. Unfortunately, MultiByteToWideChar does not return encoding errors; it just does the best that it can. So the scan is necessary to detect if the incoming string is indeed UTF-8. The IsUTF8 routine is my interpretation of a C++ routine that I found on the net. It has not been tested extensively, and it could probably be executed more efficiently. Determining if an incoming string is Unicode or not is a different story, and I have not found a reliable way to do that. I tested the Microsoft "IsTextUnicode" function, but as most of the literature indicated, it is virtually useless.

    I discovered another "gotcha" with MultiByteToWideChar. It will return NULL characters at the end of the string, depending on the length. That is not a problem with C++, as NULL characters signify the end of the string. But with VB, that is a problem because it identifies the string length in it's definition. So the FromUTF8 routine was modified to remove any NULL characters.

    If you convert the Russian sample, you will notice that the UTF-8 string is shorter than the original (due to ASCII NULL removal), but the Chinese sample converts to a longer string. That is because the Chinese sample converts 2 byte characters to mostly 3 byte characters. Even considering the downsides, UTF-8 appears to be the most logical solution.

    J.A. Coutts
    Attached Images Attached Images  
    Attached Files Attached Files

  2. #2
    PowerPoster
    Join Date
    Jun 2015
    Posts
    2,224

    Re: VB6 - The case for UTF-8

    You might want to try this instead to clean up your off by 1 issues in FromUTF8. If this doesn't work - you're probably feeding it NullChars on the input.

    Code:
    Public Function FromUTF8(bUTF8() As Byte) As String
        Const CP_UTF8 As Long = 65001 '&HFDE9
        Dim lRet As Long
        Dim sUTF8 As String
        lRet = MultiByteToWideChar(CP_UTF8, 0, VarPtr(bUTF8(0)), GetbSize(bUTF8), 0, 0)
        If lRet > 0 Then
            sUTF8 = String$(lRet, 0)
            lRet = MultiByteToWideChar(CP_UTF8, 0, VarPtr(bUTF8(0)), GetbSize(bUTF8), StrPtr(sUTF8), lRet)
            If lRet > 0 Then
                FromUTF8 = Left$(sUTF8, lRet)
            End If
        End If
    End Function
    Whether or not to count NullChars in buffers, on input and output counts of an API is an _extremely_ common issue, due to the fact that almost identical API Calls randomly include them in the counts, or not. The only thing that is consistent is that the Documentation specifies whether or not a nullchar is included in the buffer counts.

    Feel free to post any Strings that are troublesome or that have invalid NullChar's in them after conversion. Also remember that a BSTR always ends in a NullChar that is not included in a string's Length. Not only that - if you include a nullchar in the conversion input buffer, you're going to get one on the output too.

    one cool tip about UTF-8, is that you can treat it almost exactly like ASCII. It only get's weird when you're dealing with non-ascii Chars. This is how linux works, it essentially treats everything as ASCII. It's only when characters are displayed that UTF-8 comes into play. ASCII newlines are UTF-8 newlines.


    Oh Man... your ToUTF8 is a disaster.. It returns a String, instead of a Byte Array??? and what the heck is GetsSize... Rounding?
    Also instead of "Erase"ing your dynamic array - you Redim it to 1 element?

    None of it makes sense. You've built yourself so many pittfalls - its a miracle any of it works.
    Last edited by DEXWERX; Feb 8th, 2016 at 02:05 PM. Reason: reverted due to more testing.

  3. #3

    Thread Starter
    Fanatic Member
    Join Date
    Dec 2012
    Posts
    798

    Re: VB6 - The case for UTF-8

    Quote Originally Posted by DEXWERX View Post
    You might want to try this instead to clean up your off by 1 issues in FromUTF8. If this doesn't work - you're probably feeding it NullChars on the input.
    The problem turns out to be in the IsUTF8 routine, and not the FromUTF8 routine at all. Because Chinese works with 3 byte UTF-8 sequences, the length can be an odd number. Unicode characters in VB are always even, and odd numbered sequences could end in a NULL. UTF-8 standards do not allow NULL bytes, and the IsUTF8 routine is supposed to check for NULL bytes. But this one is at the very end and passes through the "If bUTF8(lPt) < &H80 Then" statement. The routine supposedly checks for "Overlong", but it doesn't check for "Underlong" (byte array is longer than the number of characters), and "MultiByteToWideChar" doesn't care. I will have to do some more testing.

    Your other comments are not productive and will be ignored.

    J.A. Coutts

  4. #4
    PowerPoster
    Join Date
    Jun 2013
    Posts
    4,514

    Re: VB6 - The case for UTF-8

    Quote Originally Posted by couttsj View Post
    Your other comments are not productive and will be ignored.
    But he's right... also took a look at your code - and the whole unicode-part
    is in quite a disarray - such that one doesn't know "where to start with
    more productive criticism".

    I know that is hard to swallow as it stands there - but it's not meant as
    it may sound to you - I agree with Dex, that your Crypto-Classes would
    have great value, when the Unicode-stuff wouldn't be in there...

    The best advice one can give to you at this stage is, to change all the String-Input -
    Params and Output-Results of your Class-Functions simply to ByteArrays.

    Then letting the Users of your Classes decide, how to pass and convert those
    ByteArrays on their own (at the outside) - and you're done.

    For ANSI-conversions from and to ByteArrays (on the outside), there's always
    StrConv - and for convenience you might want to offer a VBStringToUTF8 -
    and an UTF8ToVBString-Routine in your Classes...

    But those two should not be used by yourself inside the Crypto-Code (and should
    not contain any "Extra-For-Next-Loops" at all - just applying MB2WC and WC2MB
    correctly should be enough in those two helpers, for the outside users convenience.

    Peace... :-)

    Olaf

  5. #5
    PowerPoster
    Join Date
    Jun 2015
    Posts
    2,224

    Re: VB6 - The case for UTF-8

    I've commented some of the really suspicious parts of the code.

    Code:
    Public Function GetsSize(bArray() As Byte) As Long
        On Error GoTo GetSizeErr
        GetsSize = ((UBound(bArray) + 1) / 2) + 0.4 'DEX: Rounding?
        Exit Function
    GetSizeErr:
        GetsSize = 0 
    End Function
    
    Public Function ToUTF8(ByVal Text As String, Optional ByVal UTF8Flg As Boolean) As String
        Const CP_UTF8 As Long = 65001 '&HFDE9
        Dim lRet As Long
        Dim bUTF8() As Byte
        Dim bTmp() As Byte
        Dim sUTF8 As String
        lRet = WideCharToMultiByte(CP_UTF8, 0, StrPtr(Text), Len(Text), 0, 0, 0, 0)
        If lRet > 0 Then
            ReDim bUTF8(lRet - 1)
            If WideCharToMultiByte(CP_UTF8, 0, StrPtr(Text), Len(Text), VarPtr(bUTF8(0)), lRet, 0, 0) = 0 Then
                ReDim bUTF8(0)
            End If
            If UTF8Flg Then    'DEX: This looks like it's adding a BOM
                bTmp = bUTF8
                ReDim bUTF8(lRet + 3)
                bUTF8(0) = b0: bUTF8(1) = b1: bUTF8(2) = b2
                CopyMemory bUTF8(3), bTmp(0), GetbSize(bTmp)
            End If
        Else
            ReDim bUTF8(0) 'DEX: This Redimensions the Array to 1 Byte? Why 1 Byte? Every Zero length String, now has 1/2 a NullChar
        End If
    'DEX: Why are we converting this back to a String? What about Odd Byte Lengths? IS that why you're rounding?
        sUTF8 = String$(GetsSize(bUTF8), Chr$(0)) 
        CopyMemory ByVal StrPtr(sUTF8), bUTF8(0), GetbSize(bUTF8) 'IF There array is Empty (ie: 1 Byte???) or Odd length - Now you're copying a Garbage Half Char into your String
        ToUTF8 = sUTF8
    End Function

    Here's some functions you should use. Also you should Erase a dynamic array, if you want to empty it.

    Code:
    Private Declare Function ArrPtr Lib "msvbvm60" Alias "VarPtr" (Arr() As Any) As Long
    Private Declare Function GetMem4 Lib "msvbvm60" (Src As Any, Dst As Any) As Long
    Private Function DeRef(ByVal Address As Long) As Long
        GetMem4 ByVal Address, DeRef
    End Function
    Private Function Length(ArrayName() As Byte) As Long
        If (DeRef(ArrPtr(ArrayName)) = 0&) Then Exit Function
        Length = UBound(ArrayName) - LBound(ArrayName) + 1
    End Function
    Last edited by DEXWERX; Feb 9th, 2016 at 12:07 PM.

  6. #6

    Thread Starter
    Fanatic Member
    Join Date
    Dec 2012
    Posts
    798

    Re: VB6 - The case for UTF-8

    To DEXWERX & Schmidt;

    To quote Karl E. Peterson (who is no stranger to Visual Basic), "Unfortunately, Microsoft set a horrific precedent and chose to redefine the fundamental String data type, rather than provide a new one. UniMess, as it came be known, effectively broke every line of BASIC binary file i/o code ever written." He goes on to explain that many VB programmers use arrays of strings containing binary data, and as far as I know, that is not easily implemented with byte arrays. I found his comments while searching for a way to do a search of a byte array. He demonstrated that "InStrB" can be used with byte arrays, even though the help file says that it is used with strings. Unfortunately, I was looking for a reverse search, so I abandoned the effort and went back to using strings.

    Whether you agree with it or not, the design philosophy I employed with clsCNG.cls was to make the demarcation point between string use and byte array use at the boundary of the class. For my purposes, the class itself MUST be able to handle anything that is passed to it, and that includes binary data above &H7F. I have provided the class at zero cost to you, and demonstrated that it can be used with Unicode. What you do with it from there is up to you. In all likelihood, I will never use it with Unicode again, as I don't use anything but ASCII for text. I find the string a convenient carrier for binary data. I also find it a lot easier to do data manipulation with strings than with byte arrays. Unicode just made it a little more difficult. If you want to change it and make it only work with UTF-8, then by all means go ahead. That is your option, but it doesn't give you license to criticise the approach that I took.

    J.A. Coutts

  7. #7
    PowerPoster
    Join Date
    Feb 2006
    Posts
    20,785

    Re: VB6 - The case for UTF-8

    I throw up my hands and walk away.

  8. #8
    Fanatic Member
    Join Date
    Aug 2013
    Posts
    806

    Re: VB6 - The case for UTF-8

    Quote Originally Posted by dilettante View Post
    I throw up my hands and walk away.
    I think this is the only sane response at this point. You can't help someone that doesn't want to be helped.

    Once a coder starts quoting Karl E. Peterson's comments in reference to VB4 - a version that still targeted Windows 3.1, for ****'s sake - there's not much any of us can do.

    @couttsj: I'm not sure why you think there's a conspiracy here. The dozens of people who have tried to help you understand these concepts aren't doing it for personal gain. We're doing it to make your life easier. Between us, we've written thousands of Unicode-compatible projects that use 1/10th the lines of code that your "solutions" do, and are far more intuitive, elegant, and safe for both us and our customers. We'd like to share our combined wisdom with you, and it's bizarre that you take offense at it.

    Frankly, if you don't want external input, don't share your code. Close-source it and sell it. Open-sourcing something is pointless if you don't want to incorporate feedback you receive.
    Check out PhotoDemon, a pro-grade photo editor written completely in VB6. (Full source available at GitHub.)

  9. #9

    Thread Starter
    Fanatic Member
    Join Date
    Dec 2012
    Posts
    798

    Re: VB6 - The case for UTF-8

    Quote Originally Posted by Tanner_H View Post
    Frankly, if you don't want external input, don't share your code. Close-source it and sell it. Open-sourcing something is pointless if you don't want to incorporate feedback you receive.
    If you really want to help, then you could try implementing a TLS 1.2 connection using your approach. I would seriously be very interested.

    J.A. Coutts

  10. #10
    Fanatic Member
    Join Date
    Aug 2013
    Posts
    806

    Re: VB6 - The case for UTF-8

    Quote Originally Posted by couttsj View Post
    If you really want to help, then you could try implementing a TLS 1.2 connection using your approach.
    And risk you dismissing it out of hand, like you've done with all previous advice on this topic? I can't speak for others, but that hardly seems like a constructive use of time.

    The first step to fixing the problem is simple, and it's been repeated by all of us at one point or another. To quote Schmidt, above:

    The best advice one can give to you at this stage is, to change all the String-Input Params and Output-Results of your class functions simply to ByteArrays.
    This is step one. Cryptography functions must always operate on raw bytes with no semantic data attached. This is true for pretty much all cryptography APIs in all programming languages.

    In some programming languages, there are multiple data types that fulfill the description of "raw bytes with no semantic data attached." In VB, we are not that fortunate. We have one option: actual Byte arrays.

    Strings do not meet this definition. No amount of wishing can change this. VB strings are integer arrays with specialized headers and trailers and they have loads of semantic data attached. VB functions that operate on strings may perform all kinds of silent changes, with the understanding that the data they are operating on is text from the current codepage. This is by design, in the same way that you can't pass text data to a sound function or sound data to a graphics function and expect them to behave correctly.

    Once you have accepted the inevitability of operating only on byte arrays, then we can tackle the next problem, which is how to move various types of data (including strings) into bare byte arrays.

    But as long as you insist on working only with strings in your various cryptography functions, you will continue to run into problems, and your code will continue to be extremely complex and error-prone. (And this is not even getting into the flat-out falsehoods in your original post, like "The Unicode standards are very loose" or "The Unicode standard, and in particular UTF-8, only works with true ASCII characters less than 128 (&H80)". I literally have no idea how a person could arrive at conclusions like that, given that the exact opposite is true... )
    Check out PhotoDemon, a pro-grade photo editor written completely in VB6. (Full source available at GitHub.)

  11. #11
    PowerPoster
    Join Date
    Jun 2015
    Posts
    2,224

    Re: VB6 - The case for UTF-8

    Quote Originally Posted by Tanner_H View Post
    Strings do not meet this definition. No amount of wishing can change this. VB strings are integer arrays with specialized headers and trailers and they have loads of semantic data attached. VB functions that operate on strings may perform all kinds of silent changes, with the understanding that the data they are operating on is text from the current codepage. This is by design, in the same way that you can't pass text data to a sound function or sound data to a graphics function and expect them to behave correctly.
    Not to be pedantic - but technically VB Strings can be used as Binary data. Historically the Basic language allowed this, but ever since VB4/5 you have to use the Binary friendly string functions, and you definitely can't treat a Binary String Buffer as a Wide/Unicode String interchangeably. This is where all the errors are cropping up.

    No one uses this technique anymore, and It really shouldn't be used, because it is error prone and non-intuitive. Couttsj's difficulties are a perfect example of why NOT to do this.

    The quick Fix is as schmidt and you suggested. just use Byte Arrays.

  12. #12
    Fanatic Member
    Join Date
    Aug 2013
    Posts
    806

    Re: VB6 - The case for UTF-8

    Quote Originally Posted by DEXWERX View Post
    Not to be pedantic - but technically VB Strings can be used as Binary data. Historically the Basic language allowed this, but ever since VB4/5 you have to use the Binary friendly string functions, and you definitely can't treat a Binary String Buffer as a Wide/Unicode String interchangeably. This is where all the errors are cropping up.

    No one uses this technique anymore, and It really shouldn't be used, because it is error prone and non-intuitive. Couttsj's difficulties are a perfect example of why NOT to do this.

    The quick Fix is as schmidt and you suggested. just use Byte Arrays.
    Certainly true, but I didn't even want to MENTION that possibility, because as we know couttsj is fixated on making that technique work, despite the endless pitfalls.

    (I mean, technically any data can be treated as binary data by wrapping it with a SafeArray header. I think it could be really constructive to discuss this with couttsj in more detail, but that discussion is basically pointless until his code is rewritten to operate on byte arrays.)
    Check out PhotoDemon, a pro-grade photo editor written completely in VB6. (Full source available at GitHub.)

  13. #13
    PowerPoster
    Join Date
    Feb 2006
    Posts
    20,785

    Re: VB6 - The case for UTF-8

    BasicBuffer, Binary Stream Class or something along similar lines might go a long way toward providing an alternative to working with String variables.

  14. #14

    Thread Starter
    Fanatic Member
    Join Date
    Dec 2012
    Posts
    798

    Re: VB6 - The case for UTF-8

    I would like to think that I am capable of thinking "outside the box". That is to say "There is more than one way to skin a cat (not the domestic variety)". I have produced a class that has the capability to execute cryptographic functions using VB6 & CNG. I don't think that anyone else can claim that, and I have shared it with the VB community. But there are those that seem to think there is only one way of doing things, and if your code doesn't fit their niche methods, it is wrong. Fortunately, I don't share that restrictive view. End of story. I will not respond further to this thread.

    J.A. Coutts

  15. #15
    PowerPoster
    Join Date
    Feb 2006
    Posts
    20,785

    Re: VB6 - The case for UTF-8

    Maybe:

    RSA Public Key Encryption via CNG

    RSA Data Signing via CNG

    Nothing especially challenging there except for the need to wait until Windows XP became a thing of the past. Before that most people couldn't justify looking into CNG.

    We face similar issues with new APIs introduced in post-Win7 versions of Windows now. One that comes to mind immediately is the new Compression API.

    Another neglected API is the one described under Packaging API. This one comes in Windows 7 and the Vista Platform Update but I haven't seen anyone post a result of experimentation with it in VB6 yet.

  16. #16
    Fanatic Member
    Join Date
    Aug 2013
    Posts
    806

    Re: VB6 - The case for UTF-8

    If a new user stumbles onto this thread thinking it will be helpful for UTF-8 handling, my only advice can be: "look elsewhere". Converting a VB string to UTF-8 is as simple as:

    Declarations:
    Code:
    Private Const CP_UTF8 As Long = 65001
    Private Declare Function WideCharToMultiByte Lib "kernel32" (ByVal CodePage As Long, ByVal dwFlags As Long, ByVal lpWideCharStr As Long, _
        ByVal cchWideChar As Long, ByVal lpMultiByteStr As Long, ByVal cchMultiByte As Long, ByVal lpDefaultChar As Long, ByVal lpUsedDefaultChar As Long) As Long
    Code:
    Code:
    'Given some VB string named "Text"...
    Dim lenUTF8 As Long
    Dim UTF8() As Byte
    lenUTF8 = WideCharToMultiByte(CP_UTF8, 0, StrPtr(Text), Len(Text), 0, 0, 0, 0)
    If lenUTF8 > 0 Then
        ReDim UTF8(lenUTF8 - 1)
        WideCharToMultiByte CP_UTF8, 0, StrPtr(Text), Len(Text), VarPtr(UTF8(0)), lenUTF8, 0, 0
    End If
    UTF-8 characters have variable length (anywhere from one to four bytes per character), so byte arrays are the only VB unit that can sensibly deal with them. Dr Unicode's comprehensive tutorial has a great deal more code on this topic.
    Check out PhotoDemon, a pro-grade photo editor written completely in VB6. (Full source available at GitHub.)

  17. #17
    PowerPoster
    Join Date
    Jun 2015
    Posts
    2,224

    Re: VB6 - The case for UTF-8

    Quote Originally Posted by couttsj View Post
    I would like to think that I am capable of thinking "outside the box". That is to say "There is more than one way to skin a cat (not the domestic variety)". I have produced a class that has the capability to execute cryptographic functions using VB6 & CNG. I don't think that anyone else can claim that, and I have shared it with the VB community. But there are those that seem to think there is only one way of doing things, and if your code doesn't fit their niche methods, it is wrong. Fortunately, I don't share that restrictive view. End of story. I will not respond further to this thread.

    J.A. Coutts
    It really has nothing to do with that. You've got a bug ridden code base, with so many pitfalls, that not only show you don't know how strings or unicode works, but you don't even know how to convert a string to a byte array, or erase a dynamic array. You're just completely missing the boat on fundamentals of the language. You over-complicating things because you don't understand the basics. I'll leave you alone, I just wanted to try and get you to understand where you're going so wrong.

  18. #18
    Super Moderator Shaggy Hiker's Avatar
    Join Date
    Aug 2002
    Location
    Idaho
    Posts
    34,445

    Re: VB6 - The case for UTF-8

    Let's just let this one fade away. Everybody has had their say, and many of the points were good, but it was beginning to wander away from a constructive discussion to more destructive ends.
    My usual boring signature: Nothing

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Featured


Click Here to Expand Forum to Full Width