Results 1 to 13 of 13

Thread: Question about unicode

  1. #1

    Thread Starter
    Hyperactive Member
    Join Date
    Nov 2011
    Posts
    498

    Question about unicode

    Hi. I dont really know how to implement unicode into my apps. I see that you have to use API W codes etc
    I am really trying to do this for my own learning as the apps i make are just for my own use.

    I have a grid that list filenames and the grid is unicode aware.
    I load some filenames into the grid.
    Now some files are plain Ansi and a couple might be unicode.

    When going through the list and wanting to rename the files i know there is the MoveW api.

    my question is this. Reguarding filenames.

    Do i need to check if the filename is Unicode or not and then either use the standard rename function
    of can i use the MoveW api code to rename all files.

    Can i just code for unicode and thats it or do i need to code for both.

    If i need to do both then how would i check a filename is unicode.

    ie.

    If IsUnicodeFile(Filename) = True then

    tks
    Last edited by k_zeon; Feb 5th, 2023 at 12:58 PM.

  2. #2
    PowerPoster
    Join Date
    Feb 2006
    Posts
    24,482

    Re: Question about unicode

    There is no "plain ANSI."

    ANSI/DBCS encodings vary and any given text transcoded from Unicode to ANSI for any given codepage can lose fidelity. Typically "lost" characters become "?" symbols by default.

    There is no "Unicode or not" since text is always encoded. Either as a Unicode encoding or something lossier. I suspect what you meant was something closer to "Is this string safe for encoding as ANSI for my codepage or do I need to use Unicode to avoid losing fidelity?"


    Whether ANSI is "safe" or not depends on the characters involved and whether or not you need to move the text across locales with different codepage values.

    No simple function can enter the necessary Socratic dialog with you to ask enough questions to determine what your intent really is.


    Perhaps what you really need is some function that accepts Unicode text and returns a value that means "Foreign to me or not?" instead taking the current codepage into account?

    Why bother, just use Unicode (-W entrypoints) when in doubt. These are faster in NT (since the end of the Win9x days) anyway.

  3. #3

    Thread Starter
    Hyperactive Member
    Join Date
    Nov 2011
    Posts
    498

    Re: Question about unicode

    Quote Originally Posted by dilettante View Post
    There is no "plain ANSI."

    ANSI/DBCS encodings vary and any given text transcoded from Unicode to ANSI for any given codepage can lose fidelity. Typically "lost" characters become "?" symbols by default.

    There is no "Unicode or not" since text is always encoded. Either as a Unicode encoding or something lossier. I suspect what you meant was something closer to "Is this string safe for encoding as ANSI for my codepage or do I need to use Unicode to avoid losing fidelity?"


    Whether ANSI is "safe" or not depends on the characters involved and whether or not you need to move the text across locales with different codepage values.

    No simple function can enter the necessary Socratic dialog with you to ask enough questions to determine what your intent really is.


    Perhaps what you really need is some function that accepts Unicode text and returns a value that means "Foreign to me or not?" instead taking the current codepage into account?

    Why bother, just use Unicode (-W entrypoints) when in doubt. These are faster in NT (since the end of the Win9x days) anyway.
    tks dilettante. so just code my functions for unicode filenames

  4. #4
    PowerPoster
    Join Date
    Jul 2010
    Location
    NYC
    Posts
    5,625

    Re: Question about unicode

    Taking a peek at the Windows source, all the A apis simply convert the string to Unicode and call the W api.

  5. #5
    Angel of Code Niya's Avatar
    Join Date
    Nov 2011
    Posts
    8,598

    Re: Question about unicode

    Quote Originally Posted by fafalone View Post
    Taking a peek at the Windows source
    Where? You mean the Windows XP leaks?
    Treeview with NodeAdded/NodesRemoved events | BlinkLabel control | Calculate Permutations | Object Enums | ComboBox with centered items | .Net Internals article(not mine) | Wizard Control | Understanding Multi-Threading | Simple file compression | Demon Arena

    Copy/move files using Windows Shell | I'm not wanted

    C++ programmers will dismiss you as a cretinous simpleton for your inability to keep track of pointers chained 6 levels deep and Java programmers will pillory you for buying into the evils of Microsoft. Meanwhile C# programmers will get paid just a little bit more than you for writing exactly the same code and VB6 programmers will continue to whitter on about "footprints". - FunkyDexter

    There's just no reason to use garbage like InputBox. - jmcilhinney

    The threads I start are Niya and Olaf free zones. No arguing about the benefits of VB6 over .NET here please. Happiness must reign. - yereverluvinuncleber

  6. #6
    Angel of Code Niya's Avatar
    Join Date
    Nov 2011
    Posts
    8,598

    Re: Question about unicode

    Quote Originally Posted by k_zeon View Post
    Now some files are plain Ansi and a couple might be unicode.
    All file names in Windows are Unicode.
    Treeview with NodeAdded/NodesRemoved events | BlinkLabel control | Calculate Permutations | Object Enums | ComboBox with centered items | .Net Internals article(not mine) | Wizard Control | Understanding Multi-Threading | Simple file compression | Demon Arena

    Copy/move files using Windows Shell | I'm not wanted

    C++ programmers will dismiss you as a cretinous simpleton for your inability to keep track of pointers chained 6 levels deep and Java programmers will pillory you for buying into the evils of Microsoft. Meanwhile C# programmers will get paid just a little bit more than you for writing exactly the same code and VB6 programmers will continue to whitter on about "footprints". - FunkyDexter

    There's just no reason to use garbage like InputBox. - jmcilhinney

    The threads I start are Niya and Olaf free zones. No arguing about the benefits of VB6 over .NET here please. Happiness must reign. - yereverluvinuncleber

  7. #7
    PowerPoster Elroy's Avatar
    Join Date
    Jun 2014
    Location
    Near Nashville TN
    Posts
    9,817

    Re: Question about unicode

    Ok, I'll jump in here. This can all be dizzying to the uninitiated.

    ANSI, in some sense, is more complex than Unicode. And maybe it's best to start with ASCII.

    ASCII is 7-bit encoding (always setting the high 8th bit to zero). And this covers the English letters, base-10 digits, and all the special characters seen on a typical English-style keyboard. In addition, ASCII has a few control-characters (like backspace, tab, etc) encoded into it.

    In the beginning, ANSI was an extension of ASCII whereby the encoding set was doubled, using the 8th bit to get twice as many encodings. The first passes just added characters for Latin-style languages to cover things like r̃ , Á and other letters frequently seen. But then ANSI "pages" were introduced to specify what the characters were in the high-bit-on encodings, and that system is still in use.

    But ANSI has gotten even more complex and goes beyond just the 8th bit being on. In fact, there are various Unicode pages within ANSI, but that gets complex and I won't go into it.

    ------------

    Ok, Unicode ... there are several flavors of Unicode. To name a few:
    • UTF-8
    • UTF-16
    • UCS-2
    • UTF-32


    UCS-2 is a perfect subset of UTF-16 whereas all the characters are encoded as exactly two-bytes.

    UTF-8 is the most popular, being used for almost all HTTP communications and web communications.

    Microsoft, on the other hand, tends to promote the UTF-16 flavor of Unicode, and all its ...W API calls expect strings coming in to be encoded as UTF-16. It does have another set of ...A API calls that expect ANSI strings to be passed. As a note, API calls that don't deal with strings, don't have to worry about this.

    ------------

    So, how does this relate to VB6? Well, VB6 is a bit of a hodge-podge. Internally, VB6 considers its strings to be UCS-2 (and that's what all the VB6 string functions expect). (Some like to say that VB6 strings are UTF-16, but that's a debate I'll sidestep here.)

    But, VB6 was rushed out the door a bit, and most of its controls (like TextBox, ComboBox, etc) only understand ANSI (and a version of ANSI that's only one-byte-per-character). (Krool and others have corrected that by making full Unicode versions of the controls.) So, to say again, internally, VB6 strings are UCS-2, but typically displayed as ANSI.

    -------------

    Now here's another wrinkle. VB6 was setup (by default) to make API calls with ANSI strings. So, when you make an API call (with the ...A suffix), VB6 converts your internal UCS-2 string to ANSI and then passes it to the API call. It's actually quicker to just use the ...W version of the API call, and pass your string using StrPtr(YourString). That way, no conversion needs to be done. This works for both [in] and [out] strings for API calls.

    --------------

    So, a couple of answers to your questions:

    1) If you just set everything up to use the ...W (Unicode) API calls, you're all set. No worries about ANSI as you'll never actually be using it. Everything will stay pure Unicode, including the VB6 strings.

    2) If you just really want to know if a string contains Unicode characters that won't easily convert to ANSI, you can do something like the following:

    Code:
    
    Public Function HighBytesUsed(s As String) As Boolean
        HighBytesUsed = s <> StrConv(StrConv(s, vbFromUnicode), vbUnicode)
    End Function
    
    
    I called it HighBytesUsed rather than something like HasUnicode because, technically, ASCII is a character subset of Unicode. So, strictly speaking, all VB6 strings are Unicode regardless of whether or not they can be converted to a one-byte encoding.

    Maybe that'll help,
    Elroy

    -----------------
    Added: I've decided to rename that above function again, because there are cases where two-byte UCS-2 encoding can successfully be converted to ANSI (and vice-versa). So, HighByteUsed isn't strictly correct. Here's a better name:

    Code:
    
    Public Function HasNonAnsi(s As String) As Boolean
        HasNonAnsi = s <> StrConv(StrConv(s, vbFromUnicode), vbUnicode)
    End Function
    
    
    Last edited by Elroy; Feb 6th, 2023 at 12:00 PM.
    Any software I post in these forums written by me is provided "AS IS" without warranty of any kind, expressed or implied, and permission is hereby granted, free of charge and without restriction, to any person obtaining a copy. To all, peace and happiness.

  8. #8
    PowerPoster Elroy's Avatar
    Join Date
    Jun 2014
    Location
    Near Nashville TN
    Posts
    9,817

    Re: Question about unicode

    Just a touch more clarification (after a bit of review):

    I think it's fair to say that ANSI is (almost) always referring to a one-byte-per-character encoding scheme. The first 128 codes (0 thru 127) are ASCII, and the next 128 codes (128 thru 255) are specified by the code page designation set in the OS (typically Windows for us). So again, according to most sources, ANSI encoding is a one-byte encoding with the additional specification of a code-page needed for interpreting the second-half of the characters.

    However, the notion of "code-page" outgrew (or maybe never completely fit into) ANSI. In a certain sense, a code-page is the most general of character specifications. For instance, UTF-16 or UCS-2 or UTF-8 are all code-pages (but nothing specifically to do with ANSI). If we look at the Wikipedia site, we can see that this terminology has a long and historied use.

    -----------

    And, just to summarize again:
    • VB6 strings internally are UCS-2.
    • VB6's controls typically prefer ANSI (with code-page specified by Windows). (And Krool, Eduardo, and others have corrected this oversight.)
    • Windows API calls (by default) will convert VB6's strings to ANSI.
    • Windows API calls with ...W suffix will send VB6's strings straight in (but string pointer must be passed).
    • If the high-byte is non-zero for UCS-2 strings, they probably won't convert to ANSI very well (but this is actually a longer discussion).
    Any software I post in these forums written by me is provided "AS IS" without warranty of any kind, expressed or implied, and permission is hereby granted, free of charge and without restriction, to any person obtaining a copy. To all, peace and happiness.

  9. #9
    PowerPoster
    Join Date
    Jul 2010
    Location
    NYC
    Posts
    5,625

    Re: Question about unicode

    Quote Originally Posted by Niya View Post
    Where? You mean the Windows XP leaks?
    I usually start with Windows Server 2003 from the same leak as it's ever so slightly more recent... but man I wish I had access to more recent source. Can't believe Vista/7 hasn't leaked yet... that's when so many major, major changes were introduced... I could finally solve so many long standing issues if I could only look under the hood of that...

  10. #10
    Frenzied Member
    Join Date
    Dec 2012
    Posts
    1,468

    Re: Question about unicode

    Elroy has shown us a very good understanding of Unicode, and it is a pleasure to see. Like he says, the VB6 interpretation is in reality not Unicode, but rather Wide Character. As long as you work within VB6, that is not a problem. But if you want to communicate with non-VB6 programs, you should establish some kind of common ground. Web servers for example will most often communicate string information as UTF-8. A UTF-8 character can be one, two, three, or four bytes. It is also what I have chosen to use, although my work is basically all ASCII, so conversion is straight forward.

    J.A. Coutts

  11. #11
    PowerPoster
    Join Date
    Feb 2006
    Posts
    24,482

    Re: Question about unicode

    This is painful. Is it April 1st?

  12. #12
    PowerPoster Elroy's Avatar
    Join Date
    Jun 2014
    Location
    Near Nashville TN
    Posts
    9,817

    Re: Question about unicode

    Quote Originally Posted by dilettante View Post
    This is painful. Is it April 1st?
    Personally, I've always found this to be somewhat painful. I've just always attributed it to all the varied attempts to deal with all the worldwide languages.

    And we still currently/frequently have to deal with the confluence of ANSI (our keyboards and computer codepage, as well as many VB6 controls), UCS-2 (VB6 strings), UTF-16 (API calls), and UTF-8 (web pages).

    If the United Nations made a worldwide law that everyone had to use UTF-32, the problem would be solved.
    Last edited by Elroy; Feb 6th, 2023 at 12:11 PM.
    Any software I post in these forums written by me is provided "AS IS" without warranty of any kind, expressed or implied, and permission is hereby granted, free of charge and without restriction, to any person obtaining a copy. To all, peace and happiness.

  13. #13
    Angel of Code Niya's Avatar
    Join Date
    Nov 2011
    Posts
    8,598

    Re: Question about unicode

    Quote Originally Posted by Elroy View Post
    If the United Nations made a worldwide law that everyone had to use UTF-32, the problem would be solved.
    Yuck!

    UTF-8 is the most efficient Unicode encoding.
    Treeview with NodeAdded/NodesRemoved events | BlinkLabel control | Calculate Permutations | Object Enums | ComboBox with centered items | .Net Internals article(not mine) | Wizard Control | Understanding Multi-Threading | Simple file compression | Demon Arena

    Copy/move files using Windows Shell | I'm not wanted

    C++ programmers will dismiss you as a cretinous simpleton for your inability to keep track of pointers chained 6 levels deep and Java programmers will pillory you for buying into the evils of Microsoft. Meanwhile C# programmers will get paid just a little bit more than you for writing exactly the same code and VB6 programmers will continue to whitter on about "footprints". - FunkyDexter

    There's just no reason to use garbage like InputBox. - jmcilhinney

    The threads I start are Niya and Olaf free zones. No arguing about the benefits of VB6 over .NET here please. Happiness must reign. - yereverluvinuncleber

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width