-
Feb 5th, 2023, 12:47 PM
#1
Thread Starter
Hyperactive Member
Question about unicode
Hi. I dont really know how to implement unicode into my apps. I see that you have to use API W codes etc
I am really trying to do this for my own learning as the apps i make are just for my own use.
I have a grid that list filenames and the grid is unicode aware.
I load some filenames into the grid.
Now some files are plain Ansi and a couple might be unicode.
When going through the list and wanting to rename the files i know there is the MoveW api.
my question is this. Reguarding filenames.
Do i need to check if the filename is Unicode or not and then either use the standard rename function
of can i use the MoveW api code to rename all files.
Can i just code for unicode and thats it or do i need to code for both.
If i need to do both then how would i check a filename is unicode.
ie.
If IsUnicodeFile(Filename) = True then
tks
Last edited by k_zeon; Feb 5th, 2023 at 12:58 PM.
-
Feb 5th, 2023, 01:19 PM
#2
Re: Question about unicode
There is no "plain ANSI."
ANSI/DBCS encodings vary and any given text transcoded from Unicode to ANSI for any given codepage can lose fidelity. Typically "lost" characters become "?" symbols by default.
There is no "Unicode or not" since text is always encoded. Either as a Unicode encoding or something lossier. I suspect what you meant was something closer to "Is this string safe for encoding as ANSI for my codepage or do I need to use Unicode to avoid losing fidelity?"
Whether ANSI is "safe" or not depends on the characters involved and whether or not you need to move the text across locales with different codepage values.
No simple function can enter the necessary Socratic dialog with you to ask enough questions to determine what your intent really is.
Perhaps what you really need is some function that accepts Unicode text and returns a value that means "Foreign to me or not?" instead taking the current codepage into account?
Why bother, just use Unicode (-W entrypoints) when in doubt. These are faster in NT (since the end of the Win9x days) anyway.
-
Feb 5th, 2023, 01:30 PM
#3
Thread Starter
Hyperactive Member
Re: Question about unicode
 Originally Posted by dilettante
There is no "plain ANSI."
ANSI/DBCS encodings vary and any given text transcoded from Unicode to ANSI for any given codepage can lose fidelity. Typically "lost" characters become "?" symbols by default.
There is no "Unicode or not" since text is always encoded. Either as a Unicode encoding or something lossier. I suspect what you meant was something closer to "Is this string safe for encoding as ANSI for my codepage or do I need to use Unicode to avoid losing fidelity?"
Whether ANSI is "safe" or not depends on the characters involved and whether or not you need to move the text across locales with different codepage values.
No simple function can enter the necessary Socratic dialog with you to ask enough questions to determine what your intent really is.
Perhaps what you really need is some function that accepts Unicode text and returns a value that means "Foreign to me or not?" instead taking the current codepage into account?
Why bother, just use Unicode (-W entrypoints) when in doubt. These are faster in NT (since the end of the Win9x days) anyway.
tks dilettante. so just code my functions for unicode filenames
-
Feb 5th, 2023, 01:32 PM
#4
Re: Question about unicode
Taking a peek at the Windows source, all the A apis simply convert the string to Unicode and call the W api.
-
Feb 5th, 2023, 01:52 PM
#5
Re: Question about unicode
 Originally Posted by fafalone
Taking a peek at the Windows source
Where? You mean the Windows XP leaks?
-
Feb 5th, 2023, 01:55 PM
#6
Re: Question about unicode
 Originally Posted by k_zeon
Now some files are plain Ansi and a couple might be unicode.
All file names in Windows are Unicode.
-
Feb 5th, 2023, 02:04 PM
#7
Re: Question about unicode
Ok, I'll jump in here. This can all be dizzying to the uninitiated.
ANSI, in some sense, is more complex than Unicode. And maybe it's best to start with ASCII.
ASCII is 7-bit encoding (always setting the high 8th bit to zero). And this covers the English letters, base-10 digits, and all the special characters seen on a typical English-style keyboard. In addition, ASCII has a few control-characters (like backspace, tab, etc) encoded into it.
In the beginning, ANSI was an extension of ASCII whereby the encoding set was doubled, using the 8th bit to get twice as many encodings. The first passes just added characters for Latin-style languages to cover things like r̃ , Á and other letters frequently seen. But then ANSI "pages" were introduced to specify what the characters were in the high-bit-on encodings, and that system is still in use.
But ANSI has gotten even more complex and goes beyond just the 8th bit being on. In fact, there are various Unicode pages within ANSI, but that gets complex and I won't go into it.
------------
Ok, Unicode ... there are several flavors of Unicode. To name a few:
UCS-2 is a perfect subset of UTF-16 whereas all the characters are encoded as exactly two-bytes.
UTF-8 is the most popular, being used for almost all HTTP communications and web communications.
Microsoft, on the other hand, tends to promote the UTF-16 flavor of Unicode, and all its ...W API calls expect strings coming in to be encoded as UTF-16. It does have another set of ...A API calls that expect ANSI strings to be passed. As a note, API calls that don't deal with strings, don't have to worry about this.
------------
So, how does this relate to VB6? Well, VB6 is a bit of a hodge-podge. Internally, VB6 considers its strings to be UCS-2 (and that's what all the VB6 string functions expect). (Some like to say that VB6 strings are UTF-16, but that's a debate I'll sidestep here.)
But, VB6 was rushed out the door a bit, and most of its controls (like TextBox, ComboBox, etc) only understand ANSI (and a version of ANSI that's only one-byte-per-character). (Krool and others have corrected that by making full Unicode versions of the controls.) So, to say again, internally, VB6 strings are UCS-2, but typically displayed as ANSI.
-------------
Now here's another wrinkle. VB6 was setup (by default) to make API calls with ANSI strings. So, when you make an API call (with the ...A suffix), VB6 converts your internal UCS-2 string to ANSI and then passes it to the API call. It's actually quicker to just use the ...W version of the API call, and pass your string using StrPtr(YourString). That way, no conversion needs to be done. This works for both [in] and [out] strings for API calls.
--------------
So, a couple of answers to your questions:
1) If you just set everything up to use the ...W (Unicode) API calls, you're all set. No worries about ANSI as you'll never actually be using it. Everything will stay pure Unicode, including the VB6 strings.
2) If you just really want to know if a string contains Unicode characters that won't easily convert to ANSI, you can do something like the following:
Code:
Public Function HighBytesUsed(s As String) As Boolean
HighBytesUsed = s <> StrConv(StrConv(s, vbFromUnicode), vbUnicode)
End Function
I called it HighBytesUsed rather than something like HasUnicode because, technically, ASCII is a character subset of Unicode. So, strictly speaking, all VB6 strings are Unicode regardless of whether or not they can be converted to a one-byte encoding.
Maybe that'll help,
Elroy
-----------------
Added: I've decided to rename that above function again, because there are cases where two-byte UCS-2 encoding can successfully be converted to ANSI (and vice-versa). So, HighByteUsed isn't strictly correct. Here's a better name:
Code:
Public Function HasNonAnsi(s As String) As Boolean
HasNonAnsi = s <> StrConv(StrConv(s, vbFromUnicode), vbUnicode)
End Function
Last edited by Elroy; Feb 6th, 2023 at 12:00 PM.
Any software I post in these forums written by me is provided “AS IS” without warranty of any kind, expressed or implied, and permission is hereby granted, free of charge and without restriction, to any person obtaining a copy. Please understand that I’ve been programming since the mid-1970s and still have some of that code. My contemporary VB6 project is approaching 1,000 modules. In addition, I have a “VB6 random code folder” that is overflowing. I’ve been at this long enough to truly not know with absolute certainty from whence every single line of my code has come, with much of it coming from programmers under my employ who signed intellectual property transfers. I have not deliberately attempted to remove any licenses and/or attributions from any software. If someone finds that I have inadvertently done so, I sincerely apologize, and, upon notice and reasonable proof, will re-attach those licenses and/or attributions. To all, peace and happiness.
-
Feb 5th, 2023, 03:51 PM
#8
Re: Question about unicode
Just a touch more clarification (after a bit of review):
I think it's fair to say that ANSI is (almost) always referring to a one-byte-per-character encoding scheme. The first 128 codes (0 thru 127) are ASCII, and the next 128 codes (128 thru 255) are specified by the code page designation set in the OS (typically Windows for us). So again, according to most sources, ANSI encoding is a one-byte encoding with the additional specification of a code-page needed for interpreting the second-half of the characters.
However, the notion of "code-page" outgrew (or maybe never completely fit into) ANSI. In a certain sense, a code-page is the most general of character specifications. For instance, UTF-16 or UCS-2 or UTF-8 are all code-pages (but nothing specifically to do with ANSI). If we look at the Wikipedia site, we can see that this terminology has a long and historied use.
-----------
And, just to summarize again:
- VB6 strings internally are UCS-2.
- VB6's controls typically prefer ANSI (with code-page specified by Windows). (And Krool, Eduardo, and others have corrected this oversight.)
- Windows API calls (by default) will convert VB6's strings to ANSI.
- Windows API calls with ...W suffix will send VB6's strings straight in (but string pointer must be passed).
- If the high-byte is non-zero for UCS-2 strings, they probably won't convert to ANSI very well (but this is actually a longer discussion).
Any software I post in these forums written by me is provided “AS IS” without warranty of any kind, expressed or implied, and permission is hereby granted, free of charge and without restriction, to any person obtaining a copy. Please understand that I’ve been programming since the mid-1970s and still have some of that code. My contemporary VB6 project is approaching 1,000 modules. In addition, I have a “VB6 random code folder” that is overflowing. I’ve been at this long enough to truly not know with absolute certainty from whence every single line of my code has come, with much of it coming from programmers under my employ who signed intellectual property transfers. I have not deliberately attempted to remove any licenses and/or attributions from any software. If someone finds that I have inadvertently done so, I sincerely apologize, and, upon notice and reasonable proof, will re-attach those licenses and/or attributions. To all, peace and happiness.
-
Feb 5th, 2023, 06:46 PM
#9
Re: Question about unicode
 Originally Posted by Niya
Where? You mean the Windows XP leaks?
I usually start with Windows Server 2003 from the same leak as it's ever so slightly more recent... but man I wish I had access to more recent source. Can't believe Vista/7 hasn't leaked yet... that's when so many major, major changes were introduced... I could finally solve so many long standing issues if I could only look under the hood of that...
-
Feb 5th, 2023, 10:34 PM
#10
Re: Question about unicode
Elroy has shown us a very good understanding of Unicode, and it is a pleasure to see. Like he says, the VB6 interpretation is in reality not Unicode, but rather Wide Character. As long as you work within VB6, that is not a problem. But if you want to communicate with non-VB6 programs, you should establish some kind of common ground. Web servers for example will most often communicate string information as UTF-8. A UTF-8 character can be one, two, three, or four bytes. It is also what I have chosen to use, although my work is basically all ASCII, so conversion is straight forward.
J.A. Coutts
-
Feb 5th, 2023, 11:23 PM
#11
Re: Question about unicode
This is painful. Is it April 1st?
-
Feb 6th, 2023, 12:05 PM
#12
Re: Question about unicode
 Originally Posted by dilettante
This is painful. Is it April 1st?
Personally, I've always found this to be somewhat painful. I've just always attributed it to all the varied attempts to deal with all the worldwide languages.
And we still currently/frequently have to deal with the confluence of ANSI (our keyboards and computer codepage, as well as many VB6 controls), UCS-2 (VB6 strings), UTF-16 (API calls), and UTF-8 (web pages).
If the United Nations made a worldwide law that everyone had to use UTF-32, the problem would be solved.
Last edited by Elroy; Feb 6th, 2023 at 12:11 PM.
Any software I post in these forums written by me is provided “AS IS” without warranty of any kind, expressed or implied, and permission is hereby granted, free of charge and without restriction, to any person obtaining a copy. Please understand that I’ve been programming since the mid-1970s and still have some of that code. My contemporary VB6 project is approaching 1,000 modules. In addition, I have a “VB6 random code folder” that is overflowing. I’ve been at this long enough to truly not know with absolute certainty from whence every single line of my code has come, with much of it coming from programmers under my employ who signed intellectual property transfers. I have not deliberately attempted to remove any licenses and/or attributions from any software. If someone finds that I have inadvertently done so, I sincerely apologize, and, upon notice and reasonable proof, will re-attach those licenses and/or attributions. To all, peace and happiness.
-
Feb 6th, 2023, 07:44 PM
#13
Re: Question about unicode
 Originally Posted by Elroy
If the United Nations made a worldwide law that everyone had to use UTF-32, the problem would be solved. 
Yuck!
UTF-8 is the most efficient Unicode encoding.
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|