-
Ansi/Unicoding Encoding Issue
I've got a bit of an issue with reading a file correctly on NON English PC's.
I've just changed my Regional settings to CHINESE. Viewing the file in HEX seems to be replacing some characters, as the codepage has changed.
So the issue is, getting the correct codepage set in order for VB to read properly.
MultiByteToWideChar is way off....
strConv is way off...[/B]
Even calling the correct getACP setting is still off.
Any ideas?
-
Re: Ansi/Unicoding Encoding Issue
What kind of file are you talking about?
One written by your application?
-
Re: Ansi/Unicoding Encoding Issue
Quote:
Originally Posted by
Arnoutdv
What kind of file are you talking about?
One written by your application?
No. It's exported/created by another app.
-
Re: Ansi/Unicoding Encoding Issue
Lets assume your hex viewer isn't busted itself.
What encoding is the file written with? For Chinese it could be a number of things.
So much depends on the writing program.
-
Re: Ansi/Unicoding Encoding Issue
Quote:
Originally Posted by
some1uk03
No. It's exported/created by another app.
Is it a text file?
If it is then Textpad or Notepad++ should be able to read it and show what encoding is used.
UTF8/16, Unicode or whatever.
-
Re: Ansi/Unicoding Encoding Issue
Quote:
Originally Posted by
some1uk03
So the issue is, getting the correct codepage set in order for VB to read properly.
I think that you need to know in advance in what codeset was written an ANSI file.
Perhaps with statistical analysis of the content it could be guessed, but to my understanding that's not what programs do normally.
-
Re: Ansi/Unicoding Encoding Issue
Quote:
Originally Posted by
dilettante
Lets assume your hex viewer isn't busted itself.
What encoding is the file written with? For Chinese it could be a number of things.
So much depends on the writing program.
I have an example of the output from the ENGLISH Locale version to compare with.
The Default viewing on hex is ANSI, however, if i change the Encoding to Chinese Simplified, then it matches the English Local version.
Quote:
Originally Posted by
Arnoutdv
Is it a text file?
If it is then Textpad or Notepad++ should be able to read it and show what encoding is used.
UTF8/16, Unicode or whatever.
Not a text file, but a custom file format.
As another app is outputting the data, im not sure what encoding/codepage they're outputting as, but presumably the default system codepage.
-
Re: Ansi/Unicoding Encoding Issue
Quote:
Originally Posted by
some1uk03
. . . but presumably the default system codepage.
There is default user codepage and default system codepage. When you do StrConv from/to unicode it uses default *user* codepage by default.
You have to pass explicitly LOCALE_SYSTEM_DEFAULT = &H800 for LocaleID parameter (the optional 3-rd one) to use default *system* codepage.
This will probably not solve you issue at all because it seems you don't have a clear definition of "correct" in "issue with reading a file correctly on NON English PC's."
cheers,
</wqw>
-
Re: Ansi/Unicoding Encoding Issue
The output from the other APP is correct..
My reading of the FILE in to a ByteArray is also correct!
Problem arises, when converting the ByteArray in to a STRING and there a conversion happens, when it doesn't require any conversion!!
All of the following go through a conversion:
Code:
sString = bBytes 'Conversion <<
sString = sStrConv(bBytes, vbFromUnicode) 'Conversion <<
MultiByteToWideChar has same results as StrConv() too.
CopyMemory has same results too.
So, the question is, isn't there a way to convert the byte array to a string without any conversion taking place?
-
Re: Ansi/Unicoding Encoding Issue
If you assign the value of a dynamic Byte array to a String there is no transcoding ("conversion") performed.
It seems that the data on disk has been encoded as ANSI using some code page, and what you are really after is transcoding from that to UTF-16LE ("Unicode") and you aren't using the code page it was encoded in.
But we don't even know that. It seems very likely that your problem is that you are trying to display this stuff in an ANSI control that uses a different encoding, yielding scrambled results.
Or, for all we know, the data in the file is UTF-8 and you are expecting this to magically be transcoded properly.
One thing that might help get to a solution could be to show us a sample of the data and what you expect its interpretation as text to look like.
-
Re: Ansi/Unicoding Encoding Issue
Here's a simple way of replicating the issue:
Code:
Dim bBytes(4) As Byte
Dim sString As String
bBytes(0) = 192
bBytes(1) = 84
bBytes(2) = 69
bBytes(3) = 83
bBytes(4) = 84
sString = StrConv(bBytes, vbUnicode)
Dim xLoop As Integer
For xLoop = 1 To Len(sString)
Debug.Print Asc(Mid$(sString, xLoop, 1))
Next
Debug.Print does not OUTPUT: 192,84,69,83,84
Seems like the 192 gets treated/converted way off based on the system codepage.
-
Re: Ansi/Unicoding Encoding Issue
No, because you specify a conversion
sString = bBytes should give the correct bytes, but is maybe not the correct text
-
Re: Ansi/Unicoding Encoding Issue
Quote:
Originally Posted by
Arnoutdv
No, because you specify a conversion
sString = bBytes should give the correct bytes, but is maybe not the correct text
sString = bBytes, does NOT give correct result either.
REMEMBER TO TEST THIS BY SETTING YOUR SYSTEM LOCALE TO CHINESE!
-
1 Attachment(s)
Re: Ansi/Unicoding Encoding Issue
Clearly there is some confusion. Most likely you are unaware of how "ANSI" works with DBCS code pages.
There is a reason why these are called multibyte encodings. For example:
Code:
Option Explicit
Private Declare Function TextOutW Lib "gdi32" ( _
ByVal hDC As Long, _
ByVal X As Long, _
ByVal Y As Long, _
ByVal lpString As Long, _
ByVal nCount As Long) As Long
Private Sub Form_Load()
Dim bBytes(4) As Byte
Dim sString As String
With Font
.Name = "Segoe UI"
.Size = 16
End With
AutoRedraw = True
bBytes(0) = 192
bBytes(1) = 84
bBytes(2) = 69
bBytes(3) = 83
bBytes(4) = 84
With New ADODB.Stream
.Open
.Type = adTypeBinary
.Write bBytes
.Position = 0
.Type = adTypeText
.Charset = "big5"
sString = .ReadText(adReadAll)
.Close
End With
TextOutW hDC, 0, 0, StrPtr(sString), Len(sString)
End Sub
Five bytes but only 4 characters.
-
Re: Ansi/Unicoding Encoding Issue
Also note:
Quote:
Asc Function
The range for returns is 0 – 255 on non-DBCS systems, but –32768 – 32767 on DBCS systems.
-
Re: Ansi/Unicoding Encoding Issue
BTW: big5 was just a guess, gb2312 is just as likely.
-
Re: Ansi/Unicoding Encoding Issue
Ok, so the question is, how do we get the sString to hold the exact Same values as the byte ?
-
Re: Ansi/Unicoding Encoding Issue
You don’t want the exact bytes in the string.
You want the correct representation .
Dilettante does a lot to help you, but you seem to ignore it
-
Re: Ansi/Unicoding Encoding Issue
Quote:
Originally Posted by
Arnoutdv
You don’t want the exact bytes in the string.
You want the correct representation .
Dilettante does a lot to help you, but you seem to ignore it
I always appreciate Dilettante's input, however I'm not trying to Output/display these bytes.
I need the exact bytes in a string. Not a representation.
That's how VB behaves with English Locale systems.
-
Re: Ansi/Unicoding Encoding Issue
Have you tried this?
Code:
Dim bBytes(4) As Byte
Dim sString As String
bBytes(0) = 192
bBytes(1) = 84
bBytes(2) = 69
bBytes(3) = 83
bBytes(4) = 84
sString = bBytes
Dim xLoop As Integer
For xLoop = 1 To LenB(sString)
Debug.Print AscB(MidB$(sString, xLoop, 1))
Next
-
1 Attachment(s)
Re: Ansi/Unicoding Encoding Issue
I think a lot confusion comes from mythologies that float around, like the weird assumption that bytes and characters are the same thing.
-
Re: Ansi/Unicoding Encoding Issue
Quote:
Originally Posted by
dilettante
I think a lot confusion comes from mythologies that float around, like the weird assumption that bytes and characters are the same thing.
I don't think it is mythology but a logical assumption for the ones who still don't know.
I bet that once you also thought they were the same. Tell me that I'm wrong.
-
Re: Ansi/Unicoding Encoding Issue
Quote:
Originally Posted by
Eduardo-
Tell me that I'm wrong.
When I went to school we were taught about and programmed on several very different computers and learned about others.
1. had memory organized as BCD digits with a flag bit (5 bits per digit), no bytes. There characters were 2-dgitis: 00 through 99.
2. had memory organized as 8-bit bytes, each character was 8 bits but encoded in EBCDIC most of the time (though 7-bit ASCII could also be accommodated).
3. had memory organized as 60-bit words, characters were 6 bits wide packed 10 to a word.
4. had 48-bit words, characters were 8-bit EBCDIC or ASCII packed 6 to a word or 6-bit BCL packed 8 to a word.
5. we learned about had 12-bit words and mostly stuffed 7-bit ASCII into the low bits, though you could use characters made of two 6-bit values packed per word.
2 and 4 of those are actually still in use today. Schemes have been adopted to handle ANSI code pages, UTF-8, and UTF-16 on both over the years. First in software and later through the help of new op codes.
Sure, that all goes back a very long time. Certainly before computers became common, well before PCs were common.
So yeah, I was never confused between bytes and characters. But neither should anyone else be. Windows has been Unicode-based since NT 3.1 in 1993, though Win9x was ANSI and needed additional support for Unicode (unicows.dll, part of the VB5/6 runtimes, etc.).
-
Re: Ansi/Unicoding Encoding Issue
There are some problems with Asc and Chr functions with some characters in some locales.
-
Re: Ansi/Unicoding Encoding Issue
Quote:
Originally Posted by
some1uk03
I'm not trying to Output/display these bytes.
I need the exact bytes in a string. Not a representation.
As already mentioned by others, the direct assignments work without any conversions:
SomeString = SomeByteArray 'assign the exact ByteContent to a VB-StringVariable (without conversion)
SomeOtherByteArray = SomeString 'assign the StringContent to a ByteArray (without conversion)
There will be no "locale" involved in the two operations above.
Olaf
-
Re: Ansi/Unicoding Encoding Issue
Some1uk03 what do you need to do with data?
-
Re: Ansi/Unicoding Encoding Issue
Quote:
Originally Posted by
some1uk03
I need the exact bytes in a string. Not a representation.
That's how VB behaves with English Locale systems.
Assigning a string to a byte-array is "exact bytes" but each symbol occupies 2 bytes which might not be what you need.
If you have been dealing with a one byte per symbol byte-arrays then this must have had some kind of transcoding going on using English locale.
Transcoding happens with StrConv with vbUnicode/vbFromUnicode option to determine direction and optional 3-rd parameter for locale (mapping 2-byte Unicode wide chars to 1-byte or multi-byte ANSI representation and vice versa) which is using current default user locale by default.
cheers,
</wqw>
-
Re: Ansi/Unicoding Encoding Issue
Quote:
Originally Posted by
dilettante
Have you tried this?
Code:
Dim bBytes(4) As Byte
Dim sString As String
bBytes(0) = 192
bBytes(1) = 84
bBytes(2) = 69
bBytes(3) = 83
bBytes(4) = 84
sString = bBytes
Dim xLoop As Integer
For xLoop = 1 To LenB(sString)
Debug.Print AscB(MidB$(sString, xLoop, 1))
Next
Ok, so quite a learning curve there. Using the ascB/MidB functions does return the same bytes.
Quote:
Originally Posted by
wqweto
If you have been dealing with a one byte per symbol byte-arrays then this must have had some kind of transcoding going on using English locale.
</wqw>
One byte per symbol is how the English locale is behaving by default.
So how do i proceed from now onwards. Always work with byteArrays? (which is a no go zone as the app is huge to change it all now)
Is there a way to convert DBCS to non-DBCS.
I understand the problem and how it's handling the strings, but still can't understand a solution/fix.
-
Re: Ansi/Unicoding Encoding Issue
Quote:
Originally Posted by
Arnoutdv
Some1uk03 what do you need to do with data?
If you can answer this then maybe some of us is able to help you further
-
Re: Ansi/Unicoding Encoding Issue
Quote:
Originally Posted by
some1uk03
Is there a way to convert DBCS to non-DBCS.
That's what I was doing back in post #14. You can also do this calling MultiByteToWideChar() directly, passing the correct code page value.
I suspect there is no answer though without a ton of rewriting. "A character is a byte" is a deep fallacy, a hole that can be hard to dig out of.
-
Re: Ansi/Unicoding Encoding Issue
Quote:
Originally Posted by
Arnoutdv
If you can answer this then maybe some of us is able to help you further
I'm reading a proprietary file as a byteArray, then passing it to a STRING and from there on, parsing / reading various chunks and populating them to a class OBJ with the settings & parameters which are read from this string. (it's much deeper that this, so I can't just easily convert everything to bytearrays)
Quote:
Originally Posted by
dilettante
That's what I was doing back in post #14. You can also do this calling MultiByteToWideChar() directly, passing the correct code page value.
I suspect there is no answer though without a ton of rewriting. "A character is a byte" is a deep fallacy, a hole that can be hard to dig out of.
MultiByteToWideChar() is what I'm currently using anyway, rather than strConv, but that's no good either!
I'm quite surprised that there is no moving forward solution to this other than forcing users to change their system locale to English! or a total rewrite (which is not preferred).
-
Re: Ansi/Unicoding Encoding Issue
Uh no, if it’s all about byte arrays then using strings in all your objects is the wrong approach
-
Re: Ansi/Unicoding Encoding Issue
OK, I edited some mistakes that I made.