|
-
May 19th, 2021, 10:22 AM
#1
Thread Starter
Frenzied Member
Ansi/Unicoding Encoding Issue
I've got a bit of an issue with reading a file correctly on NON English PC's.
I've just changed my Regional settings to CHINESE. Viewing the file in HEX seems to be replacing some characters, as the codepage has changed.
So the issue is, getting the correct codepage set in order for VB to read properly.
MultiByteToWideChar is way off....
strConv is way off...[/B]
Even calling the correct getACP setting is still off.
Any ideas?
_____________________________________________________________________
----If this post has helped you. Please take time to Rate it.
----If you've solved your problem, then please mark it as RESOLVED from Thread Tools.

-
May 19th, 2021, 11:05 AM
#2
Re: Ansi/Unicoding Encoding Issue
What kind of file are you talking about?
One written by your application?
-
May 19th, 2021, 11:38 AM
#3
Thread Starter
Frenzied Member
Re: Ansi/Unicoding Encoding Issue
 Originally Posted by Arnoutdv
What kind of file are you talking about?
One written by your application?
No. It's exported/created by another app.
_____________________________________________________________________
----If this post has helped you. Please take time to Rate it.
----If you've solved your problem, then please mark it as RESOLVED from Thread Tools.

-
May 19th, 2021, 01:00 PM
#4
Re: Ansi/Unicoding Encoding Issue
Lets assume your hex viewer isn't busted itself.
What encoding is the file written with? For Chinese it could be a number of things.
So much depends on the writing program.
-
May 19th, 2021, 01:14 PM
#5
Re: Ansi/Unicoding Encoding Issue
 Originally Posted by some1uk03
No. It's exported/created by another app.
Is it a text file?
If it is then Textpad or Notepad++ should be able to read it and show what encoding is used.
UTF8/16, Unicode or whatever.
-
May 19th, 2021, 01:18 PM
#6
Re: Ansi/Unicoding Encoding Issue
 Originally Posted by some1uk03
So the issue is, getting the correct codepage set in order for VB to read properly.
I think that you need to know in advance in what codeset was written an ANSI file.
Perhaps with statistical analysis of the content it could be guessed, but to my understanding that's not what programs do normally.
Last edited by Eduardo-; May 19th, 2021 at 11:52 PM.
-
May 19th, 2021, 06:16 PM
#7
Thread Starter
Frenzied Member
Re: Ansi/Unicoding Encoding Issue
 Originally Posted by dilettante
Lets assume your hex viewer isn't busted itself.
What encoding is the file written with? For Chinese it could be a number of things.
So much depends on the writing program.
I have an example of the output from the ENGLISH Locale version to compare with.
The Default viewing on hex is ANSI, however, if i change the Encoding to Chinese Simplified, then it matches the English Local version.
 Originally Posted by Arnoutdv
Is it a text file?
If it is then Textpad or Notepad++ should be able to read it and show what encoding is used.
UTF8/16, Unicode or whatever.
Not a text file, but a custom file format.
As another app is outputting the data, im not sure what encoding/codepage they're outputting as, but presumably the default system codepage.
_____________________________________________________________________
----If this post has helped you. Please take time to Rate it.
----If you've solved your problem, then please mark it as RESOLVED from Thread Tools.

-
May 21st, 2021, 03:19 AM
#8
Re: Ansi/Unicoding Encoding Issue
 Originally Posted by some1uk03
. . . but presumably the default system codepage.
There is default user codepage and default system codepage. When you do StrConv from/to unicode it uses default *user* codepage by default.
You have to pass explicitly LOCALE_SYSTEM_DEFAULT = &H800 for LocaleID parameter (the optional 3-rd one) to use default *system* codepage.
This will probably not solve you issue at all because it seems you don't have a clear definition of "correct" in "issue with reading a file correctly on NON English PC's."
cheers,
</wqw>
-
May 23rd, 2021, 01:14 PM
#9
Thread Starter
Frenzied Member
Re: Ansi/Unicoding Encoding Issue
The output from the other APP is correct..
My reading of the FILE in to a ByteArray is also correct!
Problem arises, when converting the ByteArray in to a STRING and there a conversion happens, when it doesn't require any conversion!!
All of the following go through a conversion:
Code:
sString = bBytes 'Conversion <<
sString = sStrConv(bBytes, vbFromUnicode) 'Conversion <<
MultiByteToWideChar has same results as StrConv() too.
CopyMemory has same results too.
So, the question is, isn't there a way to convert the byte array to a string without any conversion taking place?
_____________________________________________________________________
----If this post has helped you. Please take time to Rate it.
----If you've solved your problem, then please mark it as RESOLVED from Thread Tools.

-
May 23rd, 2021, 01:27 PM
#10
Re: Ansi/Unicoding Encoding Issue
If you assign the value of a dynamic Byte array to a String there is no transcoding ("conversion") performed.
It seems that the data on disk has been encoded as ANSI using some code page, and what you are really after is transcoding from that to UTF-16LE ("Unicode") and you aren't using the code page it was encoded in.
But we don't even know that. It seems very likely that your problem is that you are trying to display this stuff in an ANSI control that uses a different encoding, yielding scrambled results.
Or, for all we know, the data in the file is UTF-8 and you are expecting this to magically be transcoded properly.
One thing that might help get to a solution could be to show us a sample of the data and what you expect its interpretation as text to look like.
-
May 23rd, 2021, 02:15 PM
#11
Thread Starter
Frenzied Member
Re: Ansi/Unicoding Encoding Issue
Here's a simple way of replicating the issue:
Code:
Dim bBytes(4) As Byte
Dim sString As String
bBytes(0) = 192
bBytes(1) = 84
bBytes(2) = 69
bBytes(3) = 83
bBytes(4) = 84
sString = StrConv(bBytes, vbUnicode)
Dim xLoop As Integer
For xLoop = 1 To Len(sString)
Debug.Print Asc(Mid$(sString, xLoop, 1))
Next
Debug.Print does not OUTPUT: 192,84,69,83,84
Seems like the 192 gets treated/converted way off based on the system codepage.
_____________________________________________________________________
----If this post has helped you. Please take time to Rate it.
----If you've solved your problem, then please mark it as RESOLVED from Thread Tools.

-
May 23rd, 2021, 02:19 PM
#12
Re: Ansi/Unicoding Encoding Issue
No, because you specify a conversion
sString = bBytes should give the correct bytes, but is maybe not the correct text
-
May 23rd, 2021, 02:26 PM
#13
Thread Starter
Frenzied Member
Re: Ansi/Unicoding Encoding Issue
 Originally Posted by Arnoutdv
No, because you specify a conversion
sString = bBytes should give the correct bytes, but is maybe not the correct text
sString = bBytes, does NOT give correct result either.
REMEMBER TO TEST THIS BY SETTING YOUR SYSTEM LOCALE TO CHINESE!
_____________________________________________________________________
----If this post has helped you. Please take time to Rate it.
----If you've solved your problem, then please mark it as RESOLVED from Thread Tools.

-
May 23rd, 2021, 02:46 PM
#14
Re: Ansi/Unicoding Encoding Issue
Clearly there is some confusion. Most likely you are unaware of how "ANSI" works with DBCS code pages.
There is a reason why these are called multibyte encodings. For example:
Code:
Option Explicit
Private Declare Function TextOutW Lib "gdi32" ( _
ByVal hDC As Long, _
ByVal X As Long, _
ByVal Y As Long, _
ByVal lpString As Long, _
ByVal nCount As Long) As Long
Private Sub Form_Load()
Dim bBytes(4) As Byte
Dim sString As String
With Font
.Name = "Segoe UI"
.Size = 16
End With
AutoRedraw = True
bBytes(0) = 192
bBytes(1) = 84
bBytes(2) = 69
bBytes(3) = 83
bBytes(4) = 84
With New ADODB.Stream
.Open
.Type = adTypeBinary
.Write bBytes
.Position = 0
.Type = adTypeText
.Charset = "big5"
sString = .ReadText(adReadAll)
.Close
End With
TextOutW hDC, 0, 0, StrPtr(sString), Len(sString)
End Sub
Five bytes but only 4 characters.
-
May 23rd, 2021, 02:49 PM
#15
Re: Ansi/Unicoding Encoding Issue
Also note:
Asc Function
The range for returns is 0 – 255 on non-DBCS systems, but –32768 – 32767 on DBCS systems.
-
May 23rd, 2021, 03:23 PM
#16
Re: Ansi/Unicoding Encoding Issue
BTW: big5 was just a guess, gb2312 is just as likely.
-
May 23rd, 2021, 03:32 PM
#17
Thread Starter
Frenzied Member
Re: Ansi/Unicoding Encoding Issue
Ok, so the question is, how do we get the sString to hold the exact Same values as the byte ?
_____________________________________________________________________
----If this post has helped you. Please take time to Rate it.
----If you've solved your problem, then please mark it as RESOLVED from Thread Tools.

-
May 23rd, 2021, 03:36 PM
#18
Re: Ansi/Unicoding Encoding Issue
You don’t want the exact bytes in the string.
You want the correct representation .
Dilettante does a lot to help you, but you seem to ignore it
-
May 23rd, 2021, 04:05 PM
#19
Thread Starter
Frenzied Member
Re: Ansi/Unicoding Encoding Issue
 Originally Posted by Arnoutdv
You don’t want the exact bytes in the string.
You want the correct representation .
Dilettante does a lot to help you, but you seem to ignore it
I always appreciate Dilettante's input, however I'm not trying to Output/display these bytes.
I need the exact bytes in a string. Not a representation.
That's how VB behaves with English Locale systems.
_____________________________________________________________________
----If this post has helped you. Please take time to Rate it.
----If you've solved your problem, then please mark it as RESOLVED from Thread Tools.

-
May 23rd, 2021, 04:14 PM
#20
Re: Ansi/Unicoding Encoding Issue
Have you tried this?
Code:
Dim bBytes(4) As Byte
Dim sString As String
bBytes(0) = 192
bBytes(1) = 84
bBytes(2) = 69
bBytes(3) = 83
bBytes(4) = 84
sString = bBytes
Dim xLoop As Integer
For xLoop = 1 To LenB(sString)
Debug.Print AscB(MidB$(sString, xLoop, 1))
Next
-
May 23rd, 2021, 05:44 PM
#21
Re: Ansi/Unicoding Encoding Issue
I think a lot confusion comes from mythologies that float around, like the weird assumption that bytes and characters are the same thing.
-
May 23rd, 2021, 06:35 PM
#22
Re: Ansi/Unicoding Encoding Issue
 Originally Posted by dilettante
I think a lot confusion comes from mythologies that float around, like the weird assumption that bytes and characters are the same thing.
I don't think it is mythology but a logical assumption for the ones who still don't know.
I bet that once you also thought they were the same. Tell me that I'm wrong.
-
May 24th, 2021, 01:34 AM
#23
Re: Ansi/Unicoding Encoding Issue
 Originally Posted by Eduardo-
Tell me that I'm wrong.
When I went to school we were taught about and programmed on several very different computers and learned about others.
1. had memory organized as BCD digits with a flag bit (5 bits per digit), no bytes. There characters were 2-dgitis: 00 through 99.
2. had memory organized as 8-bit bytes, each character was 8 bits but encoded in EBCDIC most of the time (though 7-bit ASCII could also be accommodated).
3. had memory organized as 60-bit words, characters were 6 bits wide packed 10 to a word.
4. had 48-bit words, characters were 8-bit EBCDIC or ASCII packed 6 to a word or 6-bit BCL packed 8 to a word.
5. we learned about had 12-bit words and mostly stuffed 7-bit ASCII into the low bits, though you could use characters made of two 6-bit values packed per word.
2 and 4 of those are actually still in use today. Schemes have been adopted to handle ANSI code pages, UTF-8, and UTF-16 on both over the years. First in software and later through the help of new op codes.
Sure, that all goes back a very long time. Certainly before computers became common, well before PCs were common.
So yeah, I was never confused between bytes and characters. But neither should anyone else be. Windows has been Unicode-based since NT 3.1 in 1993, though Win9x was ANSI and needed additional support for Unicode (unicows.dll, part of the VB5/6 runtimes, etc.).
-
May 24th, 2021, 02:01 AM
#24
Re: Ansi/Unicoding Encoding Issue
There are some problems with Asc and Chr functions with some characters in some locales.
Last edited by Eduardo-; May 24th, 2021 at 12:05 PM.
-
May 24th, 2021, 02:20 AM
#25
Re: Ansi/Unicoding Encoding Issue
 Originally Posted by some1uk03
I'm not trying to Output/display these bytes.
I need the exact bytes in a string. Not a representation.
As already mentioned by others, the direct assignments work without any conversions:
SomeString = SomeByteArray 'assign the exact ByteContent to a VB-StringVariable (without conversion)
SomeOtherByteArray = SomeString 'assign the StringContent to a ByteArray (without conversion)
There will be no "locale" involved in the two operations above.
Olaf
-
May 24th, 2021, 02:59 AM
#26
Re: Ansi/Unicoding Encoding Issue
Some1uk03 what do you need to do with data?
-
May 24th, 2021, 03:55 AM
#27
Re: Ansi/Unicoding Encoding Issue
 Originally Posted by some1uk03
I need the exact bytes in a string. Not a representation.
That's how VB behaves with English Locale systems.
Assigning a string to a byte-array is "exact bytes" but each symbol occupies 2 bytes which might not be what you need.
If you have been dealing with a one byte per symbol byte-arrays then this must have had some kind of transcoding going on using English locale.
Transcoding happens with StrConv with vbUnicode/vbFromUnicode option to determine direction and optional 3-rd parameter for locale (mapping 2-byte Unicode wide chars to 1-byte or multi-byte ANSI representation and vice versa) which is using current default user locale by default.
cheers,
</wqw>
-
May 24th, 2021, 05:38 AM
#28
Thread Starter
Frenzied Member
Re: Ansi/Unicoding Encoding Issue
 Originally Posted by dilettante
Have you tried this?
Code:
Dim bBytes(4) As Byte
Dim sString As String
bBytes(0) = 192
bBytes(1) = 84
bBytes(2) = 69
bBytes(3) = 83
bBytes(4) = 84
sString = bBytes
Dim xLoop As Integer
For xLoop = 1 To LenB(sString)
Debug.Print AscB(MidB$(sString, xLoop, 1))
Next
Ok, so quite a learning curve there. Using the ascB/MidB functions does return the same bytes.
 Originally Posted by wqweto
If you have been dealing with a one byte per symbol byte-arrays then this must have had some kind of transcoding going on using English locale.
</wqw>
One byte per symbol is how the English locale is behaving by default.
So how do i proceed from now onwards. Always work with byteArrays? (which is a no go zone as the app is huge to change it all now)
Is there a way to convert DBCS to non-DBCS.
I understand the problem and how it's handling the strings, but still can't understand a solution/fix.
_____________________________________________________________________
----If this post has helped you. Please take time to Rate it.
----If you've solved your problem, then please mark it as RESOLVED from Thread Tools.

-
May 24th, 2021, 06:05 AM
#29
Re: Ansi/Unicoding Encoding Issue
 Originally Posted by Arnoutdv
Some1uk03 what do you need to do with data?
If you can answer this then maybe some of us is able to help you further
-
May 24th, 2021, 06:07 AM
#30
Re: Ansi/Unicoding Encoding Issue
 Originally Posted by some1uk03
Is there a way to convert DBCS to non-DBCS.
That's what I was doing back in post #14. You can also do this calling MultiByteToWideChar() directly, passing the correct code page value.
I suspect there is no answer though without a ton of rewriting. "A character is a byte" is a deep fallacy, a hole that can be hard to dig out of.
-
May 24th, 2021, 07:56 AM
#31
Thread Starter
Frenzied Member
Re: Ansi/Unicoding Encoding Issue
 Originally Posted by Arnoutdv
If you can answer this then maybe some of us is able to help you further
I'm reading a proprietary file as a byteArray, then passing it to a STRING and from there on, parsing / reading various chunks and populating them to a class OBJ with the settings & parameters which are read from this string. (it's much deeper that this, so I can't just easily convert everything to bytearrays)
 Originally Posted by dilettante
That's what I was doing back in post #14. You can also do this calling MultiByteToWideChar() directly, passing the correct code page value.
I suspect there is no answer though without a ton of rewriting. "A character is a byte" is a deep fallacy, a hole that can be hard to dig out of.
MultiByteToWideChar() is what I'm currently using anyway, rather than strConv, but that's no good either!
I'm quite surprised that there is no moving forward solution to this other than forcing users to change their system locale to English! or a total rewrite (which is not preferred).
_____________________________________________________________________
----If this post has helped you. Please take time to Rate it.
----If you've solved your problem, then please mark it as RESOLVED from Thread Tools.

-
May 24th, 2021, 10:40 AM
#32
Re: Ansi/Unicoding Encoding Issue
Uh no, if it’s all about byte arrays then using strings in all your objects is the wrong approach
-
May 24th, 2021, 12:06 PM
#33
Re: Ansi/Unicoding Encoding Issue
OK, I edited some mistakes that I made.
Tags for this Thread
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|