Ansi/Unicoding Encoding Issue

**some1uk03** · May 19th, 2021, 10:22 AM

I've got a bit of an issue with reading a file correctly on NON English PC's.

I've just changed my Regional settings to CHINESE. Viewing the file in HEX seems to be replacing some characters, as the codepage has changed.
So the issue is, getting the correct codepage set in order for VB to read properly.

MultiByteToWideChar is way off....
strConv is way off...[/B]

Even calling the correct getACP setting is still off.

Any ideas?

**Arnoutdv** · May 19th, 2021, 11:05 AM

What kind of file are you talking about?
One written by your application?

**some1uk03** · May 19th, 2021, 11:38 AM

Originally Posted by Arnoutdv

What kind of file are you talking about?
One written by your application?

No. It's exported/created by another app.

**dilettante** · May 19th, 2021, 01:00 PM

Lets assume your hex viewer isn't busted itself.

What encoding is the file written with? For Chinese it could be a number of things.

So much depends on the writing program.

**Arnoutdv** · May 19th, 2021, 01:14 PM

Originally Posted by some1uk03

No. It's exported/created by another app.

Is it a text file?
If it is then Textpad or Notepad++ should be able to read it and show what encoding is used.
UTF8/16, Unicode or whatever.

**Eduardo-** · May 19th, 2021, 01:18 PM

Originally Posted by some1uk03

So the issue is, getting the correct codepage set in order for VB to read properly.

I think that you need to know in advance in what codeset was written an ANSI file.

Perhaps with statistical analysis of the content it could be guessed, but to my understanding that's not what programs do normally.

**some1uk03** · May 19th, 2021, 06:16 PM

Originally Posted by dilettante

Lets assume your hex viewer isn't busted itself.

What encoding is the file written with? For Chinese it could be a number of things.

So much depends on the writing program.

I have an example of the output from the ENGLISH Locale version to compare with.
The Default viewing on hex is ANSI, however, if i change the Encoding to Chinese Simplified, then it matches the English Local version.

Originally Posted by Arnoutdv

Is it a text file?
If it is then Textpad or Notepad++ should be able to read it and show what encoding is used.
UTF8/16, Unicode or whatever.

Not a text file, but a custom file format.

As another app is outputting the data, im not sure what encoding/codepage they're outputting as, but presumably the default system codepage.

**wqweto** · May 21st, 2021, 03:19 AM

Originally Posted by some1uk03

. . . but presumably the default system codepage.

There is default user codepage and default system codepage. When you do StrConv from/to unicode it uses default *user* codepage by default.

You have to pass explicitly LOCALE_SYSTEM_DEFAULT = &H800 for LocaleID parameter (the optional 3-rd one) to use default *system* codepage.

This will probably not solve you issue at all because it seems you don't have a clear definition of "correct" in "issue with reading a file correctly on NON English PC's."

cheers,
</wqw>

**some1uk03** · May 23rd, 2021, 01:14 PM

The output from the other APP is correct..
My reading of the FILE in to a ByteArray is also correct!
Problem arises, when converting the ByteArray in to a STRING and there a conversion happens, when it doesn't require any conversion!!

All of the following go through a conversion:

Code:

sString = bBytes                              'Conversion <<
sString = sStrConv(bBytes, vbFromUnicode)     'Conversion <<

MultiByteToWideChar has same results as StrConv() too.
CopyMemory has same results too.

So, the question is, isn't there a way to convert the byte array to a string without any conversion taking place?

**dilettante** · May 23rd, 2021, 01:27 PM

If you assign the value of a dynamic Byte array to a String there is no transcoding ("conversion") performed.

It seems that the data on disk has been encoded as ANSI using some code page, and what you are really after is transcoding from that to UTF-16LE ("Unicode") and you aren't using the code page it was encoded in.

But we don't even know that. It seems very likely that your problem is that you are trying to display this stuff in an ANSI control that uses a different encoding, yielding scrambled results.

Or, for all we know, the data in the file is UTF-8 and you are expecting this to magically be transcoded properly.

One thing that might help get to a solution could be to show us a sample of the data and what you expect its interpretation as text to look like.

**some1uk03** · May 23rd, 2021, 02:15 PM

Here's a simple way of replicating the issue:

Code:

Dim bBytes(4) As Byte
Dim sString As String


bBytes(0) = 192
bBytes(1) = 84
bBytes(2) = 69
bBytes(3) = 83
bBytes(4) = 84


sString = StrConv(bBytes, vbUnicode)

Dim xLoop As Integer
For xLoop = 1 To Len(sString)
    Debug.Print Asc(Mid$(sString, xLoop, 1))
Next

Debug.Print does not OUTPUT: 192,84,69,83,84

Seems like the 192 gets treated/converted way off based on the system codepage.

**Arnoutdv** · May 23rd, 2021, 02:19 PM

No, because you specify a conversion
sString = bBytes should give the correct bytes, but is maybe not the correct text

**some1uk03** · May 23rd, 2021, 02:26 PM

Originally Posted by Arnoutdv

No, because you specify a conversion
sString = bBytes should give the correct bytes, but is maybe not the correct text

sString = bBytes, does NOT give correct result either.

REMEMBER TO TEST THIS BY SETTING YOUR SYSTEM LOCALE TO CHINESE!

**dilettante** · May 23rd, 2021, 02:46 PM

Clearly there is some confusion. Most likely you are unaware of how "ANSI" works with DBCS code pages.

There is a reason why these are called multibyte encodings. For example:

Code:

Option Explicit

Private Declare Function TextOutW Lib "gdi32" ( _
    ByVal hDC As Long, _
    ByVal X As Long, _
    ByVal Y As Long, _
    ByVal lpString As Long, _
    ByVal nCount As Long) As Long

Private Sub Form_Load()
    Dim bBytes(4) As Byte
    Dim sString As String

    With Font
        .Name = "Segoe UI"
        .Size = 16
    End With
    AutoRedraw = True

    bBytes(0) = 192
    bBytes(1) = 84
    bBytes(2) = 69
    bBytes(3) = 83
    bBytes(4) = 84
    With New ADODB.Stream
        .Open
        .Type = adTypeBinary
        .Write bBytes
        .Position = 0
        .Type = adTypeText
        .Charset = "big5"
        sString = .ReadText(adReadAll)
        .Close
    End With
    TextOutW hDC, 0, 0, StrPtr(sString), Len(sString)
End Sub

Name: sshot.png
Views: 607
Size: 603 Bytes

Five bytes but only 4 characters.

**dilettante** · May 23rd, 2021, 02:49 PM

Also note:

Asc Function

The range for returns is 0 – 255 on non-DBCS systems, but –32768 – 32767 on DBCS systems.

**dilettante** · May 23rd, 2021, 03:23 PM

BTW: big5 was just a guess, gb2312 is just as likely.

**some1uk03** · May 23rd, 2021, 03:32 PM

Ok, so the question is, how do we get the sString to hold the exact Same values as the byte ?

**Arnoutdv** · May 23rd, 2021, 03:36 PM

You don’t want the exact bytes in the string.
You want the correct representation .
Dilettante does a lot to help you, but you seem to ignore it

**some1uk03** · May 23rd, 2021, 04:05 PM

Originally Posted by Arnoutdv

You don’t want the exact bytes in the string.
You want the correct representation .
Dilettante does a lot to help you, but you seem to ignore it

I always appreciate Dilettante's input, however I'm not trying to Output/display these bytes.

I need the exact bytes in a string. Not a representation.
That's how VB behaves with English Locale systems.

**dilettante** · May 23rd, 2021, 04:14 PM

Have you tried this?

Code:

    Dim bBytes(4) As Byte
    Dim sString As String
    
    bBytes(0) = 192
    bBytes(1) = 84
    bBytes(2) = 69
    bBytes(3) = 83
    bBytes(4) = 84
    
    sString = bBytes
    
    Dim xLoop As Integer
    For xLoop = 1 To LenB(sString)
        Debug.Print AscB(MidB$(sString, xLoop, 1))
    Next

**dilettante** · May 23rd, 2021, 05:44 PM

I think a lot confusion comes from mythologies that float around, like the weird assumption that bytes and characters are the same thing.

Name: bytes is bytes b.png
Views: 586
Size: 9.6 KB

**Eduardo-** · May 23rd, 2021, 06:35 PM

Originally Posted by dilettante

I think a lot confusion comes from mythologies that float around, like the weird assumption that bytes and characters are the same thing.

I don't think it is mythology but a logical assumption for the ones who still don't know.

I bet that once you also thought they were the same. Tell me that I'm wrong.

**dilettante** · May 24th, 2021, 01:34 AM

Originally Posted by Eduardo-

Tell me that I'm wrong.

When I went to school we were taught about and programmed on several very different computers and learned about others.

1. had memory organized as BCD digits with a flag bit (5 bits per digit), no bytes. There characters were 2-dgitis: 00 through 99.

2. had memory organized as 8-bit bytes, each character was 8 bits but encoded in EBCDIC most of the time (though 7-bit ASCII could also be accommodated).

3. had memory organized as 60-bit words, characters were 6 bits wide packed 10 to a word.

4. had 48-bit words, characters were 8-bit EBCDIC or ASCII packed 6 to a word or 6-bit BCL packed 8 to a word.

5. we learned about had 12-bit words and mostly stuffed 7-bit ASCII into the low bits, though you could use characters made of two 6-bit values packed per word.

2 and 4 of those are actually still in use today. Schemes have been adopted to handle ANSI code pages, UTF-8, and UTF-16 on both over the years. First in software and later through the help of new op codes.

Sure, that all goes back a very long time. Certainly before computers became common, well before PCs were common.

So yeah, I was never confused between bytes and characters. But neither should anyone else be. Windows has been Unicode-based since NT 3.1 in 1993, though Win9x was ANSI and needed additional support for Unicode (unicows.dll, part of the VB5/6 runtimes, etc.).

**Eduardo-** · May 24th, 2021, 02:01 AM

There are some problems with Asc and Chr functions with some characters in some locales.

**Schmidt** · May 24th, 2021, 02:20 AM

Originally Posted by some1uk03

I'm not trying to Output/display these bytes.

I need the exact bytes in a string. Not a representation.

As already mentioned by others, the direct assignments work without any conversions:

SomeString = SomeByteArray 'assign the exact ByteContent to a VB-StringVariable (without conversion)

SomeOtherByteArray = SomeString 'assign the StringContent to a ByteArray (without conversion)

There will be no "locale" involved in the two operations above.

Olaf

**Arnoutdv** · May 24th, 2021, 02:59 AM

Some1uk03 what do you need to do with data?

**wqweto** · May 24th, 2021, 03:55 AM

Originally Posted by some1uk03

I need the exact bytes in a string. Not a representation.
That's how VB behaves with English Locale systems.

Assigning a string to a byte-array is "exact bytes" but each symbol occupies 2 bytes which might not be what you need.

If you have been dealing with a one byte per symbol byte-arrays then this must have had some kind of transcoding going on using English locale.

Transcoding happens with StrConv with vbUnicode/vbFromUnicode option to determine direction and optional 3-rd parameter for locale (mapping 2-byte Unicode wide chars to 1-byte or multi-byte ANSI representation and vice versa) which is using current default user locale by default.

cheers,
</wqw>

**some1uk03** · May 24th, 2021, 05:38 AM

Originally Posted by dilettante

Have you tried this?

Code:

    Dim bBytes(4) As Byte
    Dim sString As String
    
    bBytes(0) = 192
    bBytes(1) = 84
    bBytes(2) = 69
    bBytes(3) = 83
    bBytes(4) = 84
    
    sString = bBytes
    
    Dim xLoop As Integer
    For xLoop = 1 To LenB(sString)
        Debug.Print AscB(MidB$(sString, xLoop, 1))
    Next

Ok, so quite a learning curve there. Using the ascB/MidB functions does return the same bytes.

Originally Posted by wqweto

If you have been dealing with a one byte per symbol byte-arrays then this must have had some kind of transcoding going on using English locale.
</wqw>

One byte per symbol is how the English locale is behaving by default.

So how do i proceed from now onwards. Always work with byteArrays? (which is a no go zone as the app is huge to change it all now)

Is there a way to convert DBCS to non-DBCS.
I understand the problem and how it's handling the strings, but still can't understand a solution/fix.

**Arnoutdv** · May 24th, 2021, 06:05 AM

Originally Posted by Arnoutdv

Some1uk03 what do you need to do with data?

If you can answer this then maybe some of us is able to help you further

**dilettante** · May 24th, 2021, 06:07 AM

Originally Posted by some1uk03

Is there a way to convert DBCS to non-DBCS.

That's what I was doing back in post #14. You can also do this calling MultiByteToWideChar() directly, passing the correct code page value.

I suspect there is no answer though without a ton of rewriting. "A character is a byte" is a deep fallacy, a hole that can be hard to dig out of.

**some1uk03** · May 24th, 2021, 07:56 AM

Originally Posted by Arnoutdv

If you can answer this then maybe some of us is able to help you further

I'm reading a proprietary file as a byteArray, then passing it to a STRING and from there on, parsing / reading various chunks and populating them to a class OBJ with the settings & parameters which are read from this string. (it's much deeper that this, so I can't just easily convert everything to bytearrays)

Originally Posted by dilettante

That's what I was doing back in post #14. You can also do this calling MultiByteToWideChar() directly, passing the correct code page value.

I suspect there is no answer though without a ton of rewriting. "A character is a byte" is a deep fallacy, a hole that can be hard to dig out of.

MultiByteToWideChar() is what I'm currently using anyway, rather than strConv, but that's no good either!

I'm quite surprised that there is no moving forward solution to this other than forcing users to change their system locale to English! or a total rewrite (which is not preferred).

**Arnoutdv** · May 24th, 2021, 10:40 AM

Uh no, if it’s all about byte arrays then using strings in all your objects is the wrong approach

**Eduardo-** · May 24th, 2021, 12:06 PM

OK, I edited some mistakes that I made.

Thread: Ansi/Unicoding Encoding Issue

Thread Tools

Display

Ansi/Unicoding Encoding Issue

Re: Ansi/Unicoding Encoding Issue

Re: Ansi/Unicoding Encoding Issue

Re: Ansi/Unicoding Encoding Issue

Re: Ansi/Unicoding Encoding Issue

Re: Ansi/Unicoding Encoding Issue

Re: Ansi/Unicoding Encoding Issue

Re: Ansi/Unicoding Encoding Issue

Re: Ansi/Unicoding Encoding Issue

Re: Ansi/Unicoding Encoding Issue

Re: Ansi/Unicoding Encoding Issue

Re: Ansi/Unicoding Encoding Issue

Re: Ansi/Unicoding Encoding Issue

Re: Ansi/Unicoding Encoding Issue

Re: Ansi/Unicoding Encoding Issue

Re: Ansi/Unicoding Encoding Issue

Re: Ansi/Unicoding Encoding Issue

Re: Ansi/Unicoding Encoding Issue

Re: Ansi/Unicoding Encoding Issue

Re: Ansi/Unicoding Encoding Issue

Re: Ansi/Unicoding Encoding Issue

Re: Ansi/Unicoding Encoding Issue

Re: Ansi/Unicoding Encoding Issue

Re: Ansi/Unicoding Encoding Issue

Re: Ansi/Unicoding Encoding Issue

Re: Ansi/Unicoding Encoding Issue

Re: Ansi/Unicoding Encoding Issue

Re: Ansi/Unicoding Encoding Issue

Re: Ansi/Unicoding Encoding Issue

Re: Ansi/Unicoding Encoding Issue

Re: Ansi/Unicoding Encoding Issue

Re: Ansi/Unicoding Encoding Issue

Re: Ansi/Unicoding Encoding Issue

Tags for this Thread

Posting Permissions