Results 1 to 33 of 33

Thread: Ansi/Unicoding Encoding Issue

  1. #1

    Thread Starter
    Frenzied Member some1uk03's Avatar
    Join Date
    Jun 2006
    Location
    London, UK
    Posts
    1,675

    Question Ansi/Unicoding Encoding Issue

    I've got a bit of an issue with reading a file correctly on NON English PC's.

    I've just changed my Regional settings to CHINESE. Viewing the file in HEX seems to be replacing some characters, as the codepage has changed.
    So the issue is, getting the correct codepage set in order for VB to read properly.

    MultiByteToWideChar is way off....
    strConv is way off...[/B]

    Even calling the correct getACP setting is still off.

    Any ideas?
    _____________________________________________________________________

    ----If this post has helped you. Please take time to Rate it.
    ----If you've solved your problem, then please mark it as RESOLVED from Thread Tools.



  2. #2
    PowerPoster Arnoutdv's Avatar
    Join Date
    Oct 2013
    Posts
    6,733

    Re: Ansi/Unicoding Encoding Issue

    What kind of file are you talking about?
    One written by your application?

  3. #3

    Thread Starter
    Frenzied Member some1uk03's Avatar
    Join Date
    Jun 2006
    Location
    London, UK
    Posts
    1,675

    Re: Ansi/Unicoding Encoding Issue

    Quote Originally Posted by Arnoutdv View Post
    What kind of file are you talking about?
    One written by your application?
    No. It's exported/created by another app.
    _____________________________________________________________________

    ----If this post has helped you. Please take time to Rate it.
    ----If you've solved your problem, then please mark it as RESOLVED from Thread Tools.



  4. #4
    PowerPoster dilettante's Avatar
    Join Date
    Feb 2006
    Posts
    24,487

    Re: Ansi/Unicoding Encoding Issue

    Lets assume your hex viewer isn't busted itself.

    What encoding is the file written with? For Chinese it could be a number of things.

    So much depends on the writing program.

  5. #5
    PowerPoster Arnoutdv's Avatar
    Join Date
    Oct 2013
    Posts
    6,733

    Re: Ansi/Unicoding Encoding Issue

    Quote Originally Posted by some1uk03 View Post
    No. It's exported/created by another app.
    Is it a text file?
    If it is then Textpad or Notepad++ should be able to read it and show what encoding is used.
    UTF8/16, Unicode or whatever.

  6. #6
    PowerPoster
    Join Date
    Feb 2017
    Posts
    5,666

    Re: Ansi/Unicoding Encoding Issue

    Quote Originally Posted by some1uk03 View Post
    So the issue is, getting the correct codepage set in order for VB to read properly.
    I think that you need to know in advance in what codeset was written an ANSI file.

    Perhaps with statistical analysis of the content it could be guessed, but to my understanding that's not what programs do normally.

  7. #7

    Thread Starter
    Frenzied Member some1uk03's Avatar
    Join Date
    Jun 2006
    Location
    London, UK
    Posts
    1,675

    Re: Ansi/Unicoding Encoding Issue

    Quote Originally Posted by dilettante View Post
    Lets assume your hex viewer isn't busted itself.

    What encoding is the file written with? For Chinese it could be a number of things.

    So much depends on the writing program.
    I have an example of the output from the ENGLISH Locale version to compare with.
    The Default viewing on hex is ANSI, however, if i change the Encoding to Chinese Simplified, then it matches the English Local version.



    Quote Originally Posted by Arnoutdv View Post
    Is it a text file?
    If it is then Textpad or Notepad++ should be able to read it and show what encoding is used.
    UTF8/16, Unicode or whatever.
    Not a text file, but a custom file format.

    As another app is outputting the data, im not sure what encoding/codepage they're outputting as, but presumably the default system codepage.
    _____________________________________________________________________

    ----If this post has helped you. Please take time to Rate it.
    ----If you've solved your problem, then please mark it as RESOLVED from Thread Tools.



  8. #8
    PowerPoster wqweto's Avatar
    Join Date
    May 2011
    Location
    Sofia, Bulgaria
    Posts
    6,167

    Re: Ansi/Unicoding Encoding Issue

    Quote Originally Posted by some1uk03 View Post
    . . . but presumably the default system codepage.
    There is default user codepage and default system codepage. When you do StrConv from/to unicode it uses default *user* codepage by default.

    You have to pass explicitly LOCALE_SYSTEM_DEFAULT = &H800 for LocaleID parameter (the optional 3-rd one) to use default *system* codepage.

    This will probably not solve you issue at all because it seems you don't have a clear definition of "correct" in "issue with reading a file correctly on NON English PC's."

    cheers,
    </wqw>

  9. #9

    Thread Starter
    Frenzied Member some1uk03's Avatar
    Join Date
    Jun 2006
    Location
    London, UK
    Posts
    1,675

    Re: Ansi/Unicoding Encoding Issue

    The output from the other APP is correct..
    My reading of the FILE in to a ByteArray is also correct!
    Problem arises, when converting the ByteArray in to a STRING and there a conversion happens, when it doesn't require any conversion!!

    All of the following go through a conversion:
    Code:
    sString = bBytes                              'Conversion <<
    sString = sStrConv(bBytes, vbFromUnicode)     'Conversion <<
    MultiByteToWideChar has same results as StrConv() too.
    CopyMemory has same results too.

    So, the question is, isn't there a way to convert the byte array to a string without any conversion taking place?
    _____________________________________________________________________

    ----If this post has helped you. Please take time to Rate it.
    ----If you've solved your problem, then please mark it as RESOLVED from Thread Tools.



  10. #10
    PowerPoster dilettante's Avatar
    Join Date
    Feb 2006
    Posts
    24,487

    Re: Ansi/Unicoding Encoding Issue

    If you assign the value of a dynamic Byte array to a String there is no transcoding ("conversion") performed.

    It seems that the data on disk has been encoded as ANSI using some code page, and what you are really after is transcoding from that to UTF-16LE ("Unicode") and you aren't using the code page it was encoded in.

    But we don't even know that. It seems very likely that your problem is that you are trying to display this stuff in an ANSI control that uses a different encoding, yielding scrambled results.

    Or, for all we know, the data in the file is UTF-8 and you are expecting this to magically be transcoded properly.


    One thing that might help get to a solution could be to show us a sample of the data and what you expect its interpretation as text to look like.

  11. #11

    Thread Starter
    Frenzied Member some1uk03's Avatar
    Join Date
    Jun 2006
    Location
    London, UK
    Posts
    1,675

    Re: Ansi/Unicoding Encoding Issue

    Here's a simple way of replicating the issue:

    Code:
    Dim bBytes(4) As Byte
    Dim sString As String
    
    
    bBytes(0) = 192
    bBytes(1) = 84
    bBytes(2) = 69
    bBytes(3) = 83
    bBytes(4) = 84
    
    
    sString = StrConv(bBytes, vbUnicode)
    
    Dim xLoop As Integer
    For xLoop = 1 To Len(sString)
        Debug.Print Asc(Mid$(sString, xLoop, 1))
    Next
    Debug.Print does not OUTPUT: 192,84,69,83,84

    Seems like the 192 gets treated/converted way off based on the system codepage.
    _____________________________________________________________________

    ----If this post has helped you. Please take time to Rate it.
    ----If you've solved your problem, then please mark it as RESOLVED from Thread Tools.



  12. #12
    PowerPoster Arnoutdv's Avatar
    Join Date
    Oct 2013
    Posts
    6,733

    Re: Ansi/Unicoding Encoding Issue

    No, because you specify a conversion
    sString = bBytes should give the correct bytes, but is maybe not the correct text

  13. #13

    Thread Starter
    Frenzied Member some1uk03's Avatar
    Join Date
    Jun 2006
    Location
    London, UK
    Posts
    1,675

    Re: Ansi/Unicoding Encoding Issue

    Quote Originally Posted by Arnoutdv View Post
    No, because you specify a conversion
    sString = bBytes should give the correct bytes, but is maybe not the correct text
    sString = bBytes, does NOT give correct result either.

    REMEMBER TO TEST THIS BY SETTING YOUR SYSTEM LOCALE TO CHINESE!
    _____________________________________________________________________

    ----If this post has helped you. Please take time to Rate it.
    ----If you've solved your problem, then please mark it as RESOLVED from Thread Tools.



  14. #14
    PowerPoster dilettante's Avatar
    Join Date
    Feb 2006
    Posts
    24,487

    Re: Ansi/Unicoding Encoding Issue

    Clearly there is some confusion. Most likely you are unaware of how "ANSI" works with DBCS code pages.

    There is a reason why these are called multibyte encodings. For example:

    Code:
    Option Explicit
    
    Private Declare Function TextOutW Lib "gdi32" ( _
        ByVal hDC As Long, _
        ByVal X As Long, _
        ByVal Y As Long, _
        ByVal lpString As Long, _
        ByVal nCount As Long) As Long
    
    Private Sub Form_Load()
        Dim bBytes(4) As Byte
        Dim sString As String
    
        With Font
            .Name = "Segoe UI"
            .Size = 16
        End With
        AutoRedraw = True
    
        bBytes(0) = 192
        bBytes(1) = 84
        bBytes(2) = 69
        bBytes(3) = 83
        bBytes(4) = 84
        With New ADODB.Stream
            .Open
            .Type = adTypeBinary
            .Write bBytes
            .Position = 0
            .Type = adTypeText
            .Charset = "big5"
            sString = .ReadText(adReadAll)
            .Close
        End With
        TextOutW hDC, 0, 0, StrPtr(sString), Len(sString)
    End Sub
    Name:  sshot.png
Views: 606
Size:  603 Bytes

    Five bytes but only 4 characters.

  15. #15
    PowerPoster dilettante's Avatar
    Join Date
    Feb 2006
    Posts
    24,487

    Re: Ansi/Unicoding Encoding Issue

    Also note:

    Asc Function

    The range for returns is 0 – 255 on non-DBCS systems, but –32768 – 32767 on DBCS systems.

  16. #16
    PowerPoster dilettante's Avatar
    Join Date
    Feb 2006
    Posts
    24,487

    Re: Ansi/Unicoding Encoding Issue

    BTW: big5 was just a guess, gb2312 is just as likely.

  17. #17

    Thread Starter
    Frenzied Member some1uk03's Avatar
    Join Date
    Jun 2006
    Location
    London, UK
    Posts
    1,675

    Re: Ansi/Unicoding Encoding Issue

    Ok, so the question is, how do we get the sString to hold the exact Same values as the byte ?
    _____________________________________________________________________

    ----If this post has helped you. Please take time to Rate it.
    ----If you've solved your problem, then please mark it as RESOLVED from Thread Tools.



  18. #18
    PowerPoster Arnoutdv's Avatar
    Join Date
    Oct 2013
    Posts
    6,733

    Re: Ansi/Unicoding Encoding Issue

    You don’t want the exact bytes in the string.
    You want the correct representation .
    Dilettante does a lot to help you, but you seem to ignore it

  19. #19

    Thread Starter
    Frenzied Member some1uk03's Avatar
    Join Date
    Jun 2006
    Location
    London, UK
    Posts
    1,675

    Re: Ansi/Unicoding Encoding Issue

    Quote Originally Posted by Arnoutdv View Post
    You don’t want the exact bytes in the string.
    You want the correct representation .
    Dilettante does a lot to help you, but you seem to ignore it
    I always appreciate Dilettante's input, however I'm not trying to Output/display these bytes.

    I need the exact bytes in a string. Not a representation.
    That's how VB behaves with English Locale systems.
    _____________________________________________________________________

    ----If this post has helped you. Please take time to Rate it.
    ----If you've solved your problem, then please mark it as RESOLVED from Thread Tools.



  20. #20
    PowerPoster dilettante's Avatar
    Join Date
    Feb 2006
    Posts
    24,487

    Re: Ansi/Unicoding Encoding Issue

    Have you tried this?

    Code:
        Dim bBytes(4) As Byte
        Dim sString As String
        
        bBytes(0) = 192
        bBytes(1) = 84
        bBytes(2) = 69
        bBytes(3) = 83
        bBytes(4) = 84
        
        sString = bBytes
        
        Dim xLoop As Integer
        For xLoop = 1 To LenB(sString)
            Debug.Print AscB(MidB$(sString, xLoop, 1))
        Next

  21. #21
    PowerPoster dilettante's Avatar
    Join Date
    Feb 2006
    Posts
    24,487

    Re: Ansi/Unicoding Encoding Issue

    I think a lot confusion comes from mythologies that float around, like the weird assumption that bytes and characters are the same thing.

    Name:  bytes is bytes b.png
Views: 585
Size:  9.6 KB

  22. #22
    PowerPoster
    Join Date
    Feb 2017
    Posts
    5,666

    Re: Ansi/Unicoding Encoding Issue

    Quote Originally Posted by dilettante View Post
    I think a lot confusion comes from mythologies that float around, like the weird assumption that bytes and characters are the same thing.

    Name:  bytes is bytes b.png
Views: 585
Size:  9.6 KB
    I don't think it is mythology but a logical assumption for the ones who still don't know.

    I bet that once you also thought they were the same. Tell me that I'm wrong.

  23. #23
    PowerPoster dilettante's Avatar
    Join Date
    Feb 2006
    Posts
    24,487

    Re: Ansi/Unicoding Encoding Issue

    Quote Originally Posted by Eduardo- View Post
    Tell me that I'm wrong.
    When I went to school we were taught about and programmed on several very different computers and learned about others.

    1. had memory organized as BCD digits with a flag bit (5 bits per digit), no bytes. There characters were 2-dgitis: 00 through 99.

    2. had memory organized as 8-bit bytes, each character was 8 bits but encoded in EBCDIC most of the time (though 7-bit ASCII could also be accommodated).

    3. had memory organized as 60-bit words, characters were 6 bits wide packed 10 to a word.

    4. had 48-bit words, characters were 8-bit EBCDIC or ASCII packed 6 to a word or 6-bit BCL packed 8 to a word.

    5. we learned about had 12-bit words and mostly stuffed 7-bit ASCII into the low bits, though you could use characters made of two 6-bit values packed per word.

    2 and 4 of those are actually still in use today. Schemes have been adopted to handle ANSI code pages, UTF-8, and UTF-16 on both over the years. First in software and later through the help of new op codes.


    Sure, that all goes back a very long time. Certainly before computers became common, well before PCs were common.

    So yeah, I was never confused between bytes and characters. But neither should anyone else be. Windows has been Unicode-based since NT 3.1 in 1993, though Win9x was ANSI and needed additional support for Unicode (unicows.dll, part of the VB5/6 runtimes, etc.).

  24. #24
    PowerPoster
    Join Date
    Feb 2017
    Posts
    5,666

    Re: Ansi/Unicoding Encoding Issue

    There are some problems with Asc and Chr functions with some characters in some locales.

  25. #25
    PowerPoster
    Join Date
    Jun 2013
    Posts
    7,454

    Re: Ansi/Unicoding Encoding Issue

    Quote Originally Posted by some1uk03 View Post
    I'm not trying to Output/display these bytes.

    I need the exact bytes in a string. Not a representation.
    As already mentioned by others, the direct assignments work without any conversions:

    SomeString = SomeByteArray 'assign the exact ByteContent to a VB-StringVariable (without conversion)

    SomeOtherByteArray = SomeString 'assign the StringContent to a ByteArray (without conversion)

    There will be no "locale" involved in the two operations above.

    Olaf

  26. #26
    PowerPoster Arnoutdv's Avatar
    Join Date
    Oct 2013
    Posts
    6,733

    Re: Ansi/Unicoding Encoding Issue

    Some1uk03 what do you need to do with data?

  27. #27
    PowerPoster wqweto's Avatar
    Join Date
    May 2011
    Location
    Sofia, Bulgaria
    Posts
    6,167

    Re: Ansi/Unicoding Encoding Issue

    Quote Originally Posted by some1uk03 View Post
    I need the exact bytes in a string. Not a representation.
    That's how VB behaves with English Locale systems.
    Assigning a string to a byte-array is "exact bytes" but each symbol occupies 2 bytes which might not be what you need.

    If you have been dealing with a one byte per symbol byte-arrays then this must have had some kind of transcoding going on using English locale.

    Transcoding happens with StrConv with vbUnicode/vbFromUnicode option to determine direction and optional 3-rd parameter for locale (mapping 2-byte Unicode wide chars to 1-byte or multi-byte ANSI representation and vice versa) which is using current default user locale by default.

    cheers,
    </wqw>

  28. #28

    Thread Starter
    Frenzied Member some1uk03's Avatar
    Join Date
    Jun 2006
    Location
    London, UK
    Posts
    1,675

    Re: Ansi/Unicoding Encoding Issue

    Quote Originally Posted by dilettante View Post
    Have you tried this?

    Code:
        Dim bBytes(4) As Byte
        Dim sString As String
        
        bBytes(0) = 192
        bBytes(1) = 84
        bBytes(2) = 69
        bBytes(3) = 83
        bBytes(4) = 84
        
        sString = bBytes
        
        Dim xLoop As Integer
        For xLoop = 1 To LenB(sString)
            Debug.Print AscB(MidB$(sString, xLoop, 1))
        Next
    Ok, so quite a learning curve there. Using the ascB/MidB functions does return the same bytes.


    Quote Originally Posted by wqweto View Post
    If you have been dealing with a one byte per symbol byte-arrays then this must have had some kind of transcoding going on using English locale.
    </wqw>
    One byte per symbol is how the English locale is behaving by default.

    So how do i proceed from now onwards. Always work with byteArrays? (which is a no go zone as the app is huge to change it all now)

    Is there a way to convert DBCS to non-DBCS.
    I understand the problem and how it's handling the strings, but still can't understand a solution/fix.
    _____________________________________________________________________

    ----If this post has helped you. Please take time to Rate it.
    ----If you've solved your problem, then please mark it as RESOLVED from Thread Tools.



  29. #29
    PowerPoster Arnoutdv's Avatar
    Join Date
    Oct 2013
    Posts
    6,733

    Re: Ansi/Unicoding Encoding Issue

    Quote Originally Posted by Arnoutdv View Post
    Some1uk03 what do you need to do with data?
    If you can answer this then maybe some of us is able to help you further

  30. #30
    PowerPoster dilettante's Avatar
    Join Date
    Feb 2006
    Posts
    24,487

    Re: Ansi/Unicoding Encoding Issue

    Quote Originally Posted by some1uk03 View Post
    Is there a way to convert DBCS to non-DBCS.
    That's what I was doing back in post #14. You can also do this calling MultiByteToWideChar() directly, passing the correct code page value.

    I suspect there is no answer though without a ton of rewriting. "A character is a byte" is a deep fallacy, a hole that can be hard to dig out of.

  31. #31

    Thread Starter
    Frenzied Member some1uk03's Avatar
    Join Date
    Jun 2006
    Location
    London, UK
    Posts
    1,675

    Re: Ansi/Unicoding Encoding Issue

    Quote Originally Posted by Arnoutdv View Post
    If you can answer this then maybe some of us is able to help you further
    I'm reading a proprietary file as a byteArray, then passing it to a STRING and from there on, parsing / reading various chunks and populating them to a class OBJ with the settings & parameters which are read from this string. (it's much deeper that this, so I can't just easily convert everything to bytearrays)


    Quote Originally Posted by dilettante View Post
    That's what I was doing back in post #14. You can also do this calling MultiByteToWideChar() directly, passing the correct code page value.

    I suspect there is no answer though without a ton of rewriting. "A character is a byte" is a deep fallacy, a hole that can be hard to dig out of.
    MultiByteToWideChar() is what I'm currently using anyway, rather than strConv, but that's no good either!


    I'm quite surprised that there is no moving forward solution to this other than forcing users to change their system locale to English! or a total rewrite (which is not preferred).
    _____________________________________________________________________

    ----If this post has helped you. Please take time to Rate it.
    ----If you've solved your problem, then please mark it as RESOLVED from Thread Tools.



  32. #32
    PowerPoster Arnoutdv's Avatar
    Join Date
    Oct 2013
    Posts
    6,733

    Re: Ansi/Unicoding Encoding Issue

    Uh no, if it’s all about byte arrays then using strings in all your objects is the wrong approach

  33. #33
    PowerPoster
    Join Date
    Feb 2017
    Posts
    5,666

    Re: Ansi/Unicoding Encoding Issue

    OK, I edited some mistakes that I made.

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width