Results 1 to 11 of 11

Thread: [RESOLVE] How to read an UTF-8 text file?

  1. #1

    Thread Starter
    Junior Member
    Join Date
    Oct 2005
    Location
    China
    Posts
    23

    Unhappy [RESOLVE] How to read an UTF-8 text file?

    I wrote some chinese words in NOTEPAD, then save as UTF-8 coding.
    When I open it in VB, all of chinese words changed to wrong chars, can't read.
    I must save it as UTF-8 coding.
    Please help me , I will thank you !

    VB Code:
    1. Dim FileHandle As Integer
    2.     Dim Contents As String
    3.     FileHandle = FreeFile
    4.     Open filename For Binary As #FileHandle
    5.         Contents = Input(LOF(FileHandle), #FileHandle) & vbCrLf
    6.     Close #FileHandle
    7.     LoadFileEx = Contents
    ---------------------------------------
    我在一个记事本中输入一些中文后保存为UTF-8编码文档,
    我在VB中打开他的时候,中文都变成乱码了,不能阅读。
    必须保存为UTF-8编码
    请帮助我,谢谢。
    Last edited by lichkingCN; Oct 29th, 2005 at 05:18 AM. Reason: resolve
    Don't walk before me I may not follow.
    Don't walk behind me I may not lead.
    Walk beside me and be my friend.


    ICQ number: 325052114

  2. #2
    Banned dglienna's Avatar
    Join Date
    Jun 2004
    Location
    Center of it all
    Posts
    17,901

    Re: How to read an UTF-8 text file?

    Try a RichTextBox instead of a Textbox. You can use Unicode.

  3. #3

    Thread Starter
    Junior Member
    Join Date
    Oct 2005
    Location
    China
    Posts
    23

    Re: How to read an UTF-8 text file?

    I tried it. But failed. just the same.

  4. #4
    Banned dglienna's Avatar
    Join Date
    Jun 2004
    Location
    Center of it all
    Posts
    17,901

    Re: How to read an UTF-8 text file?

    Check the properties. You need to use the correct Font, I think.

  5. #5

    Thread Starter
    Junior Member
    Join Date
    Oct 2005
    Location
    China
    Posts
    23

    Re: How to read an UTF-8 text file?

    But... I think i have the correct font.
    I'm chinese, so my OS is chinese.
    I think problem is text file coding..

  6. #6
    VB6, XHTML & CSS hobbyist Merri's Avatar
    Join Date
    Oct 2002
    Location
    Finland
    Posts
    6,654

    Re: How to read an UTF-8 text file?

    You have a double-edged problem. First of all, VB can't directly read UTF-8 to a textbox; you'd see garbage. So, you first need to convert UTF-8 data to VB's native format (which is two bytes per character, thus Unicode UTF-16). Windows API provides a way to do a conversion.

    The other problem is that VB controls don't natively support Unicode. A TextBox can only contain only SBCS and DBCS character sets (single byte character set and double byte character set). So, you need to change a character set before you assign the text to the textbox or else you will see just question marks.

    Now, to get past these problems, here you have some useful code:

    Code:
    'modCharset.bas
    Option Explicit
    
    Public Enum KnownCodePage
        CP_UNKNOWN = -1
        CP_ACP = 0
        CP_OEMCP = 1
        CP_SYMBOL = 42
    '   ARABIC
        CP_AWIN = 101   ' Bidi Windows codepage
        CP_709 = 102    ' MS-DOS Arabic Support CP 709
        CP_720 = 103    ' MS-DOS Arabic Support CP 720
        CP_A708 = 104   ' ASMO 708
        CP_A449 = 105   ' ASMO 449+
        CP_TARB = 106   ' MS Transparent Arabic
        CP_NAE = 107    ' Nafitha Enhanced Arabic Char Set
        CP_V4 = 108     ' Nafitha v 4.0
        CP_MA2 = 109    ' Mussaed Al Arabi (MA/2) CP 786
        CP_I864 = 110   ' IBM Arabic Supplement CP 864
        CP_A437 = 111   ' Ansi 437 codepage
        CP_AMAC = 112   ' Macintosh Code Page
    '   HEBREW
        CP_HWIN = 201   ' Bidi Windows codepage
        CP_862I = 202   ' IBM Hebrew Supplement CP 862
        CP_7BIT = 203   ' IBM Hebrew Supplement CP 862 Folded
        CP_ISO = 204    ' ISO Hebrew 8859-8 Character Set
        CP_H437 = 205   ' Ansi 437 codepage
        CP_HMAC = 206   ' Macintosh Code Page
    '   CODE PAGES
        CP_OEM_437 = 437
        CP_ARABICDOS = 708
        CP_DOS720 = 720
        CP_IBM850 = 850
        CP_IBM852 = 852
        CP_DOS862 = 862
        CP_IBM866 = 866
        CP_THAI = 874
        CP_JAPAN = 932
        CP_CHINA = 936
        CP_KOREA = 949
        CP_TAIWAN = 950
    '   UNICODE
        CP_UNICODELITTLE = 1200
        CP_UNICODEBIG = 1201
    '   CODE PAGES
        CP_EASTEUROPE = 1250
        CP_RUSSIAN = 1251
        CP_WESTEUROPE = 1252
        CP_GREEK = 1253
        CP_TURKISH = 1254
        CP_HEBREW = 1255
        CP_ARABIC = 1256
        CP_BALTIC = 1257
        CP_VIETNAMESE = 1258
    '   KOREAN
        CP_JOHAB = 1361
    '   MAC
        CP_MAC_ROMAN = 10000
        CP_MAC_JAPAN = 10001
        CP_MAC_ARABIC = 10004
        CP_MAC_GREEK = 10006
        CP_MAC_CYRILLIC = 10007
        CP_MAC_LATIN2 = 10029
        CP_MAC_TURKISH = 10081
    '   CODE PAGES
        CP_ASCII = 20127
        CP_RUSSIANKOI8R = 20866
        CP_RUSSIANKOI8U = 21866
        CP_ISOLATIN1 = 28591
        CP_ISOEASTEUROPE = 28592
        CP_ISOTURKISH = 28593
        CP_ISOBALTIC = 28594
        CP_ISORUSSIAN = 28595
        CP_ISOARABIC = 28596
        CP_ISOGREEK = 28597
        CP_ISOHEBREW = 28598
        CP_ISOTURKISH2 = 28599
        CP_ISOLATIN9 = 28605
        CP_HEBREWLOG = 38598
        CP_USER = 50000
        CP_AUTOALL = 50001
        CP_JAPANNHK = 50220
        CP_JAPANESC = 50221
        CP_JAPANISO = 50222
        CP_KOREAISO = 50225
        CP_TAIWANISO = 50227
        CP_CHINAISO = 50229
        CP_AUTOJAPAN = 50932
        CP_AUTOCHINA = 50936
        CP_AUTOKOREA = 50949
        CP_AUTOTAIWAN = 50950
        CP_AUTORUSSIAN = 51251
        CP_AUTOGREEK = 51253
        CP_AUTOARABIC = 51256
        CP_JAPANEUC = 51932
        CP_CHINAEUC = 51936
        CP_KOREAEUC = 51949
        CP_TAIWANEUC = 51950
        CP_CHINAHZ = 52936
    '   UNICODE
        CP_UTF7 = 65000
        CP_UTF8 = 65001
    End Enum
    
    ' Flags
    Public Const MB_PRECOMPOSED = &H1
    Public Const MB_COMPOSITE = &H2
    Public Const MB_USEGLYPHCHARS = &H4
    Public Const MB_ERR_INVALID_CHARS = &H8
    
    Public Const WC_DEFAULTCHECK = &H100                ' check for default char
    Public Const WC_COMPOSITECHECK = &H200              ' convert composite to precomposed
    Public Const WC_DISCARDNS = &H10                    ' discard non-spacing chars
    Public Const WC_SEPCHARS = &H20                     ' generate separate chars
    Public Const WC_DEFAULTCHAR = &H40                  ' replace with default char
    
    ' API
    Private Declare Function GetACP Lib "kernel32" () As Long
    Private Declare Function MultiByteToWideChar Lib "kernel32" (ByVal CodePage As Long, ByVal dwFlags As Long, ByVal lpMultiByteStr As Long, ByVal cchMultiByte As Long, ByVal lpWideCharStr As Long, ByVal cchWideChar As Long) As Long
    Private Declare Function WideCharToMultiByte Lib "kernel32" (ByVal CodePage As Long, ByVal dwFlags As Long, ByVal lpWideCharStr As Long, ByVal cchWideChar As Long, ByVal lpMultiByteStr As Long, ByVal cchMultiByte As Long, ByVal lpDefaultChar As Long, lpUsedDefaultChar As Long) As Long
    Public Function ANSItoUTF16(ByRef Text() As Byte, Optional ByVal cPage As KnownCodePage = CP_UNKNOWN, Optional lFlags As Long) As Byte()
        Static tmpArr() As Byte, textStr As String
        Dim tmpLen As Long, textLen As Long, A As Long
        If (Not Text) = True Then Exit Function
        ' set code page to a valid one
        If cPage = CP_UNKNOWN Then cPage = GetACP
        If cPage = CP_ACP Or cPage = CP_WESTEUROPE Then
            textLen = UBound(Text)
            tmpLen = textLen + textLen + 1
            If (Not tmpArr) = True Then ReDim Preserve tmpArr(tmpLen)
            If UBound(tmpArr) <> tmpLen Then ReDim Preserve tmpArr(tmpLen)
            For A = 0 To UBound(Text)
                tmpArr(A + A) = Text(A)
            Next A
        Else
            textStr = CStr(Text) & "|"
            textLen = LenB(textStr)
            tmpLen = textLen + textLen
            ReDim Preserve tmpArr(tmpLen + 1)
            'Debug.Print "SIZE OF TMPARR: " & tmpLen + 1
            ' get the new string to tmpArr
            tmpLen = MultiByteToWideChar(CLng(cPage), lFlags, ByVal StrPtr(textStr), -1, ByVal VarPtr(tmpArr(0)), tmpLen)
            'Debug.Print "ANSI to Unicode: " & tmpLen
            If tmpLen = 0 Then Exit Function
            tmpLen = tmpLen + tmpLen - 5
            'If tmpArr(tmpLen - 1) = 0 And tmpArr(tmpLen) = 0 Then tmpLen = tmpLen - 2
            If UBound(tmpArr) <> tmpLen Then ReDim Preserve tmpArr(tmpLen)
            'Debug.Print "SIZE OF TMPARR: " & tmpLen
        End If
        ' return the result
        ANSItoUTF16 = tmpArr
    End Function
    Public Function UTF16toANSI(ByRef Text() As Byte, Optional ByVal cPage As KnownCodePage = CP_UNKNOWN, Optional lFlags As Long) As Byte()
        Static tmpArr() As Byte
        Dim tmpLen As Long, textLen As Long, A As Long
        If (Not Text) = True Then Exit Function
        ' set code page to a valid one
        If cPage = CP_UNKNOWN Then cPage = GetACP
        If cPage = CP_ACP Or cPage = CP_WESTEUROPE Then
            textLen = UBound(Text)
            tmpLen = (textLen + 1) \ 2 - 1
            If (Not tmpArr) = True Then ReDim Preserve tmpArr(tmpLen)
            If UBound(tmpArr) <> tmpLen Then ReDim Preserve tmpArr(tmpLen)
            For A = 0 To tmpLen
                tmpArr(A) = Text(A + A)
            Next A
        Else
            textLen = (UBound(Text) + 1) \ 2
            ' at maximum ANSI can be four bytes per character in new Chinese encoding GB18030–2000
            tmpLen = textLen + textLen + textLen + textLen + 1
            ReDim Preserve tmpArr(tmpLen - 1)
            ' get the new string to tmpArr
            tmpLen = WideCharToMultiByte(CLng(cPage), lFlags, ByVal VarPtr(Text(0)), textLen, ByVal VarPtr(tmpArr(0)), tmpLen, ByVal 0&, ByVal 0&)
            'Debug.Print "Unicode to ANSI: " & tmpLen
            If tmpLen = 0 Then Exit Function
            ' a hopeless try to correct a weird error?
            ReDim Preserve tmpArr(tmpLen - 1)
        End If
        ' return the result
        UTF16toANSI = tmpArr
    End Function
    These add ANSItoUTF16 and UTF16toANSI functions to your program. What these actually do is to convert from character set to another (ie. Unicode to some common Chinese character set). We need to do pretty complex conversions: first, read the file (preferably to a byte array to avoid an extra conversion), then convert the byte array UTF-8 to Unicode and set the result in a string. Then display the end result in the textbox which is set to show the correct character set.

    VB Code:
    1. ' a simple sample
    2. Dim barTemp() As Byte
    3. Open Filename For Binary Access Read As #1
    4.     ' set buffer to the size of the file
    5.     ReDim barTemp(FileLen(Filename) - 1)
    6.     ' read file to buffer
    7.     Get #1, , barTemp
    8. Close #1
    9.  
    10. ' set TextBox character set
    11. ' (this will make sure the font is correct and able to display the characters)
    12. Text1.Font.Charset = 134
    13. ' you could define those as constants:
    14. ' Const GB2312_CHARSET = 134
    15. ' Const CHINESEBIG5_CHARSET = 136
    16. ' you can find other charsets using Google
    17.  
    18. ' convert from UTF-8 to Unicode and assing to textbox
    19. ' (byte array is automatically converted to string)
    20. Text1.Text = ANSItoUTF16(barTemp, CP_UTF8, 0)

    Hope you get this to work


    For more information about Unicode in VB6, see this tutorial

  7. #7

    Thread Starter
    Junior Member
    Join Date
    Oct 2005
    Location
    China
    Posts
    23

    Re: How to read an UTF-8 text file?

    thank you very much!
    It works very well!
    thanks a lot!

  8. #8
    VB6, XHTML & CSS hobbyist Merri's Avatar
    Join Date
    Oct 2002
    Location
    Finland
    Posts
    6,654

    Re: How to read an UTF-8 text file?

    You can use Thread Tools in the top of the page and select Mark Thread Resolved from there. And great it worked (didn't test it myself)

  9. #9
    New Member
    Join Date
    Dec 2021
    Posts
    2

    Re: How to read an UTF-8 text file?


  10. #10
    PowerPoster Elroy's Avatar
    Join Date
    Jun 2014
    Location
    Near Nashville TN
    Posts
    10,914

    Re: [RESOLVE] How to read an UTF-8 text file?

    Yaso, welcome to VBForums. You should come join us in the more recent threads, found here, rather than a 16 year old thread.
    Any software I post in these forums written by me is provided "AS IS" without warranty of any kind, expressed or implied, and permission is hereby granted, free of charge and without restriction, to any person obtaining a copy. To all, peace and happiness.

  11. #11
    New Member
    Join Date
    Dec 2021
    Posts
    2

    Re: [RESOLVE] How to read an UTF-8 text file?

    Quote Originally Posted by Elroy View Post
    Yaso, welcome to VBForums. You should come join us in the more recent threads, found here, rather than a 16 year old thread.
    You are right., but google gives still the VBForums solution. So ...

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width