Results 1 to 6 of 6

Thread: Reading and Writing UTF-16 and UTF-8 Files

  1. #1

    Thread Starter
    PowerPoster Elroy's Avatar
    Join Date
    Jun 2014
    Location
    Near Nashville TN
    Posts
    9,853

    Reading and Writing UTF-16 and UTF-8 Files

    Ok, here's my procrastination for the day. I've long been able to read Unicode (UTF-16) files, but I decided I also wanted to read and write UTF-8 files, so I did it. The attached "test" project is the best way to get it, but here's the essential code for the file IO. Focus specifically on the ReadAsciiOrUnicodeNotepadFile and WriteAsciiOrUnicodeNotepadFile procedures. I thought about making them Get/Let properties, but I think they're better this way. Again, don't forget that the attached ZIP has a nice demo.

    UTF8 and UTF16.zip

    Code:
    Option Explicit
    '
    Private Declare Function WideCharToMultiByte Lib "kernel32" (ByVal codepage As Long, ByVal dwFlags As Long, ByVal lpWideCharStr As Long, ByVal cchWideChar As Long, ByVal lpMultiByteStr As Long, ByVal cchMultiByte As Long, ByVal lpDefaultChar As Long, ByVal lpUsedDefaultChar As Long) As Long
    Private Declare Function MultiByteToWideChar Lib "kernel32" (ByVal codepage As Long, ByVal dwFlags As Long, ByVal lpMultiByteStr As Long, ByVal cbMultiByte As Long, ByVal lpWideCharStr As Long, ByVal cchWideChar As Long) As Long
    '
    Private Const Utf8CodePage As Long = 65001
    '
    Public Enum AsciiUnicodeEncoding
        AsciiEncode = 0
        Utf8Encode = 1
        Utf16Encode = 2
    End Enum
    '
    
    Public Function ReadAsciiOrUnicodeNotepadFile(sFileSpec As String) As String
        ' These are typically .TXT files.  They can be read with notepad.
        Dim iFle As Long
        Dim bb() As Byte
        Dim i As Integer
        Dim s As String
        '
        iFle = FreeFile
        Open sFileSpec For Binary As iFle
        If LOF(iFle) = 0 Then
            Close iFle
            Exit Function
        End If
        '
        Get iFle, , i
        Select Case i
        Case &HFEFF ' UTF16 file header.  First byte = FF, second byte = FE.
            ReDim bb(1 To LOF(iFle) - 2&)
            Get iFle, , bb
            ReadAsciiOrUnicodeNotepadFile = bb ' This directly copies the byte array to the Unicode string (no conversion).
        Case &HBBEF
            ReDim bb(1 To LOF(iFle) - 3&)
            Seek iFle, 4
            Get iFle, , bb
            ReadAsciiOrUnicodeNotepadFile = Utf8toUtf16(bb)
        Case Else ' Assume ascii.
            s = Space$(LOF(iFle))
            Seek iFle, 1
            Get iFle, , s
            ReadAsciiOrUnicodeNotepadFile = s
        End Select
        '
        Close iFle
    End Function
    
    Public Sub WriteAsciiOrUnicodeNotepadFile(sFileSpec As String, sData As String, Encoding As AsciiUnicodeEncoding)
        ' These are typically .TXT files.  They can be read with notepad.
        Dim iFle As Long
        '
        iFle = FreeFile
        Open sFileSpec For Binary As iFle
        Select Case Encoding
        Case AsciiEncode
            Put iFle, , sData
        Case Utf8Encode
            Put iFle, , CByte(&HEF)
            Put iFle, , CByte(&HBB)
            Put iFle, , CByte(&HBF)
            Put iFle, , Utf16toUtf8(sData)
        Case Utf16Encode
            Put iFle, , &HFEFF ' This is the Unicode header to a text file.  First byte = FF, second byte = FE.
            Put iFle, , Utf16ByteArrayFromString(sData)
        End Select
        Close iFle
    End Sub
    
    Public Function Utf16ByteArrayFromString(s As String) As Byte()
        ' This directly copies the Unicode string into the byte array, using two bytes per character (i.e., Unicode).
        Utf16ByteArrayFromString = s
    End Function
     
    Public Function Utf16toUtf8(s As String) As Byte()
        ' UTF-8 returned to VB6 as a byte array (zero based) because it's pretty useless to VB6 as anything else.
        Dim iLen As Long
        Dim bbBuf() As Byte
        '
        iLen = WideCharToMultiByte(Utf8CodePage, 0, StrPtr(s), Len(s), 0, 0, 0, 0)
        ReDim bbBuf(0 To iLen - 1) ' Will be initialized as all &h00.
        iLen = WideCharToMultiByte(Utf8CodePage, 0, StrPtr(s), Len(s), VarPtr(bbBuf(0)), iLen, 0, 0)
        Utf16toUtf8 = bbBuf
    End Function
     
    Public Function Utf8toUtf16(bb() As Byte) As String
        ' Incoming must be a dimensioned byte array with a UTF-8 string in it.
        Dim sBuf As String
        Dim iLen As Long
        '
        iLen = MultiByteToWideChar(Utf8CodePage, 0, VarPtr(bb(LBound(bb))), UBound(bb) - LBound(bb) + 1, 0, 0)
        sBuf = String$(iLen, 0)
        iLen = MultiByteToWideChar(Utf8CodePage, 0, VarPtr(bb(LBound(bb))), UBound(bb) - LBound(bb) + 1, StrPtr(sBuf), Len(sBuf))
        Utf8toUtf16 = sBuf
    End Function
    EDIT: This is in response to some of the following posts. If the above routine is to correctly read Unicode (UTF-16 and/or UTF-8), those files MUST have the Byte Order Marker (BOM) in the files. For UTF-16 files, they typically DO have their BOM. Many UTF-8 files also have this BOM header. If files are written by a relatively recent version of Windows Notepad, they will have these BOM markers, but there are Unicode files from sources other than notepad.

    If you wish for a routine that reads Unicode files without the BOM header (which will primarily be UTF-8 files), you may want to consider incorporating Arnoutdv's routine (in Post #3 below) into your work. For further reading on this entire issue, the following link outlines the problems well:

    http://blogs.msdn.com/b/oldnewthing/...7/2158334.aspx
    Last edited by Elroy; Oct 29th, 2014 at 10:22 PM.

  2. #2
    Fanatic Member
    Join Date
    Jan 2006
    Posts
    557

    Re: Reading and Writing UTF-16 and UTF-8 Files

    Actually this code will fail reading a UTF-8 file with no BOM.

    Notepad can detect UTF-8 even if there is no bom... the above code does not.

    Try to load the attached file here with your routine... and put the string in a text box....

    Then see the results opening this file with Notepad. The file is (the beginning of) an UTF-8 file whose 3 first bytes have been replaced with 3 spaces. Notepad can still identify the content as UTF-8. You will find plenty of UTF-8 files with no BOM out there, as I have experienced recently.

    q2.txt

    Edit... just to be sure and minimize uncertainty. I added the q3 file, same as q2, but without the 3 leading spaces. The content is indeed UTF-8 and Notepad recognizes it as such. Inspecting a file for a 2 or 3 bytes BOM and rule out UTF-8 is a too simplistic algorithm that is bound to fail in the real world.

    q3.txt
    Last edited by Navion; Oct 29th, 2014 at 07:58 AM.

  3. #3
    PowerPoster Arnoutdv's Avatar
    Join Date
    Oct 2013
    Posts
    5,872

    Re: Reading and Writing UTF-16 and UTF-8 Files

    The routines to detect UTF8 in a file without BOM are quite complex and hardly fool proof.
    Code:
    '---------------------------------------------------------------------------------------
    ' Procedure : ContainsUTF8
    ' DateTime  : 23-10-2012
    ' Author    : Arnoutdv
    ' Purpose   : Detect UTF8 Characters in byte array
    '           : If byte1 >= "&C2" and byte1 <= "D3" and
    '           :    byte2 >= "&80" and byte2 <= "BF" then UTF8 encoded character
    ' Source    : http://www.utf8-chartable.de/unicode-utf8-table.pl
    ' Revision  : 23-11-2012
    '           : 02-04-2014 - 3 bytes UTF8 : http://www.utf8-chartable.de/unicode-utf8-table.pl?start=2048
    '---------------------------------------------------------------------------------------
    Public Function ContainsUTF8(ByRef Source() As Byte) As Boolean
      Dim i As Long, lUBound As Long
      Dim hexC2 As Byte, hexDB As Byte, hexBF As Byte, hex80 As Byte, hexE0 As Byte
      Dim CurByte As Byte
      
      If pUTF8header(Source) Then
        ContainsUTF8 = True
      Else
        hexC2 = &HC2
        hexDB = &HDB
        hex80 = &H80
        hexBF = &HBF
        hexE0 = &HE0
        
        lUBound = UBound(Source)
        
        For i = 0 To lUBound - 1
          CurByte = Source(i)
          If CurByte >= hexC2 And CurByte <= hexDB Then
            If (Source(i + 1) And hex80) Then
              ContainsUTF8 = True
              Exit For
            End If
          End If
          If i + 2 <= lUBound Then
            If CurByte >= hexE0 Then
              If (Source(i + 1) And hex80) And (Source(i + 2) And hex80) Then
                ContainsUTF8 = True
                Exit For
              End If
            End If
          End If
        Next i
        
      End If
    End Function
    
    '---------------------------------------------------------------------------------------
    ' Procedure : pUTF8header
    ' DateTime  : 14-6-2013
    ' Author    : Arnoutdv
    ' Purpose   :
    ' Reference : http://en.wikipedia.org/wiki/Byte_order_mark
    '---------------------------------------------------------------------------------------
    Private Function pUTF8header(Source() As Byte) As Boolean
    
      If UBound(Source) >= 2 Then
        If Source(0) = &HEF Then
          If Source(1) = &HBB Then
            If Source(2) = &HBF Then
              pUTF8header = True
              Exit Function
            End If
          End If
        End If
      End If
    
    End Function

  4. #4

    Thread Starter
    PowerPoster Elroy's Avatar
    Join Date
    Jun 2014
    Location
    Near Nashville TN
    Posts
    9,853

    Re: Reading and Writing UTF-16 and UTF-8 Files

    Hmmm, quite interesting, Navion. I tried your two tests and you are absolutely correct. I think I'll post a question on the main forum to clear this up. I'll also study your code, Arnoutdv.
    Last edited by Elroy; Oct 29th, 2014 at 10:05 AM.

  5. #5
    Frenzied Member VanGoghGaming's Avatar
    Join Date
    Jan 2020
    Location
    Eve Online - Mining, Missions & Market Trading!
    Posts
    1,324

    Resolved Re: Reading and Writing UTF-16 and UTF-8 Files

    Here are my functions to read and write Unicode files using the ADODB.Stream object which has the advantage of also reading file names containing Unicode characters in their path:

    Code:
    Public Enum eCodePageIds
        CP_Autodetect_All = 50001
        CP_Unicode_UTF_16_LE = 1200
        CP_Unicode_UTF_8 = 65001
        CP_Windows_1250_ANSI_Central_European_Latin_2 = 1250
        CP_Windows_1251_ANSI_Cyrillic = 1251
        CP_Windows_1252_ANSI_Western_European_Latin_1 = 1252
        CP_Windows_1253_ANSI_Greek = 1253
        CP_Windows_1254_ANSI_Turkish = 1254
        CP_Windows_1255_ANSI_Hebrew = 1255
        CP_Windows_1256_ANSI_Arabic = 1256
        CP_Windows_1257_ANSI_Baltic = 1257
        CP_Windows_1258_ANSI_Vietnamese = 1258
    End Enum
    
    Private Function GetCharset(Optional eCodePage As eCodePageIds = CP_Autodetect_All) As String
        Select Case eCodePage
            Case CP_Autodetect_All
                GetCharset = "_autodetect_all"
            Case CP_Unicode_UTF_16_LE
                GetCharset = "unicode"
            Case CP_Unicode_UTF_8
                GetCharset = "utf-8"
            Case Else
                GetCharset = "windows-" & eCodePage
        End Select
    End Function
    
    Public Sub ReadFile(sFileName As String, vData As Variant, Optional eCodePage As eCodePageIds = CP_Autodetect_All)
    Dim oStream As New ADODB.Stream
        With oStream
            Select Case VarType(vData)
                Case vbArray Or vbByte
                    .Type = adTypeBinary: .Open: .LoadFromFile sFileName: vData = .Read
                Case vbString
                    .Charset = GetCharset(eCodePage): .Type = adTypeText: .Open: .LoadFromFile sFileName: vData = .ReadText
            End Select
            If .State <> adStateClosed Then .Close
        End With
        Set oStream = Nothing
    End Sub
    
    Public Sub WriteFile(sFileName As String, vData As Variant, Optional eCodePage As eCodePageIds = CP_Autodetect_All, Optional bNoBOM As Boolean = False)
    Dim oStream As New ADODB.Stream, oStreamNoBOM As New ADODB.Stream
        With oStream
            Select Case VarType(vData)
                Case vbArray Or vbByte
                    .Type = adTypeBinary: .Open: .Write vData: .SaveToFile sFileName, adSaveCreateOverWrite
                Case vbString
                    .Charset = GetCharset(eCodePage): .Type = adTypeText: .Open: .WriteText vData
                    If bNoBOM Then
                        Select Case eCodePage
                            Case CP_Unicode_UTF_16_LE
                                .Position = 2
                            Case CP_Unicode_UTF_8
                                .Position = 3
                            Case Else
                                .Position = 0
                        End Select
                        With oStreamNoBOM
                            .Type = adTypeBinary: .Open: oStream.CopyTo oStreamNoBOM: .SaveToFile sFileName, adSaveCreateOverWrite: .Close
                        End With
                    Else
                        .SaveToFile sFileName, adSaveCreateOverWrite
                    End If
            End Select
            If .State <> adStateClosed Then .Close
        End With
        Set oStream = Nothing: Set oStreamNoBOM = Nothing
    End Sub
    You can supply strings or byte arrays as parameters. Byte arrays will be read/written in binary mode and strings will be read/written in text mode. The "WriteFile" function can write Unicode text either with or without a BOM (Byte Order Mark). By default a BOM is included. If you do not supply a CodePage parameter from the above Enumeration, the functions will try to autodetect the type of text to be read/written. If a BOM is present in the file then autodetect works fine, otherwise it's mostly hit-and-miss.

  6. #6

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width