Results 1 to 7 of 7

Thread: [RESOLVED] CSV File Format: UTF8 vs ANSI vs UTF16 w/o BOM

  1. #1

    Thread Starter
    Frenzied Member
    Join Date
    Jan 2010
    Posts
    1,103

    Resolved [RESOLVED] CSV File Format: UTF8 vs ANSI vs UTF16 w/o BOM

    I wrote an ExportToCSV function to export Grid content to CSV format. For better Unicode support, the CSV file can be save in UTF8 or UTF16 (L or B) or UTF7 (not so good, Chrome even not support UTF7) or UTF32(L or B). My question is:

    1. Does it break the CSV Format standard? My answer is yes, native CSV is ANSI w/o BOM.

    2. Should I put BOM?

    I did some experimental in Excel. All have BOM signature.

    a. UTF32BE: Excel Can load and has CSV format but got a question mark at the front of the first row from BOM signature.
    b. UTF32LE: Excel lost csv format, but can load.
    c. UTF16BE: Excel Can load and has CSV format but got a question mark at the front of the first row from BOM signature.
    d. UTF16LE: Excel lost csv format, but can load.
    e. UTF8 : OK
    f. UTF7 : Excel Can load and has CSV format but got "+/v8" mark at the front of the first row from BOM signature.

    So, with BOM, UTF8 got the best support for MS Excel and Ms NotePad and it is my best choice. But for some software with poor or less unicode or BOM consideration, they may get problem to parse. I got into a dilemma.

    Refer to attachment for summary.
    Attached Images Attached Images   
    Last edited by Jonney; Sep 4th, 2015 at 04:30 AM.

  2. #2

    Thread Starter
    Frenzied Member
    Join Date
    Jan 2010
    Posts
    1,103

    Re: CSV File Format: UTF8 vs ANSI vs UTF16 w/o BOM

    OK. For flexibility, I made two options.

    Code:
    'encodations:
    Public Enum EncodingFormat
        UEF_ANSI = &H1&    'ANSI+DBCS
        UEF_UTF7 = &H2&      'UTF7 '&H38 76 2F 2B (5 types V8/V9/V+/V/V8-)
        UEF_UTF8 = &H3&      'UTF-8
        UEF_UTF16LE = &H4&  'UTF-16LE
        UEF_UTF16BE = &H5&  'UTF-16BE
        UEF_UTF32LE = &H6&   'UTF-32LE
        UEF_UTF32BE = &H7&   'UTF-32BE
    
        UEF_Auto = &H0& 'Auto Recognize Text File Format
    
    End Enum
    
    Public Enum BOMOption
       NoBOM = 0
       WriteBOM = 1
    End Enum
    
    Public Function ExportToCSV(ByVal FileName As String, Optional ExportFixedRow As Boolean = False, Optional ByVal ExportFixedColumn As Boolean = False, _
                                Optional ByVal Encodation As EncodingFormat = UEF_UTF8, Optional ByVal WriteBOMOption As BOMOption = WriteBOM) As Boolean
    
    '...
    
    End Function

    I mark this thread resolved even though I am not so satisfied with two options, which could confuse users, but just serviced the application-specific purpose.

    Reliance on the standard documented by RFC 4180 can simplify CSV exchange. However, this standard only specifies handling of text-based fields. Interpretation of the text of each field is still application-specific.
    Last edited by Jonney; Sep 4th, 2015 at 04:26 AM.

  3. #3
    PowerPoster
    Join Date
    Feb 2006
    Posts
    24,482

    Re: [RESOLVED] CSV File Format: UTF8 vs ANSI vs UTF16 w/o BOM

    The problem is no "CSV standard" exists.

    Almost everything about CSV files is subject to variation, and in some locales comma (",") isn't even used but instead semicolon (";") gets substituted because the comma is so frequently used within number values as the decimal point.


    There is a "lowest common denominator" CSV format: 7-bit ASCII text, CRLF row delimiters, comma field delimiters, no commas allowed within field values. But this has obvious issues as soon as requirements deviate beyond its limitations.

    What if you need other characters, or commas in text or numbers, or CRLF isn't native to your platform and you want to use just LF? Then you have to begin modifying what you mean by "CSV" format.

    Maybe you move to an ANSI encoding. What codepage? Or a Unicode encoding. Then you have the BOM vs. no-BOM problem.

    Maybe you quote fields containing commas. What if the fields can have quotes too?

    How about boolean values? Use 1/0? Use True/False or local language equivalent?

    What about date values?


    So it can all fall apart pretty quickly.

  4. #4
    Fanatic Member
    Join Date
    Apr 2015
    Location
    Finland
    Posts
    679

    Re: [RESOLVED] CSV File Format: UTF8 vs ANSI vs UTF16 w/o BOM

    I live in country, where comma is locale decimal point, but sometimes dot, i mean point is used also - aand have learned throught the years, that 0x9 (tab) character works best as a field separator in .csv files.

  5. #5

    Thread Starter
    Frenzied Member
    Join Date
    Jan 2010
    Posts
    1,103

    Re: [RESOLVED] CSV File Format: UTF8 vs ANSI vs UTF16 w/o BOM

    Quote Originally Posted by dilettante View Post
    The problem is no "CSV standard" exists.

    Almost everything about CSV files is subject to variation, and in some locales comma (",") isn't even used but instead semicolon (";") gets substituted because the comma is so frequently used within number values as the decimal point.


    There is a "lowest common denominator" CSV format: 7-bit ASCII text, CRLF row delimiters, comma field delimiters, no commas allowed within field values. But this has obvious issues as soon as requirements deviate beyond its limitations.

    What if you need other characters, or commas in text or numbers, or CRLF isn't native to your platform and you want to use just LF? Then you have to begin modifying what you mean by "CSV" format.

    Maybe you move to an ANSI encoding. What codepage? Or a Unicode encoding. Then you have the BOM vs. no-BOM problem.

    Maybe you quote fields containing commas. What if the fields can have quotes too?

    How about boolean values? Use 1/0? Use True/False or local language equivalent?

    What about date values?


    So it can all fall apart pretty quickly.
    1. I used fixed comma as field separator, If content got comma, use Double quotation mark.
    An example:
    "c:\program files\my app\app.exe"
    escapedString = """c:\program files\my app\app.exe"""
    An example:
    123,456.09
    escapedString = "123,456.09"

    2. For better Unicode support, I made options for encoding. For ANSI, the SaveStringToTextFile function used current default Codepage.
    For UTF8 and others, We can choose whether put BOM or not.

  6. #6
    PowerPoster
    Join Date
    Feb 2006
    Posts
    24,482

    Re: [RESOLVED] CSV File Format: UTF8 vs ANSI vs UTF16 w/o BOM

    That's a big fail, because quoting a field like "123,456.09" makes it a String value. Many programs that import CSV data auto-sense field data types and this would break them.

  7. #7

    Thread Starter
    Frenzied Member
    Join Date
    Jan 2010
    Posts
    1,103

    Re: [RESOLVED] CSV File Format: UTF8 vs ANSI vs UTF16 w/o BOM

    Quote Originally Posted by dilettante View Post
    That's a big fail, because quoting a field like "123,456.09" makes it a String value. Many programs that import CSV data auto-sense field data types and this would break them.
    My Cell Type is String, so "123,456.09" with double quotation is OK.
    If the cell Mask type is numeric, the 123,456.09 will come to 123456.09 without DQ.
    Last edited by Jonney; Sep 4th, 2015 at 10:07 PM.

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width