-
Sep 3rd, 2015, 09:27 PM
#1
Thread Starter
Frenzied Member
[RESOLVED] CSV File Format: UTF8 vs ANSI vs UTF16 w/o BOM
I wrote an ExportToCSV function to export Grid content to CSV format. For better Unicode support, the CSV file can be save in UTF8 or UTF16 (L or B) or UTF7 (not so good, Chrome even not support UTF7) or UTF32(L or B). My question is:
1. Does it break the CSV Format standard? My answer is yes, native CSV is ANSI w/o BOM.
2. Should I put BOM?
I did some experimental in Excel. All have BOM signature.
a. UTF32BE: Excel Can load and has CSV format but got a question mark at the front of the first row from BOM signature.
b. UTF32LE: Excel lost csv format, but can load.
c. UTF16BE: Excel Can load and has CSV format but got a question mark at the front of the first row from BOM signature.
d. UTF16LE: Excel lost csv format, but can load.
e. UTF8 : OK
f. UTF7 : Excel Can load and has CSV format but got "+/v8" mark at the front of the first row from BOM signature.
So, with BOM, UTF8 got the best support for MS Excel and Ms NotePad and it is my best choice. But for some software with poor or less unicode or BOM consideration, they may get problem to parse. I got into a dilemma.
Refer to attachment for summary.
Last edited by Jonney; Sep 4th, 2015 at 04:30 AM.
-
Sep 4th, 2015, 04:16 AM
#2
Thread Starter
Frenzied Member
Re: CSV File Format: UTF8 vs ANSI vs UTF16 w/o BOM
OK. For flexibility, I made two options.
Code:
'encodations:
Public Enum EncodingFormat
UEF_ANSI = &H1& 'ANSI+DBCS
UEF_UTF7 = &H2& 'UTF7 '&H38 76 2F 2B (5 types V8/V9/V+/V/V8-)
UEF_UTF8 = &H3& 'UTF-8
UEF_UTF16LE = &H4& 'UTF-16LE
UEF_UTF16BE = &H5& 'UTF-16BE
UEF_UTF32LE = &H6& 'UTF-32LE
UEF_UTF32BE = &H7& 'UTF-32BE
UEF_Auto = &H0& 'Auto Recognize Text File Format
End Enum
Public Enum BOMOption
NoBOM = 0
WriteBOM = 1
End Enum
Public Function ExportToCSV(ByVal FileName As String, Optional ExportFixedRow As Boolean = False, Optional ByVal ExportFixedColumn As Boolean = False, _
Optional ByVal Encodation As EncodingFormat = UEF_UTF8, Optional ByVal WriteBOMOption As BOMOption = WriteBOM) As Boolean
'...
End Function
I mark this thread resolved even though I am not so satisfied with two options, which could confuse users, but just serviced the application-specific purpose.
Reliance on the standard documented by RFC 4180 can simplify CSV exchange. However, this standard only specifies handling of text-based fields. Interpretation of the text of each field is still application-specific.
Last edited by Jonney; Sep 4th, 2015 at 04:26 AM.
-
Sep 4th, 2015, 04:35 PM
#3
Re: [RESOLVED] CSV File Format: UTF8 vs ANSI vs UTF16 w/o BOM
The problem is no "CSV standard" exists.
Almost everything about CSV files is subject to variation, and in some locales comma (",") isn't even used but instead semicolon (";") gets substituted because the comma is so frequently used within number values as the decimal point.
There is a "lowest common denominator" CSV format: 7-bit ASCII text, CRLF row delimiters, comma field delimiters, no commas allowed within field values. But this has obvious issues as soon as requirements deviate beyond its limitations.
What if you need other characters, or commas in text or numbers, or CRLF isn't native to your platform and you want to use just LF? Then you have to begin modifying what you mean by "CSV" format.
Maybe you move to an ANSI encoding. What codepage? Or a Unicode encoding. Then you have the BOM vs. no-BOM problem.
Maybe you quote fields containing commas. What if the fields can have quotes too?
How about boolean values? Use 1/0? Use True/False or local language equivalent?
What about date values?
So it can all fall apart pretty quickly.
-
Sep 4th, 2015, 06:25 PM
#4
Fanatic Member
Re: [RESOLVED] CSV File Format: UTF8 vs ANSI vs UTF16 w/o BOM
I live in country, where comma is locale decimal point, but sometimes dot, i mean point is used also - aand have learned throught the years, that 0x9 (tab) character works best as a field separator in .csv files.
-
Sep 4th, 2015, 07:20 PM
#5
Thread Starter
Frenzied Member
Re: [RESOLVED] CSV File Format: UTF8 vs ANSI vs UTF16 w/o BOM
Originally Posted by dilettante
The problem is no "CSV standard" exists.
Almost everything about CSV files is subject to variation, and in some locales comma (",") isn't even used but instead semicolon (";") gets substituted because the comma is so frequently used within number values as the decimal point.
There is a "lowest common denominator" CSV format: 7-bit ASCII text, CRLF row delimiters, comma field delimiters, no commas allowed within field values. But this has obvious issues as soon as requirements deviate beyond its limitations.
What if you need other characters, or commas in text or numbers, or CRLF isn't native to your platform and you want to use just LF? Then you have to begin modifying what you mean by "CSV" format.
Maybe you move to an ANSI encoding. What codepage? Or a Unicode encoding. Then you have the BOM vs. no-BOM problem.
Maybe you quote fields containing commas. What if the fields can have quotes too?
How about boolean values? Use 1/0? Use True/False or local language equivalent?
What about date values?
So it can all fall apart pretty quickly.
1. I used fixed comma as field separator, If content got comma, use Double quotation mark.
An example:
"c:\program files\my app\app.exe"
escapedString = """c:\program files\my app\app.exe"""
An example:
123,456.09
escapedString = "123,456.09"
2. For better Unicode support, I made options for encoding. For ANSI, the SaveStringToTextFile function used current default Codepage.
For UTF8 and others, We can choose whether put BOM or not.
-
Sep 4th, 2015, 09:26 PM
#6
Re: [RESOLVED] CSV File Format: UTF8 vs ANSI vs UTF16 w/o BOM
That's a big fail, because quoting a field like "123,456.09" makes it a String value. Many programs that import CSV data auto-sense field data types and this would break them.
-
Sep 4th, 2015, 10:01 PM
#7
Thread Starter
Frenzied Member
Re: [RESOLVED] CSV File Format: UTF8 vs ANSI vs UTF16 w/o BOM
Originally Posted by dilettante
That's a big fail, because quoting a field like "123,456.09" makes it a String value. Many programs that import CSV data auto-sense field data types and this would break them.
My Cell Type is String, so "123,456.09" with double quotation is OK.
If the cell Mask type is numeric, the 123,456.09 will come to 123456.09 without DQ.
Last edited by Jonney; Sep 4th, 2015 at 10:07 PM.
Tags for this Thread
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|