-
[RESOLVED] CSV files and a big problem with notes field!
The only way to separate records in csv files is finding CR+LF in it.
At the end of each record there is CR+LF.
In CSV files there is a field with name of (Notes).
In this terrible field there are more CR+LF (13+10) characters!
With this terrible field i can't separate every record in CSV files
I'm using Open binary command
Excel can easily open csv files even with Notes field (with more CR+LF)!!!! i don't know why i can't find a logical map in csv files to open them normally like excel !!!?
Please help me :'(
tnx :(
-
Re: CSV files and a big problem with notes field!
Very often, VB6 (and .net) programmers use line input and the split method, both of which are easy and quick to write and appropriate in many cases but not always.
Programs such as Excel probably use stream input, one character at a time and follow parsing rules that are different from the quick and easy VB6 methods. In order to be able to differentiate those memo files, I bet that the memo field at least, and probably the other ones too, are enclosed in double quotes ("). Excel parses from a double quote pairs, anything else is a comma or a CrLf.
That's the most likely explanation. Open your file in Notepad to confirm that is the case. If so.... you will have to write a one character at a time parsing routine.
-
Re: CSV files and a big problem with notes field!
Quote:
Originally Posted by
Navion
Very often, VB6 (and .net) programmers use line input and the split method, both of which are easy and quick to write and appropriate in many cases but not always.
Programs such as Excel probably use stream input, one character at a time and follow parsing rules that are different from the quick and easy VB6 methods. In order to be able to differentiate those memo files, I bet that the memo field at least, and probably the other ones too, are enclosed in double quotes ("). Excel parses from a double quote pairs, anything else is a comma or a CrLf.
That's the most likely explanation. Open your file in Notepad to confirm that is the case. If so.... you will have to write a one character at a time parsing routine.
i know i can use (line input) or (input) or....
but it's impossible because CSV structure is not so simple!
let me show you a very simple and short map of CSV files:
===============================
CSV FILE (START)
===============================
"Title","First Name","Middle Name","Last Name","Mobile Phone","E-mail Address","Notes" {CRLF}
"","Mehdi","","Jazini","091111111","[email protected]","I'm Mehdi {CRLF} I'm a programmer" {CRLF}
"","Navion","","Foster","092222222" {CRLF}
===============================
CSV FILE (END)
===============================
- in the above simple example if you count the titles, they are 7 titles.
- the first item is (mehdi jazini) which has exactly 7 items. (equal with the titles and it's standard) and completely matched with the master template.
- but if for example, you add a new contact with name of (Navion Foster) and you decide to save it without email and notes, the microsoft outlook or google decide to export it like the above example!!!! they don't add something like this at the end of the record: ("","")
- at this time you can easily split all the elements by (comma) or by (double quotes) but it's not enough!!!!
- you must split every record too not just commas or double quotes!!! every child in programming can easily split them by (input) or (line input) or even (binary) in vb6! but the real art is: you be able to split records...
- in the above example we have 3 records. but no one can split them... because there is no a static format in the above example! there is no way!!! i think more... but i couldn't find excel plan to open them in seperated rows!!!
- {CRLF} in the (Notes) field and also none standard format of every row made me crazy! :(
- The question is: why excel can open this conflicted format but we can't !!!!
-
Re: CSV files and a big problem with notes field!
The way to read is not so complicated.
Numbers starting with no CRLF or " and spaces are skipped.
Numbers ending to , or crlf
Srtings started with " and ending with "
double "" at the start is an empty string, double "" after is a chr$(34) only.
When we found crlf or cr or lf or crcr or any <32 char we can change "record" or "line"
So for each turn, we read numbers or strings maybe with some crlf inside, and maybe less or more from previus record
Excel has no two dimensional array but a database where the key interpreted as a position in displayed table, of rows and columns. So we can do without excel. Just make a class as a holder for a variant type and use "a1" for the first cell an index in a collection of those objects. So you can read the variable item csv, maybe with multi line strings, in this collection. You can recreate the collection in the screen by giving the position "e12" end if not found (skip this error with on error) just display an empty cell.
From the above learn the 3 stages...parsing data, storing. displaying. These are 3 different stages. You miss the two last?..then you have no idea how to parse...
-
Re: CSV files and a big problem with notes field!
Have you tried using ADO to open the file and access the data?
-
Re: CSV files and a big problem with notes field!
Something like this perhaps;
Code:
Private Sub Command1_Click()
Dim a As String, b As String
Dim FinalString As String
Dim InQuotes As Boolean
f = FreeFile
Open "test.csv" For Input As f
a = Input$(LOF(f), f)
Close f
For i = 1 To Len(a)
b = Mid$(a, i, 1)
Select Case b
Case """"
InQuotes = Not InQuotes
Case vbCr, vbLf
If Not InQuotes Then
FinalString = FinalString & b
End If
Case Else
FinalString = FinalString & b
End Select
Next
Close f
Text1 = FinalString
End Sub
Note you should set the Multiline property of the TextBox to True at design time
-
Re: CSV files and a big problem with notes field!
Quote:
Originally Posted by
Mehdi Jazini
- in the above example we have 3 records. but no one can split them... because there is no a static format in the above example! there is no way!!! i think more... but i couldn't find excel plan to open them in seperated rows!!!
- {CRLF} in the (Notes) field and also none standard format of every row made me crazy! :(
- The question is: why excel can open this conflicted format but we can't !!!!
Your given example is valid and well-parsable CSV - as the Excel-Importer demonstrates -
so the statement that "no one can split them" is obviously not true in case of Excel
(and for other CSV-import-routines).
Linefeeds within quoted strings are possible (and identifyable).
Also commas (the Field-Separators) are possible within such double-quoted-Fields.
And qouted parts within Fields (double-quoted or not) are possible as well.
Empty Fields at the right-end of a line can be left out, since the LineFeed is enough as a "New-Record" identifier.
CSV-Parsers which can handle these Inputs, work like a state-machine,
as basically shown in the code posted by 'Magic Ink'.
Using a File with the following text-content:
Code:
"A Title, with a "double-quoted" part","First Name","Middle Name","Last Name","Mobile Phone","E-mail Address","Notes"
"","Mehdi","","Jazini","091111111","[email protected]","I'm Mehdi
I'm a programmer"
"","Navion","","Foster","092222222"
A Title with a "double-quoted" part, First Name, Middle Name, Last Name, Mobile Phone, E-mail Address, Notes
My parser here produces the following output:
Row_Index Col_Index Field_Value
Code:
0 0 A Title, with a "double-quoted" part
0 1 First Name
0 2 Middle Name
0 3 Last Name
0 4 Mobile Phone
0 5 E-mail Address
0 6 Notes
1 0
1 1 Mehdi
1 2
1 3 Jazini
1 4 091111111
1 5 [email protected]
1 6 I'm Mehdi
I'm a programmer
2 0
2 1 Navion
2 2
2 3 Foster
2 4 092222222
3 0 A Title with a "double-quoted" part
3 1 First Name
3 2 Middle Name
3 3 Last Name
3 4 Mobile Phone
3 5 E-mail Address
3 6 Notes
Olaf
-
Re: CSV files and a big problem with notes field!
Code:
Private Sub Command1_Click()
Dim a As String, b As String
Dim FinalString As String
Dim InQuotes As Boolean
f = FreeFile
Open "test.csv" For Input As f
a = Input$(LOF(f), f)
Close f
For i = 1 To Len(a)
b = Mid$(a, i, 1)
Select Case b
Case """"
InQuotes = Not InQuotes
Case vbCr, vbLf
If Not InQuotes Then
FinalString = FinalString & b
End If
Case Else
FinalString = FinalString & b
End Select
Next
Close f
Text1 = FinalString
End Sub
Magic Ink can you describe about the red part of your code?
is my guest true: is it processing that the CRLF is in Quotes or out of Quotes?!!! is it really possible?
-
Re: CSV files and a big problem with notes field!
Code tested with the file source supplied to us by you in post #3.
-
Re: CSV files and a big problem with notes field!
Quote:
Originally Posted by
Magic Ink
Code tested with the file source supplied to us by you in post #3.
Magic Ink this is a very good idea. thank you very much. i didn't know that we can process characters (in) or (out) of Quotes.
But sometimes in a CSV file format there is no any Quote in the file... just with (data and comma and CRLF)
i want to write a code for general use...
the main problem is (Notes) field. in (Notes) field sometimes we have CRLF. and the only way to seperate ROWs is just CRLF :(
is there a way to know: is CRLF in the (Notes) field or out of it?
By this way we can separate each ROW line by line
Microsoft Excel resolved it... i don't know how :'(
-
Re: CSV files and a big problem with notes field!
Mehdi :
I see a couple of pitfalls in the example you just posted about which I prefer not to get into (they may be just typos or something)
but here is a simple line of code to change the embedded coded CrLf (in one line ONLY), replacing it with a real one
Code:
oneline$ = Replace(oneline$, "{CRLF}", vbCrLf)
-
Re: CSV files and a big problem with notes field!
Quote:
Originally Posted by
Navion
Mehdi :
I see a couple of pitfalls in the example you just posted about which I prefer not to get into (they may be just typos or something)
but here is a simple line of code to change the embedded coded CrLf (in one line ONLY), replacing it with a real one
Code:
oneline$ = Replace(oneline$, "{CRLF}", vbCrLf)
Navion!!!!!!!!!!!!!!!!!!!!
{CRLF} = chr$(13) + chr$(10) = vbCrLf
in the above example {CRLF} is just a symbol to tell programmers it's a chr$(13) + chr$(10)
-
Re: CSV files and a big problem with notes field!
Well then post an actual .CSV file to get the guesswork out of the picture.!!!!!
-
Re: CSV files and a big problem with notes field!
>But sometimes in a CSV file format there is no any Quote in the file
In which case string values must not contain any characters which upset things like vbCr, vbLf, and commas (the latter being affected by the Decimal symbol and List separator character in Windows regional settings). So it can be that a character other than the comma is used to delimit columns (often semi-colon) and that decimalised numbers can contain a comma instead of a period to indicate the decimal point.
>i want to write a code for general use...
It is generally accepted that string values are always surrounded in double quotes in csv files. Try saving an Excel Sheet as a csv file and see how the result looks in NotePad.
-
Re: CSV files and a big problem with notes field!
Quote:
Originally Posted by
Navion
Well then post an actual .CSV file to get the guesswork out of the picture.!!!!!
OK
Quote:
Originally Posted by
Magic Ink
>But sometimes in a CSV file format there is no any Quote in the file
In which case string values must not contain any characters which upset things like vbCr, vbLf, and commas (the latter being affected by the Decimal symbol and List separator character in Windows regional settings). So it can be that a character other than the comma is used to delimit columns (often semi-colon) and that decimalised numbers can contain a comma instead of a period to indicate the decimal point.
>i want to write a code for general use...
It is generally accepted that string values are always surrounded in double quotes in csv files. Try saving an Excel Sheet as a csv file and see how the result looks in NotePad.
Login to Gmail then click on the (Gmail) caption and change it to (contacts) then try to export your contact database to csv and see the result...
i think CSV format of Google is better than Outlook. i can easily work on it. even with CRLF problem because in every row of it, number of commas are equal to the title.
if in the title there are 96 commas in other rows there are 96 commas too. and it's better for me
-
Re: CSV files and a big problem with notes field!
I guess I'll ask again, have you tried ADO yet?
-
Re: CSV files and a big problem with notes field!
Quote:
Originally Posted by
MarkT
I guess I'll ask again, have you tried ADO yet?
Sorry i forgot to reply you.
no i didn't
have the ADO a separated special engine to just open CSV files? if yes, ADO can be a good choice for this situation, otherwise it can't open terrible map of (Old Outlook CSV) files :(
can you make an example?
tnx
-
Re: CSV files and a big problem with notes field!
If you post a sample file I'd give it a try.
-
Re: CSV files and a big problem with notes field!
Quote:
Originally Posted by
MarkT
If you post a sample file I'd give it a try.
Copy the following texts to notepad and save it as ansi
Code:
First Name,Middle Name,Last Name,Title,Suffix,Initials,Web Page,Gender,Birthday,Anniversary,Location,Language,Internet Free Busy,Notes,E-mail Address,E-mail 2 Address,E-mail 3 Address,Primary Phone,Home Phone,Home Phone 2,Mobile Phone,Pager,Home Fax,Home Address,Home Street,Home Street 2,Home Street 3,Home Address PO Box,Home City,Home State,Home Postal Code,Home Country,Spouse,Children,Manager's Name,Assistant's Name,Referred By,Company Main Phone,Business Phone,Business Phone 2,Business Fax,Assistant's Phone,Company,Job Title,Department,Office Location,Organizational ID Number,Profession,Account,Business Address,Business Street,Business Street 2,Business Street 3,Business Address PO Box,Business City,Business State,Business Postal Code,Business Country,Other Phone,Other Fax,Other Address,Other Street,Other Street 2,Other Street 3,Other Address PO Box,Other City,Other State,Other Postal Code,Other Country,Callback,Car Phone,ISDN,Radio Phone,TTY/TDD Phone,Telex,User 1,User 2,User 3,User 4,Keywords,Mileage,Hobby,Billing Information,Directory Server,Sensitivity,Priority,Private,Categories
Bill,,Gates,,,,,,,,,,,,,,,,,,0922449384,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Normal,,My Contacts,
John,,Foster,,,,,,,,,,,"this is a test1
this is a test2
",[email protected],,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Normal,,My Contacts,
Mehdi,,Jazini,,,,,,,,,,,"this is a test for note1
this is a test for note2
test
",[email protected],,,,,,09354170163,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Normal,,My Contacts,
Norton,,Commander,,,,,,,,,,,,,,,,,,9454857455,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Normal,,My Contacts,
-
Re: CSV files and a big problem with notes field!
I prefer to use ADO vs manually parsing a csv file. Way easier. If ADO can't do it then: 1) pretty shocked or 2) malformated csv format
Also, if interested, I believe in the codebank section, you will find a csv parsing class/control submitted by Merri
-
1 Attachment(s)
Re: CSV files and a big problem with notes field!
Attached is a quick sample project for you.
-
Re: CSV files and a big problem with notes field!
Quote:
Originally Posted by
MarkT
Attached is a quick sample project for you.
OMGGGGGGGGGGGG
Finally i have found the easiest and fastest wayyyyy to open any CSV file :D
Thank you MarkT very much for your info...
Now i can open any CSV with any format even none standard formats with ADO :D
MarkT did you know i have wrote a comprehensive code for opening csv files with (Open Binary) !!! :'((((((((
It can work good but it can't open none standard and old csv formats. but the ADO can do anything :D
you can also change:
Code:
strPath = IIf(Right(App.Path, 1) = "\", App.Path, App.Path & "\")
to this short code:
Code:
strPath = replace(App.Path & "\","\\","\")
and thank you for your screen shot but no need because a saved project have saved the reference check mark in itself :)
Thank you again :wave:
-
Re: CSV files and a big problem with notes field!
That isn't "using ADO" as such since ADO is merely used as the connector technology. The work is being done by the Text Driver, a creaky old ODBC Desktop Driver that ADO can only use by thunking through the OLEDB Provider for ODBC. This could be used other ways as well such as direct ODBC calls or RDO.
A cleaner, faster, and more powerful approach would be to use Jet 4.0's Text IISAM for this, which offers a superset of the Text Driver's capabilities with less overhead. That can be used via either ADO or even DAO 3.6 if one insists.
In the posted example the connection string bizarrely specifies HDR=No and yet processes headers. I assume that's fallout from a poorly-formed connection string having no effect (by failing silently), along with a registry default of HDR=Yes for this desktop driver. If that's why the example here (seems to) work it may well fail on other machines that have the default flipped to No.
Flawed example.
-
Re: CSV files and a big problem with notes field!
Quote:
Originally Posted by
dilettante
That isn't "using ADO" as such since ADO is merely used as the connector technology. The work is being done by the Text Driver, a creaky old ODBC Desktop Driver that ADO can only use by thunking through the OLEDB Provider for ODBC. This could be used other ways as well such as direct ODBC calls or RDO.
A cleaner, faster, and more powerful approach would be to use Jet 4.0's Text IISAM for this, which offers a superset of the Text Driver's capabilities with less overhead. That can be used via either ADO or even DAO 3.6 if one insists.
In the posted example the connection string bizarrely specifies HDR=No and yet processes headers. I assume that's fallout from a poorly-formed connection string having no effect (by failing silently), along with a registry default of HDR=Yes for this desktop driver. If that's why the example here (seems to) work it may well fail on other machines that have the default flipped to No.
Flawed example.
Did I understand your reply correctly?
- With HDR=Yes i can use MarkT sample in any windows operating system with any personal computer
- With HDR=No i can't use MarkT sample in all computers
-
Re: CSV files and a big problem with notes field!
Quote:
Originally Posted by
Mehdi Jazini
Did I understand your reply correctly?
No, what dilettante meant was, that the Header-Setting in the used connection-string of the example
had no effect. The driver was just treating the SourceFile in "contains a headerline"-mode due to a
system-default-setting (which might not be the same on all systems your solution later on has to run).
So, it's better to use a connection-string which makes the driver react to the HDR=YES/NO string-setting.
Code:
"Provider=Microsoft.Jet.OLEDB.4.0;Extended Properties=""text;FMT=Delimited;CharacterSet=ANSI;HDR=YES"";Data Source=C:\temp"
Also important...
The above Cnn-string contains a CharacterSet= ... setting for ANSI explicitely (which is normally the
default, but you cannot always be sure about these defaults).
And since you mentioned, that you want to treat (import) files exported from WebApps...
These usually export CSV-content as UTF-8.
The JET-Text-Driver currently cannot handle UTF-8 input directly - though what it
does understand is Unicode-content (files) which exist in 16bit-little-endian-format.
So after downloading an UTF-8 CSV-file from e.g. a google-mail-export, you will
have to transcode it from UTF8 to UTF16LE and save it back to disk (you can do
that operation on the whole file-content in one go).
After that you can force the JET-CSV-parser to treat it appropriately with:
Code:
"Provider=Microsoft.Jet.OLEDB.4.0;Extended Properties=""text;FMT=Delimited;CharacterSet=UNICODE;HDR=YES"";Data Source=C:\temp"
Olaf
-
Re: CSV files and a big problem with notes field!
Quote:
Originally Posted by
Schmidt
No, what dilettante meant was, that the Header-Setting in the used connection-string of the example
had no effect. The driver was just treating the SourceFile in "contains a headerline"-mode due to a
system-default-setting (which might not be the same on all systems your solution later on has to run).
So, it's better to use a connection-string which makes the driver react to the HDR=YES/NO string-setting.
Code:
"Provider=Microsoft.Jet.OLEDB.4.0;Extended Properties=""text;FMT=Delimited;CharacterSet=ANSI;HDR=YES"";Data Source=C:\temp"
Also important...
The above Cnn-string contains a CharacterSet= ... setting for ANSI explicitely (which is normally the
default, but you cannot always be sure about these defaults).
And since you mentioned, that you want to treat (import) files exported from WebApps...
These usually export CSV-content as UTF-8.
The JET-Text-Driver currently cannot handle UTF-8 input directly - though what it
does understand is Unicode-content (files) which exist in 16bit-little-endian-format.
So after downloading an UTF-8 CSV-file from e.g. a google-mail-export, you will
have to transcode it from UTF8 to UTF16LE and save it back to disk (you can do
that operation on the whole file-content in one go).
After that you can force the JET-CSV-parser to treat it appropriately with:
Code:
"Provider=Microsoft.Jet.OLEDB.4.0;Extended Properties=""text;FMT=Delimited;CharacterSet=UNICODE;HDR=YES"";Data Source=C:\temp"
Olaf
Actually in gmail contacts in the export option there are 2 options to export.
1) Google CSV format
2) Outlook CSV format
i always used (Outlook CSV format) because i didn't have unicode characters in my contact info in gmail :(
now i have tested it :(
the above ADO sample can just open (Outlook CSV format) in just (ANSI) mode. if all your contact data is english and then you click on (Outlook CSV format) the output in the text file will be normal with an ansi format. but if all your contact data is english + unicode characters and then you click on (Outlook CSV format) the output in the text file will be something like this in the file: ???????????????
the only export option that can export unicode characters very well is just (Google CSV format)
but it seems the ADO is unable to open (Google CSV format) even with CharacterSet=UNICODE;HDR=YES !!!
now, how to open (Google CSV format) ?! :'(
-
Re: CSV files and a big problem with notes field!
Quote:
Originally Posted by
Mehdi Jazini
...now, how to open (Google CSV format) ?! :'(
Already tried to explain that - when the Google-native exportformat is used,
then it is (with very high probabilty) UTF8-format.
And the ADO-setting CharSet=UNICODE expects *not* UTF8-format, but
UTF16LE-format instead.
So you need (after downloading the Google-CSV-File) to transcode it from UTF8
into UTF16.
You can do that per VB6-code - but also (for a fast test) with Notepad.exe
(per File-Open-Dialog, since that one contains the option to specify 'UTF8' there).
After successful loading of your UTF8-file - just save it (per Notepad.exe SaveAs-Dialog)
with the 'Unicode'-option (which on Win-Systems is little endian by default).
Then make sure, you pass the correct path/filename of your new created UTF16-file
into the ADO-related CSV-routines.
Olaf
-
Re: CSV files and a big problem with notes field!
Quote:
Originally Posted by
Schmidt
Already tried to explain that - when the Google-native exportformat is used,
then it is (with very high probabilty) UTF8-format.
And the ADO-setting CharSet=UNICODE expects *not* UTF8-format, but
UTF16LE-format instead.
So you need (after downloading the Google-CSV-File) to transcode it from UTF8
into UTF16.
You can do that per VB6-code - but also (for a fast test) with Notepad.exe
(per File-Open-Dialog, since that one contains the option to specify 'UTF8' there).
After successful loading of your UTF8-file - just save it (per Notepad.exe SaveAs-Dialog)
with the 'Unicode'-option (which on Win-Systems is little endian by default).
Then make sure, you pass the correct path/filename of your new created UTF16-file
into the ADO-related CSV-routines.
Olaf
i know but now the problem is not encoding of the file. because after downloading the (unicode) csv file. it's encoding is [unicode] by default. it's not utf-8 or utf-16 or ansi or something like this...
now the problem is: ADO is unable to open (Google CSV Format) and it's an unknown format for it. it can just open (Outlook CSV Format) in gmail export options!
you can copy it to a notepad and save it as ansi then try to open by ADO. it's impossible:
Code:
Name,Given Name,Additional Name,Family Name,Yomi Name,Given Name Yomi,Additional Name Yomi,Family Name Yomi,Name Prefix,Name Suffix,Initials,Nickname,Short Name,Maiden Name,Birthday,Gender,Location,Billing Information,Directory Server,Mileage,Occupation,Hobby,Sensitivity,Priority,Subject,Notes,Group Membership,E-mail 1 - Type,E-mail 1 - Value,Phone 1 - Type,Phone 1 - Value
Bill Gates,Bill,,Gates,,,,,,,,,,,,,,,,,,,,,,,* My Contacts,,,Mobile,0922449384
John Foster,John,,Foster,,,,,,,,,,,,,,,,,,,,,,"this is a test1
this is a test2
",* My Contacts,* Home,[email protected],,
Mehdi Jazini,Mehdi,,Jazini,,,,,,,,,,,,,,,,,,,,,,"this is a test for note1
this is a test for note2
test
",* My Contacts,* Home,[email protected],Mobile,09354170163
Norton Commander,Norton,,Commander,,,,,,,,,,,,,,,,,,,,,,,* My Contacts,,,Mobile,9454857455
-
Re: CSV files and a big problem with notes field!
Mehdi, just to add a bit more info (or maybe noise, not sure). But Unicode is not just one thing. There are several "Unicode Transformation Formats" or UTF's available within Unicode. UTF-16 is probably the most popular but UTF-8 is also quite popular. In Notepad, the Save As/Unicode option is actually UTF-16. There is also a UTF-8 option listed there. (There's also an issue of Endianness, but that's probably not something you need to worry about.)
VB6 uses the UTF-16 version of Unicode. This is a nice version in that all of the characters are exactly 2 bytes (or at least in most cases, including ALL of VB6). In the UTF-8 version, the original ASCII characters are actually only one byte, and it's quite easy to have characters that are 3 or 4 bytes.
For a text file, you can check the first two (or three) bytes to see how it's encoded.
If the first two bytes are &hFF &hFE then it's a UTF-16 Unicode encoded file.
If the first three bytes are &hEF &hBB &hBF then it's a UTF-8 Unicode encoded file.
If the high bit of the first byte is off (i.e., the byte < &h80), then it's ASCII.
All other conditions either other encodings or undefined.
Reading an ASCII file into VB6 is trivial, and I won't go into it.
Regarding reading a UTF-8 file into VB6, that'd be a bit tough, and I don't have anything to do it.
If you wish to read a UTF-16 file into VB6, the following is the code to do it.
None of this really addresses your parsing issues, but maybe it'll help you to read a file, re-write it a bit differently, and then parse it. However, if it were me, I'd bite the bullet and just write a parser in VB6. From the above, I'm only partially clear on what you're trying to do but I'd do it in two phases: 1) write a parser to parse it into lines, creating a string array with the lines. That way, you could do whatever tricks you need to to correctly handle your vbLF and vbCR characters. Phase 2, I would parses the individual lines into my fields. Again, you'd have to monitor for your Chr$(34) characters.
Best of Luck.
Code:
Public Sub SaveStringToUnicodeFile(sData As String, sFileSpec As String)
' These are typically .TXT files. They can be read with notepad.
Dim iFle As Long
'
iFle = FreeFile
Open sFileSpec For Binary As iFle
Put iFle, , &HFEFF ' This is the Unicode header to a text file. First byte = FF, second byte = FE.
Put iFle, , UnicodeByteArrayFromString(sData)
Close iFle
End Sub
Public Function LoadStringFromUnicodeFile(sFileSpec As String) As String
' These are typically .TXT files. They can be read with notepad.
Dim iFle As Long
Dim bb() As Byte
Dim i As Integer
'
iFle = FreeFile
Open sFileSpec For Binary As iFle
Get iFle, , i
If i <> &HFEFF Then ' Unicode file header. First byte = FF, second byte = FE.
Close iFle
Exit Function ' It's not a valid Unicode file.
End If
ReDim bb(1 To LOF(iFle) - 2&)
Get iFle, , bb
Close iFle
LoadStringFromUnicodeFile = bb ' This directly copies the byte array to the Unicode string (no conversion).
' Note: If you try to directly read the file as a string, VB6 will attempt to convert the string from ASCII to Unicode.
End Function
Public Function UnicodeByteArrayFromString(s As String) As Byte()
' This directly copies the Unicode string into the byte array, using two bytes per character (i.e., Unicode).
UnicodeByteArrayFromString = s
End Function
-
Re: CSV files and a big problem with notes field!
I don't have a google-mail-account and cannot test this myself.
Would be nice when you could send the original (as downloaded from google, not-manipulated)
two types of files to me per mail (I have a normal E-Mail reply-address defined for my account) -
or in case the content is no problem, then an zipped upload to the "forum-space" would be even better,
so that others can take a look too.
Olaf
-
1 Attachment(s)
Re: CSV files and a big problem with notes field!
Thank you Elroy for your info i'll read your post tomorrow carefully :)
Quote:
Originally Posted by
Schmidt
I don't have a google-mail-account and cannot test this myself.
Would be nice when you could send the original (as downloaded from google, not-manipulated)
two types of files to me per mail (I have a normal E-Mail reply-address defined for my account) -
or in case the content is no problem, then an zipped upload to the "forum-space" would be even better,
so that others can take a look too.
Olaf
Olaf I have attached its original file from gmail export
-
Re: CSV files and a big problem with notes field!
Mehdi,
That file you attached to post #31 isn't Unicode. It's pure ASCII. Here's some code to do it for you. I just hammered it out so be sure to run it through some tests. Also, be sure to patch up the sFileSpec (or maybe do some common dialog and prompt for it). Also, with my above "read Unicode file" routine, this could be fairly easily adapted to a Unicode file, but I'm done with free coding on this thread. I hope you get it going.
Take Care,
Elroy
Code:
Private Sub Form_Load()
Dim iFle As Long
Dim sFileSpec As String
Dim bInQuote As Boolean
Dim sLine As String
Dim bb As Byte
Dim sLines() As String
Dim iLineCount As Long
Dim iFieldCount As Long
Dim iLin As Long
Dim iChr As Long
Dim iFld As Long
Dim sLinesAndFields() As String
Dim sField As String
Dim c As String
'
sFileSpec = "c:\users\elroy\desktop\google.csv"
iFle = FreeFile
Open sFileSpec For Binary As iFle
'
' Read file and parse lines.
Do
If EOF(iFle) Then
If Len(sLine) > 0 Then
iLineCount = iLineCount + 1
ReDim Preserve sLines(1 To iLineCount)
sLines(iLineCount) = sLine
sLine = ""
End If
Exit Do ' The only way out.
End If
Get iFle, , bb
Select Case bb
Case 34 ' Quote.
bInQuote = Not bInQuote
sLine = sLine & Chr$(bb)
Case 10 ' LF.
If bInQuote Then
sLine = sLine & Chr$(bb)
Else
iLineCount = iLineCount + 1
ReDim Preserve sLines(1 To iLineCount)
sLines(iLineCount) = sLine
sLine = ""
End If
Case 13 ' CR.
If bInQuote Then
sLine = sLine & Chr$(bb)
End If ' Else ignore.
Case Else
sLine = sLine & Chr$(bb)
End Select
Loop
Close iFle
'
' Now parse fields.
ReDim sLinesAndFields(1 To iLineCount, 1 To 1)
For iLin = 1 To iLineCount
iFieldCount = 0
bInQuote = False
For iChr = 1 To Len(sLines(iLin))
bb = Asc(Mid$(sLines(iLin), iChr, 1))
Select Case bb
Case 34 ' Quote.
bInQuote = Not bInQuote ' But toss the quote mark.
Case 44 ' Comma.
If bInQuote Then
sField = sField & Chr$(bb)
Else
iFieldCount = iFieldCount + 1
If iFieldCount > UBound(sLinesAndFields, 2) Then ReDim Preserve sLinesAndFields(1 To iLineCount, 1 To iFieldCount)
sLinesAndFields(iLin, iFieldCount) = sField
sField = ""
End If
Case Else
sField = sField & Chr$(bb)
End Select
Next iChr
' Create one more field with whatever's in sField.
iFieldCount = iFieldCount + 1
If iFieldCount > UBound(sLinesAndFields, 2) Then ReDim Preserve sLinesAndFields(1 To iLineCount, 1 To iFieldCount)
sLinesAndFields(iLin, iFieldCount) = sField
sField = ""
Next iLin
'
' Dump it to see if it worked.
For iLin = 1 To iLineCount
Debug.Print "Line " & Format$(iLin)
For iFld = 1 To UBound(sLinesAndFields, 2)
Debug.Print " (" & Format$(iFld) & ") ";
For iChr = 1 To Len(sLinesAndFields(iLin, iFld))
bb = Asc(Mid$(sLinesAndFields(iLin, iFld), iChr, 1))
Select Case bb
Case 10 ' LF
Debug.Print "{LF}";
Case 13
Debug.Print "{CR}";
Case Else
Debug.Print Chr$(bb);
End Select
Next iChr
Next iFld
Debug.Print
Next iLin
End Sub
p.s. If you wind up using this code, I'd like to think that you'd think through it and learn something from it. :)
-
Re: CSV files and a big problem with notes field!
Quote:
Originally Posted by
Elroy
Mehdi, just to add a bit more info (or maybe noise, not sure). But Unicode is not just one thing. There are several "Unicode Transformation Formats" or UTF's available within Unicode. UTF-16 is probably the most popular but UTF-8 is also quite popular. In Notepad, the Save As/Unicode option is actually UTF-16. There is also a UTF-8 option listed there. (There's also an issue of Endianness, but that's probably not something you need to worry about.)
VB6 uses the UTF-16 version of Unicode. This is a nice version in that all of the characters are exactly 2 bytes (or at least in most cases, including ALL of VB6). In the UTF-8 version, the original ASCII characters are actually only one byte, and it's quite easy to have characters that are 3 or 4 bytes.
For a text file, you can check the first two (or three) bytes to see how it's encoded.
If the first two bytes are &hFF &hFE then it's a UTF-16 Unicode encoded file.
If the first three bytes are &hEF &hBB &hBF then it's a UTF-8 Unicode encoded file.
If the high bit of the first byte is off (i.e., the byte < &h80), then it's ASCII.
All other conditions either other encodings or undefined.
Reading an ASCII file into VB6 is trivial, and I won't go into it.
Regarding reading a UTF-8 file into VB6, that'd be a bit tough, and I don't have anything to do it.
If you wish to read a UTF-16 file into VB6, the following is the code to do it.
None of this really addresses your parsing issues, but maybe it'll help you to read a file, re-write it a bit differently, and then parse it. However, if it were me, I'd bite the bullet and just write a parser in VB6. From the above, I'm only partially clear on what you're trying to do but I'd do it in two phases: 1) write a parser to parse it into lines, creating a string array with the lines. That way, you could do whatever tricks you need to to correctly handle your vbLF and vbCR characters. Phase 2, I would parses the individual lines into my fields. Again, you'd have to monitor for your Chr$(34) characters.
Best of Luck.
Code:
Public Sub SaveStringToUnicodeFile(sData As String, sFileSpec As String)
' These are typically .TXT files. They can be read with notepad.
Dim iFle As Long
'
iFle = FreeFile
Open sFileSpec For Binary As iFle
Put iFle, , &HFEFF ' This is the Unicode header to a text file. First byte = FF, second byte = FE.
Put iFle, , UnicodeByteArrayFromString(sData)
Close iFle
End Sub
Public Function LoadStringFromUnicodeFile(sFileSpec As String) As String
' These are typically .TXT files. They can be read with notepad.
Dim iFle As Long
Dim bb() As Byte
Dim i As Integer
'
iFle = FreeFile
Open sFileSpec For Binary As iFle
Get iFle, , i
If i <> &HFEFF Then ' Unicode file header. First byte = FF, second byte = FE.
Close iFle
Exit Function ' It's not a valid Unicode file.
End If
ReDim bb(1 To LOF(iFle) - 2&)
Get iFle, , bb
Close iFle
LoadStringFromUnicodeFile = bb ' This directly copies the byte array to the Unicode string (no conversion).
' Note: If you try to directly read the file as a string, VB6 will attempt to convert the string from ASCII to Unicode.
End Function
Public Function UnicodeByteArrayFromString(s As String) As Byte()
' This directly copies the Unicode string into the byte array, using two bytes per character (i.e., Unicode).
UnicodeByteArrayFromString = s
End Function
Thank you Elroy for your useful info.
I have no knowledge to read and import UTF-8 or Unicode characters to VB6 objects like textbox and listbox and etc. so it's better to make the following conditions to avoid importing none ansi files:
Asc=255 and Asc=254 is UTF-16 or Unicode {i guess it's &hFF &hFE}
Asc=239 and Asc=187 and Asc=191 is UTF-8 {i guess it's &hEF &hBB &hBF}
Asc=254 and Asc=255 and Asc=0 is Unicode big endian {and i don't guess anything :D just tested}
if the codes detected the above ascii codes at the first of the binary file, it returns this error: "this program is unable to import none ansi files. please convert them to ansi then try again" :D
Quote:
Originally Posted by
Elroy
Mehdi,
That file you attached to post #31 isn't Unicode. It's pure ASCII. Here's some code to do it for you. I just hammered it out so be sure to run it through some tests. Also, be sure to patch up the sFileSpec (or maybe do some common dialog and prompt for it). Also, with my above "read Unicode file" routine, this could be fairly easily adapted to a Unicode file, but I'm done with free coding on this thread. I hope you get it going.
Take Care,
Elroy
Code:
Private Sub Form_Load()
Dim iFle As Long
Dim sFileSpec As String
Dim bInQuote As Boolean
Dim sLine As String
Dim bb As Byte
Dim sLines() As String
Dim iLineCount As Long
Dim iFieldCount As Long
Dim iLin As Long
Dim iChr As Long
Dim iFld As Long
Dim sLinesAndFields() As String
Dim sField As String
Dim c As String
'
sFileSpec = "c:\users\elroy\desktop\google.csv"
iFle = FreeFile
Open sFileSpec For Binary As iFle
'
' Read file and parse lines.
Do
If EOF(iFle) Then
If Len(sLine) > 0 Then
iLineCount = iLineCount + 1
ReDim Preserve sLines(1 To iLineCount)
sLines(iLineCount) = sLine
sLine = ""
End If
Exit Do ' The only way out.
End If
Get iFle, , bb
Select Case bb
Case 34 ' Quote.
bInQuote = Not bInQuote
sLine = sLine & Chr$(bb)
Case 10 ' LF.
If bInQuote Then
sLine = sLine & Chr$(bb)
Else
iLineCount = iLineCount + 1
ReDim Preserve sLines(1 To iLineCount)
sLines(iLineCount) = sLine
sLine = ""
End If
Case 13 ' CR.
If bInQuote Then
sLine = sLine & Chr$(bb)
End If ' Else ignore.
Case Else
sLine = sLine & Chr$(bb)
End Select
Loop
Close iFle
'
' Now parse fields.
ReDim sLinesAndFields(1 To iLineCount, 1 To 1)
For iLin = 1 To iLineCount
iFieldCount = 0
bInQuote = False
For iChr = 1 To Len(sLines(iLin))
bb = Asc(Mid$(sLines(iLin), iChr, 1))
Select Case bb
Case 34 ' Quote.
bInQuote = Not bInQuote ' But toss the quote mark.
Case 44 ' Comma.
If bInQuote Then
sField = sField & Chr$(bb)
Else
iFieldCount = iFieldCount + 1
If iFieldCount > UBound(sLinesAndFields, 2) Then ReDim Preserve sLinesAndFields(1 To iLineCount, 1 To iFieldCount)
sLinesAndFields(iLin, iFieldCount) = sField
sField = ""
End If
Case Else
sField = sField & Chr$(bb)
End Select
Next iChr
' Create one more field with whatever's in sField.
iFieldCount = iFieldCount + 1
If iFieldCount > UBound(sLinesAndFields, 2) Then ReDim Preserve sLinesAndFields(1 To iLineCount, 1 To iFieldCount)
sLinesAndFields(iLin, iFieldCount) = sField
sField = ""
Next iLin
'
' Dump it to see if it worked.
For iLin = 1 To iLineCount
Debug.Print "Line " & Format$(iLin)
For iFld = 1 To UBound(sLinesAndFields, 2)
Debug.Print " (" & Format$(iFld) & ") ";
For iChr = 1 To Len(sLinesAndFields(iLin, iFld))
bb = Asc(Mid$(sLinesAndFields(iLin, iFld), iChr, 1))
Select Case bb
Case 10 ' LF
Debug.Print "{LF}";
Case 13
Debug.Print "{CR}";
Case Else
Debug.Print Chr$(bb);
End Select
Next iChr
Next iFld
Debug.Print
Next iLin
End Sub
p.s. If you wind up using this code, I'd like to think that you'd think through it and learn something from it. :)
Elroyyyyy did you really write this code yourself?
I have read it line by line... it's a very smart and professional code...
i didn't know this trick in vb6:
Code:
bInQuote = Not bInQuote
so by the above trick we can switch between true and false in every process... wow good idea :D
Code:
ReDim Preserve sLines(1 To iLineCount)
sLines(iLineCount) = sLine
but sometimes we can't use arrays in the code because some CSV files have at least 100,000 or more records in a file and in my opinion ram memory of personal computers can't handle this big data :(
i'll change your code to a code without arrays... i think random files can be best choice instead of arrays :D
Code:
Case 44 ' Comma.
If bInQuote Then
sField = sField & Chr$(bb)
Else
omg.......
you also think about comma problem....
by using your professional code we can also use comma between quote in note filed...
amazingggggg thank you for your time :)
actually we can use your professional code instead of ADO... goodbye ADO :D
i have decided to send you 2 rates for your posts but i have got error in 2nd rate.
i'll send the 2nd rate to you later
tnx :wave:
-
Re: CSV files and a big problem with notes field!
Quote:
Originally Posted by
Mehdi Jazini
...actually we can use your professional code instead of ADO... goodbye ADO :D
I wouldn't ditch it that fast - since it can reliably handle quite a few more things than Elroys code
does currently - which was perhaps thought as a fast hacked example, to be adapted on your own -
but then you already got a few of those also from others...
The code as it currently is, doesn't handle Unicode - it will choke on *.csv Files which have
line-separators different from vbCRLF, it tends to get progressively slower, the larger the files get
(prolonging a String-Array per Redim-Preserve will work Ok for files below 10000 lines or so -
but can be handled better) - also reading FileContent "Byte by Byte" is not really a good idea IMO...
Here's a site where you can download some real-world CSV-samples:
http://support.spatialkey.com/spatia...mple-csv-data/
And e.g. on the first downloadable sample-file (the insurance-data, which contains about 36000 records)
the code you got from Elroy would need about 10seconds or so (just to parse it) - whereas the JET-driver
will hand out the results over to you in a nice format (a Recordset) after only about 500msecs (~factor 20 faster).
If you really plan to import larger files (with 100,000 lines and more), then such performance-considerations
might play a role (depending on your or your customers needs/expectations).
To finish what I started - here's a small UTF8-UTF16 File-Transcoding routine, which (as it is currently) -
should work for transcoding of Utf8-CSV-Files of up to ~100MB - above that you would need to change
to a different approach (e.g. the csv-parser in the vbRichClient5-library - which is even faster than ADO/JET,
the codebank actually has an example for it: http://www.vbforums.com/showthread.p...-and-ADOJet%29 ).
Ok, here the Transcoding-Routine - and how it works in conjunction with UTF8-encoded files
and the JET-TextDriver (+ an MSHFlexGrid, to finally show the retrieved Recordset-Content).
Into a Form (needs an instance MSHFlexGrid1 and a Command1 on it):
Code:
Option Explicit
Private Declare Function MultiByteToWideChar& Lib "kernel32" (ByVal CodePage&, ByVal dwFlags&, MultiBytes As Any, ByVal cBytes&, ByVal pWideChars&, ByVal cWideChars&)
Private Sub Command1_Click()
'UTF8-transcoding related code
Dim CsvPath As String, CsvSrcFile As String, CsvDstFile As String
CsvPath = "c:\temp\"
CsvSrcFile = "FL_insurance_sample.csv" 'downloadable from here (first example): http://support.spatialkey.com/spatialkey-sample-csv-data/
CsvDstFile = "FL_insurance_sampleUTF16.csv"
TranscodeUTF8toUTF16 CsvPath & CsvSrcFile, CsvPath & CsvDstFile
'ADO-Csv-import related stuff
Dim Cnn As New ADODB.Connection, Rs As ADODB.Recordset, T!
T = Timer
Cnn.CursorLocation = adUseClient
Cnn.Open "Provider=Microsoft.Jet.OLEDB.4.0;Extended Properties=""text;FMT=Delimited;CharacterSet=UNICODE;HDR=YES"";Data Source=" & CsvPath
Set Rs = Cnn.Execute("Select * From [" & CsvDstFile & "]")
Set Rs.ActiveConnection = Nothing 'disconnect the Rs from the DataSource
Caption = Rs.RecordCount & " Records imported in " & Timer - T
Set MSHFlexGrid1.DataSource = Rs 'fill a FlexGrid with the Recordsets content
Cnn.Close
End Sub
Sub TranscodeUTF8toUTF16(SrcFileName As String, DstFileName As String)
Dim FNr&, BLen&, B() As Byte, Offs&, WLen&, BW() As Byte
On Error GoTo 1
BLen = FileLen(SrcFileName)
If BLen = 0 Then Err.Raise vbObjectError, , "FileLen=0, nothing to transcode"
FNr = FreeFile: Open SrcFileName For Binary Access Read As FNr
ReDim B(0 To BLen - 1)
Get FNr, , B 'read the UTF8-Bytes
If BLen > 2 Then Offs = IIf(B(0) = 239 And B(1) = 187 And B(2) = 191, 3, 0)
If BLen > Offs Then
WLen = MultiByteToWideChar(65001, 0, B(Offs), BLen - Offs, 0, 0)
ReDim BW(0 To WLen * 2 - 1)
MultiByteToWideChar 65001, 0, B(Offs), BLen - Offs, VarPtr(BW(0)), WLen
End If
Close FNr
FNr = FreeFile: Open DstFileName For Binary Access Write As FNr
Put FNr, , BW 'write the WideChars
1:
If FNr Then Close FNr
If Err Then Err.Raise Err.Number, Err.Source & ".TranscodeUTF8toUTF16", Err.Description
End Sub
Olaf
-
Re: CSV files and a big problem with notes field!
Quote:
Originally Posted by
Elroy
That file you attached to post #31 isn't Unicode. It's pure ASCII.
The fact that a file has "pure ASCII-content" doesn't necessarily mean, that its content
wasn't produced by an UTF8-encoding Export-Routine (at the server-end of the gmail-
WebApp or by other services).
When there's no explicit UTF8-BOM in the file, then you can not tell an UTF8-exported
File (in case it contains only "english Char-Values" < 128) from a "real ASCII-file" ...
UTF8 is backward-compatible to ASCII.
@Mehdi - I'd try with a GMail-exported CSV, which contains contact-data of people
from a few diffferent countries.
Olaf
-
Re: CSV files and a big problem with notes field!
hahahaha, geez, my code is already taking the hits. And YES, I wrote it yesterday Mehdi. I didn't have anything better to do and I'm having fun on these forums. Also, I've been at this a LOOONNNGGG time. It took me something between one and two hours, but now I truly am bragging. :p
Well, you've got 4gig of memory to play with when using VB6, so unless you're files are VERY large, I wouldn't worry about memory. But yes, the fact that I read the Ascii/Unicode file one byte at a time, it may get a bit slow. I've got other routines to create buffers for reading binary, but that would take more work.
Also, Schmidt, if you study my code, you'll see that I focused on the {LF} only as a line terminator (just tossing the {CR} that weren't surrounded by quotes). That should make it handle Unix files just as well as Windows files.
Also, as previously stated, it wouldn't take too much to use my previously posted LoadStringFromUnicodeFile function (post #29) and adapt the entire parsing routine to work with Unicode. Hardly anything would need to be changed except that it gets its bytes (errr, I mean, characters) from the loaded string rather than directly from disk. I'll admit that that would be getting several copies of the file in memory, but all the pieces are there for a pretty nice routine.
EDIT: I guess, beyond minimal work, the only thing it won't handle is a UTF8 file. Those things are beasts, and I've never really messed with them that much.
-
Re: CSV files and a big problem with notes field!
Quote:
Originally Posted by
Elroy
Well, you've got 4gig of memory to play with when using VB6,
Nah - typically it's more like 1.8GB (from the theoretical 2GB) we have available with VB6-Apps,
until the VB-Mem-allocator starts choking on larger (or already midsized) allocations.
Quote:
Originally Posted by
Elroy
But yes, the fact that I read the Ascii/Unicode file one byte at a time, it may get a bit slow.
Well, I'd say factor 20 is not only "a bit slower" than it could be - and what contributes to the
performance-penalty is also the char-by-char-appending when gathering the Line-content.
Then you parse for fields a second time (unnecessarily) - and that also one char at a time -
so there's quite some room for further improvements on that front.
Also your simple negation in case of quotes is currently error-prone, since there's no reset
of the Flag to 'False' in case you encounter a (non-quoted) comma... a single double-quote
is often used to specify inches on (otherwise unquoted) number-fields as a 'units'-specifier.
How other CSV-parsers treat certain "hard to parse" fields, is e.g. pointed out here:
http://www.xbeat.net/vbspeed/c_ParseCSV.php
Quote:
Originally Posted by
Elroy
... you'll see that I focused on the {LF} only as a line terminator (just tossing the {CR} that weren't surrounded by quotes).
That should make it handle Unix files just as well as Windows files.
Right you are - was a bit imprecise there - though CSV-files which contain vbCR-only are not uncommon -
e.g. the examples from the link I've posted: http://support.spatialkey.com/spatia...mple-csv-data/
come in that category.
Quote:
Originally Posted by
Elroy
... the only thing it won't handle is a UTF8 file.
A quite common format these days - it's (due to its compactness) quite common in WebApps -
whereas UTF16-format is not that often encountered in the wild.
Quote:
Originally Posted by
Elroy
Those things are beasts, and I've never really messed with them that much.
It's not really handled very differently from ANSI-content (just another CodePage-EnumValue) -
in either case the MultiByteToWideChar-API is used (too bad, that VBs StrConv-function
only supports the different ANSI-Codepages over its last optional LCID-param).
Though it's only a few lines of code - once you have the necessary API-Declaration in place.
The Transcoding-Routine I've posted already contains the necessary bits - and a few other
UTF8-to-UTF16-routines were recently posted and discussed as well here:
http://www.vbforums.com/showthread.p...y-Line-by-Line
Olaf
-
Re: CSV files and a big problem with notes field!
And I guess it all turns to mush if we try to read a csv file created on a computer in e.g. France with one created in e.g. England unless we have prior knowledge of the Regional settings active when/ where it was created; and then I think we would need to parse the whole file character by character.
-
Re: CSV files and a big problem with notes field!
Hmmm, Schmidt, I suppose I sort of agree and disagree about the memory thing. 32 bit applications (which a VB6 program is) have a 4gig virtual address space. And, in fact, with the understanding of how forms, classes, and user controls are actually overlays in the EXE, you can have binary code that's even much larger than that so long as you don't try to pull it all into memory at the same time (with Load Form and Set MyObj = New SomeClass statements). However, you're right in that pretty much all variables and arrays (depending on the situation) are capped at a max of 2gig.
If we're truly dealing with Ascii/Unicode type files, this is still one HECK of a lot of space. I'm not sure I've ever seen an ASCII or Unicode file that's over a few meg, but I'm sure they're out there. Many years ago, I used to be like Mehdi and afraid to use memory. However, machines I make installations on these days have a minimum of 2gig of RAM (usually more like 8), so it's just not a problem.
Now I did recently post a binary file search routine here, and it was certainly disk based, but I view ASCII/Unicode files a bit differently.
I'm trying to be done with this thread, but, considering how boolean variables are initialized as false, I'm pretty sure I handle everything except for one situation, which I did think about. And that's an escaped quote mark inside of quotes. Some CSV formats have a way of doing that, possibly with another escape character. But the following is handled: 1) a line opening a quote but not closing it before EOL, 2) a line terminating in a comma, 3) a null string such as ,"", 4) repeating commas such as ,,,, 4) a comma inside of quotes (which should be treated as part of the field).
Regarding the speed, like I said, I hammered this out in an hour or so. It'd be fairly easy to write a buffer procedure for reading the data faster. It'd also be fairly easy to do some Mid$(s, iSpot, 1) = Chr$(bb) work to speed up string processing. Hmmm, you know what? You won't hurt my feelings one teensy weensy bit if you take my work, make those improvements to it, and re-post. :p
Ahhh, and thanks for the thread. It had some interesting info in it. One thing that was fascinating (sort of expanding a bit on what it said) was to automate Notepad to use it as a UTF-8 to UTF-16 converter. It's certainly built into all later versions of Notepad. There is some other interesting code on that thread though, and I'll be studying it. I certainly wouldn't mind having a bulletproof UTF-8 to UTF-16 converter in my toolbox.
Schmidt, you take care,
Elroy
EDIT: Note to Mehdi. Yes, if you do pull out any of the variables in that routine, and making their scope "module" or "global" then you probably do want to reset their initial values to False, "", and 0. And Erase all the arrays.
-
Re: CSV files and a big problem with notes field!
Quote:
Originally Posted by
Elroy
Hmmm, Schmidt, I suppose I sort of agree and disagree about the memory thing.
...However, you're right in that pretty much all variables and arrays (depending on the situation) are capped at a max of 2gig.
Thought that's what was relevant - and what we were discussing...
I was talking from practical experience - and as for the theoretical limit of 2GB for
mem-allocations in a 32Bit-process, we can reach only about 1.6 - 1.8 GB in practice
(depending on fragmentation).
The below code chokes here on my machine, after reaching 169 allocated chunks of 10MB each.
(about 1.7 GB).
When we change the chunksize to a larger one - e.g. 100MB - then the loop is
exited already at 13 successfully allocated chunks (1.3 GB) here on a Win8-machine with 8GB Ram.
Code:
Option Explicit
Private Type tMemAllocTest
B() As Byte
End Type
Const MemoryChunks As Long = 200
Private MemAllocTest(1 To MemoryChunks) As tMemAllocTest
Private Sub Form_Load()
Dim i As Long
Const ByteArrSize As Long = 10& * 1024& * 1024& '10MB per Chunk
On Error GoTo 1
For i = 1 To MemoryChunks 'let's see how many 10MB-memory-chunks we can allocate
ReDim MemAllocTest(i).B(1 To ByteArrSize)
Next
1: MsgBox "the system was able, to allocate only " & i - 1 & " 10MB-chunks"
End Sub
Quote:
Originally Posted by
Elroy
Hmmm, you know what? You won't hurt my feelings one teensy weensy bit if you take my work, make those improvements to it, and re-post. :p
Nah - I leave that to you ;) - I have my 2 cents of CSV-parsing made available to the community
in compiled form (allowing filesizes above 4GB, as well as Unicode-Filenames) - and its usage would look like this:
Code:
Option Explicit
Implements ICSVCallback
Private CSV As cCSV
Private Sub Form_Load()
Set CSV = New_c.CSV
CSV.ColSepChar = ","
CSV.QuotChar = Chr$(34)
CSV.ParseFile "c:\temp\google.csv", Me
End Sub
Private Function ICSVCallback_NewValue(ByVal RowNr As Long, ByVal ColNr As Long, B() As Byte, _
ByVal BValStartPos As Long, ByVal BValLen As Long) As Long
Debug.Print RowNr, ColNr, CSV.GetStringValue(B, BValStartPos, BValLen)
End Function
Quote:
Originally Posted by
Elroy
EDIT: Note to Mehdi. Yes, if you do pull out any of the variables in that routine, and making their scope "module" or "global" then you probably do want to reset their initial values to False, "", and 0. And Erase all the arrays.
As the example above shows - a class-encapsulation in conjunction with an interface-based callback
allows for a quite generic and fast "SAX-like" Field-parsing - just a suggestion...
Olaf
-
Re: CSV files and a big problem with notes field!
Alright, just because I'm me, I just finished writing code to read either a UTF-16, UTF-8, or ASCII file into memory. The above CSV parser could easily be adapted to use it, and it'd speed up the file I/O because it'd already be in memory. Yeah yeah, it's still doing it in memory, and I'll let others make it all disk-based if they feel the need.
Here's the link to the UTF-16, UTF-8, or ASCII file reader:
http://www.vbforums.com/showthread.p...nd-UTF-8-Files
-
Re: CSV files and a big problem with notes field!
This may not always work, depending on whether or not the UTF-8 has or does not have a BOM. See my example files in the codebank forum.