dcsimg
Results 1 to 2 of 2

Thread: [vb6]Yet Another CSV Parser

  1. #1

    Thread Starter
    VB-aholic & Lovin' It LaVolpe's Avatar
    Join Date
    Oct 2007
    Location
    Beside Waldo
    Posts
    18,642

    [vb6]Yet Another CSV Parser

    A fairly basic CSV parser, with a bit more user-control and a slight twist: Event driven by Record
    The parser does handle quoted field data and delimiters, carriage returns, and other non-printable characters wtihin field data, assuming the field data is properly formatted.

    The parser will raise an event for each record it has finished parsing. You would respond to this event to use/format the parsed field data. The event allows aborting further processing if the CSV appears corrupted. The event will inform you if the CSV file appears corrupt.

    The class offers two ways of feeding it CSV information but only one method is called.
    1) Entire file read into a string
    2) Line by line from the CSV file, read from a loop or from a split string

    As mentioned, an event is called for each record. This event has 3 states:
    1) RecordParsed. A vbNullChar-delimited string is passed which contains the entire record
    2) FieldNamesStatic. A vbNullChar-delimeted string containing field names from the CSV's first row of data
    3) FieldNamesGeneric. A vbNullChar-delimited string containing default field names for CSVs without a header row
    Each event has has three other parameters: HeaderCount, FieldDifferential, and RecordNumber
    RecordNumber will be zero if processing field names else incrementing by 1 each time event is called
    HeaderCount is the number of fields based from the 1st row of parsed data
    FieldDifferential is basically an error if non-zero.
    0 indicates that number of fields in the record equal number of field names
    Negative indicates number of missing fields in the record. HeaderCount + FieldDifferential = number processed fields
    Positive indicates number of extra fields in the record.
    If you feel comfortable trying to handle any discrepancies between field count and header count, no need to reply to the event. However, if you want to abort processing any further records, you simply return the Record parameter as a null string.

    Let's talk about proper formatting
    Delimiters come in various flavors with this class:
    :: Record Delimiter defines when a record ends & a new record begins. Hard coded in the class a vbCr and/or vbLf
    :: Field Delimiter defines when a field ends & a new field begins. This is user-defined & defaults to a comma
    :: Quote/Text Delimiter defines start and end of text where any character (delimiter or not) is not specially handled
    :: Escape Delimiter defines characters not specially handled. Escape delimited files are rare
    Quote and Escape Delimiters are also used to delimit themselves as non-special characters

    1) Every record in a CSV, including any header row, is delimited by a carriage return and/or line feed
    -- Only exception is the final record. It does not require a record delimiter

    2) Every field within a record must be delimited by a character you specify. This class does not process fixed-length field CSVs
    -- Field delimiters are never used before the 1st field and never used after the final field
    -- Any field that contains a null character (ASCII byte 0) will abort the parser

    3) Quotes, i.e., ", are defaulted to be handled as text identifiers. This option can be turned off or changed to a different character
    -- Quote delimiters allow non-printable characters and other delimiters to be treated as just any other character

    4) If any field contains a record and/or field delimiter within the field's data, the delimiters must be identified as non-delimiters
    -- Two options are provided in the class: quoted field data and escape characters

    Delimiters. Let's say the field delimiter is a comma
    :: If any character within a field contains any delimiter, then that delimiter must be escaped
    :: Sample field that contains a field delimiter: Hello, my name is LaVolpe
    If Quote delimit character is " then should be saved to file as: "Hello, my name is LaVolpe"
    If Escape delimit character is \ then should be saved to file as: Hello\, my name is LaVolpe

    Delimiting the delimiters. Simple rule, replace each delimiter with a double delimiter
    :: Sample field with Quote delimiter: Hello, my name is "LaVolpe"
    Saved to file as: "Hello, my name is ""LaVolpe"""
    -- if no field delimiter existed, then this works too: My name is ""LaVolpe""
    :: Sample field with Escape delimiter: C:\My Documents
    Saved to file as: C:\\My Documents
    Note that the Quote & Escape delimiters are a tad different.
    -- Quote delimiters are doubled only within the field data.
    -- If field data has delimiters, then the field data is written to file with a single quote both as a prefix & suffix
    -- Quote delimiters are always paired if used as delimiters
    -- Any record or field delimiter within a field needs no special handling, when that field on disk begins & ends with a quote delimiter
    -- Escape delimiters, if used, are required for every field, record and escape delimiter that exists within a field
    -- Escape delimiters have no 'pairing' requirement unless escaping itself
    -- Mixing Quote & Escape delimiters is not recommended, though can be used if you want to customize your CSV data
    -- Quote and/or Escape delimiter characters must be defined by the user, both are optional
    -- General rule of thumb
    :: Quote/Escape delimiters only needed if record or field delimiters exist in field data
    :: The fact that a field has quotes does not force a requirement of using quote delimiters
    If no quote or no escape delimiters were defined during parser initialization, the following records would work just fine because no record nor field delimiter exists within the individual fields
    Year,Make,Model,Engine,Body,NickName
    1969,Ford,Mustang,351 Windsor,Sport Roof,"Fast Back"


    Quick examples of using the class, both for reading line by line & entire file
    Code:
    Private WithEvents CSVParser As ICSVParser
    
    Private CSVParser_ProcessRecord(ByVal State As ProcessStateEnum, _
    				Record As String, _
    				ByVal FieldDifferential As Long, _
    				ByVal HeaderCount As Long, _
    				ByVal RecordNumber As Long)
    	' process parsed CSV record
    	Dim sData() As String
    	Select Case State
    	Case csvRecordParsed
    		If FieldDifferential Then
    			' handle potentially corrupt CSV
    			' to abort further processing: Record = vbNullString
    		Else
    			sData = Split(Record, vbNullChar)
    			' process data
    		End If
    	Case csvFieldNameStatic
    		sData = Split(Record, vbNullChar)
    		' process field names
    	Case csvFieldNameGeneric
    		sData = Split(Record, vbNullChar)
    		' process field names, optionally, using your own names
    	End Select
    End Sub
    
    Private Sub Command1_Click() ' Full file example
    	If CSVParser Is Nothing Then Set CSVParser = New ICSVParser
    	Dim fnr As Integer, sFile As string
    	fnr = FreeFile()
    	Open "C:\Temp\TestCSV.csv" For Binary As #fnr
    	sFile = Space$(LOF(fnr))
    	Get #fnr, 1, sFile
    	Close #fnr
    	CSVParser.InitializeParser True
    	If CSVParser.ParseRecord(sFile) = False Then
    		' handle informing user of corrupt file
    	End If
    	If CSVParser.TerminateParser() = False Then
    		' handle informing user of corrupt file, final record
    	End If
    End Sub
    
    Private Sub Command2_Click() ' Line by line example
    	If CSVParser Is Nothing Then Set CSVParser = New ICSVParser
    	Dim fnr As Integer, sLine As string
    	fnr = FreeFile()
    	Open "C:\Temp\TestCSV.csv" For Input As #fnr
    	CSVParser.InitializeParser True
    	Do Until EOF(fnr) = True
    		Line Input #fnr, sLine
    		If CSVParser.ParseRecord(sLine) = False Then 
    			' handle informing user of corrupt file
    			Exit Do
    		End If
    	Loop
    	Close #fnr
    	If CSVParser.TerminateParser() = False Then
    		' handle informing user of corrupt file, final record
    	End If
    End Sub
    And, just for the heck of it, a real simple example of loading a CSV to a ListView. In the parser's event:
    Code:
        Dim sData() As String, lItem As Long
        Select Case State
        Case csvRecordParsed
            If FieldDifferential > 0 Then
                Record = ""
            Else
                sData = Split(Record, vbNullChar)
                With ListView1.ListItems.Add(, , sData(0))
                    For lItem = 1 To UBound(sData)
                        .SubItems(lItem) = sData(lItem)
                    Next
                End With
            End If
                    
        Case csvFieldNamesGeneric, csvFieldNamesStatic
            sData = Split(Record, vbNullChar)
            With ListView1
                .ListItems.Clear
                .ColumnHeaders.Clear
                For lItem = 0 To HeaderCount - 1
                    .ColumnHeaders.Add , , sData(lItem)
                Next
            End With
        End Select
    Another parser can be found at this link, this site, by a well-respected coder.
    The attachment below is a class file, remove the .txt extension after downloading
    Attached Files Attached Files
    Last edited by LaVolpe; Nov 27th, 2015 at 11:27 AM.
    Insomnia is just a byproduct of, "It can't be done"

    Classics Enthusiast? Here's my 1969 Mustang Mach I Fastback. Her sister '67 Coupe has been adopted

    Newbie? Novice? Bored? Spend a few minutes browsing the FAQ section of the forum.
    Read the HitchHiker's Guide to Getting Help on the Forums.
    Here is the list of TAGs you can use to format your posts
    Here are VB6 Help Files online


    {Alpha Image Control} {Memory Leak FAQ} {Unicode Open/Save Dialog} {Resource Image Viewer/Extractor}
    {VB and DPI Tutorial} {Manifest Creator} {UserControl Button Template} {stdPicture Render Usage}

  2. #2

    Thread Starter
    VB-aholic & Lovin' It LaVolpe's Avatar
    Join Date
    Oct 2007
    Location
    Beside Waldo
    Posts
    18,642

    Some Expected FAQs

    Some questions are likely to come up, I'll try to head off some of them right here.

    Q. Do you need to prefix/suffix, with a quote, if a field that has quotes but has no other delimiters within the field?
    A. No, but how it was saved and how it is parsed, depends on whether you define a quote delimiter during the parser initialization. Let's say the field is exactly this: My Name is "LaVolpe"
    (1) If the quote delimiter is defined as " then the expected field data should look like one of these
    My Name is ""LaVolpe""
    "My Name is ""LaVolpe"""
    (2) If the quote delimiter is not defined, i.e., passed as vbNullString then the following would be expected
    My Name is "LaVolpe"

    Q. Why is the option to remove leading/trailing spaces defaulted to False?
    A. Per RFC 4180, spaces are considered part of field data. To each their own

    Q. Why do one or more of my fields have carriage returns when they do not within the CSV file?
    A. One or more of the fields have unmatched quotes and the carriage return of one record is being interpreted as part of a field. Note that the FieldDifferential parameter in the raised event should be non-zero in this case.

    Q. Why is the FieldDifferential parameter always a negative value?
    A. You have less fields than columns/header count. This is most likely the result of a corrupted CSV or empty trailing fields not filled with a field delimiter. Example: If 5 fields were expected but the processed record looked like this: 1,2 instead of this: 1,2,,,

    Q. Is there a way to automatically detect if a CSV has a header row?
    A. The real answer is No, but technically, Yes in one scenario.

    Yes, only if there is a Schema.ini file in the same folder as the CSV. That file can contain the CSV file as a section and within that section a setting that explicitly defines your CSV as having a header row or not. Of course, major assumptions need to be made: was the schema.ini updated correctly, was the CSV saved in a format that matches the schema.ini entry for that same CSV? Schema.ini files are not system files, apps & people update them.

    No, in any other scenario. It is possible to analyze field data from the 2nd row on to a portion of the records. But that logic would be a best guess. Assumptions have to be made. If every field, for a specific column, were of a specific data-type or format, say like a format of ###-##-####, then if the 1st record's same field was not that data-type/format, you could assume header row. If every field in a specific column were blank and the 1st record's same field was not blank, you might assume header row. Other commonsense comparisons could also be applied. But bottom line, it's a guess at best. Assuming a header row incorrectly would result in improperly processing the 1st row of data.

    Q. Well, if we can't detect reliably a header row, can we detect whether one definitely does not exist?
    A. 99% reliability in these two scenarios
    (1) If any field in the 1st row is blank, you could almost safely assume no header row
    (2) If any field in the 1st row has same value as any of the other fields in same row, highly likely no header row

    Q. I used Split(myFileData, vbCrLf) and will feed the parser line by line. Will the lack of carriage returns cause a problem or do I need to append carriage returns with every line?
    Q. I am using Line Input to read text from the CSV line by line, the carriage return isn't returned. Will the lack of carriage returns cause a problem or do I need to append carriage returns with every line?
    A. For both questions the answer is No. In fact, don't do it. Doing so can return those added carriage returns as part of field data, incorrectly, if your field data has carriage returns already. Passing a string that contains the entire file, carriage returns and all, is not an issue. Bottom line, don't manually add extra carriage returns, don't manually remove any.

    Q. My CSV files have extraneous blank lines between many records and at the beginning and/or end of the file, is this a problem?
    A. No. The class ignores carriage returns except in two cases. In any other scenario, they are ignored completely.
    1) If processing current field and it is quoted, but the final quote not yet encountered, the carriage return is treated as part of the field data until that final delimiter is encountered.
    2) If a field has not yet been started, carriage returns are ignored. The class does not allow carriage returns to start a record nor a field, they can end fields and end records that have been started. The class was designed to allow some malformatting with carriage returns.
    Some parsers may interpret a blank line as a record where all fields are null/empty. This class does not make that assumption & could be a option I'll add later, defined during parser initialization.
    Last edited by LaVolpe; Nov 28th, 2015 at 01:35 PM. Reason: updated FAQs
    Insomnia is just a byproduct of, "It can't be done"

    Classics Enthusiast? Here's my 1969 Mustang Mach I Fastback. Her sister '67 Coupe has been adopted

    Newbie? Novice? Bored? Spend a few minutes browsing the FAQ section of the forum.
    Read the HitchHiker's Guide to Getting Help on the Forums.
    Here is the list of TAGs you can use to format your posts
    Here are VB6 Help Files online


    {Alpha Image Control} {Memory Leak FAQ} {Unicode Open/Save Dialog} {Resource Image Viewer/Extractor}
    {VB and DPI Tutorial} {Manifest Creator} {UserControl Button Template} {stdPicture Render Usage}

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Featured


Click Here to Expand Forum to Full Width