Results 1 to 2 of 2

Thread: CSV and TAB File Parser

Threaded View

  1. #1

    Thread Starter
    "The" RedHeadedLefty
    Join Date
    Aug 2005
    Location
    College Station, TX Preferred Nickname: Gig Current Mood: Just Peachy Turnons: String Manipulation
    Posts
    4,495

    CSV and TAB File Parser

    Here's a little homebrew parser function that I made to parse Tab seperated files and Comma seperated files using Regex and string manipulation. Yes you can use ADO.NET if you want, but what fun is that? (see "Turnons" ) The below function might not be the most efficient or the most pretty, but it is one that works. Some of you may wonder "why can't you just use a simple split on the delimiter?". Well, some files are in the format like:

    "Doe, John",24,1000

    Where "Doe, John" is supposed to be one value. If you split on the comma, it would return 4 values in the array and not 3, along with the quotes.

    It first uses Regex in order to find the matches of delimiters in between quotes (since you can specify commas in the actual data when you place the data inside of quotes), then checks to see if the delimiter is inside of those quotes. If so, replaces it with another string of characters in order to ease splitting.

    When that is done, you can simply split on the regular delimiter into an array of strings, and then loop through the values in the resulting array, replacing the custom delimiter back to the original delimiter character (the comma), as well as adding the data into a new row.

    ***EDIT - added option for rowheaders

    The below function takes a filename as a parameter, a delimiter as a character, and a third boolean parameter that specifies if the first row are column headers are not. The result of the function is a data table of the parsed data, which you can then manipulate however you want (only tested with comma and tab as delimiters)
    VB Code:
    1. Private Function DeLimitFile(ByVal FileName As String, ByVal Delimiter As Char, ByVal RowHeader As Boolean) As DataTable
    2.         'string that should (hopefully) not occur in the file,
    3.         'used to replace the delimiter that is inside of quotes to ease splitting
    4.         Dim AltDelimiter As String = "GeorgE W. BusH AnD JohN KerrY WerE HappilY MarrieD TodaY..."
    5.         'string to hold final delimiter string, for special escape sequence characters needed in regex
    6.         Dim FinalDelimiter As String
    7.         If Delimiter = Convert.ToChar(9) Then
    8.             'special character sequence for regex to represent a tab in the regex pattern
    9.             FinalDelimiter = "\t"
    10.         Else
    11.             FinalDelimiter = Delimiter
    12.         End If
    13.         Dim HeaderFlag As Boolean = RowHeader 'flag to indicate if rowheaders are present
    14.         Dim Pattern As String = "(?<="")[^" & FinalDelimiter & "].*?[^" & FinalDelimiter & "](?="")"
    15.         Dim Regex As New System.Text.RegularExpressions.Regex(Pattern)
    16.         Dim MyReader As New System.IO.StreamReader(FileName) 'opens file for reading
    17.         Dim MyTable As New DataTable 'to hold resulting data
    18.         Dim GetRowCount As Boolean = True 'flag to get first line column count
    19.         While MyReader.Peek <> -1
    20.             Dim MyLine As String = MyReader.ReadLine
    21.             Dim MyMatches As System.Text.RegularExpressions.MatchCollection = Regex.Matches(MyLine)
    22.             'replaces delimiters inside of quotes, to ease splits
    23.             For Each Match As System.Text.RegularExpressions.Match In MyMatches
    24.                 Dim NewString As String = Match.Value.Replace(FinalDelimiter, AltDelimiter)
    25.                 MyLine = MyLine.Replace("""" & Match.Value & """", NewString)
    26.             Next
    27.             'splits the string on the delimiter, each regular delimiter should now be a column
    28.             Dim Splits() As String = System.Text.RegularExpressions.Regex.Split(MyLine, FinalDelimiter)
    29.             'dont know number of columns until the first line is done parsing, so here we
    30.             'get the number of columns in the datatable and add them
    31.             If GetRowCount = True Then
    32.                 For I As Integer = 0 To Splits.GetLength(0) - 1
    33.                     Dim MyColumn As New DataColumn
    34.                     'adds Column Headers if RowHeader is True
    35.                     If HeaderFlag = True Then
    36.                         MyColumn.ColumnName = Splits(I)
    37.                     End If
    38.                     MyTable.Columns.Add(MyColumn)
    39.                 Next
    40.                 GetRowCount = False
    41.             End If
    42.             'adds data in the array into the data row, replacing the alternate delimiter back to original
    43.             Dim MyRow As DataRow
    44.             If HeaderFlag = False Then
    45.                 'its ok to add to add to table
    46.                 For X As Integer = 0 To Splits.GetLength(0) - 1
    47.                     Splits(X) = Splits(X).Replace(AltDelimiter, FinalDelimiter)
    48.                 Next
    49.                 MyTable.Rows.Add(Splits)
    50.             Else
    51.                 'its a header row, so dont bother adding it and set flag to false
    52.                 HeaderFlag = False
    53.             End If
    54.         End While
    55.         MyReader.Close()
    56.         Return MyTable
    57. End Function
    And samples of usage, using a datagrid for display of the datatable:
    VB Code:
    1. 'for tab files, "9" is the value of a tab character,
    2. 'RowHeader is True, meaning first row of data in file are column headers
    3. DataGrid1.DataSource = DeLimitFile("c:\mytabfile.tab", Convert.ToChar(9), True)
    4.  
    5. 'for csv files, with False for RowHeader, meaning first row does not contain headers
    6. DataGrid1.DataSource = DeLimitFile("c:\mycsvfile.csv", ","c, False)
    It may or may not work with other characters as delimiters, didn't really test any others. If it is a special character or other whitespace characters, then it probably won't work, as you have to convert the special character to a character sequence that Regex recognizes, like what was done in the above for the Tab character.
    Last edited by gigemboy; Apr 18th, 2006 at 07:01 AM. Reason: ***ADDED Header Option to specify if first row are column headers

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width