-
Apr 16th, 2006, 02:28 AM
#1
CSV and TAB File Parser
Here's a little homebrew parser function that I made to parse Tab seperated files and Comma seperated files using Regex and string manipulation. Yes you can use ADO.NET if you want, but what fun is that? (see "Turnons" ) The below function might not be the most efficient or the most pretty, but it is one that works. Some of you may wonder "why can't you just use a simple split on the delimiter?". Well, some files are in the format like:
"Doe, John",24,1000
Where "Doe, John" is supposed to be one value. If you split on the comma, it would return 4 values in the array and not 3, along with the quotes.
It first uses Regex in order to find the matches of delimiters in between quotes (since you can specify commas in the actual data when you place the data inside of quotes), then checks to see if the delimiter is inside of those quotes. If so, replaces it with another string of characters in order to ease splitting.
When that is done, you can simply split on the regular delimiter into an array of strings, and then loop through the values in the resulting array, replacing the custom delimiter back to the original delimiter character (the comma), as well as adding the data into a new row.
***EDIT - added option for rowheaders
The below function takes a filename as a parameter, a delimiter as a character, and a third boolean parameter that specifies if the first row are column headers are not. The result of the function is a data table of the parsed data, which you can then manipulate however you want (only tested with comma and tab as delimiters)
VB Code:
Private Function DeLimitFile(ByVal FileName As String, ByVal Delimiter As Char, ByVal RowHeader As Boolean) As DataTable
'string that should (hopefully) not occur in the file,
'used to replace the delimiter that is inside of quotes to ease splitting
Dim AltDelimiter As String = "GeorgE W. BusH AnD JohN KerrY WerE HappilY MarrieD TodaY..."
'string to hold final delimiter string, for special escape sequence characters needed in regex
Dim FinalDelimiter As String
If Delimiter = Convert.ToChar(9) Then
'special character sequence for regex to represent a tab in the regex pattern
FinalDelimiter = "\t"
Else
FinalDelimiter = Delimiter
End If
Dim HeaderFlag As Boolean = RowHeader 'flag to indicate if rowheaders are present
Dim Pattern As String = "(?<="")[^" & FinalDelimiter & "].*?[^" & FinalDelimiter & "](?="")"
Dim Regex As New System.Text.RegularExpressions.Regex(Pattern)
Dim MyReader As New System.IO.StreamReader(FileName) 'opens file for reading
Dim MyTable As New DataTable 'to hold resulting data
Dim GetRowCount As Boolean = True 'flag to get first line column count
While MyReader.Peek <> -1
Dim MyLine As String = MyReader.ReadLine
Dim MyMatches As System.Text.RegularExpressions.MatchCollection = Regex.Matches(MyLine)
'replaces delimiters inside of quotes, to ease splits
For Each Match As System.Text.RegularExpressions.Match In MyMatches
Dim NewString As String = Match.Value.Replace(FinalDelimiter, AltDelimiter)
MyLine = MyLine.Replace("""" & Match.Value & """", NewString)
Next
'splits the string on the delimiter, each regular delimiter should now be a column
Dim Splits() As String = System.Text.RegularExpressions.Regex.Split(MyLine, FinalDelimiter)
'dont know number of columns until the first line is done parsing, so here we
'get the number of columns in the datatable and add them
If GetRowCount = True Then
For I As Integer = 0 To Splits.GetLength(0) - 1
Dim MyColumn As New DataColumn
'adds Column Headers if RowHeader is True
If HeaderFlag = True Then
MyColumn.ColumnName = Splits(I)
End If
MyTable.Columns.Add(MyColumn)
Next
GetRowCount = False
End If
'adds data in the array into the data row, replacing the alternate delimiter back to original
Dim MyRow As DataRow
If HeaderFlag = False Then
'its ok to add to add to table
For X As Integer = 0 To Splits.GetLength(0) - 1
Splits(X) = Splits(X).Replace(AltDelimiter, FinalDelimiter)
Next
MyTable.Rows.Add(Splits)
Else
'its a header row, so dont bother adding it and set flag to false
HeaderFlag = False
End If
End While
MyReader.Close()
Return MyTable
End Function
And samples of usage, using a datagrid for display of the datatable:
VB Code:
'for tab files, "9" is the value of a tab character,
'RowHeader is True, meaning first row of data in file are column headers
DataGrid1.DataSource = DeLimitFile("c:\mytabfile.tab", Convert.ToChar(9), True)
'for csv files, with False for RowHeader, meaning first row does not contain headers
DataGrid1.DataSource = DeLimitFile("c:\mycsvfile.csv", ","c, False)
It may or may not work with other characters as delimiters, didn't really test any others. If it is a special character or other whitespace characters, then it probably won't work, as you have to convert the special character to a character sequence that Regex recognizes, like what was done in the above for the Tab character.
Last edited by gigemboy; Apr 18th, 2006 at 07:01 AM.
Reason: ***ADDED Header Option to specify if first row are column headers
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|