-
Aug 29th, 2023, 05:26 PM
#1
Thread Starter
Frenzied Member
Ideas for handling large CSV files with malformed lines
I have a program that routinely extracts data from a 237 MB CSV directly to memory with many columns and over half a million lines using TextFieldParser. The creator of the file generator was sloppy and I think just used a join command to make each line without enclosing in quotes and their data entry has inadequate validation. I'm looking at one field on a line around 18,000 that starts with a comma then three quotes and there are no more quotes until a line in the 38k range, so it's seeing all the rows in between as one field and those lines are not being imported. And, as you might expect. Commas sometime appear in string fields and the perceived field count it one too many. A single quote not adjacent to a comma... It's a long list of modes and the detected number of malformed is growing, about 20 now.
I've been reviewing all the exceptions and methods on the MS site and I think I have a good understanding of its capabilities. Now I'm seeking sage advice from programmers on strategies for handling this kind of scenario. I can't make the government fix their data reliably and even when they do, new data is introduced and more malformed line exceptions are created. I don't see a good way to stop and have a user make a correction easily because the malformed line can end up being multiple lines. Do you stash the file on disc and manually fix it and have the program try it again until it runs clean? Maybe just kick out the malformed lines? Any other ideas?
Thanks for taking the time to read my message.
-
Aug 30th, 2023, 08:51 AM
#2
Re: Ideas for handling large CSV files with malformed lines
Sounds like TextFieldParser won't get you anywhere with this crippled source data.
I'd switch to a custom read/parsing function that first splits the file into lines and then validates each single line independently. This will prevent a missing quotation mark resulting in subsequent lines beeing skipped. With 237MB Files you could process the entire file in one go without paging which makes things easier.
Nevertheless, this is working around a problem that lies somewhere else. You need to be aware of this and make it clear to your users/client because down the road someone might fix the bug or replace the software that creates the source files and they will come back to you asking what rubbish tool you created....
-
Aug 30th, 2023, 09:33 AM
#3
Re: Ideas for handling large CSV files with malformed lines
Post the malformed lines or some of them.
-
Aug 30th, 2023, 09:48 AM
#4
Re: Ideas for handling large CSV files with malformed lines
I'm working on a tool to improve the import process with text files ... still in the early stages, but if youi're willing to post some good & bad data as a sample, I'll look to see if it's something that can be dealt with.
-tg
-
Aug 30th, 2023, 10:18 AM
#5
Thread Starter
Frenzied Member
Re: Ideas for handling large CSV files with malformed lines
Well the one bad line I was mainly mentioning is 20,000 lines long. I can make some samples later today though.
Please understand I am not looking for a method where the program fixes the lines. I'm certain I will need to manually do this. I was wondering if people tend to design the program to sop and have the user do things, save to file and abort, or something else. I don't think one can back up the cursor to re-do a line.
-
Aug 30th, 2023, 10:38 AM
#6
Re: Ideas for handling large CSV files with malformed lines
I was wondering if people tend to design the program to sop and have the user do things, save to file and abort, or something else.
i dont think there is a general answer to the question. or at least only this: fail gracefully.
it depends on things like: is your program automated running in the background? vs user attended? how often is it run? etc.
if you are already able to detect the malformed lines, and it is an option to get the file manually corrected, why not. the easiest option would then just be to dump the error lines to a log.
a manual line by line approach would be able to detect the line with the error but resume with the line after the next environment.newline while you say your current approch skipps lines until the closure quote. more work though.
implementing a logic to even parse incorrect lines as far as possible is risky and i'd only implement this if really required and no chance to have a human look into it. e.g. some critical background job. but still would log these lines and have a human review them asap.
but... it really depends.
-
Aug 31st, 2023, 02:42 AM
#7
Re: Ideas for handling large CSV files with malformed lines
If you really have such crappy file-content, IMO the only way is to write your own parser.
Starting with "The end of a Line is denoted by a CR/LF" (Depending on the "source" of the Text-File, --> is it CR, LF, CRLF, w/o BOM, UTF8, blablabla)
That said: i loath CSV-Files with comma-delimiter, especially if you're in a locale which uses comma as decimal separator
Why they cannot use an "unusual" character to delimit fields (like the Pipe-Symbol) is a mystery to me....
Nevermind, that in todays times there are still people and programs using CSV instead of JSON or XML...
Last edited by Zvoni; Tomorrow at 31:69 PM.
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
One System to rule them all, One Code to find them,
One IDE to bring them all, and to the Framework bind them,
in the Land of Redmond, where the Windows lie
---------------------------------------------------------------------------------
People call me crazy because i'm jumping out of perfectly fine airplanes.
---------------------------------------------------------------------------------
Code is like a joke: If you have to explain it, it's bad
-
Aug 31st, 2023, 06:18 AM
#8
Re: Ideas for handling large CSV files with malformed lines
IMO issues with CSV files tend to be that creators have a tendency not to follow the rules - but produce something that's sort of csv but not quite.
https://docs.fileformat.com/spreadsheet/csv/
https://datatracker.ietf.org/doc/html/rfc4180
Often these 'not quite correct' formats require their own processing rather than using a 'standards compliant parser'.
All advice is offered in good faith only. You are ultimately responsible for the effects of your programs and the integrity of the machines they run on. Anything I post, code snippets, advice, etc is licensed as Public Domain https://creativecommons.org/publicdomain/zero/1.0/
C++23 Compiler: Microsoft VS2022 (17.6.5)
-
Aug 31st, 2023, 06:36 AM
#9
Re: Ideas for handling large CSV files with malformed lines
 Originally Posted by 2kaud
IMO issues with CSV files tend to be that creators have a tendency not to follow the rules - but produce something that's sort of csv but not quite.
agreed, but this is not limited you csv. have you ever seen an xml file containing unencoded ampersand's or "less than" signs? I have. The Devs just concated strings to create the xml instead of using validated functions/objects that ensure correct encoding...
-
Aug 31st, 2023, 06:47 AM
#10
Re: Ideas for handling large CSV files with malformed lines
agreed, but this is not limited you csv
Unfortunately - you are correct...
All advice is offered in good faith only. You are ultimately responsible for the effects of your programs and the integrity of the machines they run on. Anything I post, code snippets, advice, etc is licensed as Public Domain https://creativecommons.org/publicdomain/zero/1.0/
C++23 Compiler: Microsoft VS2022 (17.6.5)
-
Aug 31st, 2023, 06:52 AM
#11
Re: Ideas for handling large CSV files with malformed lines
 Originally Posted by 2kaud
IMO issues with CSV files tend to be that creators have a tendency not to follow the rules
Funny you should mention that....
From the first post:
I can't make the government fix their data reliably and even when they do, new data is introduced and more malformed line exceptions are created.
Last edited by Zvoni; Tomorrow at 31:69 PM.
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
One System to rule them all, One Code to find them,
One IDE to bring them all, and to the Framework bind them,
in the Land of Redmond, where the Windows lie
---------------------------------------------------------------------------------
People call me crazy because i'm jumping out of perfectly fine airplanes.
---------------------------------------------------------------------------------
Code is like a joke: If you have to explain it, it's bad
-
Aug 31st, 2023, 07:08 AM
#12
Re: Ideas for handling large CSV files with malformed lines
 Originally Posted by Zvoni
Funny you should mention that....
From the first post:
 Originally Posted by cory_jackson
I can't make the government fix their data reliably and even when they do, new data is introduced and more malformed line exceptions are created. I don't see a good way to stop and have a user make a correction easily because the malformed line can end up being multiple lines.
Well, I can honestly say it isn't one of my files ... I don't just string things together. That said, that's what happens when you go with the lowest bidder who is probably outsourcing bad requirements to some off-shore coding sweatshop.
-tg
Tags for this Thread
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|