Results 1 to 12 of 12

Thread: Ideas for handling large CSV files with malformed lines

  1. #1

    Thread Starter
    Frenzied Member cory_jackson's Avatar
    Join Date
    Dec 2011
    Location
    Fallbrook, California
    Posts
    1,100

    Ideas for handling large CSV files with malformed lines

    I have a program that routinely extracts data from a 237 MB CSV directly to memory with many columns and over half a million lines using TextFieldParser. The creator of the file generator was sloppy and I think just used a join command to make each line without enclosing in quotes and their data entry has inadequate validation. I'm looking at one field on a line around 18,000 that starts with a comma then three quotes and there are no more quotes until a line in the 38k range, so it's seeing all the rows in between as one field and those lines are not being imported. And, as you might expect. Commas sometime appear in string fields and the perceived field count it one too many. A single quote not adjacent to a comma... It's a long list of modes and the detected number of malformed is growing, about 20 now.
    I've been reviewing all the exceptions and methods on the MS site and I think I have a good understanding of its capabilities. Now I'm seeking sage advice from programmers on strategies for handling this kind of scenario. I can't make the government fix their data reliably and even when they do, new data is introduced and more malformed line exceptions are created. I don't see a good way to stop and have a user make a correction easily because the malformed line can end up being multiple lines. Do you stash the file on disc and manually fix it and have the program try it again until it runs clean? Maybe just kick out the malformed lines? Any other ideas?

    Thanks for taking the time to read my message.

  2. #2
    Frenzied Member
    Join Date
    May 2014
    Location
    Central Europe
    Posts
    1,368

    Re: Ideas for handling large CSV files with malformed lines

    Sounds like TextFieldParser won't get you anywhere with this crippled source data.
    I'd switch to a custom read/parsing function that first splits the file into lines and then validates each single line independently. This will prevent a missing quotation mark resulting in subsequent lines beeing skipped. With 237MB Files you could process the entire file in one go without paging which makes things easier.
    Nevertheless, this is working around a problem that lies somewhere else. You need to be aware of this and make it clear to your users/client because down the road someone might fix the bug or replace the software that creates the source files and they will come back to you asking what rubbish tool you created....

  3. #3
    Powered By Medtronic dbasnett's Avatar
    Join Date
    Dec 2007
    Location
    Jefferson City, MO
    Posts
    9,687

    Re: Ideas for handling large CSV files with malformed lines

    Post the malformed lines or some of them.
    My First Computer -- Documentation Link (RT?M) -- Using the Debugger -- Prime Number Sieve
    Counting Bits -- Subnet Calculator -- UI Guidelines -- >> SerialPort Answer <<

    "Those who use Application.DoEvents have no idea what it does and those who know what it does never use it." John Wein

  4. #4
    Smooth Moperator techgnome's Avatar
    Join Date
    May 2002
    Posts
    34,385

    Re: Ideas for handling large CSV files with malformed lines

    I'm working on a tool to improve the import process with text files ... still in the early stages, but if youi're willing to post some good & bad data as a sample, I'll look to see if it's something that can be dealt with.

    -tg
    * I don't respond to private (PM) requests for help. It's not conducive to the general learning of others.*
    * I also don't respond to friend requests. Save a few bits and don't bother. I'll just end up rejecting anyways.*
    * How to get EFFECTIVE help: The Hitchhiker's Guide to Getting Help at VBF - Removing eels from your hovercraft *
    * How to Use Parameters * Create Disconnected ADO Recordset Clones * Set your VB6 ActiveX Compatibility * Get rid of those pesky VB Line Numbers * I swear I saved my data, where'd it run off to??? *

  5. #5

    Thread Starter
    Frenzied Member cory_jackson's Avatar
    Join Date
    Dec 2011
    Location
    Fallbrook, California
    Posts
    1,100

    Re: Ideas for handling large CSV files with malformed lines

    Well the one bad line I was mainly mentioning is 20,000 lines long. I can make some samples later today though.
    Please understand I am not looking for a method where the program fixes the lines. I'm certain I will need to manually do this. I was wondering if people tend to design the program to sop and have the user do things, save to file and abort, or something else. I don't think one can back up the cursor to re-do a line.

  6. #6
    Frenzied Member
    Join Date
    May 2014
    Location
    Central Europe
    Posts
    1,368

    Re: Ideas for handling large CSV files with malformed lines

    I was wondering if people tend to design the program to sop and have the user do things, save to file and abort, or something else.
    i dont think there is a general answer to the question. or at least only this: fail gracefully.
    it depends on things like: is your program automated running in the background? vs user attended? how often is it run? etc.
    if you are already able to detect the malformed lines, and it is an option to get the file manually corrected, why not. the easiest option would then just be to dump the error lines to a log.
    a manual line by line approach would be able to detect the line with the error but resume with the line after the next environment.newline while you say your current approch skipps lines until the closure quote. more work though.
    implementing a logic to even parse incorrect lines as far as possible is risky and i'd only implement this if really required and no chance to have a human look into it. e.g. some critical background job. but still would log these lines and have a human review them asap.
    but... it really depends.

  7. #7
    PowerPoster Zvoni's Avatar
    Join Date
    Sep 2012
    Location
    To the moon and then left
    Posts
    4,154

    Re: Ideas for handling large CSV files with malformed lines

    If you really have such crappy file-content, IMO the only way is to write your own parser.
    Starting with "The end of a Line is denoted by a CR/LF" (Depending on the "source" of the Text-File, --> is it CR, LF, CRLF, w/o BOM, UTF8, blablabla)

    That said: i loath CSV-Files with comma-delimiter, especially if you're in a locale which uses comma as decimal separator
    Why they cannot use an "unusual" character to delimit fields (like the Pipe-Symbol) is a mystery to me....

    Nevermind, that in todays times there are still people and programs using CSV instead of JSON or XML...
    Last edited by Zvoni; Tomorrow at 31:69 PM.
    -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

    One System to rule them all, One Code to find them,
    One IDE to bring them all, and to the Framework bind them,
    in the Land of Redmond, where the Windows lie
    ---------------------------------------------------------------------------------
    People call me crazy because i'm jumping out of perfectly fine airplanes.
    ---------------------------------------------------------------------------------
    Code is like a joke: If you have to explain it, it's bad

  8. #8
    Fanatic Member 2kaud's Avatar
    Join Date
    May 2014
    Location
    England
    Posts
    916

    Re: Ideas for handling large CSV files with malformed lines

    IMO issues with CSV files tend to be that creators have a tendency not to follow the rules - but produce something that's sort of csv but not quite.
    https://docs.fileformat.com/spreadsheet/csv/
    https://datatracker.ietf.org/doc/html/rfc4180

    Often these 'not quite correct' formats require their own processing rather than using a 'standards compliant parser'.
    All advice is offered in good faith only. You are ultimately responsible for the effects of your programs and the integrity of the machines they run on. Anything I post, code snippets, advice, etc is licensed as Public Domain https://creativecommons.org/publicdomain/zero/1.0/

    C++23 Compiler: Microsoft VS2022 (17.6.5)

  9. #9
    Frenzied Member
    Join Date
    May 2014
    Location
    Central Europe
    Posts
    1,368

    Re: Ideas for handling large CSV files with malformed lines

    Quote Originally Posted by 2kaud View Post
    IMO issues with CSV files tend to be that creators have a tendency not to follow the rules - but produce something that's sort of csv but not quite.
    agreed, but this is not limited you csv. have you ever seen an xml file containing unencoded ampersand's or "less than" signs? I have. The Devs just concated strings to create the xml instead of using validated functions/objects that ensure correct encoding...

  10. #10
    Fanatic Member 2kaud's Avatar
    Join Date
    May 2014
    Location
    England
    Posts
    916

    Re: Ideas for handling large CSV files with malformed lines

    agreed, but this is not limited you csv
    Unfortunately - you are correct...
    All advice is offered in good faith only. You are ultimately responsible for the effects of your programs and the integrity of the machines they run on. Anything I post, code snippets, advice, etc is licensed as Public Domain https://creativecommons.org/publicdomain/zero/1.0/

    C++23 Compiler: Microsoft VS2022 (17.6.5)

  11. #11
    PowerPoster Zvoni's Avatar
    Join Date
    Sep 2012
    Location
    To the moon and then left
    Posts
    4,154

    Re: Ideas for handling large CSV files with malformed lines

    Quote Originally Posted by 2kaud View Post
    IMO issues with CSV files tend to be that creators have a tendency not to follow the rules
    Funny you should mention that....
    From the first post:
    I can't make the government fix their data reliably and even when they do, new data is introduced and more malformed line exceptions are created.
    Last edited by Zvoni; Tomorrow at 31:69 PM.
    -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

    One System to rule them all, One Code to find them,
    One IDE to bring them all, and to the Framework bind them,
    in the Land of Redmond, where the Windows lie
    ---------------------------------------------------------------------------------
    People call me crazy because i'm jumping out of perfectly fine airplanes.
    ---------------------------------------------------------------------------------
    Code is like a joke: If you have to explain it, it's bad

  12. #12
    Smooth Moperator techgnome's Avatar
    Join Date
    May 2002
    Posts
    34,385

    Re: Ideas for handling large CSV files with malformed lines

    Quote Originally Posted by Zvoni View Post
    Funny you should mention that....
    From the first post:

    Quote Originally Posted by cory_jackson View Post
    I can't make the government fix their data reliably and even when they do, new data is introduced and more malformed line exceptions are created. I don't see a good way to stop and have a user make a correction easily because the malformed line can end up being multiple lines.
    Well, I can honestly say it isn't one of my files ... I don't just string things together. That said, that's what happens when you go with the lowest bidder who is probably outsourcing bad requirements to some off-shore coding sweatshop.


    -tg
    * I don't respond to private (PM) requests for help. It's not conducive to the general learning of others.*
    * I also don't respond to friend requests. Save a few bits and don't bother. I'll just end up rejecting anyways.*
    * How to get EFFECTIVE help: The Hitchhiker's Guide to Getting Help at VBF - Removing eels from your hovercraft *
    * How to Use Parameters * Create Disconnected ADO Recordset Clones * Set your VB6 ActiveX Compatibility * Get rid of those pesky VB Line Numbers * I swear I saved my data, where'd it run off to??? *

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width