dcsimg
Results 1 to 5 of 5
  1. #1

    Thread Starter
    Fanatic Member
    Join Date
    Apr 2008
    Location
    Kent, England
    Posts
    676

    Reading text file length reports differently

    Hi Guys,

    I have a piece of software I use, which for all intents and purposes is a "grep" tool with a GUI.

    As part of the process, I read in the working files length, to set the properties for the maximum value of the progress bar.

    Code:
    srTotalLineCount = File.ReadAllLines(srInputFile).Length
    This line of code works perfectly, and generates the correct response (13,609,986 lines). The only problem is, it reads the entire file into my applications memory, so I can end up with an app running at 4GB.

    Code:
    srTotalLineCount = IO.File.OpenRead(srInputFile).Length
    This code works (in the sense that it doesn't read in a 4GB file) however generates an apparent line count of 10,633,270,273 - which is entirely inaccurate and screws with my software's ability to work efficiently, but also messes with figures such as progress bar and other counters on the software.

    Surely there is a correct answer, where I can read the correct length of a file, and get the correct answer, without reading the entire file into memory?

    James
    "Wisdom is only truly achieved, when you realise you dont know everything" ... I must be a genius because I always have to ask stupid questions...

    Pointing an idiot like me in the right direction, is always appreciated by the idiot, explaining how to do what you have pointed the idiot to, is appreciated even more. I apologise to all experienced coders who will think I am an idiot, you are right, I am an idiot, but I am an idiot who is trying to learn

  2. #2
    .NUT jmcilhinney's Avatar
    Join Date
    May 2005
    Location
    Sydney, Australia
    Posts
    99,149

    Re: Reading text file length reports differently

    That second code snippet isn't reading lines. It's creating a FileStream and the Length of that is the number of bytes in the file.

    If you want to know how many lines there are in the file then you have no choice but to read the whole file but you can avoid reading it all at once. The ReadLines method reads a line at a time, unlike ReadAllLines that reads them all at once. Use this:
    vb.net Code:
    1. srTotalLineCount = File.ReadLines(srInputFile).Count()
    Why is my data not saved to my database? | MSDN Data Walkthroughs
    VBForums Database Development FAQ
    My CodeBank Submissions: VB | C#
    My Blog: Data Among Multiple Forms (3 parts)
    Beginner Tutorials: VB | C# | SQL

  3. #3

    Thread Starter
    Fanatic Member
    Join Date
    Apr 2008
    Location
    Kent, England
    Posts
    676

    Re: Reading text file length reports differently

    Thanks for the reply.

    Would this prevent the memory being drained on opening this application? and does it clear down after itself as necessary?

    James
    "Wisdom is only truly achieved, when you realise you dont know everything" ... I must be a genius because I always have to ask stupid questions...

    Pointing an idiot like me in the right direction, is always appreciated by the idiot, explaining how to do what you have pointed the idiot to, is appreciated even more. I apologise to all experienced coders who will think I am an idiot, you are right, I am an idiot, but I am an idiot who is trying to learn

  4. #4
    .NUT jmcilhinney's Avatar
    Join Date
    May 2005
    Location
    Sydney, Australia
    Posts
    99,149

    Re: Reading text file length reports differently

    I've never tested it but I would expect so. I would think that the GC would likely be invoked during the operation or at the end at least. You can always test for yourself though.
    Why is my data not saved to my database? | MSDN Data Walkthroughs
    VBForums Database Development FAQ
    My CodeBank Submissions: VB | C#
    My Blog: Data Among Multiple Forms (3 parts)
    Beginner Tutorials: VB | C# | SQL

  5. #5
    You don't want to know.
    Join Date
    Aug 2010
    Posts
    4,580

    Re: Reading text file length reports differently

    Count()'s implementation is a black box (I mean, you can go look at reference source, I guess.) but we can make some guesses.

    The main reason you have to worry about it in LINQ is Count() can't return without iterating the collection. That means it's a terror for an infinite sequence but "just slow" for a large sequence. We don't like to use Count() when we can avoid it, but if you want to count all of the elements you can't avoid it.

    Problem: all of this code is synchronous, so for a large file it'll take a perceptible amount of time to get the number of lines. It'd be more polite to use an asynchronous background task to update the line count. Microsoft wasn't kind enough to put actual async methods on the File Class, but we can make do.

    So I'd write something like this:
    Code:
    Public Function CountLines() As Task(Of Integer)
        Return Task.Run(Function()
                            Return File.ReadLines().Count()    
                        End Function) 
    End Function
    This wraps the job of counting the lines in a Task(Of Integer), which is a type that means "a function that will eventually return an Integer". This work will be done on a worker thread.

    The easiest way to use it would be to write some event handler like this:
    Code:
    Private Async Sub YourEventHandlerName(...) Handles ...
        lblLineCount.Text = "Lines: <calculating...>"
        Dim numberOfLines = Await CountLines()
        lblLineCount.Text = $"Lines: {numberOfLines}"
    End Sub
    The 'Async' keyword must be in the event handler name, that tells the compiler you want to use the 'Await' keyword. It does some neat magic, so you do want to use it.

    This method sets the line count label to tell the user the lines are being counted, then uses the 'Await' keyword to call CountLines(). This tells the compiler to stop executing this event handler, wait for CountLines() to finish, then come back when it finishes. When it comes back, the label can be updated.

    Writing code like this means you can load the file a little faster and let the line count magically appear when it finishes.

    Now, philosophically speaking...

    While this may not keep the entire file in memory, iterating all of the lines is still an operation that'll muck up your memory in a lot of ways. At the very least, you're likely to incur several garbage collections. Odds are, for your tool to do its job, you already have to iterate the file at least once. If you can, it'd be best to count the lines while doing that.

    My personal choice, unless a customer just NEEDED the line count, would be to display the total number of bytes in the FileStream. That still gives the user an idea of the file's size, but you can get that instantaneously without iteration. If line count is important, I'd sort of want to put it off or make them have to push a button to ask for it. I might experiment with memory-mapped files to see if it let me not trash the GC in this case; I've always wanted a reason to play with those. Maybe the upcoming Span(Of T) class could help, too? "Big text files" have always been a sore spot in .NET memory management.
    This answer is wrong. You should be using TableAdapter and Dictionaries instead.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Featured


Click Here to Expand Forum to Full Width