[3.0/LINQ] How would you imporve this
hey all
The objective is to read a massive textfile (200+mb)
each line is a "entry"
then output the totals of each entry
rob
rob
rhaps
tim
output
rob =2
rhaps =1
tim =1
the code i came up with is here
c# Code:
System.Diagnostics.Stopwatch sw = new System.Diagnostics.Stopwatch();
Dictionary<String,int> Count = new Dictionary<string,int>();
string path = @"C:\Documents and Settings\Rob\Desktop\test.txt";
int num = 0;
sw.Reset();
sw.Start();
using (StreamReader sr = new StreamReader(path))
{
string line;
while ((line = sr.ReadLine()) != null)
{
if (Count.ContainsKey(line))
{
Count[line]++;
}
else
{
Count.Add(line, 0);
}
num++;
}
}
foreach (KeyValuePair<string, int> kvp in Count)
{
Console.WriteLine(kvp.Key+": "+kvp.Value.ToString());
}
sw.Stop();
long time = sw.ElapsedMilliseconds;
Console.WriteLine("Time = " + time + " milliseconds.");
Console.WriteLine("Total Rows: " + num.ToString());
Console.WriteLine("lines per ms: "+ (num / time).ToString());
I was thinking of somesort of parrellel while but im not sure how to do that or if it even exists and if it did exist would it work with the stream reader
the other idea was to load it as a DB and then run a sql command to get the count or something but im not sure if that would be faster or not?
any ideas are greatly appreshated
thanks
Re: [3.0/LINQ] How would you imporve this
well i caught one error already
count.add(line,0) should be 1 because its a total not zero based
Re: [3.0/LINQ] How would you imporve this
If it is a 200mb file, it'll take some time.
You can use a backgroundworker (or a thread), to run the process in the background. You could then invoke delegate methods to update the UI to let the user know what is going on, and give them an option to cancel (obviously you'd need to switch to a Windows app instead).
A count through SQL would very likely be faster than opening the file, reading through the entire thing and then telling someone what the count is. The only drawback is the time that it would take SQL to import the file.
Re: [3.0/LINQ] How would you imporve this
Im just doing a console application
its pretty fast now 200mb in about 20 seconds
I was just thinking that if i could spread the work over two cpus it might be a bit faster but im not sure how to cordnate the threading
I was hopeing that there was something built in to do it for me.
Re: [3.0/LINQ] How would you imporve this
There is: System.Threading namespace. I'm not an expert, but I'm not sure you can "divide" the work between the processors. Although you can create new threads and run different methods in different ones, I think that you don't get a lot of control over which processor actually does the work..
If I'm wrong though, someone please correct me :)
Re: [3.0/LINQ] How would you imporve this
I think that the processor isn't the bottleneck in this situation, but the drive. So dividing the workload between different cores/processors doesn't speed it up very much. But MS has a parallel extension (sort of beta) that you could tryMS
Re: [3.0/LINQ] How would you imporve this
I'd recommend the Parallel FX framework too. It should help you take advantage of the multi-core processors.
Have the threads run through the file and then update the Dictionary. Lock the dictionary when updating the corresponding int.
Re: [3.0/LINQ] How would you imporve this
The problem isn't that you are using 1 thread. The problem is with how you are using StreamReader. StreamReader uses a FileStream and reads 1kb at a time by default. So you are accessing the disk 204,800 times while parsing that 200mb file.
Try changing
using (StreamReader sr = new StreamReader(path))
to
using (StreamReader sr = new StreamReader(path, Encoding.UTF8, true, 0x100000/*1mb*/))
That should increase the speed.
Re: [3.0/LINQ] How would you imporve this
High6
I gave that a try and it had some impovment ( generaly around 100 extra lines a ms)
The thing i thought i saw and i could be completely off my rocker was a parrellel foreach
and basicly it would take a look at the work and determine how many threads were nesscary
then it would take each part of the each and split it up
thread 1 reads first 5th thread 2 reads 5th
and it knew to take care of this
I could be thinking of perl maybe
Re: [3.0/LINQ] How would you imporve this
I should also point out that this is not any sort of work related project its just for the hell of it
and by hell of it im trying to show up my perl scripting coworker again.
:)
Re: [3.0/LINQ] How would you imporve this
tell me do you know the names before hand?
Re: [3.0/LINQ] How would you imporve this
What about this. Get the file, make a copy. load the copy to memory the go through the list. When we hit a unique name create a regex for that name and use the regex to 1: count the amounts and 2: replace the line with a null so everytime your if() hits one of these lines it just skips over it. When the regex has finished move to the next unquie line and repeat. Could anyone verify if this would give a speed improvement as you are only parssing unique lines?
Re: [3.0/LINQ] How would you imporve this
Unfortunately, regex won't help with your speed here, it is slow even with RegexOptions.Compiled.
Re: [3.0/LINQ] How would you imporve this