How to handle very huge text/log files

Printable View

May 4th, 2011, 07:01 AM
csKanna

How to handle very huge text/log files

Hi,

I have a very huge server log files like files are more than 500 to 600 mb and even some files are over 2 gb data. the files are maintained for few years. it will be maintained as is. each log file has at least 1 million lines to 20 million lines.

I am looking create an application which can find a line in the text file using regex, removing duplicate entries.

please let me know how these huge file can be handled in a way it works very quickly.

thanks in advance.
May 4th, 2011, 08:46 AM
kareninstructor

Re: How to handle very huge text/log files

Have you considered monitoring log file’s physical size (or by time period) and archiving old entries which would make it easier to manage opening/reading log information? Archive parts of these log files could be placed in an archive folder with a naming convention which allows anyone to go back in time to view information. Of course going this route in the beginning would take some effort on a developer to run thru the current log files and create many archives. I would look at Stream reader and writer.

http://msdn.microsoft.com/en-us/libr...(v=vs.71).aspx
http://msdn.microsoft.com/en-us/libr...eamwriter.aspx
May 4th, 2011, 09:17 AM
stanav

Re: How to handle very huge text/log files

With such large log files, you certainly don't want to read the whole file into memory. You could, however, read it in chunks. I'd probably use a streamreader and streamwriter (as suggested by Kevin) in a loop and read x number of lines, start a new thread and pass those lines to it to do further processing...
May 7th, 2011, 06:03 AM
csKanna

Re: How to handle very huge text/log files

I can do that going forward. but I have tons of files which are created past few years. So, i am looking for a solution to handle those files.
May 7th, 2011, 08:50 AM
kareninstructor

Re: How to handle very huge text/log files

Quote:

Originally Posted by csKanna

I can do that going forward. but I have tons of files which are created past few years. So, i am looking for a solution to handle those files.

The fact is simple in that you need to treat the large current files no different from the files, which will be logged too. Figure out the maximum files size that the split will occur and when that size is reached split the file up. The difficult part would be to remove duplicate lines no matter if this is done before or after the split operation if removing duplicates is a major concern. This is what we do as developers and there is no way around the fact.

To get started I would layout the processes on paper or software such as Visio then work from the design. Within the design, there should be an algorithm to name new files and a method to get the last file name used to create the next file then increment the last file used name. If you need to search within the files and depending on what the results are for you could write a search utility or use a third party search tool to search one or more files. How the split utility runs could be triggered by some type of task manager or be a manual process triggered by an event in a shared calendar.
May 8th, 2011, 01:10 AM
ntg

Re: How to handle very huge text/log files

Quote:

Originally Posted by csKanna

I am looking create an application which can find a line in the text file using regex, removing duplicate entries.

Just so it's clear to me, you need to search the files in order to remove duplicate entries or do you just need to search in the files?