PDA

Click to See Complete Forum and Search --> : Fastest way to compare huge lists


JoeH
Jan 16th, 2000, 10:27 PM
I have 2 lists of file paths that I need to compare and get the difference. The lists are in text files. The first one is about 10474 lines long and the other file ranges from 10474 to 37000 lines long (the longest one I've seen so far). Right now, the program reads in the first file into an array. Then the second file is opened and read in one line at a time. The read line is compared against the array. If it was not found, it is added to a listbox. Right now, it takes about 4 to 5 minutes to do the 37000 line compare on a PII 400 w/ 128 meg ram. This program need to run on a vsriety of PC's ranging from a 386 to a PII. So trying to run it on a 25 MHz 386 would take 30 minutes to compare.
Any one knows of a faster way to compare the 2 files ?

Thanks


------------------
Joe Handal
Workstation Engineer
joe.handal@carle.com

MartinLiss
Jan 16th, 2000, 10:53 PM
Instead of reading the first file into an array and then sequentially reading the array, create a keyed Collection instead. That way you can you can tell with just one read if the string from the second file exists in the first. I think this will be much faster. See my response to this post (http://www.vb-world.net/ubb/Forum1/HTML/010597.html) for an example of the technique. I would be most interested in knowing the results. Please e-mail me.

------------------
Marty

Jan 17th, 2000, 11:28 AM
Is your file sorted in any way?

If so then you can start in the middle and only compare in one direction.
For example :
if the word was monkey
you would compare the string to the middle string of your array
if it was less than then search backward from the middle of the file if not then search forward.

Also you can terminate your loop once a match is found.

Just suggestions

------------------
Boothman
There is a war out there and it is about who controls the information, it's all about the information.

HarryW
Jan 17th, 2000, 11:47 AM
You can take what Boothman said even further and continue to split the remaining records in half until you get to the record if your list is sorted. I think this is a binary search. (If you've ever done A-Level maths this should sound familiar, as you can use it for finding turning points in a function)

Another thing you could do is organise the entries into a binary tree structure. This is a relatively simple process. Once you have that you can just follow the appropriate pointers associated with the items in the list, speeding the whole thing up a lot, especially with that many records :)

JoeH
Jan 17th, 2000, 09:05 PM
I tried MartinLiss's suggestion using Collections and it was very fast.
I timed the old way and the new way. The old way took 6.5 minutes, while the new way using a collection took only 25 seconds WOW!
The only problem with that is that the collection takes up 16 byts of memory no matter what data type is being stored in it. Other than that, collections are great.

Thanks a lot for your help