Algorithm to merge files

**mutley** · Mar 27th, 2015, 01:08 PM

Hi

I would like build some algorithm to join records of the textfile when the difference between two records is is less than 2

characteristics of records

1) The size is always 14 characters
2) The possible characters are: (1,2,3,4,5,6,7)
3) Usually has only the characters: 1.2 and / or 4

Should I read the file in sequential order, and find for each record all possible records that have difference of 1 or two characters
example of records
[pre]
42444442414411
42444441414211
42444441214411
42444421414411
42424441414411
42244441414411
41444442424411
41444442414421
41444442414412

[/pre]
When found some record less than 2 minimum difference, should add up the numbers that are different and eliminate one of the records
Example :
42444442414411 and 42444441414211

The two records merge into:42444443414611

As this record has already been summed, it is discarded in the next comparisons

What Is the best way to do It reading a text file ?

**LaVolpe** · Mar 27th, 2015, 01:23 PM

what happens if the 2 numbers that are different sum to a value > 7 or > 10?

Can these records that will be compared exist anywhere in the file or only next to each other in the file?

**DataMiser** · Mar 27th, 2015, 01:26 PM

So what would need to happen after you merge the two, would it continue to compare the original for all the lines in the file? Would it add to that new number? Can you have 8,9 and 0 digits in the output?

A little more detail please.

**mutley** · Mar 27th, 2015, 01:44 PM

Originally Posted by LaVolpe

what happens if the 2 numbers that are different sum to a value > 7 or > 10?

Can these records that will be compared exist anywhere in the file or only next to each other in the file?

Thank you for your answer

But will never be greater than 7. generally only contain numbers: 1, 2 and 4

suppose I have a text file, I get to read it in a sequential manner, I read the first record and seeking the first record that contains less difference than or equal 2, we gather these records and record it in another file and adding'll

42444442414411
42444441414211
42444441214411
42444421414411
42424441414411
42244441414411
41444442424411
41444442414421
41444442414412

42444442414411 First record
42444441414211 second record

I merge to 42444443414611 save in other file, then I must to read the third record , because the First was read and the second was merged .suppose it was not the second record that had less than or equal to two difference, but the fifth, then this fifth record should be disposed of close readings

**LaVolpe** · Mar 27th, 2015, 01:49 PM

Where DM and I are a bit confused, don't quite understand is that you said the array can contain numbers from 1 to 7, so theoretically, 42444442414411 & 42444442414611 would yield a sum of 10. Also DM was asking how do you know that you are not comparing a record that was already merged? When the record is merged and saved to another file, are unmerged records also saved to that file? Are you trying to merge multiple files too?

**mutley** · Mar 27th, 2015, 01:55 PM

Originally Posted by LaVolpe

Where DM and I are a bit confused, don't quite understand is that you said the array can contain numbers from 1 to 7, so theoretically, 42444442414411 & 42444442414611 would yield a sum of 10. Also DM was asking how do you know that you are not comparing a record that was already merged? When the record is merged and saved to another file, are unmerged records also saved to that file? Are you trying to merge multiple files too?

Show the two records below
42444442414411
42444441414211

The difference is only eighth character and 12 twelfth , then the sum will be (2+1) and (4+2)
42444443414611

The Original file almost always have only: 1,2,4 if you have other, will be the exception, which is easy to circumvent manually

**DataMiser** · Mar 27th, 2015, 03:58 PM

I still don't know what you are really trying to do. I am assuming that your file has a lot more than just 2 records in it but all your examples are only dealing with two records, can't tell how to process the file based on that limited info.

**passel** · Mar 27th, 2015, 04:16 PM

My interpretation.
Start with a list of strings containing the digits 1,2 and 4. If there are any other digits in the string, they won't match any other strings within two characters so don't need to be concerned with them.

Read first item from the list
Compare to each of the following items in the list until you find one that differs by only one or two characters.
Add the digits that are different together to create a new number and write that to the output file.
Remove the two items combined from the first list.
Start again with the first item in the list.
If you make it through the list with no close matches to the first number then
start the process over again, but starting with the second item in the first list.
Repeat the above until your starting item from the first list is your last item in the first list. You've removed all closely matching pairs.

Based on that you should have a second file with some number of merged pairs from the first list (the merged pairs removed from the first list), with the first list being all the remaining values where all the numbers differ by more than two characters with any other number in that list.

Why? Who knows. It smacks of a compression scheme.

**mutley** · Mar 28th, 2015, 04:25 AM

Originally Posted by DataMiser

I still don't know what you are really trying to do. I am assuming that your file has a lot more than just 2 records in it but all your examples are only dealing with two records, can't tell how to process the file based on that limited info.

Thank you , work like bitwise operations

1 ==> 001
2 ==>010
4 == > 100

Then using OR Operators
1 OR 2 ==> 011 equal 3
2 OR 4 ==> 110 equal 6

My problem is How can I to read a text file and found all string with difference 2 and after choice any record , save the merge and to continue reading text file despising the records joined, because no use database , only datafile

**passel** · Mar 28th, 2015, 10:52 AM

How big is the file?
The best speed will be accomplished if you can read the whole list in memory.
For a similar task, I created a second array that can act like a linked list to the array of strings.
The access may be initially slower because of the indirect indexing to get to the item in the array, but as items are merged and "removed" from the link list, things should speed up as you don't have to read multiple flags to find the next non-merged value (if you used a flag to mark an entry that has been merged), or time spent compacting the list to eliminate merged fields.

If the file is too large to fit into memory, are the lines really consistently sized so that the beginning of each line is guaranteed to be calculable by simply multiplying a fixed value by index to get to the line?

Above all other questions, re-reading this thread multiple times, I'm still not sure that my interpretation in post #8 is correct. The question is what Datamiser asked in post #3,

So what would need to happen after you merge the two, would it continue to compare the original for all the lines in the file?...

and still wasn't really addressed in your response in post #4,

42444442414411 First record
42444441414211 second record

I merge to 42444443414611 save in other file, then I must to read the third record , because the First was read and the second was merged .

You say you must read the third in the case that the 1st was merged with the second and the second removed, but it isn't clear if you're reading the third to compare to the first. Or reading the third as the new "first" to compare to the rest of the file until a close match is found.

ReQuestion:
So, is the desired to compare the first item to all the remaining items in the file, merging it with all close matches and writing those matches in order to a second file, removing them (the matched entry) from the first file, and only after all close matches to the first item has been found, merged and removed, do you move on to the next item (remaining) in the first file and compare it to all the following (remaining) items in the first file, (merging and removing from the first file, and appending the merge to the second file).

Apparently the original order isn't important in the second file, as you would be shuffling all close matches to the first item to the front of the second file, and then all close matches to the next (more than two differences to the previous entry in the first file), item and moving close matches to follow as a group in the second file. Seems like an odd thing.

I guess, first answer the "Requestion:" above so we know whether post #8 is a wrong interpretation of the requirements.

Thread: Algorithm to merge files

Thread Tools

Display

Algorithm to merge files

Re: Algorithm to merge files

Re: Algorithm to merge files

Re: Algorithm to merge files

Re: Algorithm to merge files

Re: Algorithm to merge files

Re: Algorithm to merge files

Re: Algorithm to merge files

Re: Algorithm to merge files

Re: Algorithm to merge files

Tags for this Thread

Posting Permissions