[2.0] Efficient Matching

**pjrage** · Jan 29th, 2008, 02:12 PM

This may be more of a general programming question, but here goes:

I have three arrays (could be more, though, in the future) of points. Each point has an X, Y, Z associated with it.

There are as many as 136400 Points in these arrays but could be more in the future. I have them designated as Lists, actually, Lists of my custom class Point.

What I want to do, is remove any redundant Points in the Lists (Points that have the *same* X, Y, Z coordinates as each other). Within a limit anyway, say if the Xs, Ys, and Zs are all within +/- 0.5 of each other.

To do it, I have it set up with a few nested for loops.

The first outer loop loops List1, and inside of this outer loop, List2 is looped to check for matches with the current point we're looking for in List1 (List1[i]), and then does the same with another inner loop which checks List3 for matches with the current List1[i] Point.. If the points are the same as the current point being examined (List1[i]) then the point is removed from List2 or List3 (wherever it was found).

It then outer loops List2, with an inner loop looking for matches with List3.

This process makes takes FOREVER. I calculated it out to be around 58 billion loops through, and takes sooo long.

Is there a more efficient or better way to look for matches? I can type the code out if it helps anyone, I was just thinking that this has to be sort of common and you probably can guess/understand the way I'm attempting it.

It is currently on it's own thread and everything, it just takes a very long time to do all the loops (~20 minutes with the numbers I listed).

**mendhak** · Jan 29th, 2008, 02:54 PM

Can you nip it in the bud and ensure that when you're populating the arrays, the ordinals of each corresponding array don't contain the same values?

Or even better, why not (unless I misunderstand you) have one list of class objects, where the class simply contains x,y and z as int properties? And again, when populating the list, don't populate it if x=y=z.

**pjrage** · Jan 29th, 2008, 04:20 PM

Yes, I the x,y,z are currently in a class I made, called Point, where the x, y, z variables are present as doubles (along with some other variables containing info about the Point that isn't relevent for this problem). And the List1, List2, List3 lists are just lists, i.e. List<Point> List1. Is that what you meant on your second note?

With regards to your first note, that may be possible. It would still require searching through every already existing List, looking at every Point's x,y,z, and then only adding to the new, currently being populated List when the (current) Point does not match an existing Point. I can try this method, but I'm not sure how much speed I will gain since I think the same amount of searching and comparing will need to be done, just not during post processing. I would prefer post processing, however, so as not to slow down the gathering of the point information.

The Lists are made very independently of each other, but may or may not contain some of the same X,Y,Z Points. I don't want them to contain Points that are close to, or the same as, Points that exist in other Lists already, though. Does that make sense?

**Tom Sawyer** · Jan 29th, 2008, 04:23 PM

Well, they're in lists, so just use the Find method and build a predicate function to check for a matching point in the other lists instead of looping through them - that will save tons of time.

Also, you should do that in the add event, so you don't even have to bother inserting the point into a list if it exists in another one.

**pjrage** · Jan 29th, 2008, 04:28 PM

I guess I'm not too familiar with Lists, and the Find event. Or predicate functions for that matter.

I'll try that, though. What I originally thought is that the Find method needed to find an *exact* match to whatever you search for, but in reality, I want to find any point that is +/- 0.5 from the x && +/- 0.5 from the y && +/- 0.5 from the z. And ignore all of the other Properties that are contained in the Point class. I'll see what Find can do for me though, thanks!

I'll have to see what type of speed this provides too, but it would need to search the other Lists of 100,000+ Points very very quickly for matches if I were to add it to the Add method.

**mendhak** · Jan 29th, 2008, 05:24 PM

List comes with a .RemoveAll() method. You pass a predicate to the RemoveAll() method to get the list item removed. Think of a predicate as a method that is passed an object of the list's type and it returns a true or false. So taking List1,

Code:

List1.RemoveAll(XYZAllTheSame);

XYZAllTheSame would have a definition like:

Code:

private bool XYZAllTheSame(Point p)
{
if(p.x == p.y && p.y == p.z)
{
return true;
}
else
{
return false;
}
}

**pjrage** · Jan 30th, 2008, 07:22 AM

I'm actually excited to have learned that for future use, but I'm not sure I can apply it here?

I need to compare the 'x,y,z's from OTHER list's points, not the same list. OR, I can make the 3+ lists into one list, but even then I need to compare the current point's x,y,z to OTHER point's x,y,z in the list, not to it's own x,y,z.

Know what I mean? Sorry that the way I keep explaining it is a bit confusing.

Here is basically what I'm doing..

Code:

List<Point> List1;
List<Point> List2;
List<Point> List3;

//Do stuff here to populate List1, List2, List3

//Check List1 for any point matches with points from List2 or List3
for (int i = 0; i < List1.Count; i++)
{
     //Check the current point, List1[i] for matches with any point in List2
     for (int j = 0; j < List2.Count; j++)
     {
          if (Math.Abs(List1[i].x - List2[j].x) <= 0.5 && Math.Abs(List1[i].y - List2[j].y) <= 0.5 && Math.Abs(List1[i].z - List2[j].z) <= 0.5)
               List2[j].Remove(List2[j]);
     }

     //Check the current point, List1[i] for matches with any point in List3
     for (int j = 0; j < List3.Count; j++)
     {
          if (Math.Abs(List1[i].x - List3[j].x) <= 0.5 && Math.Abs(List1[i].y - List3[j].y) <= 0.5 && Math.Abs(List1[i].z - List3[j].z) <= 0.5)
               List3[j].Remove(List3[j]);
     }
}

//Now check List2 for matches with List3
for (int i = 0; i < List2.Count; i++)
{
     //Check the current point, List2[i] for matches with any point in List3
     for (int j = 0; j < List3.Count; j++)
     {
          if (Math.Abs(List2[i].x - List3[j].x) <= 0.5 && Math.Abs(List2[i].y - List3[j].y) <= 0.5 && Math.Abs(List2[i].z - List3[j].z) <= 0.5)
               List3[j].Remove(List3[j]);
     }
}

You can see how the first for loop is looping List1, which is 136,400 points, or more. And inside of that, for every one of those 136,400 points it loops List2 which is also 136,400 points, then it loops List3 which is also 136,400 points. Then, it moves on to looping List2, 136,400 points, and for every point again, loops List3 at 136,400.

So, basically, [136400 * (136400 * 2)] + (136400 * 136400) = ~55.8 billion loops

**pjrage** · Jan 30th, 2008, 07:25 AM

I was just thinking of a way that maybe I could use the RemoveAll method. Maybe if I just loop through List1, and do a RemoveAll for List2 then a RemoveAll for List3, for the List1[i], then do the same for List2, and RemoveAll for List2[i] only from List3. I'll see how that improves performance.

**wossname** · Jan 30th, 2008, 07:42 AM

Try List.RemoveAt() instead of List.Remove(), that should save a bit of time as well.

Also you don't want to loop forwards through a list that you are deleting items from, thats just nasty. Loop backwards instead, that way you avoid any potential illegal indexes and also you don't have to keep track of the item count. When your loop counter reduces to 0 then you've tested all the items.

**pjrage** · Jan 30th, 2008, 07:46 AM

Just tried what I mentioned above..

Had to use global variables to pass the current point's x,y,z to the predicate.

It *may* have improved speed a little, but definitely not as drastically as I hoped it might have. It still has to loop through a pretty big List in List1, and perform the relatively time consuming search process of RemoveAll for the pretty big lists, List2, and List3. Then it has to do it again for List2 searching for matches with List3. I guess I was hoping for some magical algorithm or some way of doing this that is much faster, but maybe the fact that it has to compare so many individual points is going to make it take a while no matter how it's done?

Here is how I did it the new way, anyway:

Code:

List<Point> List1;
List<Point> List2;
List<Point> List3;

double x_to_check;
double y_to_check;
double z_to_check;

private bool Check_Matches(Point p)
{
     if (Math.Abs(p.x - x_to_check) <= 0.5 && Math.Abs(p.y - y_tocheck) <= 0.5 && Math.Abs(p.z - z_to_check) <= 0.5)
          return true;
     else
          return false;
}

//Do stuff to populate lists here

//Loop List1 looking for matches with List2 or List3
for (int i = 0; i < List1.Count; i++)
{
     x_to_check = List1[i].x;
     y_to_check = List1[i].y;
     z_to_check = List1[i].z;

     List2.RemoveAll(Check_Matches);
     List3.RemoveAll(Check_Matches);
}

//Loop List2 looking for matches with List3
for (int i = 0; i < List2.Count; i++)
{
     x_to_check = List2[i].x;
     y_to_check = List2[i].y;
     z_to_check = List2[i].z;

     List3.RemoveAll(Check_Matches);
}

**wossname** · Jan 30th, 2008, 07:50 AM

Would you be able to sort all the points into ascending order? If that is feasible for your situation then you can just use a custom binary search function to reduce the number of total comparisons a great deal. (eg 1000000 comparisons would be reduced to about 20 or so).

Binary searches only work on sorted data though so it depends whether or not your data is in a particular order for a reason.

**pjrage** · Jan 30th, 2008, 08:01 AM

Ohhh, interesting idea.

No, the data is currently not sorted for any particular reason, but it could be if it helps!

I understand the principle of what you are saying, but I'm not too familiar with binary searches?

And what would the best way to sort the data be?

EDIT: And for looping backwards, you mean: for (int i = List1.Count - 1; i >= 0; i--) instead?

**mendhak** · Jan 30th, 2008, 08:47 AM

OK, so re-re-interpreting your post, you want to remove duplicates from your list. Is that it?

Here's something I found

Code:

static List<string> removeDuplicates(List<string> inputList)
{
    Dictionary<string, int> uniqueStore = new Dictionary<string, int>();
    List<string> finalList = new List<string>();
    foreach (string currValue in inputList)
    {
        if (!uniqueStore.ContainsKey(currValue))
        {
            uniqueStore.Add(currValue, 0);
            finalList.Add(currValue);
        }
    }
    return finalList;
} //kirupa.com

The dictionary allows for a fast search for matching an existing T. You need to modify that code to work with your Point class rather than strings.

**pjrage** · Jan 30th, 2008, 08:59 AM

Originally Posted by mendhak

OK, so re-re-interpreting your post, you want to remove duplicates from your list. Is that it?

Here's something I found

Code:

static List<string> removeDuplicates(List<string> inputList)
{
    Dictionary<string, int> uniqueStore = new Dictionary<string, int>();
    List<string> finalList = new List<string>();
    foreach (string currValue in inputList)
    {
        if (!uniqueStore.ContainsKey(currValue))
        {
            uniqueStore.Add(currValue, 0);
            finalList.Add(currValue);
        }
    }
    return finalList;
} //kirupa.com

The dictionary allows for a fast search for matching an existing T. You need to modify that code to work with your Point class rather than strings.

Can the Dictionary be used for non-exact matches? I think this problem could be simplified if I was looking to match *exact* points, but I'm calling "matches" any point that has an x,y,z that are all within 0.5 of the point to be matched.

I'm currently reading up on binary search trees, and the topic is extremely interesting to me. I think they may turn out to be the sort of "magic alogorithm" I was hoping for. It looks like if I arrange the point lists into an acceptable BST, I can perform my searches at a log base 2 of n rate. In other words, for my 136,400 point lists, it will be as fast as searching a 17 point list, or so (almost 8000 times faster

). I guess that is the theory anyway. I'm finishing reading up on them and then I'll try some implementation, if possible.

Have you guys heard of BSTs before? That is to ask, are they pretty common in the programming world. It seems that they really would be, especially for the sort of thing I'm doing now, or maybe for video/image processing or something? I'm surprised I've never really heard of it before

**wey97** · Jan 30th, 2008, 12:47 PM

Originally Posted by pjrage

Can the Dictionary be used for non-exact matches? I think this problem could be simplified if I was looking to match *exact* points, but I'm calling "matches" any point that has an x,y,z that are all within 0.5 of the point to be matched.

I'm currently reading up on binary search trees, and the topic is extremely interesting to me. I think they may turn out to be the sort of "magic alogorithm" I was hoping for. It looks like if I arrange the point lists into an acceptable BST, I can perform my searches at a log base 2 of n rate. In other words, for my 136,400 point lists, it will be as fast as searching a 17 point list, or so (almost 8000 times faster

). I guess that is the theory anyway. I'm finishing reading up on them and then I'll try some implementation, if possible.

Have you guys heard of BSTs before? That is to ask, are they pretty common in the programming world. It seems that they really would be, especially for the sort of thing I'm doing now, or maybe for video/image processing or something? I'm surprised I've never really heard of it before

Binary Search Trees are one of the fundamental building blocks of programming...in any language.

**pjrage** · Jan 30th, 2008, 12:56 PM

Originally Posted by wey97

Binary Search Trees are one of the fundamental building blocks of programming...in any language.

That's what I'm finding out! And why I'm so surprised I've never really heard of them before!

**wey97** · Jan 30th, 2008, 01:31 PM

You can use a SortedDictionary<TKey, TValue> or SortedList<TKey, TValue> and you won't be able to insert duplicates and the values will already be sorted regardless of what order you insert them. There will be a little overhead to using either of these, but you'll gain that back by not having to search every time before an insertion.

You'll need to inherit from the IComparable<T> interface which requires you to implement a CompareTo function inside your Point class so the dictionary or list can compare two points and decide if they're the same point or not.

If you attempt to insert a duplicate, an exception will be thrown and need to be caught.

For example:

csharp Code:

class Point : IComparable<Point>
{
    public int x;
    public int y;
    public int z;
 
    public Point(int x, int y, int z)
    {
        this.x = x;
        this.y = y;
        this.z = z;
    }
 
    #region IComparable<Point> Members
 
    public int CompareTo(Point other)
    {
        if (this.x == other.x && this.y == other.y && this.z == other.z)
            return 0;
 
        return -1;
    }
 
    #endregion
}

and to use the dictionary or list:

csharp Code:

SortedDictionary<Point, int> d = new SortedDictionary<Point, int>();
 
Point p = new Point(0, 1, 2);
d.Add(p, 0);
 
try
{
    p = new Point(0, 1, 2);
    d.Add(p, 0);
}
catch (Exception ex)
{
    Console.WriteLine(ex.Message);
}

Whether you use SortedDictionary<TKey, TValue> or SortedList<TKey, TValue> depends on several things such as initial loading, inserting, removing, and key and value retrieval.

You can view the details of this in http://msdn2.microsoft.com/en-us/lib...19(VS.80).aspx in the Remarks section.

**wossname** · Jan 30th, 2008, 02:02 PM

Originally Posted by pjrage

I understand the principle of what you are saying, but I'm not too familiar with binary searches?

Yes you are, you've looked up a friend's phone number in the yellow pages right? First you opened the book in the middle, then chose which half of the book to search next and so on. You didn't read every name on every single page until you found what you were looking for. At least I hope you didn't.

Originally Posted by pjrage

EDIT: And for looping backwards, you mean: for (int i = List1.Count - 1; i >= 0; i--) instead?

Yes.

Originally Posted by pjrage

And what would the best way to sort the data be?

Well that is rather more complicated. Especially for your multiple array strategy. For a binary search it would be most convenient f you could just dump al your points into a single array and then sort them, but I suspect that is not possible otherwise you would have done that already, please tell me if this is feasible or not. Sorting them is actually very easy if you have all the data in one array.

**wossname** · Jan 30th, 2008, 02:07 PM

General remark:
BST's and dictionaries are complete overkill for this, they may be easy to use but that is not the aim of the excercise judging by the OP. Speed is the objective so we need to get low-level. BST's and dictionaries just soak up more runtime and RAM where this could be avoided by using better logic.

**wey97** · Jan 30th, 2008, 02:33 PM

Originally Posted by wossname

General remark:
BST's and dictionaries are complete overkill for this, they may be easy to use but that is not the aim of the excercise judging by the OP. Speed is the objective so we need to get low-level. BST's and dictionaries just soak up more runtime and RAM where this could be avoided by using better logic.

Why reinvent the wheel?

A BST is built for speed. Are you saying that custom procedures and data structures designed to prune these Lists of "duplicate" Points will be faster than not inserting duplicates to begin with?

**pjrage** · Jan 30th, 2008, 03:10 PM

Thanks for the great advice guys. I'm doing a ton of research into the different data structures out there and I'm going to implement a few just to learn. I do like the idea of not adding duplicates in the first place as mentioned by wey97. Or something similar anyway. If the search to find if the point already exists isn't going to be basically instant, then I'll probably need to do it after the data is gathered.

There is no reason why every point can't be contained in one array (or list). In fact, they start off this way, in a MasterList, but I extract the three specific sections into 3 different lists, List1, List2, List3 to do the search the way I originally was. Because the same point can't exist in more than one list, BUT, the same point CAN exist within it's own list. That is, you can have an entire List full of the exact same point, but if any of the other lists have that point, it needs to be removed from the other list. That is why I needed to compare to the original lists separately.

I'm still not sure the best way to sort, though? Sort by x? Sort by y? Sort by z?

If you sort by x, that will speed it up for sure, but not as much as it could be, right? It will speed it up for any x's that don't match, but once an x matches, it will need to see if the y and z match too, which will be slow again while it iterates through all the possible y and z matches for the particular matching x, right? I think I'll have to do something where it sorts by x originally, if it finds x, it resorts by y and searches y's, then if it finds y it resorts again and searches by z. Does that make sense? Or am I missing the whole point?

This is soo interesting to me, I'm loving it

**wossname** · Jan 30th, 2008, 03:31 PM

Originally Posted by wey97

Why reinvent the wheel?

A BST is built for speed. Are you saying that custom procedures and data structures designed to prune these Lists of "duplicate" Points will be faster than not inserting duplicates to begin with?

Firstly, do not assume that the "wheel" is the most efficient form of transport. The wheel (by "wheel" I mean the libraries that vendors distribute to their customers without prior knowledge of the kind of programs their customers wish to write) in many cases is just enough to get the job done. Not necessarily to always do it in the most efficient manner possible. It has to be generalised by nature. Fair enough. A competent programmer can often out-perform standard libraries when performance at a specific (possibly esoteric) task is critical.

Secondly, BST's are indeed designed for speed and efficiency, but generally used for manipulating more complex types of data. What we are dealing with here is very simple data. On the big scheme of things the quantity of data mentioned here is actually very small. The nature of the data can be easily shoved around using very basic techniques.

Given the task specified ("make this faster"), I'd happily say that a few hours/days spent optimising a custom algorithm is well worth the effort if the long-term effect is a significant saving in time.

**wossname** · Jan 30th, 2008, 03:33 PM

And another thing, the difference between "removing duplicates" and "not inserting them in the first place" is merely a matter of where the data is at any given time. You still have to use logic to decide whether you already have the data or not.

I could go on at great length about "Information Theory" (which I highly recommend all those reading this post to look into) but I'd only be retreading old ground.

**wey97** · Jan 30th, 2008, 03:58 PM

Originally Posted by wossname

...
Given the task specified ("make this faster"), I'd happily say that a few hours/days spent optimizing a custom algorithm is well worth the effort if the long-term effect is a significant saving in time.

I agree. However, if the requirement is "make this faster" and what is supplied "makes this better" which in turn makes it faster, then the original requirement is met, is easier to understand, and incorporates code reuse/inheritance, blah, blah.

We can go on all day about what will take more time, i.e. using a somewhat complete and generic framework versus writing the framework yourself, in a language this is designed for code reuse, no less. What about maintenance? Will the programmer that inherits this code from you know what your intent was, know how to port the code or maintain it, and will he trust that your algorithm is 0.00001 milliseconds faster than the supplied template? That's your call but you're defeating the purpose of the OOP paradigm and should be programming in C or ASM.

**wossname** · Jan 31st, 2008, 01:38 PM

Originally Posted by wey97

I agree. However, if the requirement is "make this faster" and what is supplied "makes this better" which in turn makes it faster, then the original requirement is met, is easier to understand, and incorporates code reuse/inheritance, blah, blah.

We can go on all day about what will take more time, i.e. using a somewhat complete and generic framework versus writing the framework yourself, in a language this is designed for code reuse, no less. What about maintenance? Will the programmer that inherits this code from you know what your intent was, know how to port the code or maintain it, and will he trust that your algorithm is 0.00001 milliseconds faster than the supplied template? That's your call but you're defeating the purpose of the OOP paradigm and should be programming in C or ASM.

Nothing I said is in any way against OOP. I never suggested rewriting the entire framework, where you got that preposterous idea from I don't know. As for maintenance, you use things called "comments and documentation". There is absolutely no benefit in writing code that makes a time difference of 0.00001 milliseconds and I made it clear that the solution I proposed would be a great deal faster than relying on the generalised classes in this situation thus making the task a worthwhile one.

And I find your fatuous comment about C and ASM rather insulting. OOP doesn't mean you are not allowed to think for yourself, there is still a great deal of freedom to implement your own ideas and algorithms without compromising anything that falls under the techniques of OOP.

**Tom Sawyer** · Jan 31st, 2008, 01:52 PM

Originally Posted by wossname

And another thing, the difference between "removing duplicates" and "not inserting them in the first place" is merely a matter of where the data is at any given time. You still have to use logic to decide whether you already have the data or not.

But in both cases, you're searching through a list to see if a value is in there. Regardless of the method of searching used, searching a smaller list is always going to be faster than searching a larger list when the same search is used on both lists. If you perform this search before entering the data into the list, then the searches are necessarily always going to be on smaller groups of data.

Additionally, the overhead required to add and remove items from the list is taken out by doing the search ahead of time. There is no situation where removing duplicates would be equally as efficient as not inserting the duplicates in the first place.

**wey97** · Feb 1st, 2008, 09:57 AM

Originally Posted by wossname

Nothing I said is in any way against OOP. I never suggested rewriting the entire framework, where you got that preposterous idea from I don't know. As for maintenance, you use things called "comments and documentation". There is absolutely no benefit in writing code that makes a time difference of 0.00001 milliseconds and I made it clear that the solution I proposed would be a great deal faster than relying on the generalised classes in this situation thus making the task a worthwhile one.

And I find your fatuous comment about C and ASM rather insulting. OOP doesn't mean you are not allowed to think for yourself, there is still a great deal of freedom to implement your own ideas and algorithms without compromising anything that falls under the techniques of OOP.

Could you relax?

I never said rewrite the ENTIRE framework, I was making the point that why use a platform that already has the functionality you need if you're going to go against what the platform was designed for anyway?

And I agree with the comment above. How can you expect an algorithm that has to search and remove duplicates from a larger list to be faster than searching on a smaller list and not inserting duplicates to begin with?

**wossname** · Feb 1st, 2008, 01:50 PM

I never relax unless I run out of coffee.

The larger list already exists - we can take that for granted.

From what I gathered earlier about what technically constitutes a duplicate (please correct me if I have misinterpreted...), we have to remove any item where all 3 axes (x,y,z) are simultaneously within a given tolerance of at least 1 other item. This sounds like it could originate from a noisy data source.

So if we have this list on our local machine then we need to sort it into ascending order. Something like quicksort (which is what List<>.Sort uses) would make very short work of that.

At this point we know that any given item will be most similar to the ones immediately adjacent to it, not for example one at the other end of the array. Therefore each potential duplicate will need at most 1 comparison to remove it.

So with a single pass through the List<> we can identify a duplicate by comparing [i] with [i+1]. If they are within the tolerance then set [i]=null and continue with the next pair of items.

Then a second pass (left to right) can move all the remaining items into the nulled positions and null the position it occupied a moment ago. This will preserve the ascending order. After this pass the List<> will have all the nulls at the end, leaving us to do a RemoveRange() to dispose of all the nulls, now we have a clean List<> with no superfluous data.

All the while we have never increased the amount of storage space required for the original data set (apart from the tiny overhead for quicksort and the two or three integers needed for the housekeeping.

**wey97** · Feb 1st, 2008, 01:55 PM

Originally Posted by wossname

I never relax unless I run out of coffee.

The larger list already exists - we can take that for granted.

From what I gathered earlier about what technically constitutes a duplicate (please correct me if I have misinterpreted...), we have to remove any item where all 3 axes (x,y,z) are simultaneously within a given tolerance of at least 1 other item. This sounds like it could originate from a noisy data source.

So if we have this list on our local machine then we need to sort it into ascending order. Something like quicksort (which is what List<>.Sort uses) would make very short work of that.

At this point we know that any given item will be most similar to the ones immediately adjacent to it, not for example one at the other end of the array. Therefore each potential duplicate will need at most 1 comparison to remove it.

So with a single pass through the List<> we can identify a duplicate by comparing [i] with [i+1]. If they are within the tolerance then set [i]=null and continue with the next pair of items.

Then a second pass (left to right) can move all the remaining items into the nulled positions and null the position it occupied a moment ago. This will preserve the ascending order. After this pass the List<> will have all the nulls at the end, leaving us to do a RemoveRange() to dispose of all the nulls, now we have a clean List<> with no superfluous data.

All the while we have never increased the amount of storage space required for the original data set (apart from the tiny overhead for quicksort and the two or three integers needed for the housekeeping.

How would you sort a list of Points?

**wossname** · Feb 1st, 2008, 02:05 PM

Prioritise X, then Y, then Z last.

You can do this easily if you use a custom class that implements IComparer<Point>.

So when the time comes to test X if it's outside tolerance then don't bother testing Y and Z, just move on.

List<>.Sort() lets you specify such a class, in order for the underlying quicksort to utilise that class's Compare() method to do this custom sort.

I tried a quick test of this at lunchtime today, it took about 1 second to deduplicate an array of 1 million xyz point items. (on a 2.8Ghz HT with 2GB ram). The trick is not to actually remove things from the array during the main loop (setting an item to null doesn't affect the size of the list and therefore is fast), wait until you've clustered all your nulls at one end and use a single call to RemoveRange().

**wey97** · Feb 1st, 2008, 02:29 PM

Originally Posted by wossname

Prioritise X, then Y, then Z last.

You can do this easily if you use a custom class that implements IComparer<Point>.

So when the time comes to test X if it's outside tolerance then don't bother testing Y and Z, just move on.

List<>.Sort() lets you specify such a class, in order for the underlying quicksort to utilise that class's Compare() method to do this custom sort.

You can make the comparer integral to the Point class. I think this should work:

csharp Code:

class Point : IComparable<Point>
{
    public double x;
    public double y;
    public double z;
 
    public int CompareTo(Point other)
    {
        if (Math.Abs(this.x - other.x) <= 0.5)
        {
            if (Math.Abs(this.y - other.y) <= 0.5)
            {
                if (Math.Abs(this.z - other.z) <= 0.5)
                {
                    return 0;  // the points are "equal"
                }
            }
        }
 
        if (this.x < other.x)
            return -1;
    
        return 1;
 
    }
 
}

**wossname** · Feb 1st, 2008, 03:06 PM

List.Sort() requires either a IComparer or a Comparison delegate, but not an IComparable.

You need to look at your logic but yes, thats the way to do it.

**pjrage** · Feb 4th, 2008, 07:24 AM

Thanks guys for going back and forth with some great ideas on how to handle this situation.

I've been playing with some different things myself, some of which I found here: http://msdn2.microsoft.com/en-us/lib...04(VS.71).aspx. There are 6 parts, and it was a great read IMO. I definitely learned alot, because alot of it was new to me.

I implemented their SkipList class just to try it out, with a custom CompareTo in my Point class that looks alot like what wey97 just posted. This worked really well. What used to take ~20min now takes ~5 seconds, so yeah, there was some speed improvement.

I don't think I'll get much time to work on it any further this week, but when I get to it again, I might try some of the other suggestions to see how they pan out. But since this is just a prototypey thing, there is no real drive for me, other than to learn, to change it from what it is now, since it's working pretty good.

But feel free to continue with the suggestions. I'm learning alot just listening to the different ways to do this and how some of the built in classes work.

Thread: [2.0] Efficient Matching

Thread Tools

Display

[2.0] Efficient Matching

Re: [2.0] Efficient Matching

Re: [2.0] Efficient Matching

Re: [2.0] Efficient Matching

Re: [2.0] Efficient Matching

Re: [2.0] Efficient Matching

Re: [2.0] Efficient Matching

Re: [2.0] Efficient Matching

Re: [2.0] Efficient Matching

Re: [2.0] Efficient Matching

Re: [2.0] Efficient Matching

Re: [2.0] Efficient Matching

Re: [2.0] Efficient Matching

Re: [2.0] Efficient Matching

Re: [2.0] Efficient Matching

Re: [2.0] Efficient Matching

Re: [2.0] Efficient Matching

Re: [2.0] Efficient Matching

Re: [2.0] Efficient Matching

Re: [2.0] Efficient Matching

Re: [2.0] Efficient Matching

Re: [2.0] Efficient Matching

Re: [2.0] Efficient Matching

Re: [2.0] Efficient Matching

Re: [2.0] Efficient Matching

Re: [2.0] Efficient Matching

Re: [2.0] Efficient Matching

Re: [2.0] Efficient Matching

Re: [2.0] Efficient Matching

Re: [2.0] Efficient Matching

Re: [2.0] Efficient Matching

Re: [2.0] Efficient Matching

Re: [2.0] Efficient Matching

Posting Permissions