Results 1 to 11 of 11

Thread: [RESOLVED] Relationship key with a wildcard?

  1. #1

    Thread Starter
    Frenzied Member cory_jackson's Avatar
    Join Date
    Dec 2011
    Location
    Fallbrook, California
    Posts
    1,104

    Resolved [RESOLVED] Relationship key with a wildcard?

    I have a column of business names. In some cases the name is an alias and I want to relate it to the parent company so a query will show the real company. In both cases the data is mostly human entered and isn't consistent. A simple example is one might be entered variously as "Mike's Plumbing", "Mike's Plumbing LLC", "Mike's Plumbing, LLC". I plan on having a hash table to resolve the variations, but I know in the future new ones will crop up. If I was executing a select query I'd use a wildcard like "Mike's Plumbing%". Is there any way to do that or something similar with a relationship key?

  2. #2
    Smooth Moperator techgnome's Avatar
    Join Date
    May 2002
    Posts
    34,532

    Re: Relationship key with a wildcard?

    I don't think so.
    Something you could do - I've seen this done in a previous life for resolving such things - is this:
    1) strip off any trailing suffixes (LLC, Corp, Inc, etc) ... so all three of those become "Mike's Plumbing"
    2) Run the name through a soundex function... that should return a number, let's for example say it's 648353
    3) combine that with the first two letters: MI ... MI648353 ... you can now store that in the database as part of the record.

    So now, let's say someone is feeling cheeky and enters "Myke's Plumbing" with a y instead of an i ... because it has the same sound as "Mike's Plumbing" it would get the same soundex value back: 648353 ... you can then look for other entries with 648353 and compare MY with MI ... and see, it's different... then prompt the user "Hey, you entered 'Myke's Plumbing' but we found 'Mike's Plumbing' ... is this the correct record?" ... and then you can decide what to do from there.

    -tg
    * I don't respond to private (PM) requests for help. It's not conducive to the general learning of others.*
    * I also don't respond to friend requests. Save a few bits and don't bother. I'll just end up rejecting anyways.*
    * How to get EFFECTIVE help: The Hitchhiker's Guide to Getting Help at VBF - Removing eels from your hovercraft *
    * How to Use Parameters * Create Disconnected ADO Recordset Clones * Set your VB6 ActiveX Compatibility * Get rid of those pesky VB Line Numbers * I swear I saved my data, where'd it run off to??? *

  3. #3

    Thread Starter
    Frenzied Member cory_jackson's Avatar
    Join Date
    Dec 2011
    Location
    Fallbrook, California
    Posts
    1,104

    Re: Relationship key with a wildcard?

    Thanks TG. That's what I need to know. In the meantime I thought of an idea where I'll make a program that will use RegEx patterns to look for all matches per company then indicate which are not in the hash table and give the user the option to add them. This is an annual occurrence so we can just add this to the SOPs.

  4. #4
    Smooth Moperator techgnome's Avatar
    Join Date
    May 2002
    Posts
    34,532

    Re: [RESOLVED] Relationship key with a wildcard?

    That's why we used soundex, because regex wont pick up that Mykes and Mikes sound the same and are likely to be the same, while a soundex value of each would. We only then included the two letters for indexing purposes as we had a clustered index on it, so that AC234% comes before TG8475, allowing for quick look ups.

    -tg
    * I don't respond to private (PM) requests for help. It's not conducive to the general learning of others.*
    * I also don't respond to friend requests. Save a few bits and don't bother. I'll just end up rejecting anyways.*
    * How to get EFFECTIVE help: The Hitchhiker's Guide to Getting Help at VBF - Removing eels from your hovercraft *
    * How to Use Parameters * Create Disconnected ADO Recordset Clones * Set your VB6 ActiveX Compatibility * Get rid of those pesky VB Line Numbers * I swear I saved my data, where'd it run off to??? *

  5. #5
    Super Moderator FunkyDexter's Avatar
    Join Date
    Apr 2005
    Location
    An obscure body in the SK system. The inhabitants call it Earth
    Posts
    7,900

    Re: [RESOLVED] Relationship key with a wildcard?

    I've got some relevant knowledge on this. I wrote the software that supports the "Troubled Families" program for several local authorities in the UK. This consists (among other things) of taking data source from lots of different datasource and trying to identify the individuals across them. It sounds like the problem you're facing has a lot of cross-over with that so here's a mini case study that you can probably take alot from:-

    There are two problems: The data structure and the matching.

    I would suggest the data structure consists of two tables: individuals and aliases with a one to many relationship between them - one individual many aliases. You may or may not want to specify a single "canon" name in which case put it on the individual but I always preferred to leave it off i.e. an individual is a combination of all their aliases.

    I'm not clear whether you're pulling from multiple source or just different spellings within a single source but, if you have multiples, I liked to have a separate alias table for each source (or a "SourceID" field).

    Also consider having a different alias table for first and last names as these are often combined in multiple permutations. Similarly, you might have lots of other entities you could match on: Date of Birth, postcode, identifiers (National Insurance Number etc.), if you're going to match on these things then give each an alias table.



    On the matching side, I personally wouldn't stop at names if you can help it. Information like age (or preferably date of birth), address, telephone numbers, email etc. are all really valuable. In particular, go looking for unique identifiers like membership numbers, National Insurance numbers etc. as these are likely to be your most reliable matching data.

    Don't rely on solely Soundex. There are lots of fuzzy matching algorithms out there, each with different strengths and weaknesses. You best approach s to use several of them and combine the results to calculate a possible match. A bunch of these are built in to sql server's data analysis package (whose name now escapes me) so leverage that if you can. Failing that the fuzzy match and fuzzy group operations in SSIS support a few of them and it's pretty easy to write your own once you understand the algorithm.

    Understand the difference between approximation and significance. Smith and Smyth are highly approximate but are actually far less likely to indicate an individual than a looser match on O'Hanranhan-Billingsworth-Shore-Winding-Knightly. The latter surname is extremely rare so a non-approximate match on it is far more significant when identifying an individual than an approximate match on the far more common Smith. In essence, significance = approximation / occurrence and is the measure you're really after.

    Understand that a single individual can have radically different aliases. They might call himself John amongst one group of peers and Billy Bob amongst a different group and they WILL use those interchangeably when interacting with your system. This isn't just true forenames either, it's true of surnames too. Particularly amongst divorced or recently married women and children of separated parents.

    Given the above, for any incoming datum you can do a comparison to every alias permutation in your existing dataset and assign a match probability score to it. The chance of the datum matching in individual is then the average of the scores for each of it's alias permutations. BUT the number of calculation this would require is insane and almost certainly not practical. Instead, what I used a slightly different algorithm (from memory but I think this is right):-
    1. Get the unique list of values for each class of alias (first name, surname, email, telephone etc)
    2. Calculate a match probability of each class in the incoming datum against every value in it's corresponding unique list
    3. Discard any scores below a threshold as non matches - this just reduces the size of the data set and reduces the number of calculation the engine's going to have to do.
    4. For each class, for each individual calculate an average score from the aliases that have an undiscarded score - call this your individual_class score.
    5. Combine the individual_class scores into an individual score using a weighted average. A match on an email address, for example, is more significant than a match on a name so give that a higher weighting.
    6. So for your incoming datum you now have a probability of a match against every single individual in your system. You can consider it an automatic match against everyone which it has a very high score for, a non-match against everyone it has a low match for and you pass anything in the middle for manual review (In practice I found this review never happened and we relied on linking and unlinking to clean up false positives and negatives - see below)

    This give's you a fair stab at automatching an incoming datum to the data in your system but, no matter how you tweak it you WILL get false positives and negatives. For this reason you need the ability to manually clean things up. There are two basic approaches and you can use both.

    The first is to link and unlink individuals. combine two separate individuals into one (to fix a false negative) and to split an individual into two (to fix false positives). When combining there's a temptation to simply copy all the aliases from one individual to the other (or repoint the foreign key) and then delete the other individual. I would advise against this as it's difficult to undo if you get it wrong. I preferred to add a link between the two individuals but have each retain their own aliases. This does affect step 4 above as an individual is now a combination of the aliases for all linked individuals and this is recursive (A can link to B which can link to C) but I found the number of occurrences of this was sufficiently small that a recursive view would perform well enough. If you link two records in error you can simply delete the link back off again.

    The best way of handling false positives to add alias exclusions. It's tempting to think that simply removing an alias from an individual will prevent that false positive from matching again - it won't. Whatever put that alias on there in the first place is highly likely to put it on there again. For this reason you should add an exclusion. This can be implemented in a separate table and can be check every time you add a new alias to an individual. If there's an exclusion for that alias, don't add it.




    Sorry if this was hugely long and overly complicated. The system I was working on was highly complex and BIG so a lot of it is probably overkill for most applications but hopefully there's some god stuff in here you can cherry pick from.
    The best argument against democracy is a five minute conversation with the average voter - Winston Churchill

    Hadoop actually sounds more like the way they greet each other in Yorkshire - Inferrd

  6. #6
    PowerPoster
    Join Date
    Sep 2005
    Location
    Modesto, Ca.
    Posts
    5,196

    Re: [RESOLVED] Relationship key with a wildcard?

    I didn't read all of FD post this might be a repeat.

    I don't have a solution for your current problem but you can avoid this situation in the future by simply creating a Business Names table, to use as a Look Up table. Which means the first step is to add the business to the table.
    Then anytime you want data on a certain business you just select the name from a list. The only "human entry" is when the business is first added to the system. Any field that is going to be used in lookups, primary keys or relationship keys, I always set them up this way if possible. Anyway, food for thought.

  7. #7

    Thread Starter
    Frenzied Member cory_jackson's Avatar
    Join Date
    Dec 2011
    Location
    Fallbrook, California
    Posts
    1,104

    Re: [RESOLVED] Relationship key with a wildcard?

    Wes4dbt: Unfortunately these business names, also mixed with individual names, comes from varied sources that I have no control over. And you're completely correct, that's what they should do. I was thinking each source might have one spelling, but then I found all kinds that were varied. Obviously a human typing it in someplace, which is sad.

  8. #8

    Thread Starter
    Frenzied Member cory_jackson's Avatar
    Join Date
    Dec 2011
    Location
    Fallbrook, California
    Posts
    1,104

    Re: [RESOLVED] Relationship key with a wildcard?

    Techgnome: I just recently learned about the existence of Soundex in SQL Server. I thought about it a bit, but unfortunately it will not work here. For instance "Mike's Plumbing", "Mike's Plumbing LLC", "Mikes Plumbing, LLC", and "Mike's Plubming". But I'm excited about Soundex and am looking for an application. However I think I would still want to use it to find possible matches that then require Mk1 eyeball to review.

  9. #9

    Thread Starter
    Frenzied Member cory_jackson's Avatar
    Join Date
    Dec 2011
    Location
    Fallbrook, California
    Posts
    1,104

    Re: [RESOLVED] Relationship key with a wildcard?

    FunkyDexter:
    Don't apologize for being long winded, it was good info. I've already thought of a few other applications of a few things you said.
    Unfortunately I only have names and a dollar amount. But you're correct to spread a wide net over any data available. And what I have is a simple table for entities that two columns, an ID and a Name only now but room to expand later, and a Alias table. Which has all the potential aliases as varchar as a PK, and the entity ID right of it. Since the data doesn't change often, the PK has proven quick in testing.
    I have another problem child that has multiple lines of address and I must identify what type of data is in each. With 3 it's simple, but with 4 the second line could be a second line of name, attention, or Care/Of or it could be the street line and the third line contains a suite number or such. My solution was to create a table for every line of raw data. Now the first and last lines I'm always sure of but I could have as many as 3 additional lines. The table persists after the runs and has a Boolean column for changes. If my program sees a change has happened, it puts in the new value and flags it as changed. There were too many for human review but there were some tricks that could reduce the quantity. For instance if it starts with P.O. Box I know it's a street line or if it starts with "Attn" I know it's a name line. This program automatically set another Boolean column that indicates if it's a name or street type and changes the 'changed' flag to false. This left a reasonable number to do a human review on the first time so I created another program for the Mark 1 eyeball and a user where they could sort and filter easily and identify as name or street in chunks. It was slightly laborious the first time, but now each time it runs there are relatively few that change that the auto-tagging program can't determine.
    So I'm thinking I'll employ something similar here. These are company names so I'm thinking I'll also have a table of regular expression patterns like "\AMike(')?s\sPlumbing". I can have a program that displays the candidates to a user and they can decide. "Mike's Plumbing, LLC" yes, "Mike's Plumbing, heating, and Air Conditioning" no. But then also remember the alias and the decision. And if he has multiple DBAs, an entry for "\AAces\sPlumbing".

  10. #10
    Smooth Moperator techgnome's Avatar
    Join Date
    May 2002
    Posts
    34,532

    Re: [RESOLVED] Relationship key with a wildcard?

    Oh definately, it shouldn't be used as a replacement for anything.... just as an aide for gathering possible matches, but, yes you'll still want human eyebolls on it to say for sure, yes that's a match, or nay it is not, and should be a new record.

    -tg
    * I don't respond to private (PM) requests for help. It's not conducive to the general learning of others.*
    * I also don't respond to friend requests. Save a few bits and don't bother. I'll just end up rejecting anyways.*
    * How to get EFFECTIVE help: The Hitchhiker's Guide to Getting Help at VBF - Removing eels from your hovercraft *
    * How to Use Parameters * Create Disconnected ADO Recordset Clones * Set your VB6 ActiveX Compatibility * Get rid of those pesky VB Line Numbers * I swear I saved my data, where'd it run off to??? *

  11. #11

    Thread Starter
    Frenzied Member cory_jackson's Avatar
    Join Date
    Dec 2011
    Location
    Fallbrook, California
    Posts
    1,104

    Re: [RESOLVED] Relationship key with a wildcard?

    TG: I can't wait to try it.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width