Results 1 to 4 of 4

Thread: Data Comparison/Scrubbing software or tool suggestions?

  1. #1

    Thread Starter
    Fanatic Member
    Join Date
    Aug 2009
    Posts
    540

    Data Comparison/Scrubbing software or tool suggestions?

    I'm currently running SQL Server '05, an upgrade to '12 is in the works in the near future.

    My current problem is this. I maintain a table which contains a list of 20K records of schools we currently do business with. We recently purchased a "database" (really just a gigantic excel spreadsheet) of 150+K schools. I'm looking for software, tools, or methods to help me merge our current information with the purchased information. Obviously we want to remove any duplicate entries (those school which currently exist in our live database).

    However my issue is I can not for the life of me come up with any thing that produces a solid match between the two sets of data. If the names of the school were in a standardized format I would be have some solid ground to work with. However, you run into issues such as our DB has FooBar High School and the purchased DB has Foo Bar HS. Just a quick sort and visual comparison of the two datasets have shown that this is a common issue for us right now. I'd rather not have to comb through all this data by hand, but I'm coming up short on ideas to avoid that nightmare.

    Hopefully someone out here may point me in the right direction?
    Where I'm from we only have one bit of advice for new comers: "If you hear banjos, turn and run".


    VS 2008 .NetFW 2.0

  2. #2
    Super Moderator FunkyDexter's Avatar
    Join Date
    Apr 2005
    Location
    An obscure body in the SK system. The inhabitants call it Earth
    Posts
    7,902

    Re: Data Comparison/Scrubbing software or tool suggestions?

    If you're in the UK and the data is address based then AFD have a decent solution called Refiner which is fairly cheap. (I was going to provide a link there but the forums playing up again and the option's missing). Other than that I also used Paribus about 20 years ago and their solution was pretty reasonable at the time but I have no idea where they stand now. Be aware, though, that event with a decent tool you're either going to have to accept an imperfect result or be prepared to invest alot of manual time eyeballing similar records to decide if they're the same.

    Another option is to get a third party agency to cleanse your data for you, there are plenty out there. Be aware, though, that they're still going to have to make a bunch of judgement calls so their result won't be perfect either and they tend to be expensive. And most of them will try and take over your marketing function because that's where they really make their money.

    There's no easy solution to data cleansing, I'm afraid, and no matter what you do you're going to spend significant time and money if you want a decent result.
    The best argument against democracy is a five minute conversation with the average voter - Winston Churchill

    Hadoop actually sounds more like the way they greet each other in Yorkshire - Inferrd

  3. #3

    Thread Starter
    Fanatic Member
    Join Date
    Aug 2009
    Posts
    540

    Re: Data Comparison/Scrubbing software or tool suggestions?

    Yeah, I kind of figured we were pretty much hosed when it comes to this. However, you never really know and someone may have stumbled upon an excellent solution to this problem. Unfortunately we are a US based company so your first suggestion will not work for us. I will examine Paribus and see if it can help. There are a few other tools out there that I'm looking at as well, but I would be massaging the tools in a way to work for what we need. We accept the fact that there will be inaccuracies in what we do unless we go through everything one by one (and even then, I think that method would most likely produce an even higher error rate). So we are just searching for a solution to reduce the error rate as much as possible.

    Thank you for your help, judging from the number of views to number of replies ratio, I see I'm not the only one who is looking for something like this
    Where I'm from we only have one bit of advice for new comers: "If you hear banjos, turn and run".


    VS 2008 .NetFW 2.0

  4. #4
    Super Moderator FunkyDexter's Avatar
    Join Date
    Apr 2005
    Location
    An obscure body in the SK system. The inhabitants call it Earth
    Posts
    7,902

    Re: Data Comparison/Scrubbing software or tool suggestions?

    I see I'm not the only one who is looking for something like this
    Most definitely not. It's a huge problem in the CRM and direct marketting industries (which seems to be where I spend most of my time) and there's ALOT of money to be made by the first person who can come up with a really good solution. The tools that are out there help but none of them's a silver bullet.

    One thought, it's a lot easier to match by telephone number and/or email than it is by name and/or address because those things have to be exact or they don't function. If your data contains either of those it'll be a great help. There are two areas of concern you need to cater for though:-
    1. It's likely that you won't have collected that information for every record in your DB and/or the import file
    2. Each school is likely to have more than one TN or Email - that's OK, you just look for a single instance of commonality - but it may also be possible that two or more schools may share the same TN or email (if they're pooled their HR functions, for example) - that's alot less likely but watch for it just in case.

    If you need to match by address then you need to get the import address esand the addresses in your DB into a common format. Over here the post office maintain a PAF file whilch contains every address in the country and they're held in a standard format. The AFD software I mentioned does some fuzzy matching against the PAF file and reformats every adress it finds to the same PAF format. Any address it doesn't find are immediately suspect and need to be manually reviewed. Once you've formatted both your current data and your import apropriately then matching becomes reasonably easy. Is there a US equivalent of a PAF file? and if so, can you find some software that will do a similar job to what AFD does?

    Other than that you need to go looking for a real world unique identifier. For example, do all US educational establishments get issued with a unique code - sort of an academic National Insurance Number? If it does, and if you make a point of storing it in your DB, and if you demand it in your imports, then matching becomes a doddle.
    The best argument against democracy is a five minute conversation with the average voter - Winston Churchill

    Hadoop actually sounds more like the way they greet each other in Yorkshire - Inferrd

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width