Best way to match names

system · March 11, 2015, 7:29pm

Hi,

I have to match incorrect or misspelled names of two tables, what could be the best way to do it?
Match transform? Python any other fuzzy logic?

For example
Table1
Fname—Lname—Middlename
Smith----Jones----L
Scott----Jonathan—Junior
Gray----Brown----Jones

Table2
Fname—Lname—Middlename
Smith-----L--------Jones
Scott-----Jr.--------null
Gray-----Brown-Jones—null

Any suggestion appreciated.

Arun

Arun.K (BOB member since 2011-10-18)

system · March 11, 2015, 8:10pm

Run it through a data cleanse first.

Then run it through a match.

Depending on source and what your expectations are, you may be in for some work … if you have to start monkeying with the cleansing package. (the out of box cp has some issues)

jlynn73 (BOB member since 2009-10-27)

system · March 11, 2015, 8:34pm

I don’t have a cleansing package so what could be options? Base match transform?

Arun.K (BOB member since 2011-10-18)

system · March 12, 2015, 1:00pm

Do you have any other data you can use to match on ? (address/phone/email/shoe size)

I dont know what you’re doing, but for me to just match on a name doesnt usually buy me much. It ends up being just a fancy way to sort your data which you’re still going to have to look at.

So from a high level, what is it you’re trying to accomplish ?

jlynn73 (BOB member since 2009-10-27)

system · March 12, 2015, 1:19pm

Ok i will explain in detail -

I have two tables Table1 contains those you are travelling and Table2 contains all the people in a certain department, which include those travelling, not-travelling, dependents etc. Both the tables are in separate database managed by two different depts, hence they have different person_ids. When i say managed by two depts, the details are keyed in by two different set of users, so the names and details entered not always match in both the tables. Now i have a requirement to join both these tables to eliminate people based on a certain code. Since there are no unique keys (person_ids are different), i have to join these tables using their firstname and lastname. In some cases like the examples given in the first post they dont match because of the obvious reasons explained above. But i need to extract the correct record from both the tables in the output with those names. Right now i am doing an outer join and getting names that have issues, but this doesn’t solve the problem permenantly. I am looking for cleaning the data using some logic and match the names.

Other data which i can use to match could be date of birth (in some cases they are null in the table) or SSN (some cases null means not entered).

Arun.K (BOB member since 2011-10-18)

system · March 13, 2015, 12:40pm

You would want to do a merge of the two tables. Depending on the size of these tables, you’re going to need a break key for the match. Could be first byte of fn/ln. Then feed it into a name match.

The more data you have for a match the better, and it doesnt always have to be there. (doesnt hurt anything). In the end… someone is going to have to sift through the results, which is what I try to steer clear of.

Im surprised you have a license for a match but no data cleanse.

jlynn73 (BOB member since 2009-10-27)

system · March 16, 2015, 1:54pm

Sorry that’s a miscommunication. I do not have DQM license so it applies to match also.

Is there any python programming for matching using UDT? Any links, suggestions?

Thanks,
Arun

Arun.K (BOB member since 2011-10-18)

system · March 16, 2015, 8:22pm

There is a plethora of Python modules out there to help perform some basic fuzzy matching. You would still need the logic to apply it across the two sets of data, and analyze the results.

You’re not going to find a simple cut & paste solution to this one.

jlynn73 (BOB member since 2009-10-27)

system · March 16, 2015, 8:44pm

Best practice is to let the DBA clean up your data first.

There is no guarantee any kind of fuzzy logic fixes will give you results you are looking for. Garbage in Garbage out as they say…

MoonDragn (BOB member since 2004-05-12)

system · March 20, 2015, 8:24pm

I have to figure out a way to do it, python seems to be much more complicated to implement in DI than I thought. May be I am wrong.

One question - for Match Transform do I still need a directory/rule dictionary to work on? I do not have any directory/rule set up in my environment, so basically I am looking at some thing to work on like Match Transform which don’t need any external package.

I see the DQM enabled in my environment but do not have any USPS address license or name directory etc.

Arun.K (BOB member since 2011-10-18)

system · March 20, 2015, 10:44pm

You should not need a cleansing package to run a match.

Your results may not be sufficient to satisfy your business requirements, which are often difficult to nail down.

In situations like this, it seems like anything you can provide for them is going to be better than what they had before. We have some matching that we do which only matches at a 50-60% match rate, but still saves our staff hundreds of hours of manual lookups.

jlynn73 (BOB member since 2009-10-27)

system · March 24, 2015, 3:18pm

Thank you all, will update on the progress.

Arun.K (BOB member since 2011-10-18)