C# 地址匹配密钥算法

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/824588/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-05 03:06:32  来源:igfitidea点击:

Address Match Key Algorithm

c#algorithmtsqljoinstreet-address

提问by sestocker

I have a list of addresses in two separate tables that are slightly off that I need to be able to match. For example, the same address can be entered in multiple ways:

我在两个单独的表中有一个地址列表,这些表稍微偏离了我需要能够匹配的。例如,可以通过多种方式输入相同的地址:

  • 110 Test St
  • 110 Test St.
  • 110 Test Street
  • 110测试街
  • 110 测试圣。
  • 测试街110号

Although simple, you can imagine the situation in more complex scenerios. I am trying to develop a simple algorithm that will be able to match the above addresses as a key.

虽然简单,但你可以想象更复杂的场景中的情况。我正在尝试开发一种简单的算法,能够将上述地址作为密钥进行匹配。

For example. the key might be "11TEST" - first two of 110, first two of Test and first two of street variant. A full match key would also include first 5 of the zipcode as well so in the above example, the full key might look like "11TEST44680".

例如。关键可能是“11TEST”——110 的前两个,Test 的前两个和街道变体的前两个。完整匹配键还包括邮政编码的前 5 个,因此在上面的示例中,完整键可能类似于“11TEST44680”。

I am looking for ideas for an effective algorithm or resources I can look at for considerations when developing this. Any ideas can be pseudo code or in your language of choice.

我正在寻找有效算法或资源的想法,我可以在开发时考虑考虑因素。任何想法都可以是伪代码或您选择的语言。

We are only concerned with US addresses. In fact, we are only looking at addresses from 250 zip codes from Ohio and Michigan. We also do not have access to any postal software although would be open to ideas for cost effective solutions (it would essentially be a one time use). Please be mindful that this is an initial dump of data from a government source so suggestions of how users can clean it are helpful as I build out the application but I would love to have the best initial I possibly can by being able to match addresses as best as possible.

我们只关心美国地址。事实上,我们只查看来自俄亥俄州和密歇根州 250 个邮政编码的地址。我们也无法访问任何邮政软件,尽管我们愿意接受具有成本效益的解决方案的想法(它本质上是一次性使用)。请注意,这是来自政府来源的初始数据转储,因此用户如何清理它的建议在我构建应用程序时很有帮助,但我希望通过能够匹配地址来获得最好的初始值尽可能最好。

采纳答案by BenAlabaster

I'm working on a similar algorithm as we speak, it should handle addresses in Canada, USA, Mexico and the UK by the time I'm done. The problem I'm facing is that they're in our database in a 3 field plaintext format [whoever thought thatwas a good idea should be shot IMHO], so trying to handle rural routes, general deliveries, large volume receivers, multiple countries, province vs. state vs. county, postal codes vs. zip codes, spelling mistakes is no small or simple task.

我正在研究我们所说的类似算法,它应该在我完成时处理加拿大、美国、墨西哥和英国的地址。我面临的问题是,它们以 3 字段纯文本格式存在于我们的数据库中 [任何认为是个好主意的人都应该开枪恕我直言],因此尝试处理农村路线、一般交付、大容量接收器、多个国家, 省 vs. 州 vs. 县, 邮政编码 vs. 邮政编码, 拼写错误是不小的或不简单的任务。

Spelling mistakes alone was no small feat - especially when you get to countries that use French names - matching Saint, Sainte, St, Ste, Saints, Saintes, Sts, Stes, Grand, Grande, Grands, Grandes with or without period or hyphenation to the larger part of a name cause no end of performance issues - especially when St could mean saint orstreet and may or may not have been entered in the correct context (i.e. feminine vs. masculine). What if the address has largely been entered correctly but has an incorrect province or postal code?

仅拼写错误就不是一件小事——尤其是当你到达使用法语名称的国家时——匹配 Saint、Sainte、St、Ste、Saints、Saintes、Sts、Stes、Grand、Grande、Grands、Grandes,有或没有句号或连字号名称的较大部分会导致性能问题永无止境 - 特别是当 St 可能意味着圣人街道并且可能已或可能未在正确的上下文中输入(即女性与男性)。如果地址大部分输入正确,但省或邮政编码不正确怎么办?

One place to start your search is the Levenstein Distance Algorithmwhich I've found to be really useful for eliminating a large portion of spelling mistakes. After that, it's mostly a case of searching for keywords and comparing against a postal database.

开始搜索的一个地方是Levenstein 距离算法,我发现它对于消除大部分拼写错误非常有用。之后,主要是搜索关键字并与邮政数据库进行比较。

I would be really interested in collaborating with anyone that is currently developing tools to do this, perhaps we can assist each other to a common solution. I'm already part of the way there and have overcome all the issues I've mentioned so far, having someone else working on the same problem would be really helpful to bounce ideas off.

我真的很想与任何目前正在开发工具来做到这一点的人合作,也许我们可以互相帮助找到一个共同的解决方案。我已经参与其中,并且已经克服了到目前为止我提到的所有问题,让其他人解决相同的问题对于激发想法非常有帮助。

Cheers - [ben at afsinc dot ca]

干杯 - [本在 afsinc dot ca]

回答by JeeBee

In the UK we would use:

在英国,我们会使用:

  • House Name or Number (where name includes Flat number for apartment blocks)
  • Postcode
  • 房屋名称或编号(其中名称包括公寓楼的单位编号)
  • 邮政编码

You should certainly be using the postcode, but in the US I believe your Zip codes cover very wide areas compared to postcodes in the UK. You would therefore need to use the street and city.

您当然应该使用邮政编码,但在美国,与英国的邮政编码相比,我相信您的邮政编码涵盖的范围非常广泛。因此,您需要使用街道和城市。

Your example wouldn't differentiate between 11 Test Street, 110 - 119 Test Street, etc.

您的示例不会区分 11 Test Street、110 - 119 Test Street 等。

If your company has access to an address lookup system, I would run all the data through that to get the data back in a consistent format, possibly with address keys that can be used for matching.

如果您的公司可以访问地址查找系统,我会通过该系统运行所有数据,以便以一致的格式获取数据,可能带有可用于匹配的地址键。

回答by Brimstedt

If you dont chose to use an existing system, one idea is to do the following:

如果您不选择使用现有系统,一个想法是执行以下操作:

  • Extract numbers from the address line
  • replace common street words with blanks
  • create match string
  • 从地址行中提取数字
  • 用空格替换常见的街头词
  • 创建匹配字符串

ie: "555 Canal Street":

即:“555 运河街”:

  • Extract number gives "555" + "Canal Street"
  • Replace street words gives "555" + "Canal"
  • Create match string gives "555Canal"
  • 提取编号给出“555”+“运河街”
  • 替换街道词给“555”+“运河”
  • 创建匹配字符串给出“555Canal”

"Canal st 555" would give the same match string.

“Canal st 555”会给出相同的匹配字符串。

By street words i mean words and abbreviations for "street" in your language, for example "st", "st.", "blv", "ave", "avenue", etc etc all are removed from the string.

我所说的街道词是指您的语言中“街道”的单词和缩写,例如“st”、“st.”、“blv”、“ave”、“avenue”等都从字符串中删除。

By extracting numbers and separating them from the string it does not matter if they are first or last.

通过提取数字并将它们从字符串中分离出来,它们是第一个还是最后一个都没有关系。

回答by Robert Gowland

If I was to take a crack at this I'd convert each address string into a tree using a pre-defined order of operations.

如果我要对此进行破解,我会使用预定义的操作顺序将每个地址字符串转换为一棵树。

Eg. 110 Test Street Apt 3. Anywhere California 90210 =>

例如。110 Test Street Apt 3. 加利福尼亚州任何地方 90210 =>

  1. Get the type of address. Eg Street addresses have different formats that rural route addresses and this is different by country.
  2. Given that this is a street address, get the string that represents the type of street and convert that to an enum (eBoulevard, eRoad, etc..)
  3. Given that this is a street address, pull out the street name (store in lower case)
  4. Given that this is a street address, pull out the street number
  5. Given that this is a street address, look for any apartment number (could be before the street number with a dash, could be after "Apt.", etc...)

       eStreet  //1.an enum of possible address types eg. eStreet, eRuralRoute,...
          |
       eStreet        //2.an enum of street types eg. eStreet, eBlvd, eWay,...
       /   |   \
    

    Name Number Apt | | | test 110 3

  1. 获取地址类型。例如,街道地址与农村路线地址的格式不同,这因国家/地区而异。
  2. 鉴于这是一个街道地址,获取表示街道类型的字符串并将其转换为枚举(eBoulevard、eRoad 等)。
  3. 鉴于这是一个街道地址,拉出街道名称(小写的商店)
  4. 鉴于这是一个街道地址,拉出街道号码
  5. 鉴于这是一个街道地址,请查找任何公寓号(可以在带有破折号的街道号之前,也可以在“Apt.”之后,等等...)

       eStreet  //1.an enum of possible address types eg. eStreet, eRuralRoute,...
          |
       eStreet        //2.an enum of street types eg. eStreet, eBlvd, eWay,...
       /   |   \
    

    姓名编号 Apt | | | 测试 110 3

Eg. RR#3 Anywhere California 90210 =>

例如。RR#3 任何地方加利福尼亚 90210 =>

  1. Get the type of address: rural route
  2. Given that this is a rural route address, get the route number

       eRuralRoute 
          |
          3
    
  1. 获取地址类型:乡村路线
  2. 鉴于这是一个乡村路线地址,获取路线编号

       eRuralRoute 
          |
          3
    

You'll need to do something similar for country state and zip information.

您需要对国家/地区和邮政编码信息执行类似的操作。

Then compare the resulting trees.

然后比较生成的树。

This makes the comparison very simple, however, the code to generate the trees is very tricky. You'd want to test the crap out of it on thousands and thousands of addresses. Your problem is simpler if it is only US addresses you care about; British addresses as already mentioned are quite different, and Canadian address may have French in them (eg. Place D'Arms, Rue Laurent, etc...)

这使得比较非常简单,但是,生成树的代码非常棘手。你想在成千上万的地址上测试它的废话。如果您只关心美国地址,那么您的问题会更简单;已经提到的英国地址完全不同,加拿大地址中可能包含法语(例如 Place D'Arms、Rue Laurent 等...)

回答by KM.

use an identity for the primary key, this will always be unique and will make it easier to merge duplicates later.

为主键使用一个身份,这将始终是唯一的,并且以后可以更容易地合并重复项。

force proper data entry with the user interface. Make them enter each component in its own text box. The house number is entered in own box, the street name in its own box, city in own box, state from select list, etc.. This will make looking for matches easier

强制使用用户界面正确输入数据。让他们在自己的文本框中输入每个组件。在自己的框中输入门牌号,在自己的框中输入街道名称,在自己的框中输入城市,选择列表中的州等。这将使查找匹配更容易

have a two process "save"

有两个过程“保存”

  • after initial save, do a search to look up matches, present them with list of possible matches as well as the new one.
  • after they select the new one save it, if they pick an existing one use that ID
  • 初始保存后,进行搜索以查找匹配项,向它们显示可能的匹配项列表以及新的匹配项。
  • 在他们选择新的后保存它,如果他们选择一个现有的使用该 ID

clean the data. Try to strip out "street", "st", "drive", etc and store it as a StreetType char(1) that uses a FK to a table containing the proper abbreviations, so you can build the street.

清理数据。尝试去除“street”、“st”、“drive”等,并将其存储为 StreetType char(1),它使用 FK 到包含正确缩写的表,以便您可以构建街道。

look into SOUNDEX and DIFFERENCE

研究 SOUNDEX 和 DIFFERENCE

I have worked at large companies that maintain mailinig lists, and they did not attempt to do it automatically, they used people to filter out the new from the dups because it is so hard to do. Plan for a merge feature so you can manually merge duplicates when they occur, and ripple the values through the PKs.

我曾在维护邮件列表的大公司工作过,他们并没有尝试自动完成,他们使用人员从重复中过滤掉新的,因为这很难做到。计划合并功能,以便您可以在出现重复项时手动合并它们,并通过 PK 对值进行波动。

You might look into the google maps api and see if you can pass in you address and get a match back. I'm not familiar with it, this is just speculation.

您可能会查看 google maps api,看看是否可以传入您的地址并返回匹配项。我不熟悉,这只是猜测。

回答by Marc Bernier

If you would prefer tonot develop one and rather use an off-the-shelf product that uses many of the technologies mentioned here, see: http://www.melissadata.com/dqt/matchup-api.htm

如果您不想开发一个,而是使用使用了这里提到的许多技术的现成产品,请参阅:http: //www.melissadata.com/dqt/matchup-api.htm

Disclaimer: I had a role in its development and work for the company.

免责声明:我在公司的发展和工作中发挥了作用。

回答by Handcraftsman

If it is cost-effective for your company to write its own address normalization tool then I'd suggest starting with the USPS address standard. Alternatively, there are any number of vendors offering server side tools and web services to normalize, correct and verify addresses.

如果贵公司编写自己的地址规范化工具具有成本效益,那么我建议从USPS 地址标准开始。或者,有许多供应商提供服务器端工具和 Web 服务来规范、更正和验证地址。

My company uses AccuMail Goldfor this purpose because it does a lot more than just standardize & correct the address. When we considered the cost of even one week's worth of salary to develop a tool in-house the choice to buy an off-the-shelf product was obvious.

我的公司为此目的使用AccuMail Gold,因为它所做的不仅仅是标准化和更正地址。当我们考虑到在内部开发工具甚至一周的薪水的成本时,购买现成产品的选择是显而易见的。