在 SQL Server 中匹配街道地址的方法有哪些?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2097323/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-01 05:08:39  来源:igfitidea点击:

What are ways to match street addresses in SQL Server?

sqlsql-serverstreet-address

提问by dtc

We have a column for street addresses:

我们有一列街道地址:

123 Maple Rd.
321 1st Ave.
etc...

枫树路123号
321 1st Ave.
等...

Is there any way to match these addresses to a given input? The input would be a street address, but it might not be in the same format. For example:

有没有办法将这些地址与给定的输入匹配?输入将是街道地址,但格式可能不同。例如:

123 Maple Road
321 1st Avenue

枫树路 123 号,
第一大道 321 号

Our first thought is to strip the input of all street terms (rd, st, ave, blvd, etc).

我们的第一个想法是去除所有街道术语(rd、st、ave、blvd 等)的输入。

Obviously that won't match reliably all the time. Are there any other ways to try to match street addresses in SQL server?

显然,这不会一直可靠地匹配。还有其他方法可以尝试匹配 SQL Server 中的街道地址吗?

We can use user defined functions, stored procs and regular old t-sql. We cannot use clr.

我们可以使用用户定义的函数、存储过程和常规的旧 t-sql。我们不能使用 clr。

采纳答案by TLiebe

In order to do proper street address matching, you need to get your addresses into a standardized form. Have a look at the USPS postal standards here(I'm asssuming you're dealing with US addresses). It is by no means an easy process if you want to be able to deal with ALL types of US mail addresses. There is software available from companies like QAS and Satori Software that you can use to do the standardization for you. You'll need to export your addresses, run them through the software and then load the database with the updated addresses. There are also third party vendors that will perform the address standardization as well. It may be overkill for what you are trying to do but it's the best way to do it. if the addresses in your database are standardized you'll have a better chance of matching them (especially if you can standardize the input as well).

为了进行正确的街道地址匹配,您需要将您的地址转换为标准化的形式。在此处查看 USPS 邮政标准(我假设您正在处理美国地址)。如果您希望能够处理所有类型的美国邮件地址,这绝不是一个简单的过程。您可以使用 QAS 和 Satori Software 等公司提供的软件为您进行标准化。您需要导出地址,通过软件运行它们,然后将更新后的地址加载到数据库中。还有第三方供应商也将执行地址标准化。对于您正在尝试做的事情来说,这可能有点矫枉过正,但这是最好的方法。如果您的数据库中的地址是标准化的,您将有更好的机会匹配它们(特别是如果您也可以标准化输入)。

回答by Mark Byers

Rather than stripping out the things that can be variable, try to convert them to a "canonical form" that can be compared.

与其剥离可变的事物,不如尝试将它们转换为可以进行比较的“规范形式”。

For example, replace 'rd' or 'rd.' with 'road' and 'st' or 'st.' with 'street' before comparing.

例如,替换“rd”或“rd”。带有“路”和“st”或“st”。在比较之前用'street'。

回答by Daniel Vassallo

You may want to consider using the Levenshtein Distancealgorithm.

您可能需要考虑使用Levenshtein 距离算法。

You can create it as a user-defined function in SQL Server, where it will return the number of operations that need to be performed on String_A so that it becomes String_B. You can then compare the result of the Levenshtein Distance function against some fixed threshold, or against some value derived from the length of the strings.

您可以将其创建为 SQL Server 中的用户定义函数,它将返回需要对 String_A 执行的操作数,以便它变为 String_B。然后,您可以将 Levenshtein 距离函数的结果与某个固定阈值或从字符串长度派生的某个值进行比较。

You would simply use it as follows:

您只需按如下方式使用它:

... WHERE LEVENSHTEIN(address_in_db, address_to_search) < 5;

As Mark Byers suggested, converting variable terms into canonical form will help if you use Levenshtein Distance.

正如Mark Byers 建议的那样,如果您使用 Levenshtein Distance,将可变术语转换为规范形式会有所帮助。

Using Full-Text Searchmay be another option, especially since Levenshtein would normally require a full table scan. This decision may depend on how frequently you intend to do these queries.

使用全文搜索可能是另一种选择,特别是因为 Levenshtein 通常需要全表扫描。此决定可能取决于您打算执行这些查询的频率。

You may want to check out the following Levenshtein Distance implementation for SQL Server:

您可能需要查看 SQL Server 的以下 Levenshtein Distance 实现:

Note: You would need to implement a MIN3 function for the above implementation. You can use the following:

注意:您需要为上述实现实现一个 MIN3 函数。您可以使用以下内容:

CREATE FUNCTION MIN3(@a int, @b int,  @c int)
RETURNS int
AS
BEGIN
    DECLARE @m INT
    SET @m = @a

    IF @b < @m SET @m = @b
    IF @c < @m SET @m = @c

    RETURN @m
END

You may also be interested in checking out the following articles:

您可能还有兴趣查看以下文章:

回答by David Oneill

I think the first step for you is to better define how generous or not you're going to be regarding differing addresses. For example, which of these match and which don't:

我认为您的第一步是更好地定义您对不同地址的慷慨与否。例如,哪些匹配,哪些不匹配:

123 Maple Street
123 Maple St
123 maple street
123 mpale street
123 maple
123. maple st
123 N maple street
123 maple ave
123 maple blvd

Are there both a Maple Street and a Maple Blvd in the same area? What about Oak Street vs Oak Blvd.

同一地区有枫树街和枫树大道吗?橡树街 vs 橡树大道怎么样?

For example, where I live there many streets/roads/blvds/ave that are all named Owasso. I live on Owasso Street, which connects to North Owasso Blvd, which connects to South Owasso Blvd. However, there is only one Victoria Ave.

例如,我住的地方有很多街道/道路/大道/大道,它们都被命名为 Owasso。我住在 Owasso Street,它连接 North Owasso Blvd,连接 South Owasso Blvd。然而,只有一个维多利亚大道。

Given that reality, you must either have a database of all road names, and look for the closest road (and deal with the number seperately)

鉴于这种情况,您必须拥有所有道路名称的数据库,并寻找最近的道路(并单独处理编号)

OR

或者

Make an decision ahead of time what you'll insist on and what you won't.

提前决定你会坚持什么,你不会。

回答by Jason

Stripping out data is a bad idea. Many towns will have dozens of variations of the same street - Oak Street, Oak Road, Oak Lane, Oak Circle, Oak Court, Oak Avenue, etc... As mentioned above converting to the canonical USPS abbreviation is a better approach.

剥离数据是一个坏主意。许多城镇的同一条街道会有数十种变化——橡树街、橡树路、橡树巷、橡树圈、橡树阁、橡树大道等……如上所述,转换为规范的 USPS 缩写是更好的方法。

回答by Greg

You could try SOUNDEX to see if that gets you close. http://msdn.microsoft.com/en-us/library/aa259235%28SQL.80%29.aspx

你可以试试 SOUNDEX 看看是否能让你接近。http://msdn.microsoft.com/en-us/library/aa259235%28SQL.80%29.aspx

回答by Davin Perkins

Address matching and deduplication is a messy business. Other posters are correct when they say that the addresses need to be standardized first to the local postal standards authority (The USPS for example if it is a US addresses). Once the addresses are in standard format the rest is easy.

地址匹配和重复数据删除是一项杂乱无章的业务。其他海报是正确的,他们说地址需要首先向当地邮政标准机构标准化(例如,如果是美国地址,则为 USPS)。一旦地址采用标准格式,剩下的就很容易了。

There are several third-party services which will flag duplicates in a list for you. Doing this solely with a MySQL subquery will not account for differences in address formats and standards. The USPS (for US address) has certain guidelines to make these standard, but only a handful of vendors are certified to perform such operations.

有几种第三方服务可以为您标记列表中的重复项。仅使用 MySQL 子查询执行此操作不会考虑地址格式和标准的差异。USPS(用于美国地址)有制定这些标准的某些指导方针,但只有少数供应商获得了执行此类操作的认证。

So, I would recommend the best answer for you is to export the table into a CSV file, for instance, and submit it to a capable list processor. One such is SmartyStreets' Bulk Address Validation Toolwhich will have it done for you in a few seconds to a few minutes automatically. It will flag duplicate rows with a new field called "Duplicate" and a value of Y in it.

因此,我建议您最好的答案是将表格导出为 CSV 文件,例如,并将其提交给功能强大的列表处理器。其中一个是 SmartyStreets 的批量地址验证工具,它会在几秒钟到几分钟内自动为您完成。它将使用名为“Duplicate”的新字段和其中的 Y 值标记重复行。

Try standardizing and validating a couple of addresses hereto get an idea for what the output will look like.

尝试在此处标准化和验证几个地址以了解输出的外观。

Full Disclosure: I work for SmartyStreets

全面披露:我为SmartyStreets工作