string 从字符串中解析可用的街道地址、城市、州、邮编

Question

提问by Rob Allen

Problem: I have an address field from an Access database which has been converted to Sql Server 2005. This field has everything all in one field. I need to parse out the individual sections of the address into their appropriate fields in a normalized table. I need to do this for approximately 4,000 records and it needs to be repeatable.

问题：我有一个来自 Access 数据库的地址字段，该字段已转换为 Sql Server 2005。该字段在一个字段中包含所有内容。我需要将地址的各个部分解析为规范化表中的相应字段。我需要为大约 4,000 条记录执行此操作，并且它需要是可重复的。

Assumptions:

假设：

Assume an address in the US (for now)
assume that the input string will sometimes contain an addressee (the person being addressed) and/or a second street address (i.e. Suite B)
states may be abbreviated
zip code could be standard 5 digit or zip+4
there are typos in some instances

假设地址在美国（暂时）
假设输入字符串有时会包含收件人（被寻址的人）和/或第二个街道地址（即 Suite B）
状态可以缩写
邮政编码可以是标准的 5 位数字或 zip+4
在某些情况下有错别字

UPDATE: In response to the questions posed, standards were not universally followed, I need need to store the individual values, not just geocode and errors means typo (corrected above)

更新：针对提出的问题，标准并未得到普遍遵循，我需要存储各个值，而不仅仅是地理编码和错误意味着错字（已在上面更正）

Sample Data:

样本数据：

A. P. Croll & Son 2299 Lewes-Georgetown Hwy, Georgetown, DE 19947
11522 Shawnee Road, Greenwood DE 19950
144 Kings Highway, S.W. Dover, DE 19901
Intergrated Const. Services 2 Penns Way Suite 405 New Castle, DE 19720
Humes Realty 33 Bridle Ridge Court, Lewes, DE 19958
Nichols Excavation 2742 Pulaski Hwy Newark, DE 19711
2284 Bryn Zion Road, Smyrna, DE 19904
VEI Dover Crossroads, LLC 1500 Serpentine Road, Suite 100 Baltimore MD 21
580 North Dupont Highway Dover, DE 19901
P.O. Box 778 Dover, DE 19903

AP Croll & Son 2299 Lewes-Georgetown Hwy, Georgetown, DE 19947
11522 肖尼路，格林伍德 DE 19950
144 Kings Highway, SW Dover, DE 19901
综合常数服务 2 Penns Way Suite 405 New Castle, DE 19720
Humes Realty 33 Bridle Ridge Court, 刘易斯, DE 19958
Nichols 挖掘 2742 Pulaski Hwy Newark, DE 19711
2284 Bryn Zion Road, 士麦那, DE 19904
VEI Dover Crossroads, LLC 1500 Serpentine Road, Suite 100 Baltimore MD 21
580 North Dupont Highway Dover, DE 19901
邮政信箱 778 Dover, DE 19903

Answer 1

采纳答案by Tim Sullivan

I've done a lot of work on this kind of parsing. Because there are errors you won't get 100% accuracy, but there are a few things you can do to get most of the way there, and then do a visual BS test. Here's the general way to go about it. It's not code, because it's pretty academic to write it, there's no weirdness, just lots of string handling.

我在这种解析上做了很多工作。因为存在错误，您不会获得 100% 的准确度，但是您可以做一些事情来获得大部分准确度，然后进行视觉 BS 测试。这是解决此问题的一般方法。它不是代码，因为编写它非常学术，没有奇怪的地方，只有大量的字符串处理。

(Now that you've posted some sample data, I've made some minor changes)

（现在您已经发布了一些示例数据，我做了一些小的更改）

Work backward. Start from the zip code, which will be near the end, and in one of two known formats: XXXXX or XXXXX-XXXX. If this doesn't appear, you can assume you're in the city, state portion, below.
The next thing, before the zip, is going to be the state, and it'll be either in a two-letter format, or as words. You know what these will be, too -- there's only 50 of them. Also, you could soundex the words to help compensate for spelling errors.
before that is the city, and it's probablyon the same line as the state. You could use a zip-code databaseto check the city and state based on the zip, or at least use it as a BS detector.
The street address will generally be one or two lines. The second line will generally be the suite number if there is one, but it could also be a PO box.
It's going to be near-impossible to detect a name on the first or second line, though if it's not prefixed with a number (or if it's prefixed with an "attn:" or "attention to:" it could give you a hint as to whether it's a name or an address line.

向后工作。从接近结尾的邮政编码开始，采用两种已知格式之一：XXXXX 或 XXXXX-XXXX。如果这没有出现，您可以假设您在下面的城市，州部分。
下一个，在 zip 之前，将是 state，它要么是两个字母的格式，要么是单词。你也知道这些是什么——只有 50 个。此外，您可以对单词进行发音以帮助弥补拼写错误。
在此之前是城市，它可能与州在同一条线上。您可以使用邮政编码数据库根据邮政编码检查城市和州，或者至少将其用作 BS 检测器。
街道地址通常是一两行。如果有的话，第二行通常是套房号，但也可以是邮政信箱。
在第一行或第二行检测一个名字几乎是不可能的，但如果它没有以数字为前缀（或者如果它以“attn:”或“attention to:”为前缀，它可以给你一个提示：无论是名称还是地址行。

I hope this helps somewhat.

我希望这会有所帮助。

Answer 2

回答by James A. Rosen

I think outsourcing the problem is the best bet: send it to the Google (or Yahoo) geocoder. The geocoder returns not only the lat/long (which aren't of interest here), but also a rich parsing of the address, with fields filled in that you didn't send (including ZIP+4 and county).

我认为外包这个问题是最好的选择：将它发送到谷歌（或雅虎）地理编码器。地理编码器不仅会返回纬度/经度（此处不感兴趣），还会返回地址的丰富解析，其中填充了您未发送的字段（包括 ZIP+4 和县）。

For example, parsing "1600 Amphitheatre Parkway, Mountain View, CA" yields

例如，解析“1600 Amphitheatre Parkway, Mountain View, CA”产生

{
  "name": "1600 Amphitheatre Parkway, Mountain View, CA, USA",
  "Status": {
    "code": 200,
    "request": "geocode"
  },
  "Placemark": [
    {
      "address": "1600 Amphitheatre Pkwy, Mountain View, CA 94043, USA",
      "AddressDetails": {
        "Country": {
          "CountryNameCode": "US",
          "AdministrativeArea": {
            "AdministrativeAreaName": "CA",
            "SubAdministrativeArea": {
              "SubAdministrativeAreaName": "Santa Clara",
              "Locality": {
                "LocalityName": "Mountain View",
                "Thoroughfare": {
                  "ThoroughfareName": "1600 Amphitheatre Pkwy"
                },
                "PostalCode": {
                  "PostalCodeNumber": "94043"
                }
              }
            }
          }
        },
        "Accuracy": 8
      },
      "Point": {
        "coordinates": [-122.083739, 37.423021, 0]
      }
    }
  ]
}

Now that'sparseable!

现在可以解析了！

Answer 3

回答by Nicholas Piasecki

The original poster has likely long moved on, but I took a stab at porting the Perl Geo::StreetAddress:USmodule used by geocoder.usto C#, dumped it on CodePlex, and think that people stumbling across this question in the future may find it useful:

原来的海报可能长期感动，但我花了刺在移植Perl的地理::的StreetAddress：美国使用的模块geocoder.us到C＃，它甩在CodePlex上，并认为，人们对未来过这个问题可能绊脚石觉得有用：

US Address Parser

美国地址解析器

On the project's home page, I try to talk about its (very real) limitations. Since it is not backed by the USPS database of valid street addresses, parsing can be ambiguous and it can't confirm nor deny the validity of a given address. It can just try to pull data out from the string.

在项目的主页上，我尝试谈论它的（非常真实的）局限性。由于它不受有效街道地址的 USPS 数据库的支持，因此解析可能不明确，并且无法确认或否认给定地址的有效性。它可以尝试从字符串中提取数据。

It's meant for the case when you need to get a set of data mostly in the right fields, or want to provide a shortcut to data entry (letting users paste an address into a textbox rather than tabbing among multiple fields). It is notmeant for verifying the deliverability of an address.

它适用于以下情况：您需要在正确的字段中获取一组数据，或者想要提供数据输入的快捷方式（让用户将地址粘贴到文本框中，而不是在多个字段之间切换）。它并不用于验证地址的可传递性。

It doesn't attempt to parse out anything above the street line, but one could probably diddle with the regex to get something reasonably close--I'd probably just break it off at the house number.

它不会尝试解析街道线以上的任何内容，但人们可能会使用正则表达式来获得合理接近的东西——我可能只是在门牌号处将其断开。

Answer 4

回答by Christopher Mahan

I've done this in the past.

我过去做过这件事。

Either do it manually, (build a nice gui that helps the user do it quickly) or have it automated and check against a recent address database (you have to buy that) and manually handle errors.

要么手动执行（构建一个很好的 gui 来帮助用户快速完成），要么让它自动化并检查最近的地址数据库（您必须购买）并手动处理错误。

Manual handling will take about 10 seconds each, meaning you can do 3600/10 = 360 per hour, so 4000 should take you approximately 11-12 hours. This will give you a high rate of accuracy.

每次手动处理大约需要 10 秒，这意味着您每小时可以做 3600/10 = 360，所以 4000 应该需要大约 11-12 小时。这将为您提供高准确率。

For automation, you needa recent US address database, and tweak your rules against that. I suggest not going fancy on the regex (hard to maintain long-term, so many exceptions). Go for 90% match against the database, do the rest manually.

对于自动化，您需要一个最近的美国地址数据库，并针对该数据库调整您的规则。我建议不要看中正则表达式（很难长期维护，有很多例外）。对数据库进行 90% 匹配，手动完成其余工作。

Do get a copy of Postal Addressing Standards (USPS) at http://pe.usps.gov/cpim/ftp/pubs/Pub28/pub28.pdfand notice it is 130+ pages long. Regexes to implement that would be nuts.

请务必在http://pe.usps.gov/cpim/ftp/pubs/Pub28/pub28.pdf 上获取邮政寻址标准 (USPS) 的副本，并注意它长达 130 多页。实施那将是疯狂的正则表达式。

For international addresses, all bets are off. US-based workers would not be able to validate.

对于国际地址，一切皆有可能。美国的工人将无法进行验证。

Alternatively, use a data service. I have, however, no recommendations.

或者，使用数据服务。但是，我没有任何建议。

Furthermore: when you do send out the stuff in the mail (that's what it's for, right?) make sure you put "address correction requested" on the envelope (in the right place) and updatethe database. (We made a simple gui for the front desk person to do that; the person who actually sorts through the mail)

此外：当您确实通过邮件发送内容时（这就是它的用途，对吗？）确保将“地址更正请求”放在信封上（在正确的位置）并更新数据库。（我们为前台人员做了一个简单的 gui 来做到这一点；实际整理邮件的人）

Finally, when you have scrubbed data, look for duplicates.

最后，当您清理数据时，查找重复项。

Answer 5

回答by Rob Allen

After the advice here, I have devised the following function in VB which creates passable, although not always perfect (if a company name and a suite line are given, it combines the suite and city) usable data. Please feel free to comment/refactor/yell at me for breaking one of my own rules, etc.:

根据这里的建议，我在 VB 中设计了以下函数，该函数创建了可通过的，虽然并不总是完美的（如果给出公司名称和套房线，它结合了套房和城市）可用数据。请随意评论/重构/对我大喊大叫，因为我违反了我自己的规则之一，等等：

Public Function parseAddress(ByVal input As String) As Collection
    input = input.Replace(",", "")
    input = input.Replace("  ", " ")
    Dim splitString() As String = Split(input)
    Dim streetMarker() As String = New String() {"street", "st", "st.", "avenue", "ave", "ave.", "blvd", "blvd.", "highway", "hwy", "hwy.", "box", "road", "rd", "rd.", "lane", "ln", "ln.", "circle", "circ", "circ.", "court", "ct", "ct."}
    Dim address1 As String
    Dim address2 As String = ""
    Dim city As String
    Dim state As String
    Dim zip As String
    Dim streetMarkerIndex As Integer

    zip = splitString(splitString.Length - 1).ToString()
    state = splitString(splitString.Length - 2).ToString()
    streetMarkerIndex = getLastIndexOf(splitString, streetMarker) + 1
    Dim sb As New StringBuilder

    For counter As Integer = streetMarkerIndex To splitString.Length - 3
        sb.Append(splitString(counter) + " ")
    Next counter
    city = RTrim(sb.ToString())
    Dim addressIndex As Integer = 0

    For counter As Integer = 0 To streetMarkerIndex
        If IsNumeric(splitString(counter)) _
            Or splitString(counter).ToString.ToLower = "po" _
            Or splitString(counter).ToString().ToLower().Replace(".", "") = "po" Then
                addressIndex = counter
            Exit For
        End If
    Next counter

    sb = New StringBuilder
    For counter As Integer = addressIndex To streetMarkerIndex - 1
        sb.Append(splitString(counter) + " ")
    Next counter

    address1 = RTrim(sb.ToString())

    sb = New StringBuilder

    If addressIndex = 0 Then
        If splitString(splitString.Length - 2).ToString() <> splitString(streetMarkerIndex + 1) Then
            For counter As Integer = streetMarkerIndex To splitString.Length - 2
                sb.Append(splitString(counter) + " ")
            Next counter
        End If
    Else
        For counter As Integer = 0 To addressIndex - 1
            sb.Append(splitString(counter) + " ")
        Next counter
    End If
    address2 = RTrim(sb.ToString())

    Dim output As New Collection
    output.Add(address1, "Address1")
    output.Add(address2, "Address2")
    output.Add(city, "City")
    output.Add(state, "State")
    output.Add(zip, "Zip")
    Return output
End Function

Private Function getLastIndexOf(ByVal sArray As String(), ByVal checkArray As String()) As Integer
    Dim sourceIndex As Integer = 0
    Dim outputIndex As Integer = 0
    For Each item As String In checkArray
        For Each source As String In sArray
            If source.ToLower = item.ToLower Then
                outputIndex = sourceIndex
                If item.ToLower = "box" Then
                    outputIndex = outputIndex + 1
                End If
            End If
            sourceIndex = sourceIndex + 1
        Next
        sourceIndex = 0
    Next
    Return outputIndex
End Function

Passing the parseAddressfunction "A. P. Croll & Son 2299 Lewes-Georgetown Hwy, Georgetown, DE 19947" returns:

传递parseAddress函数“AP Croll & Son 2299 Lewes-Georgetown Hwy, Georgetown, DE 19947”返回：

2299 Lewes-Georgetown Hwy
A. P. Croll & Son  
Georgetown
DE
19947

2299 Lewes-Georgetown Hwy
A. P. Croll & Son  
Georgetown
DE
19947

Answer 6

回答by Nicholas Trandem

I've been working in the address processing domain for about 5 years now, and there really is no silver bullet. The correct solution is going to depend on the value of the data. If it's not very valuable, throw it through a parser as the other answers suggest. If it's even somewhat valuable you'll definitely need to have a human evaluate/correct all the results of the parser. If you're looking for a fully automated, repeatable solution, you probably want to talk to a address correction vendor like Group1 or Trillium.

我已经在地址处理领域工作了大约 5 年，而且真的没有灵丹妙药。正确的解决方案将取决于数据的价值。如果它不是很有价值，请按照其他答案的建议将其通过解析器。如果它甚至有点有价值，您肯定需要有人评估/纠正解析器的所有结果。如果您正在寻找完全自动化、可重复的解决方案，您可能想与 Group1 或 Trillium 等地址校正供应商交谈。

Answer 7

回答by Matt

SmartyStreets has a new feature that extracts addresses from arbitrary input strings. (Note: I don't work at SmartyStreets.)

SmartyStreets 有一个新功能，可以从任意输入字符串中提取地址。（注意：我不在 SmartyStreets 工作。）

It successfully extracted all addresses from the sample input given in the question above. (By the way, only 9 of those 10 addresses are valid.)

它成功地从上述问题中给出的样本输入中提取了所有地址。（顺便说一下，这 10 个地址中只有 9 个是有效的。）

Here's some of the output: enter image description here

这是一些输出：在此处输入图片说明

And here's the CSV-formatted output of that same request:

这是同一请求的 CSV 格式输出：

ID,Start,End,Segment,Verified,Candidate,Firm,FirstLine,SecondLine,LastLine,City,State,ZIPCode,County,DpvFootnotes,DeliveryPointBarcode,Active,Vacant,CMRA,MatchCode,Latitude,Longitude,Precision,RDI,RecordType,BuildingDefaultIndicator,CongressionalDistrict,Footnotes
1,32,79,"2299 Lewes-Georgetown Hwy, Georgetown, DE 19947",N,,,,,,,,,,,,,,,,,,,,,,
2,81,119,"11522 Shawnee Road, Greenwood DE 19950",Y,0,,11522 Shawnee Rd,,Greenwood DE 19950-5209,Greenwood,DE,19950,Sussex,AABB,199505209226,Y,N,N,Y,38.82865,-75.54907,Zip9,Residential,S,,AL,N#
3,121,160,"144 Kings Highway, S.W. Dover, DE 19901",Y,0,,144 Kings Hwy,,Dover DE 19901-7308,Dover,DE,19901,Kent,AABB,199017308444,Y,N,N,Y,39.16081,-75.52377,Zip9,Commercial,S,,AL,L#
4,190,232,"2 Penns Way Suite 405 New Castle, DE 19720",Y,0,,2 Penns Way Ste 405,,New Castle DE 19720-2407,New Castle,DE,19720,New Castle,AABB,197202407053,Y,N,N,Y,39.68332,-75.61043,Zip9,Commercial,H,,AL,N#
5,247,285,"33 Bridle Ridge Court, Lewes, DE 19958",Y,0,,33 Bridle Ridge Cir,,Lewes DE 19958-8961,Lewes,DE,19958,Sussex,AABB,199588961338,Y,N,N,Y,38.72749,-75.17055,Zip7,Residential,S,,AL,L#
6,306,339,"2742 Pulaski Hwy Newark, DE 19711",Y,0,,2742 Pulaski Hwy,,Newark DE 19702-3911,Newark,DE,19702,New Castle,AABB,197023911421,Y,N,N,Y,39.60328,-75.75869,Zip9,Commercial,S,,AL,A#
7,341,378,"2284 Bryn Zion Road, Smyrna, DE 19904",Y,0,,2284 Bryn Zion Rd,,Smyrna DE 19977-3895,Smyrna,DE,19977,Kent,AABB,199773895840,Y,N,N,Y,39.23937,-75.64065,Zip7,Residential,S,,AL,A#N#
8,406,450,"1500 Serpentine Road, Suite 100 Baltimore MD",Y,0,,1500 Serpentine Rd Ste 100,,Baltimore MD 21209-2034,Baltimore,MD,21209,Baltimore,AABB,212092034250,Y,N,N,Y,39.38194,-76.65856,Zip9,Commercial,H,,03,N#
9,455,495,"580 North Dupont Highway Dover, DE 19901",Y,0,,580 N DuPont Hwy,,Dover DE 19901-3961,Dover,DE,19901,Kent,AABB,199013961803,Y,N,N,Y,39.17576,-75.5241,Zip9,Commercial,S,,AL,N#
10,497,525,"P.O. Box 778 Dover, DE 19903",Y,0,,PO Box 778,,Dover DE 19903-0778,Dover,DE,19903,Kent,AABB,199030778781,Y,N,N,Y,39.20946,-75.57012,Zip5,Residential,P,,AL,

I was the developer who originally wrote the service. The algorithm we implemented is a bit different from any specific answers here, but each extracted address is verified against the address lookup API, so you can be sure if it's valid or not. Each verified result is guaranteed, but we know the other results won't be perfect because, as has been made abundantly clearin this thread, addresses are unpredictable, even for humans sometimes.

我是最初编写该服务的开发人员。我们实现的算法与此处的任何特定答案都略有不同，但是每个提取的地址都会根据地址查找 API 进行验证，因此您可以确定它是否有效。每个经过验证的结果都是有保证的，但我们知道其他结果不会是完美的，因为正如本主题中非常清楚的那样，地址是不可预测的，有时甚至对人类也是如此。

Answer 8

回答by Kevin

This won't solve your problem, but if you only needed lat/long data for these addresses, the Google Maps API will parse non-formatted addresses pretty well.

这不会解决您的问题，但如果您只需要这些地址的经纬度数据，Google Maps API 将很好地解析非格式化地址。

Good suggestion, alternatively you can execute a CURL request for each address to Google Maps and it will return the properly formatted address. From that, you can regex to your heart's content.

好的建议，或者您可以对 Google 地图的每个地址执行 CURL 请求，它会返回格式正确的地址。从那以后，您可以根据自己的喜好进行正则表达式。

Answer 9

回答by weston

+1 on James A. Rosen's suggested solution as it has worked well for me, however for completists this site is a fascinating read and the best attempt I've seen in documenting addresses worldwide: http://www.columbia.edu/kermit/postal.html

+1 对 James A. Rosen 建议的解决方案，因为它对我来说效果很好，但是对于完成者来说，这个网站是一个引人入胜的阅读，也是我在记录全球地址方面所见过的最好的尝试：http: //www.columbia.edu/kermit /postal.html

Answer 10

回答by Chuck

Another request for sample data.

对样本数据的另一个请求。

As has been mentioned I would work backwards from the zip.

正如已经提到的，我会从拉链开始向后工作。

Once you have a zip I would query a zip database, store the results, and remove them & the zip from the string.

一旦你有了一个 zip，我就会查询一个 zip 数据库，存储结果，然后从字符串中删除它们和 zip。

That will leave you with the address mess. MOST (All?) addresses will start with a number so find the first occurrence of a number in the remaining string and grab everything from it to the (new) end of the string. That will be your address. Anything to the left of that number is likely an addressee.

那会给你留下地址混乱。大多数（所有？）地址将以数字开头，因此找到剩余字符串中第一个出现的数字，并获取从它到字符串（新）结尾的所有内容。那将是你的地址。该号码左侧的任何内容都可能是收件人。

You should now have the City, State, & Zip stored in a table and possibly two strings, addressee and address. For the address, check for the existence of "Suite" or "Apt." etc. and split that into two values (address lines 1 & 2).

您现在应该将 City、State 和 Zip 存储在一个表中，可能还有两个字符串，addressee 和 address。对于地址，检查是否存在“Suite”或“Apt”。等，并将其拆分为两个值（地址线 1 和 2）。

For the addressee I would punt and grab the last word of that string as the last name and put the rest into the first name field. If you don't want to do that, you'll need to check for salutation (Mr., Ms., Dr., etc.) at the start and make some assumptions based on the number of spaces as to how the name is made up.

对于收件人，我会将字符串的最后一个单词作为姓氏，并将其余单词放入名字字段中。如果您不想这样做，则需要在开始时检查称呼（先生、女士、博士等），并根据空格数做出一些关于名称如何的假设捏造。

I don't think there's any way you can parse with 100% accuracy.

我认为没有任何方法可以 100% 准确地解析。

string 从字符串中解析可用的街道地址、城市、州、邮编

提问by Rob Allen

采纳答案by Tim Sullivan

回答by James A. Rosen

回答by Nicholas Piasecki

回答by Christopher Mahan

回答by Rob Allen

回答by Nicholas Trandem

回答by Matt

回答by Kevin

回答by weston

回答by Chuck

相关推荐

最近更新

标签

string 从字符串中解析可用的街道地址、城市、州、邮编

提问by Rob Allen

采纳答案by Tim Sullivan

回答by James A. Rosen

回答by Nicholas Piasecki

回答by Christopher Mahan

回答by Rob Allen

回答by Nicholas Trandem

回答by Matt

回答by Kevin

回答by weston

回答by Chuck

相关推荐

Pandas：向多索引列数据框添加一列

pandas 从 matplotlib 中的 .CSV 文件制作多线图

pandas 在PANDAS中，如何获取已知值的索引？

Python - 列表中的 Pandas '.isin'

相关推荐

最近更新

标签