database 计算邮政编码...和用户之间的距离。
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3983325/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Calculate distance between Zip Codes... AND users.
提问by bopapa_1979
This is more of a challenge question than something I urgently need, so don't spend all day on it guys.
这比我迫切需要的问题更具有挑战性,所以伙计们不要花一整天的时间。
I built a dating site (long gone) back in 2000 or so, and one of the challenges was calculating the distance between users so we could present your "matches" within an X mile radius. To just state the problem, given the following database schema (roughly):
我在 2000 年左右建立了一个约会网站(早已不复存在),其中一个挑战是计算用户之间的距离,以便我们可以在 X 英里半径内展示您的“匹配”。仅说明问题,给定以下数据库架构(大致):
USER TABLE UserId UserName ZipCode
用户表 UserId 用户名 邮政编码
ZIPCODE TABLE ZipCode Latitude Longitude
邮政编码表邮政编码纬度经度
With USER and ZIPCODE being joined on USER.ZipCode = ZIPCODE.ZipCode.
在 USER.ZipCode = ZIPCODE.ZipCode 上加入 USER 和 ZIPCODE。
What approach would you take to answer the following question: What other users live in Zip Codes that are within X miles of a given user's Zip Code.
您将采取什么方法来回答以下问题:在给定用户邮政编码 X 英里范围内的邮政编码中还有哪些其他用户。
We used the 2000 census data, which has tables for zip codes and their approximate lattitude and longitude.
我们使用了2000 年的人口普查数据,其中包含邮政编码及其近似纬度和经度的表格。
We also used the Haversine Formulato calculate distances between any two points on a sphere... pretty simple math really.
我们还使用了Haversine 公式来计算球体上任意两点之间的距离……非常简单的数学运算。
The question, at least for us, being the 19 year old college students we were, really became how to efficiently calculate and/store distances from all members to all other members. One approach (the one we used) would be to import all the data and calculate the distance FROM every zip code TO every other zip code. Then you'd store and index the results. Something like:
问题,至少对我们 19 岁的大学生来说,真正成为如何有效地计算和/存储所有成员到所有其他成员的距离的问题。一种方法(我们使用的方法)是导入所有数据并计算从每个邮政编码到每个其他邮政编码的距离。然后你将存储和索引结果。就像是:
SELECT User.UserId
FROM ZipCode AS MyZipCode
INNER JOIN ZipDistance ON MyZipCode.ZipCode = ZipDistance.MyZipCode
INNER JOIN ZipCode AS TheirZipCode ON ZipDistance.OtherZipCode = TheirZipCode.ZipCode
INNER JOIN User AS User ON TheirZipCode.ZipCode = User.ZipCode
WHERE ( MyZipCode.ZipCode = 75044 )
AND ( ZipDistance.Distance < 50 )
The problem, of course, is that the ZipDistance table is going to have a LOT of rows in it. It isn't completely unworkable, but it is really big. Also it requires complete pre-work on the whole data set, which is also not unmanageable, but not necessarily desireable.
当然,问题在于 ZipDistance 表中会有很多行。它并非完全行不通,但它确实很大。此外,它需要对整个数据集进行完整的前期工作,这也不是不可管理的,但不一定是可取的。
Anyway, I was wondering what approach some of you gurus might take on something like this. Also, I think this is a common issue programmers have to tackle from time to time, especially if you consider problems that are just algorithmically similar. I'm interested in a thorough solution that includes at least HINTS on all the pieces to do this really quickly end efficiently. Thanks!
无论如何,我想知道你们中的一些大师可能会采取什么方法来处理这样的事情。此外,我认为这是程序员必须不时解决的一个常见问题,尤其是当您考虑在算法上相似的问题时。我对一个彻底的解决方案感兴趣,该解决方案至少包括所有部分的提示,以便真正快速有效地完成此操作。谢谢!
回答by Paul McMillan
Ok, for starters, you don't really need to use the Haversine formula here. For large distances where a less accurate formula produces a larger error, your users don't care if the match is plus or minus a few miles, and for closer distances, the error is very small. There are easier (to calculate) formulas listed on the Geographical DistanceWikipedia article.
好的,对于初学者来说,您真的不需要在这里使用Haversine 公式。对于较不准确的公式产生较大误差的大距离,您的用户不关心匹配是正负几英里,而对于更近的距离,误差非常小。地理距离维基百科文章中列出了更容易(计算)的公式。
Since zip codes are nothing like evenly spaced, any process that partitions them evenly is going to suffer mightily in areas where they are clustered tightly (east coast near DC being a good example). If you want a visual comparison, check out http://benfry.com/zipdecodeand compare the zipcode prefix 89 with 07.
由于邮政编码不是均匀分布的,因此任何将它们均匀划分的过程在它们紧密聚集的区域(DC 附近的东海岸就是一个很好的例子)中都会受到严重影响。如果您想要进行视觉比较,请查看http://benfry.com/zipdecode并将邮政编码前缀 89 与 07 进行比较。
A far better way to deal with indexing this space is to use a data structure like a Quadtreeor an R-tree. This structure allows you to do spatial and distance searches over data which is not evenly spaced.
处理索引这个空间的更好的方法是使用像Quadtree或R-tree这样的数据结构。这种结构允许您对不均匀分布的数据进行空间和距离搜索。
Here's what an Quadtree looks like:
这是四叉树的样子:
To search over it, you drill down through each larger cell using the index of smaller cells that are within it. Wikipedia explains it more thoroughly.
要搜索它,您可以使用其中的较小单元格的索引向下钻取每个较大的单元格。维基百科对此进行了更彻底的解释。
Of course, since this is a fairly common thing to do, someone else has already done the hard part for you. Since you haven't specified what database you're using, the PostgreSQL extension PostGISwill serve as an example. PostGIS includes the ability to do R-tree spatial indexes which allow you to do efficient spatial querying.
当然,由于这是一件相当普遍的事情,其他人已经为您完成了困难的部分。由于您尚未指定您使用的数据库,PostgreSQL 扩展PostGIS将作为示例。PostGIS 包括执行 R 树空间索引的能力,这使您可以进行有效的空间查询。
Once you've imported your data and built the spatial index, querying for distance is a query like:
导入数据并构建空间索引后,查询距离是这样的查询:
SELECT zip
FROM zipcode
WHERE
geom && expand(transform(PointFromText('POINT(-116.768347 33.911404)', 4269),32661), 16093)
AND
distance(
transform(PointFromText('POINT(-116.768347 33.911404)', 4269),32661),
geom) < 16093
I'll let you work through the rest of the tutorial yourself.
我会让您自己完成本教程的其余部分。
Here are some other references to get you started.
以下是一些其他参考资料,可帮助您入门。
回答by Jon Black
I'd simply just create a zip_code_distances table and pre-compute the distances between all 42K zipcodes in the US which are within a 20-25 mile radius of each other.
我只是创建一个 zip_code_distances 表并预先计算美国所有 42K 邮政编码之间的距离,这些邮政编码彼此相距 20-25 英里。
create table zip_code_distances
(
from_zip_code mediumint not null,
to_zip_code mediumint not null,
distance decimal(6,2) default 0.0,
primary key (from_zip_code, to_zip_code),
key (to_zip_code)
)
engine=innodb;
Only including zipcodes within a 20-25 miles radius of each other reduces the number of rows you need to store in the distance table from it's maximum of 1.7 billion (42K ^ 2) - 42K to a much more manageable 4 million or so.
仅包含彼此相距 20-25 英里范围内的邮政编码可将您需要存储在距离表中的行数从最大 17 亿 (42K ^ 2) - 42K 减少到更易于管理的 400 万左右。
I downloaded a zipcode datafile from the web which contained the longitudes and latitudes of all the official US zipcodes in csv format:
我从网上下载了一个邮政编码数据文件,其中包含 csv 格式的所有美国官方邮政编码的经度和纬度:
"00601","Adjuntas","Adjuntas","Puerto Rico","PR","787","Atlantic", 18.166, -66.7236
"00602","Aguada","Aguada","Puerto Rico","PR","787","Atlantic", 18.383, -67.1866
...
"91210","Glendale","Los Angeles","California","CA","818","Pacific", 34.1419, -118.261
"91214","La Crescenta","Los Angeles","California","CA","818","Pacific", 34.2325, -118.246
"91221","Glendale","Los Angeles","California","CA","818","Pacific", 34.1653, -118.289
...
I wrote a quick and dirty C# program to read the file and compute the distances between every zipcode but only output zipcodes that fall within a 25 mile radius:
我编写了一个快速而肮脏的 C# 程序来读取文件并计算每个邮政编码之间的距离,但只输出 25 英里半径内的邮政编码:
sw = new StreamWriter(path);
foreach (ZipCode fromZip in zips){
foreach (ZipCode toZip in zips)
{
if (toZip.ZipArea == fromZip.ZipArea) continue;
double dist = ZipCode.GetDistance(fromZip, toZip);
if (dist > 25) continue;
string s = string.Format("{0}|{1}|{2}", fromZip.ZipArea, toZip.ZipArea, dist);
sw.WriteLine(s);
}
}
The resultant output file looks as follows:
结果输出文件如下所示:
from_zip_code|to_zip_code|distance
...
00601|00606|16.7042215574185
00601|00611|9.70353520976393
00601|00612|21.0815707704904
00601|00613|21.1780461311929
00601|00614|20.101431539283
...
91210|90001|11.6815708119899
91210|90002|13.3915723402714
91210|90003|12.371251171873
91210|90004|5.26634939906721
91210|90005|6.56649623829871
...
I would then just load this distance data into my zip_code_distances table using load data infile and then use it to limit the search space of my application.
然后我将使用加载数据 infile 将此距离数据加载到我的 zip_code_distances 表中,然后使用它来限制我的应用程序的搜索空间。
For example if you have a user whose zipcode is 91210 and they want to find people who are within a 10 mile radius of them then you can now simply do the following:
例如,如果您有一个邮政编码为 91210 的用户,并且他们想要查找距离他们 10 英里范围内的人,那么您现在可以简单地执行以下操作:
select
p.*
from
people p
inner join
(
select
to_zip_code
from
zip_code_distances
where
from_zip_code = 91210 and distance <= 10
) search
on p.zip_code = search.to_zip_code
where
p.gender = 'F'....
Hope this helps
希望这可以帮助
EDIT: extended radius to 100 miles which increased the number of zipcode distances to 32.5 million rows.
编辑:将半径扩展到 100 英里,从而将邮政编码距离的数量增加到 3250 万行。
quick performance check for zipcode 91210 runtime 0.009 seconds.
邮政编码 91210 运行时的快速性能检查 0.009 秒。
select count(*) from zip_code_distances
count(*)
========
32589820
select
to_zip_code
from
zip_code_distances
where
from_zip_code = 91210 and distance <= 10;
0:00:00.009: Query OK
回答by babtek
You could shortcut the calculation by just assuming a box instead of a circular radius. Then when searching you simply calculate the lower/upper bound of lat/lon for a given point+"radius", and as long as you have an index on the lat/lon columns you could pull back all records that fall within the box pretty easily.
您可以通过假设一个盒子而不是圆形半径来简化计算。然后,在搜索时,您只需计算给定点+“半径”的纬度/经度的下限/上限,只要您在纬度/经度列上有一个索引,您就可以很容易地拉回落在框中的所有记录.
回答by David Watson
I would use latitude and longitude. For example, if you have a latitude of 45 and a longitude of 45 and were asked to find matches within 50 miles, then you could do it by moving 50/69 ths up in latitude and 50/69 ths down in latitude (1 deg latitude ~ 69 miles). Select zip codes with latitudes in this range. Longitudes are a little different, because they get smaller as you move closer to the poles.
我会使用纬度和经度。例如,如果您的纬度为 45,经度为 45,并被要求在 50 英里内查找匹配项,那么您可以通过将纬度向上移动 50/69 ths 并在纬度向下移动 50/69 ths(1 度)来实现纬度 ~ 69 英里)。选择纬度在此范围内的邮政编码。经度略有不同,因为当您靠近两极时,它们会变小。
But at 45 deg, 1 longitude ~ 49 miles, so you could move 50/49ths left in latitude and 50/49ths right in latitude, and select all zip codes from the latitude set with this longitude. This gives you all zip codes within a square with lengths of a hundred miles. If you wanted to be really precise, you could then use the Haversine formula witch you mentioned to weed out zips in the corners of the box, to give you a sphere.
但是在 45 度,1 经度 ~ 49 英里处,因此您可以在纬度上向左移动 50/49,在纬度上向右移动 50/49,然后从具有该经度的纬度集中选择所有邮政编码。这为您提供了一个长度为 100 英里的正方形内的所有邮政编码。如果你想非常精确,你可以使用你提到的Haversine公式去除盒子角落里的拉链,给你一个球体。
回答by Jander
You could divide your space into regions of roughly equal size -- for instance, approximate the earth as a buckyball or icosahedron. The regions could even overlap a bit, if that's easier (e.g. make them circular). Record which region(s) each ZIP code is in. Then you can precalculate the maximum distance possible between every region pair, which has the same O(n^2)problem as calculating all the ZIP code pairs, but for smaller n.
您可以将空间划分为大小大致相同的区域——例如,将地球近似为巴基球或二十面体。这些区域甚至可以重叠一点,如果这样更容易的话(例如,让它们变成圆形)。记录每个邮政编码所在的区域。然后您可以预先计算每个区域对之间可能的最大距离,这与计算所有邮政编码对具有相同的O(n^2)问题,但n较小。
Now, for any given ZIP code, you can get a list of regions that are definitely within your given range, and a list of regions that cross the border. For the former, just grab all the ZIP codes. For the latter, drill down into each border region and calculate against individual ZIP codes.
现在,对于任何给定的邮政编码,您都可以获得绝对在给定范围内的区域列表,以及跨越边界的区域列表。对于前者,只需获取所有邮政编码。对于后者,深入到每个边界区域并根据各个邮政编码进行计算。
It's certainly more complex mathematically, and in particular the number of regions would have to be chosen for a good balance between the size of the table vs. the time spent calculating on the fly, but it reduces the size of the precalculated table by a good margin.
它在数学上当然更复杂,特别是必须选择区域的数量以在表格的大小与动态计算所花费的时间之间取得良好的平衡,但它减少了预先计算的表格的大小利润。
回答by bopapa_1979
I have the problem running great, and pretty much everyone's answer got used. I was thinking about this in terms of the old solution instead of just "starting over." Babtek gets the nod for stating in in simplest terms.
我的问题运行良好,几乎每个人的答案都被使用了。我是从旧解决方案的角度考虑这个问题,而不仅仅是“重新开始”。Babtek 以最简单的方式表述,因此得到了认可。
I'll skip the code because I'll provide references to derive the needed formulas, and there is too much to cleanly post here.
我将跳过代码,因为我将提供引用以导出所需的公式,而且这里有太多内容无法清晰地发布。
1) Consider Point A on a sphere, represented by latitude and longitude. Figure out North, South, East, and West edges of a box 2X miles across with Point A at the center.
1) 考虑球体上的 A 点,由纬度和经度表示。 找出以 A 点为中心的 2X 英里长的盒子的北、南、东和西边缘。
2) Select all point within the box from the ZipCode table. This includes a simple WHERE clause with two Between statements limiting by Lat and Long.
2) 从邮政编码表中选择框中的所有点。这包括一个简单的 WHERE 子句,其中包含两个受 Lat 和 Long 限制的Between 语句。
3) Use the haversine formula to determine the spherical distance between Point A and every point B returned in step 2.
3) 使用半正弦公式确定点 A 与步骤 2 中返回的每个点 B 之间的球面距离。
4) Discard all points B where distance A -> B > X.
4) 丢弃距离 A -> B > X 的所有点 B。
5) Select users where ZipCode is in the remaining set of points B.
5) 选择 ZipCode 在剩余点集 B 中的用户。
This is pretty fast for > 100 miles. Longest result was ~ 0.014 seconds to calculate the match, and trivial to run the select statement.
这对于 > 100 英里来说非常快。最长的结果是大约 0.014 秒来计算匹配,并且运行 select 语句很简单。
Also, as a side note, it was necessary to implement the math in a couple of functions and call them in SQL. Once I got past a certain distance the matching number of ZipCodes was too large to pass back to SQL and use as an IN statement, so I had to use a temp table and join the resulting ZipCodes to User on the ZipCode column.
此外,作为旁注,有必要在几个函数中实现数学运算并在 SQL 中调用它们。一旦超过一定距离,匹配的 ZipCode 数量太大而无法传回 SQL 并用作 IN 语句,因此我必须使用临时表并将生成的 ZipCode 连接到 ZipCode 列上的 User。
I suspect that using a ZipDistance table will not provide a long-term performance gain. The number of rows just gets really big. If you calculate the distance from every zip to to every other zip code (eventually) then the resultant row count from 40,000 zip codes would be ~ 1.6B. Whoah!
我怀疑使用 ZipDistance 表不会提供长期的性能提升。行数变得非常大。如果您计算从每个邮政编码到每个其他邮政编码(最终)的距离,那么 40,000 个邮政编码的结果行数将为 ~ 1.6B。哇!
Alternately, I am interested in using SQL's built in geography type to see if that will make this easier, but good old int/float types served fine for this sample.
或者,我对使用 SQL 的内置 geography 类型感兴趣,看看这是否会使这更容易,但是很好的旧 int/float 类型适用于此示例。
So... final list of online resources I used, for your easy reference:
所以...我使用的在线资源的最终列表,供您轻松参考:
1) Maximum Difference, Latitude and Longitude.
1)最大差异,纬度和经度。
2)Haversine公式。
3) Lengthy but complete discussion of the whole process, which I found from Googling stuff in your answers.
3)对整个过程进行了冗长但完整的讨论,这是我从您的答案中的谷歌搜索中发现的。
回答by John Smith
Not every possible pair of zip codes are going to be used. I would build zipdistance as a 'cache' table. For each request calculate the distance for that pair and save it in the cache. When a request for a distance pair comes, first look in the cache, then compute if it's not available.
并非所有可能的邮政编码对都将被使用。我会将 zipdistance 构建为“缓存”表。对于每个请求,计算该对的距离并将其保存在缓存中。当对距离对的请求到来时,首先查看缓存,然后计算它是否可用。
I do not know the intricacies of distance calculations, so I would also check whether computing on the fly is cheaper than looking up (also taking into consideration how often you have to compute).
我不知道距离计算的复杂性,因此我还会检查动态计算是否比查找便宜(还要考虑到您必须计算的频率)。
回答by Facundo Colombier
I know that this post is TOO old, but making some research for a client I've found some useful functionality of Google Maps API and is so simple to implement, you just need to pass to the url the origin and destination ZIP codes, and it calculates the distance even with the traffic, you can use it with any language:
我知道这篇文章太旧了,但是在为客户做一些研究时,我发现了 Google Maps API 的一些有用功能,而且实现起来非常简单,您只需要将来源和目的地邮政编码传递给 url,并且它甚至可以计算交通距离,您可以使用任何语言使用它:
origins = 90210
destinations = 93030
mode = driving
following the link you can see that it returns a json. Remember that you need an API key to use this on your own hosting.
按照链接,您可以看到它返回一个 json。请记住,您需要一个 API 密钥才能在自己的主机上使用它。
来源:http: //stanhub.com/find-distance-between-two-postcodes-zipcodes-driving-time-in-current-traffic-using-google-maps-api/