SQL 如何将彼此“接近”的纬度/经度点分组?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/4349160/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to group latitude/longitude points that are 'close' to each other?
提问by Tim Lytle
I have a database of user submitted latitude/longitude points and am trying to group 'close' points together. 'Close' is relative, but for now it seems to ~500 feet.
我有一个用户提交的纬度/经度点数据库,我正在尝试将“关闭”点组合在一起。'关闭' 是相对的,但现在似乎是 ~500 英尺。
At first it seemed I could just group by rows that have the same latitude/longitude for the first 3 decimal places (roughly a 300x300 box, understanding that it changes as you move away from the equator).
起初,我似乎只能按前 3 个小数位具有相同纬度/经度的行分组(大约是一个 300x300 的框,了解它会随着您远离赤道而变化)。
However, that method seems to be quite lacking. 'Closeness' can't be significantly different than the distance each decimal place represents. It doesn't take into account that two locations may have different digits in the 3rd (or any) decimal place, but still be within the distance that place represents (33.1239
and 33.1240
).
然而,这种方法似乎相当缺乏。“接近度”不能与每个小数位代表的距离有显着差异。它没有考虑到两个位置在第 3 个(或任何)小数位上可能有不同的数字,但仍然在该位置代表的距离内(33.1239
和33.1240
)。
I've also mulled over the situation where Point A, and Point C are both 'close' to Point B (but not each other) - should they be grouped together? If so, what happens when Point D is 'close' to point C (and no other points) - should it be grouped as well. Certainly I have to determine the desired behavior, but how would either be implemented?
我还仔细考虑了 A 点和 C 点都与 B 点“接近”(但不是彼此)的情况——它们应该组合在一起吗?如果是这样,当 D 点“接近”C 点(并且没有其他点)时会发生什么——它也应该被分组。当然,我必须确定所需的行为,但如何实现?
Can anyone point me in the right direction as to how this can be done and what different methods/approaches can be used?
任何人都可以指出我如何做到这一点以及可以使用哪些不同的方法/方法的正确方向?
I feel a bit like I'm missing something obvious.
我觉得有点像我错过了一些明显的东西。
Currently the data is an a MySQL database, use by a PHP application; however, I'm open to other storage methods if they're a key part in accomplishing this. here.
目前数据是一个 MySQL 数据库,供 PHP 应用程序使用;但是,如果其他存储方法是实现这一目标的关键部分,我愿意接受其他存储方法。这里。
采纳答案by eykanal
There are a number of ways of determining the distance between two points, but for plotting points on a 2-D graph you probably want the Euclidean distance. If (x1, y1)
represents your first point and (x2, y2)
represents your second, the distance is
有多种方法可以确定两点之间的距离,但是对于在二维图形上绘制点,您可能需要欧几里得距离。如果(x1, y1)
代表您的第一个点并(x2, y2)
代表您的第二个点,则距离为
d = sqrt( (x2-x1)^2 + (y2-y1)^2 )
Regarding grouping, you may want to use some sort of 2-D mean to determine how "close" things are to each other. For example, if you have three points, (x1, y1)
, (x2, y2)
, (x3, y3)
, you can find the center of these three points by simple averaging:
关于分组,您可能希望使用某种二维均值来确定事物彼此之间的“接近”程度。例如,如果你有三个点,(x1, y1)
,(x2, y2)
,(x3, y3)
,你可以通过简单的平均找到这三个点的中心:
x(mean) = (x1+x2+x3)/3
y(mean) = (y1+y2+y3)/3
You can then see how close each is to the center to determine whether it should be part of the "cluster".
然后,您可以查看每个距离中心的距离,以确定它是否应该成为“集群”的一部分。
There are a number of ways one can define clusters, all of which use some variant of a clustering algorithm. I'm in a rush now and don't have time to summarize, but check out the link and the algorithms, and hopefully other people will be able to provide more detail. Good luck!
可以通过多种方式定义集群,所有这些方式都使用某种聚类算法的变体。我现在很匆忙,没有时间总结,但请查看链接和算法,希望其他人能够提供更多细节。祝你好运!
回答by araqnid
Use something similar to the method you outlined in your question to get an approximate set of results, then whittle that approximate set down by doing proper calculations. If you pick your grid size (i.e. how much you round off your co-ordinates) correctly, you can at least hope to reduce the amount of work to be done to an acceptable level, although you have to manage what that grid size is.
使用与您在问题中概述的方法类似的方法来获得一组近似的结果,然后通过进行适当的计算来减少该近似值。如果您正确选择了网格大小(即,您将坐标四舍五入的程度),您至少可以希望将要完成的工作量减少到可接受的水平,尽管您必须管理该网格大小。
For example, the earthdistanceextension to PostgreSQL works by converting lat/long pairs to (x,y,z) cartesian co-ordinates, modelling the Earth as a uniform sphere. PostgreSQL has a sophisticated indexing system that allows these co-ordinates, or boxes around them, to be indexed into R-trees, but you can whack something together that is still useful without that.
例如,PostgreSQL的地球距离扩展通过将纬度/经度对转换为 (x,y,z) 笛卡尔坐标来工作,将地球建模为一个统一的球体。PostgreSQL 有一个复杂的索引系统,允许将这些坐标或它们周围的框索引到 R 树中,但是您可以将一些东西组合在一起,如果没有这些,仍然有用。
If you take your (x,y,z) triple and round off- i.e. multiply by some factor and truncate to integer- you then have three integers that you can concatenate to produce a "box name", which identifies a box in your "grid" that the point is in.
如果你把你的 (x,y,z) 三元组四舍五入——即乘以某个因子并截断为整数——那么你就有了三个整数,你可以将它们连接起来以产生一个“框名称”,它在你的“点所在的网格”。
If you want to search for all points within X km of some target point, you generate all the "box names" around that point (once you've converted your target point to an (x,y,z) triple as well, that's easy) and eliminate all the boxes that don't intersect the Earth's surface (tricker, but use of the x^2+y^2+z^2=R^2
formula at each corner will tell you) you end up with a list of boxes target points can be in- so just search for all points matching one of those boxes, which will also return you some extra points. So as a final stage you need to calculate the actual distance to your target point and eliminate some (again, this can be sped up by working in Cartesian co-ordinates and converting your target great-circle distance radius to secant distance).
如果您想搜索某个目标点 X 公里内的所有点,您可以在该点周围生成所有“框名称”(一旦您将目标点也转换为 (x,y,z) 三元组,那就是容易)并消除所有不与地球表面相交的框(诡计,但x^2+y^2+z^2=R^2
在每个角落使用公式会告诉你)你最终会得到一个框列表,目标点可以在 - 所以只需搜索所有点匹配这些盒子之一,这也会为您带来一些额外的积分。因此,作为最后阶段,您需要计算到目标点的实际距离并消除一些距离(同样,这可以通过在笛卡尔坐标中工作并将目标大圆距离半径转换为割线距离来加速)。
The fiddling around comes down to making sure you don't have to search too many boxes, but at the same time don't bring in too many extra points. I've found it useful to index each point on several different grids (e.g. resolutions of 1Km, 5Km, 25Km, 125Km etc). Ideally you want to be searching just one box, remember it expands to at least 27 as soon as your target radius exceeds your grid size.
摆弄归结为确保您不必搜索太多框,但同时不要带来太多额外积分。我发现在几个不同的网格(例如 1Km、5Km、25Km、125Km 等的分辨率)上索引每个点很有用。理想情况下,您只想搜索一个框,请记住,一旦您的目标半径超过您的网格大小,它就会扩展到至少 27。
I've used this technique to construct a spatial index using Lucene rather than doing calculations in a SQL databases. It does work, although there is some fiddling to set it up, and the indices take a while to generate and are quite big. Using an R-tree to hold all the co-ordinates is a much nicer approach, but would take more custom coding- this technique basically just requires a fast hash-table lookup (so would probably work well with all the NoSQL databases that are the rage these days, and should be usable in a SQL database too).
我已经使用这种技术使用 Lucene 构建空间索引,而不是在 SQL 数据库中进行计算。它确实有效,尽管设置它有些麻烦,并且索引需要一段时间才能生成并且非常大。使用 R 树来保存所有坐标是一种更好的方法,但需要更多的自定义编码 - 这种技术基本上只需要快速的哈希表查找(因此可能适用于所有的 NoSQL 数据库)这几天很流行,也应该可以在 SQL 数据库中使用)。
回答by Roberto Russo
Maybe overkill, but it seems to me a clustering problem: distance measurewill determine how the similarity of two elements is calculated. If you need a less naive solution try Data Mining: Practical Machine Learning Tools and Techniques, and use Wekaor Orange
也许矫枉过正,但在我看来这是一个聚类问题:距离度量将决定如何计算两个元素的相似性。如果您需要一个不太天真的解决方案,请尝试Data Mining: Practical Machine Learning Tools and Techniques,并使用Weka或Orange
回答by Deepak Upreti
If you are considering latitude and longitude there are several factors to be considered in real time data: obstructions, such as rivers and lakes, and facilities, such as bridges and tunnels. You cannot group them simply; if you use the simple algorithm as k means you will not be able to group them. I think you should go for the spatial clustering methods as partitioning CLARANS method.
如果您正在考虑纬度和经度,那么实时数据中需要考虑几个因素:障碍物,例如河流和湖泊,以及设施,例如桥梁和隧道。你不能简单地将它们分组;如果您使用简单算法,因为 k 意味着您将无法对它们进行分组。我认为您应该使用空间聚类方法作为分区 CLARANS 方法。
回答by patros
If I were tackling it, I'd start with a grid. Put each point into a square on the grid. Look for grids that are densely populated. If the adjacent grids aren't populated, then you have a decent group.
如果我要解决它,我会从网格开始。将每个点放入网格上的正方形中。寻找人口稠密的网格。如果相邻的网格没有填充,那么你就有了一个不错的组。
If you have adjacent densely populated grids, you can always drop a circle at the center of each grid and optimize for circle area vs (number of points in the circle * some tunable weight). Not perfect, but easy. Better groupings are much more complicated optimization problems.
如果您有相邻的密集网格,您可以始终在每个网格的中心放置一个圆圈,并针对圆圈面积与(圆圈中的点数 * 一些可调权重)进行优化。不完美,但很容易。更好的分组是更复杂的优化问题。