Python DBSCAN 用于聚类地理位置数据
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/34579213/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
DBSCAN for clustering of geographic location data
提问by Neil
I have a dataframe with latitude and longitude pairs.
我有一个包含纬度和经度对的数据框。
Here is my dataframe look like.
这是我的数据框的样子。
order_lat order_long
0 19.111841 72.910729
1 19.111342 72.908387
2 19.111342 72.908387
3 19.137815 72.914085
4 19.119677 72.905081
5 19.119677 72.905081
6 19.119677 72.905081
7 19.120217 72.907121
8 19.120217 72.907121
9 19.119677 72.905081
10 19.119677 72.905081
11 19.119677 72.905081
12 19.111860 72.911346
13 19.111860 72.911346
14 19.119677 72.905081
15 19.119677 72.905081
16 19.119677 72.905081
17 19.137815 72.914085
18 19.115380 72.909144
19 19.115380 72.909144
20 19.116168 72.909573
21 19.119677 72.905081
22 19.137815 72.914085
23 19.137815 72.914085
24 19.112955 72.910102
25 19.112955 72.910102
26 19.112955 72.910102
27 19.119677 72.905081
28 19.119677 72.905081
29 19.115380 72.909144
30 19.119677 72.905081
31 19.119677 72.905081
32 19.119677 72.905081
33 19.119677 72.905081
34 19.119677 72.905081
35 19.111860 72.911346
36 19.111841 72.910729
37 19.131674 72.918510
38 19.119677 72.905081
39 19.111860 72.911346
40 19.111860 72.911346
41 19.111841 72.910729
42 19.111841 72.910729
43 19.111841 72.910729
44 19.115380 72.909144
45 19.116625 72.909185
46 19.115671 72.908985
47 19.119677 72.905081
48 19.119677 72.905081
49 19.119677 72.905081
50 19.116183 72.909646
51 19.113827 72.893833
52 19.119677 72.905081
53 19.114100 72.894985
54 19.107491 72.901760
55 19.119677 72.905081
I want to cluster this points which are nearest to each other(200 meters distance) following is my distance matrix.
我想聚集这些彼此最近(200 米距离)的点,下面是我的距离矩阵。
from scipy.spatial.distance import pdist, squareform
distance_matrix = squareform(pdist(X, (lambda u,v: haversine(u,v))))
array([[ 0. , 0.2522482 , 0.2522482 , ..., 1.67313071,
1.05925366, 1.05420922],
[ 0.2522482 , 0. , 0. , ..., 1.44111548,
0.81742536, 0.98978355],
[ 0.2522482 , 0. , 0. , ..., 1.44111548,
0.81742536, 0.98978355],
...,
[ 1.67313071, 1.44111548, 1.44111548, ..., 0. ,
1.02310118, 1.22871515],
[ 1.05925366, 0.81742536, 0.81742536, ..., 1.02310118,
0. , 1.39923529],
[ 1.05420922, 0.98978355, 0.98978355, ..., 1.22871515,
1.39923529, 0. ]])
Then I am applying DBSCAN clustering algorithm on distance matrix.
然后我在距离矩阵上应用 DBSCAN 聚类算法。
from sklearn.cluster import DBSCAN
db = DBSCAN(eps=2,min_samples=5)
y_db = db.fit_predict(distance_matrix)
I don't know how to choose eps & min_samples value. It clusters the points which are way too far, in one cluster.(approx 2 km in distance) Is it because it calculates euclidean distance while clustering? please help.
我不知道如何选择 eps 和 min_samples 值。它将太远的点聚集在一个集群中。(距离约 2 公里)是因为它在聚类时计算欧几里德距离吗?请帮忙。
采纳答案by Has QUIT--Anony-Mousse
DBSCAN is meantto be used on the raw data, with a spatial index for acceleration. The only tool I know with acceleration for geo distances is ELKI(Java) - scikit-learn unfortunately only supports this for a few distances like Euclidean distance (see sklearn.neighbors.NearestNeighbors
).
But apparently, you can affort to precompute pairwise distances, so this is not (yet) an issue.
DBSCAN旨在用于原始数据,并带有用于加速的空间索引。我知道的唯一能够加速地理距离的工具是ELKI(Java) - 不幸的是,scikit-learn 仅支持欧几里德距离等少数距离(请参阅 参考资料sklearn.neighbors.NearestNeighbors
)。但显然,您可以努力预先计算成对距离,所以这(还)不是问题。
However, you did not read the documentation carefully enough, and your assumption that DBSCAN uses a distance matrix is wrong:
但是,您没有足够仔细地阅读文档,并且您认为 DBSCAN 使用距离矩阵的假设是错误的:
from sklearn.cluster import DBSCAN
db = DBSCAN(eps=2,min_samples=5)
db.fit_predict(distance_matrix)
uses Euclidean distance on the distance matrix rows, which obviously does not make any sense.
在距离矩阵 rows 上使用欧几里得距离,这显然没有任何意义。
See the documentation of DBSCAN
(emphasis added):
请参阅DBSCAN
(强调添加)的文档:
class sklearn.cluster.DBSCAN(eps=0.5, min_samples=5, metric='euclidean', algorithm='auto', leaf_size=30, p=None, random_state=None)
metric: string, or callable
The metric to use when calculating distance between instances in a feature array. If metric is a string or callable, it must be one of the options allowed by metrics.pairwise.calculate_distance for its metric parameter. If metric is “precomputed”, X is assumed to be a distance matrix and must be square.X may be a sparse matrix, in which case only “nonzero” elements may be considered neighbors for DBSCAN.
class sklearn.cluster.DBSCAN(eps=0.5, min_samples=5, metric='euclidean ', algorithm='auto', Leaf_size=30, p=None, random_state=None)
metric: 字符串,或可调用的
计算特征数组中实例之间的距离时使用的度量。如果 metric 是字符串或可调用的,则它必须是 metrics.pairwise.calculate_distance 为其 metric 参数所允许的选项之一。如果度量是“预先计算的”,则假定 X 是距离矩阵并且必须是方阵。X 可能是一个稀疏矩阵,在这种情况下,只有“非零”元素可以被视为 DBSCAN 的邻居。
similar for fit_predict
:
类似的fit_predict
:
X: array or sparse (CSR) matrix of shape (n_samples, n_features), or array of shape (n_samples, n_samples)
A feature array, or array of distances between samples if metric='precomputed'.
X:数组或稀疏 (CSR) 形状矩阵 (n_samples, n_features),或形状数组 (n_samples, n_samples)
如果 metric='precomputed',则为特征数组或样本之间的距离数组。
In other words, you need to do
换句话说,你需要做
db = DBSCAN(eps=2, min_samples=5, metric="precomputed")
回答by Jamie Bull
I don't know what implementation of haversine
you're using but it looks like it returns results in km so eps
should be 0.2, not 2 for 200 m.
我不知道haversine
您使用的是什么实现,但看起来它以公里为单位返回结果,因此eps
应该是 0.2,而不是 2 表示 200 m。
For the min_samples
parameter, that depends on what your expected output is. Here are a couple of examples. My outputs are using an implementation of haversine
based on this answerwhich gives a distance matrix similar, but not identical to yours.
对于min_samples
参数,这取决于您的预期输出是什么。这里有几个例子。我的输出正在使用haversine
基于这个答案的实现,它给出了一个与你相似但不完全相同的距离矩阵。
This is with db = DBSCAN(eps=0.2, min_samples=5)
这是与 db = DBSCAN(eps=0.2, min_samples=5)
[ 0 -1 -1 -1 1 1 1 -1 -1 1 1 1 2 2 1 1 1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 1 -1 1 1 1 1 1 2 0 -1 1 2 2 0 0 0 -1 -1 -1 1 1 1 -1 -1 1 -1 -1 1]
[ 0 -1 -1 -1 1 1 1 -1 -1 1 1 1 2 2 1 1 1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 1 -1 1 1 1 1 1 2 0 -1 1 2 2 0 0 0 -1 -1 -1 1 1 1 -1 -1 1 -1 -1 1]
This creates three clusters, 0, 1
and 2
, and a lot of the samples don't fall into a cluster with at least 5 members and so are not assigned to a cluster (shown as -1
).
这将创建三个集群,0, 1
并且2
,并且很多样本不属于具有至少 5 个成员的集群,因此未分配到集群(显示为-1
)。
Trying again with a smaller min_samples
value:
用较小的min_samples
值再试一次:
db = DBSCAN(eps=0.2, min_samples=2)
db = DBSCAN(eps=0.2, min_samples=2)
[ 0 1 1 2 3 3 3 4 4 3 3 3 5 5 3 3 3 2 6 6 7 3 2 2 8 8 8 3 3 6 3 3 3 3 3 5 0 -1 3 5 5 0 0 0 6 -1 -1 3 3 3 7 -1 3 -1 -1 3]
[ 0 1 1 2 3 3 3 4 4 3 3 3 5 5 3 3 3 2 6 6 7 3 2 2 8 8 8 3 3 6 3 3 3 3 5 0 -1 3 5 5 0 0 0 6 -1 - 1 3 3 3 7 -1 3 -1 -1 3]
Here most of the samples are within 200m of at least one other sample and so fall into one of eight clusters 0
to 7
.
这里大多数样品是至少一个其他样品的200m范围内,因此落入8之一簇0
到7
。
Edited to add
编辑添加
It looks like @Anony-Mousse is right, though I didn't see anything wrong in my results. For the sake of contributing something, here's the code I was using to see the clusters:
看起来@Anony-Mousse 是对的,尽管我的结果没有发现任何问题。为了贡献一些东西,这是我用来查看集群的代码:
from math import radians, cos, sin, asin, sqrt
from scipy.spatial.distance import pdist, squareform
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt
import pandas as pd
def haversine(lonlat1, lonlat2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
"""
# convert decimal degrees to radians
lat1, lon1 = lonlat1
lat2, lon2 = lonlat2
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
r = 6371 # Radius of earth in kilometers. Use 3956 for miles
return c * r
X = pd.read_csv('dbscan_test.csv')
distance_matrix = squareform(pdist(X, (lambda u,v: haversine(u,v))))
db = DBSCAN(eps=0.2, min_samples=2, metric='precomputed') # using "precomputed" as recommended by @Anony-Mousse
y_db = db.fit_predict(distance_matrix)
X['cluster'] = y_db
plt.scatter(X['lat'], X['lng'], c=X['cluster'])
plt.show()
回答by eos
You can cluster spatial latitude-longitude data with scikit-learn's DBSCAN without precomputing a distance matrix.
您可以使用 scikit-learn 的 DBSCAN 对空间经纬度数据进行聚类,而无需预先计算距离矩阵。
db = DBSCAN(eps=2/6371., min_samples=5, algorithm='ball_tree', metric='haversine').fit(np.radians(coordinates))
This comes from this tutorial on clustering spatial data with scikit-learn DBSCAN. In particular, notice that the eps
value is still 2km, but it's divided by 6371 to convert it to radians. Also, notice that .fit()
takes the coordinates in radian units for the haversine metric.
这来自这篇关于使用 scikit-learn DBSCAN 聚类空间数据的教程。特别要注意,该eps
值仍然是 2km,但除以 6371 将其转换为弧度。另外,请注意,.fit()
对于半正弦度量,采用弧度单位的坐标。
回答by big-o
@eos Gives the best answer I think - as well as making use of Haversine distance (the most relevant distance measure in this case), it avoids the need to generate a precomputed distance matrix. If you create a distance matrix then you need to calculate the pairwise distances for every combination of points (although you can obviously save a bit of time by taking advantage of the fact that your distance metric is symmetric).
@eos 给出了我认为的最佳答案 - 以及利用 Haversine 距离(在这种情况下最相关的距离度量),它避免了生成预先计算的距离矩阵的需要。如果创建距离矩阵,则需要计算每个点组合的成对距离(尽管利用距离度量是对称的这一事实,显然可以节省一些时间)。
If you just supply DBSCAN with a distance measure and use the ball_tree
algorithm though, it can avoid the need to calculate every possible distance. This is because the ball tree algorithm can use the triangular inequality theorem to reduce the number of candidates that need to be checked to find the nearest neighbours of a data point (this is the biggest job in DBSCAN).
如果您只是为 DBSCAN 提供距离度量并使用该ball_tree
算法,则可以避免计算每个可能的距离的需要。这是因为球树算法可以利用三角不等式定理来减少需要检查的候选数量,以找到一个数据点的最近邻居(这是 DBSCAN 中最大的工作)。
The triangular inequality theorem states:
三角不等式定理指出:
|x+y| <= |x| + |y|
...so if a point p
is distance x
from its neighbour n
, and another point q
is a distance y
from p
, if x+y
is greater than our nearest neighbour radius, we know that q
must be too far away from n
to be considered a neighbour, so we don't need to calculate its distance.
...因此,如果一个点与其邻居的p
距离x
为n
,而另一个点q
与 的距离y
为p
,如果x+y
大于我们最近的邻居半径,我们知道q
必须离它太远n
而不能被视为邻居,所以我们不需要计算它的距离。
Read more about how ball trees work in the scikit-learn documentation
在scikit-learn 文档中阅读有关球树如何工作的更多信息