Python DBSCAN 用于聚类地理位置数据

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/34579213/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 15:12:52  来源:igfitidea点击:

DBSCAN for clustering of geographic location data

pythoncluster-analysisdbscan

提问by Neil

I have a dataframe with latitude and longitude pairs.

我有一个包含纬度和经度对的数据框。

Here is my dataframe look like.

这是我的数据框的样子。

    order_lat  order_long
0   19.111841   72.910729
1   19.111342   72.908387
2   19.111342   72.908387
3   19.137815   72.914085
4   19.119677   72.905081
5   19.119677   72.905081
6   19.119677   72.905081
7   19.120217   72.907121
8   19.120217   72.907121
9   19.119677   72.905081
10  19.119677   72.905081
11  19.119677   72.905081
12  19.111860   72.911346
13  19.111860   72.911346
14  19.119677   72.905081
15  19.119677   72.905081
16  19.119677   72.905081
17  19.137815   72.914085
18  19.115380   72.909144
19  19.115380   72.909144
20  19.116168   72.909573
21  19.119677   72.905081
22  19.137815   72.914085
23  19.137815   72.914085
24  19.112955   72.910102
25  19.112955   72.910102
26  19.112955   72.910102
27  19.119677   72.905081
28  19.119677   72.905081
29  19.115380   72.909144
30  19.119677   72.905081
31  19.119677   72.905081
32  19.119677   72.905081
33  19.119677   72.905081
34  19.119677   72.905081
35  19.111860   72.911346
36  19.111841   72.910729
37  19.131674   72.918510
38  19.119677   72.905081
39  19.111860   72.911346
40  19.111860   72.911346
41  19.111841   72.910729
42  19.111841   72.910729
43  19.111841   72.910729
44  19.115380   72.909144
45  19.116625   72.909185
46  19.115671   72.908985
47  19.119677   72.905081
48  19.119677   72.905081
49  19.119677   72.905081
50  19.116183   72.909646
51  19.113827   72.893833
52  19.119677   72.905081
53  19.114100   72.894985
54  19.107491   72.901760
55  19.119677   72.905081

I want to cluster this points which are nearest to each other(200 meters distance) following is my distance matrix.

我想聚集这些彼此最近(200 米距离)的点,下面是我的距离矩阵。

from scipy.spatial.distance import pdist, squareform
distance_matrix = squareform(pdist(X, (lambda u,v: haversine(u,v))))

array([[ 0.        ,  0.2522482 ,  0.2522482 , ...,  1.67313071,
     1.05925366,  1.05420922],
   [ 0.2522482 ,  0.        ,  0.        , ...,  1.44111548,
     0.81742536,  0.98978355],
   [ 0.2522482 ,  0.        ,  0.        , ...,  1.44111548,
     0.81742536,  0.98978355],
   ..., 
   [ 1.67313071,  1.44111548,  1.44111548, ...,  0.        ,
     1.02310118,  1.22871515],
   [ 1.05925366,  0.81742536,  0.81742536, ...,  1.02310118,
     0.        ,  1.39923529],
   [ 1.05420922,  0.98978355,  0.98978355, ...,  1.22871515,
     1.39923529,  0.        ]])

Then I am applying DBSCAN clustering algorithm on distance matrix.

然后我在距离矩阵上应用 DBSCAN 聚类算法。

 from sklearn.cluster import DBSCAN

 db = DBSCAN(eps=2,min_samples=5)
 y_db = db.fit_predict(distance_matrix)

I don't know how to choose eps & min_samples value. It clusters the points which are way too far, in one cluster.(approx 2 km in distance) Is it because it calculates euclidean distance while clustering? please help.

我不知道如何选择 eps 和 min_samples 值。它将太远的点聚集在一个集群中。(距离约 2 公里)是因为它在聚类时计算欧几里德距离吗?请帮忙。

采纳答案by Has QUIT--Anony-Mousse

DBSCAN is meantto be used on the raw data, with a spatial index for acceleration. The only tool I know with acceleration for geo distances is ELKI(Java) - scikit-learn unfortunately only supports this for a few distances like Euclidean distance (see sklearn.neighbors.NearestNeighbors). But apparently, you can affort to precompute pairwise distances, so this is not (yet) an issue.

DBSCAN旨在用于原始数据,并带有用于加速的空间索引。我知道的唯一能够加速地理距离的工具是ELKI(Java) - 不幸的是,scikit-learn 仅支持欧几里德距离等少数距离(请参阅 参考资料sklearn.neighbors.NearestNeighbors)。但显然,您可以努力预先计算成对距离,所以这(还)不是问题。

However, you did not read the documentation carefully enough, and your assumption that DBSCAN uses a distance matrix is wrong:

但是,您没有足够仔细地阅读文档,并且您认为 DBSCAN 使用距离矩阵的假设是错误的:

from sklearn.cluster import DBSCAN
db = DBSCAN(eps=2,min_samples=5)
db.fit_predict(distance_matrix)

uses Euclidean distance on the distance matrix rows, which obviously does not make any sense.

在距离矩阵 rows 上使用欧几里得距离,这显然没有任何意义。

See the documentation of DBSCAN(emphasis added):

请参阅DBSCAN(强调添加)的文档:

class sklearn.cluster.DBSCAN(eps=0.5, min_samples=5, metric='euclidean', algorithm='auto', leaf_size=30, p=None, random_state=None)

metric: string, or callable

The metric to use when calculating distance between instances in a feature array. If metric is a string or callable, it must be one of the options allowed by metrics.pairwise.calculate_distance for its metric parameter. If metric is “precomputed”, X is assumed to be a distance matrix and must be square.X may be a sparse matrix, in which case only “nonzero” elements may be considered neighbors for DBSCAN.

class sklearn.cluster.DBSCAN(eps=0.5, min_samples=5, metric='euclidean ', algorithm='auto', Leaf_size=30, p=None, random_state=None)

metric: 字符串,或可调用的

计算特征数组中实例之间的距离时使用的度量。如果 metric 是字符串或可调用的,则它必须是 metrics.pairwise.calculate_distance 为其 metric 参数所允许的选项之一。如果度量是“预先计算的”,则假定 X 是距离矩阵并且必须是方阵。X 可能是一个稀疏矩阵,在这种情况下,只有“非零”元素可以被视为 DBSCAN 的邻居。

similar for fit_predict:

类似的fit_predict

X: array or sparse (CSR) matrix of shape (n_samples, n_features), or array of shape (n_samples, n_samples)

A feature array, or array of distances between samples if metric='precomputed'.

X:数组或稀疏 (CSR) 形状矩阵 (n_samples, n_features),或形状数组 (n_samples, n_samples)

如果 metric='precomputed',则为特征数组或样本之间的距离数组

In other words, you need to do

换句话说,你需要做

db = DBSCAN(eps=2, min_samples=5, metric="precomputed")

回答by Jamie Bull

I don't know what implementation of haversineyou're using but it looks like it returns results in km so epsshould be 0.2, not 2 for 200 m.

我不知道haversine您使用的是什么实现,但看起来它以公里为单位返回结果,因此eps应该是 0.2,而不是 2 表示 200 m。

For the min_samplesparameter, that depends on what your expected output is. Here are a couple of examples. My outputs are using an implementation of haversinebased on this answerwhich gives a distance matrix similar, but not identical to yours.

对于min_samples参数,这取决于您的预期输出是什么。这里有几个例子。我的输出正在使用haversine基于这个答案的实现,它给出了一个与你相似但不完全相同的距离矩阵。

This is with db = DBSCAN(eps=0.2, min_samples=5)

这是与 db = DBSCAN(eps=0.2, min_samples=5)

[ 0 -1 -1 -1 1 1 1 -1 -1 1 1 1 2 2 1 1 1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 1 -1 1 1 1 1 1 2 0 -1 1 2 2 0 0 0 -1 -1 -1 1 1 1 -1 -1 1 -1 -1 1]

[ 0 -1 -1 -1 1 1 1 -1 -1 1 1 1 2 2 1 1 1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 1 -1 1 1 1 1 1 2 0 -1 1 2 2 0 0 0 -1 -1 -1 1 1 1 -1 -1 1 -1 -1 1]

This creates three clusters, 0, 1and 2, and a lot of the samples don't fall into a cluster with at least 5 members and so are not assigned to a cluster (shown as -1).

这将创建三个集群,0, 1并且2,并且很多样本不属于具有至少 5 个成员的集群,因此未分配到集群(显示为-1)。

Trying again with a smaller min_samplesvalue:

用较小的min_samples值再试一次:

db = DBSCAN(eps=0.2, min_samples=2)

db = DBSCAN(eps=0.2, min_samples=2)

[ 0 1 1 2 3 3 3 4 4 3 3 3 5 5 3 3 3 2 6 6 7 3 2 2 8 8 8 3 3 6 3 3 3 3 3 5 0 -1 3 5 5 0 0 0 6 -1 -1 3 3 3 7 -1 3 -1 -1 3]

[ 0 1 1 2 3 3 3 4 4 3 3 3 5 5 3 3 3 2 6 6 7 3 2 2 8 8 8 3 3 6 3 3 3 3 5 0 -1 3 5 5 0 0 0 6 -1 - 1 3 3 3 7 -1 3 -1 -1 3]

Here most of the samples are within 200m of at least one other sample and so fall into one of eight clusters 0to 7.

这里大多数样品是至少一个其他样品的200m范围内,因此落入8之一簇07

Edited to add

编辑添加

It looks like @Anony-Mousse is right, though I didn't see anything wrong in my results. For the sake of contributing something, here's the code I was using to see the clusters:

看起来@Anony-Mousse 是对的,尽管我的结果没有发现任何问题。为了贡献一些东西,这是我用来查看集群的代码:

from math import radians, cos, sin, asin, sqrt

from scipy.spatial.distance import pdist, squareform
from sklearn.cluster import DBSCAN

import matplotlib.pyplot as plt
import pandas as pd


def haversine(lonlat1, lonlat2):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """
    # convert decimal degrees to radians 
    lat1, lon1 = lonlat1
    lat2, lon2 = lonlat2
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])

    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    r = 6371 # Radius of earth in kilometers. Use 3956 for miles
    return c * r


X = pd.read_csv('dbscan_test.csv')
distance_matrix = squareform(pdist(X, (lambda u,v: haversine(u,v))))

db = DBSCAN(eps=0.2, min_samples=2, metric='precomputed')  # using "precomputed" as recommended by @Anony-Mousse
y_db = db.fit_predict(distance_matrix)

X['cluster'] = y_db

plt.scatter(X['lat'], X['lng'], c=X['cluster'])
plt.show()

回答by eos

You can cluster spatial latitude-longitude data with scikit-learn's DBSCAN without precomputing a distance matrix.

您可以使用 scikit-learn 的 DBSCAN 对空间经纬度数据进行聚类,而无需预先计算距离矩阵。

db = DBSCAN(eps=2/6371., min_samples=5, algorithm='ball_tree', metric='haversine').fit(np.radians(coordinates))

This comes from this tutorial on clustering spatial data with scikit-learn DBSCAN. In particular, notice that the epsvalue is still 2km, but it's divided by 6371 to convert it to radians. Also, notice that .fit()takes the coordinates in radian units for the haversine metric.

这来自这篇关于使用 scikit-learn DBSCAN 聚类空间数据的教程。特别要注意,该eps值仍然是 2km,但除以 6371 将其转换为弧度。另外,请注意,.fit()对于半正弦度量,采用弧度单位的坐标。

回答by big-o

@eos Gives the best answer I think - as well as making use of Haversine distance (the most relevant distance measure in this case), it avoids the need to generate a precomputed distance matrix. If you create a distance matrix then you need to calculate the pairwise distances for every combination of points (although you can obviously save a bit of time by taking advantage of the fact that your distance metric is symmetric).

@eos 给出了我认为的最佳答案 - 以及利用 Haversine 距离(在这种情况下最相关的距离度量),它避免了生成预先计算的距离矩阵的需要。如果创建距离矩阵,则需要计算每个点组合的成对距离(尽管利用距离度量是对称的这一事实,显然可以节省一些时间)。

If you just supply DBSCAN with a distance measure and use the ball_treealgorithm though, it can avoid the need to calculate every possible distance. This is because the ball tree algorithm can use the triangular inequality theorem to reduce the number of candidates that need to be checked to find the nearest neighbours of a data point (this is the biggest job in DBSCAN).

如果您只是为 DBSCAN 提供距离度量并使用该ball_tree算法,则可以避免计算每个可能的距离的需要。这是因为球树算法可以利用三角不等式定理来减少需要检查的候选数量,以找到一个数据点的最近邻居(这是 DBSCAN 中最大的工作)。

The triangular inequality theorem states:

三角不等式定理指出:

|x+y| <= |x| + |y|

...so if a point pis distance xfrom its neighbour n, and another point qis a distance yfrom p, if x+yis greater than our nearest neighbour radius, we know that qmust be too far away from nto be considered a neighbour, so we don't need to calculate its distance.

...因此,如果一个点与其邻居的p距离xn,而另一个点q与 的距离yp,如果x+y大于我们最近的邻居半径,我们知道q必须离它太远n而不能被视为邻居,所以我们不需要计算它的距离。

Read more about how ball trees work in the scikit-learn documentation

scikit-learn 文档中阅读有关球树如何工作的更多信息