通过它们在 python 中的接近度对值进行聚类(机器学习?)

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/18364026/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 10:33:46  来源:igfitidea点击:

Clustering values by their proximity in python (machine learning?)

pythonmachine-learningcluster-analysisdata-mining

提问by PCoelho

I have an algorithm that is running on a set of objects. This algorithm produces a score value that dictates the differences between the elements in the set.

我有一个在一组对象上运行的算法。该算法产生一个分数值,指示集合中元素之间的差异。

The sorted output is something like this:

排序后的输出是这样的:

[1,1,5,6,1,5,10,22,23,23,50,51,51,52,100,112,130,500,512,600,12000,12230]

[1,1,5,6,1,5,10,22,23,23,50,51,51,52,100,112,130,500,512,600,12000,12230]

If you lay these values down on a spreadsheet you see that they make up groups

如果您将这些值放在电子表格上,您会看到它们组成了组

[1,1,5,6,1,5] [10,22,23,23] [50,51,51,52] [100,112,130] [500,512,600] [12000,12230]

[1,1,5,6,1,5] [10,22,23,23] [50,51,51,52] [100,112,130] [500,512,600] [12000,12230]

Is there a way to programatically get those groupings?

有没有办法以编程方式获取这些分组?

Maybe some clustering algorithm using a machine learning library? Or am I overthinking this?

也许使用机器学习库的一些聚类算法?还是我想多了?

I've looked at scikit but their examples are way too advanced for my problem...

我看过 scikit 但他们的例子对于我的问题来说太先进了......

回答by David Robinson

You can use clustering to group these. The trick is to understand that there are two dimensions to your data: the dimension you can see, and the "spatial" dimension that looks like [1, 2, 3... 22]. You can create this matrix in numpylike so:

您可以使用聚类来对这些进行分组。诀窍是要了解数据有两个维度:您可以看到的维度,以及看起来像 [1, 2, 3 ... 22] 的“空间”维度。您可以像这样在numpy 中创建这个矩阵:

import numpy as np

y = [1,1,5,6,1,5,10,22,23,23,50,51,51,52,100,112,130,500,512,600,12000,12230]
x = range(len(y))
m = np.matrix([x, y]).transpose()

Then you can perform clustering on the matrix, with:

然后您可以对矩阵执行聚类,使用:

from scipy.cluster.vq import kmeans
kclust = kmeans(m, 5)

kclust's output will look like this:

kclust 的输出将如下所示:

(array([[   11,    51],
       [   15,   114],
       [   20, 12115],
       [    4,     9],
       [   18,   537]]), 21.545126372346271)

For you, the most interesting part is the first column of the matrix, which says what the centers are along that x dimension:

对你来说,最有趣的部分是矩阵的第一列,它说明了沿 x 维度的中心是什么:

kclust[0][:, 0]
# [20 18 15  4 11]

You can then assign your points to a cluster based on which of the five centers they are closest to:

然后,您可以根据它们最接近五个中心中的哪一个将您的点分配到一个集群:

assigned_clusters = [abs(cluster_indices - e).argmin() for e in x]
# [3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 2, 2, 2, 2, 1, 1, 0, 0, 0]

回答by jabaldonedo

A good option if you don't know the number of clusters is MeanShift:

如果您不知道集群的数量,一个不错的选择是MeanShift

import numpy as np
from sklearn.cluster import MeanShift, estimate_bandwidth

x = [1,1,5,6,1,5,10,22,23,23,50,51,51,52,100,112,130,500,512,600,12000,12230]

X = np.array(zip(x,np.zeros(len(x))), dtype=np.int)
bandwidth = estimate_bandwidth(X, quantile=0.1)
ms = MeanShift(bandwidth=bandwidth, bin_seeding=True)
ms.fit(X)
labels = ms.labels_
cluster_centers = ms.cluster_centers_

labels_unique = np.unique(labels)
n_clusters_ = len(labels_unique)

for k in range(n_clusters_):
    my_members = labels == k
    print "cluster {0}: {1}".format(k, X[my_members, 0])

Output for this algorithm:

此算法的输出:

cluster 0: [ 1  1  5  6  1  5 10 22 23 23 50 51 51 52]
cluster 1: [100 112 130]
cluster 2: [500 512]
cluster 3: [12000]
cluster 4: [12230]
cluster 5: [600]

Modifying quantilevariable you can change the clustering number selection criteria

修改quantile变量可以更改聚类数选择标准

回答by Has QUIT--Anony-Mousse

Don't use clustering for 1-dimensional data

不要对一维数据使用聚类

Clustering algorithms are designed for multivariate data. When you have 1-dimensional data, sortit, and look for the largest gaps. This is trivial and fastin 1d, and not possible in 2d. If you want something more advanced, use Kernel Density Estimation (KDE) and look for local minima to split the data set.

聚类算法是为多变量数据设计的。当你有一维数据时,对它进行排序,并寻找最大的差距。这在 1d 中微不足道且快速,而在 2d 中是不可能的。如果您想要更高级的东西,请使用核密度估计 (KDE) 并寻找局部最小值来拆分数据集。

There are a number of duplicates of this question:

这个问题有很多重复: