Python sklearn-KMeans 如何获取每个簇中的样本/点
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/36195457/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python sklearn-KMeans how to get the samples/points in each clusters
提问by user77005
I am using the sklearn.cluster KMeans package. Once I finish the clustering if I need to know which values were grouped together how can I do it?
我正在使用 sklearn.cluster KMeans 包。完成聚类后,如果我需要知道将哪些值组合在一起,我该怎么做?
Say I had 100 data points and KMeans gave me 5 cluster. Now I want to know which data points are in cluster 5. How can I do that.
假设我有 100 个数据点,KMeans 给了我 5 个集群。现在我想知道集群 5 中有哪些数据点。我该怎么做。
Is there a function to give the cluster id and it will list out all the data points in that cluster
是否有提供集群 ID 的函数,它会列出该集群中的所有数据点
Thanks.
谢谢。
回答by Praveen
I had a similar requirement and i am using pandas to create a new dataframe with the index of the dataset and the labels as columns.
我有一个类似的要求,我正在使用 Pandas 创建一个新的数据框,其中包含数据集的索引和标签作为列。
data = pd.read_csv('filename')
km = KMeans(n_clusters=5).fit(data)
cluster_map = pd.DataFrame()
cluster_map['data_index'] = data.index.values
cluster_map['cluster'] = km.labels_
Once the DataFrame is available is quite easy to filter, For example, to filter all data points in cluster 3
一旦DataFrame可用就很容易过滤,例如过滤集群3中的所有数据点
cluster_map[cluster_map.cluster == 3]
回答by Kevin
If you have a large dataset and you need to extract clusters on-demand you'll see some speed-up using numpy.where. Here is an example on the iris dataset:
如果您有一个大型数据集并且需要按需提取集群,您会发现使用numpy.where. 以下是 iris 数据集的示例:
from sklearn.cluster import KMeans
from sklearn import datasets
import numpy as np
centers = [[1, 1], [-1, -1], [1, -1]]
iris = datasets.load_iris()
X = iris.data
y = iris.target
km = KMeans(n_clusters=3)
km.fit(X)
Define a function to extract the indices of the cluster_id you provide. (Here are two functions, for benchmarking, they both return the same values):
定义一个函数来提取您提供的 cluster_id 的索引。(这里有两个函数,对于基准测试,它们都返回相同的值):
def ClusterIndicesNumpy(clustNum, labels_array): #numpy
return np.where(labels_array == clustNum)[0]
def ClusterIndicesComp(clustNum, labels_array): #list comprehension
return np.array([i for i, x in enumerate(labels_array) if x == clustNum])
Let's say you want all samples that are in cluster 2:
假设您想要集群中的所有样本2:
ClusterIndicesNumpy(2, km.labels_)
array([ 52, 77, 100, 102, 103, 104, 105, 107, 108, 109, 110, 111, 112,
115, 116, 117, 118, 120, 122, 124, 125, 128, 129, 130, 131, 132,
134, 135, 136, 137, 139, 140, 141, 143, 144, 145, 147, 148])
Numpy wins the benchmark:
Numpy 赢得了基准测试:
%timeit ClusterIndicesNumpy(2,km.labels_)
100000 loops, best of 3: 4 μs per loop
%timeit ClusterIndicesComp(2,km.labels_)
1000 loops, best of 3: 479 μs per loop
Now you can extract all of your cluster 2 data points like so:
现在您可以像这样提取所有集群 2 数据点:
X[ClusterIndicesNumpy(2,km.labels_)]
array([[ 6.9, 3.1, 4.9, 1.5],
[ 6.7, 3. , 5. , 1.7],
[ 6.3, 3.3, 6. , 2.5],
... #truncated
Double-check the first three indices from the truncated array above:
仔细检查上面截断数组中的前三个索引:
print X[52], km.labels_[52]
print X[77], km.labels_[77]
print X[100], km.labels_[100]
[ 6.9 3.1 4.9 1.5] 2
[ 6.7 3. 5. 1.7] 2
[ 6.3 3.3 6. 2.5] 2
回答by seralouk
To get the IDs of the points/samples/observations that are inside each cluster, do this:
要获取每个集群内的点/样本/观察的 ID,请执行以下操作:
Example using Iris data and a nice pythonic way:
使用 Iris 数据和一个不错的 Pythonic 方式的示例:
import numpy as np
from sklearn.cluster import KMeans
from sklearn import datasets
np.random.seed(0)
# Use Iris data
iris = datasets.load_iris()
X = iris.data
y = iris.target
# KMeans with 3 clusters
clf = KMeans(n_clusters=3)
clf.fit(X,y)
#Coordinates of cluster centers with shape [n_clusters, n_features]
clf.cluster_centers_
#Labels of each point
clf.labels_
# Nice Pythonic way to get the indices of the points for each corresponding cluster
mydict = {i: np.where(clf.labels_ == i)[0] for i in range(clf.n_clusters)}
# Transform this dictionary into list (if you need a list as result)
dictlist = []
for key, value in mydict.iteritems():
temp = [key,value]
dictlist.append(temp)
RESULTS
结果
#dict format
{0: array([ 50, 51, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,
64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76,
78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90,
91, 92, 93, 94, 95, 96, 97, 98, 99, 101, 106, 113, 114,
119, 121, 123, 126, 127, 133, 138, 142, 146, 149]),
1: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]),
2: array([ 52, 77, 100, 102, 103, 104, 105, 107, 108, 109, 110, 111, 112,
115, 116, 117, 118, 120, 122, 124, 125, 128, 129, 130, 131, 132,
134, 135, 136, 137, 139, 140, 141, 143, 144, 145, 147, 148])}
# list format
[[0, array([ 50, 51, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,
64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76,
78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90,
91, 92, 93, 94, 95, 96, 97, 98, 99, 101, 106, 113, 114,
119, 121, 123, 126, 127, 133, 138, 142, 146, 149])],
[1, array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49])],
[2, array([ 52, 77, 100, 102, 103, 104, 105, 107, 108, 109, 110, 111, 112,
115, 116, 117, 118, 120, 122, 124, 125, 128, 129, 130, 131, 132,
134, 135, 136, 137, 139, 140, 141, 143, 144, 145, 147, 148])]]
回答by john gonidelis
Actually a very simple way to do this is:
实际上,一个非常简单的方法是:
clusters=KMeans(n_clusters=5)
df[clusters.labels_==0]
The second row returns all the elements of the dfthat belong to the 0th cluster. Similarly you can find the other cluster-elements.
第二行返回df属于0第 th 个簇的所有元素。同样,您可以找到其他簇元素。
回答by Farseer
You can look at attribute labels_
你可以看看属性 labels_
For example
例如
km = KMeans(2)
km.fit([[1,2,3],[2,3,4],[5,6,7]])
print km.labels_
output: array([1, 1, 0], dtype=int32)
As you can see first and second point is cluster 1, last point in cluster 0.
如您所见,第一个和第二个点是 cluster 1,最后一个点是 cluster 0。
回答by Sandeep Shahi
You can Simply store the labels in an array. Convert the array to a data frame. Then Merge the data that you used to create K means with the new data frame with clusters.
您可以简单地将标签存储在数组中。将数组转换为数据框。然后将用于创建 K 均值的数据与带有簇的新数据框合并。
Display the dataframe. Now you should see the row with corresponding cluster. If you want to list all the data with specific cluster, use something like data.loc[data['cluster_label_name'] == 2], assuming 2 your cluster for now.
显示数据框。现在您应该看到具有相应集群的行。如果要列出具有特定集群的所有数据,请使用 data.loc[data['cluster_label_name'] == 2] 之类的内容,假设您的集群现在为 2。

