Python Scikit Learn - K-Means - Elbow - 标准

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/19197715/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 13:06:25  来源:igfitidea点击:

Scikit Learn - K-Means - Elbow - criterion

pythonmachine-learningscikit-learncluster-analysisk-means

提问by Linda

Today i'm trying to learn something about K-means. I Have understand the algorithm and i know how it works. Now i'm looking for the right k... I found the elbow criterion as a method to detect the right k but i do not understand how to use it with scikit learn?! In scikit learn i'm clustering things in this way

今天我想学习一些关于 K-means 的知识。我已经了解算法,我知道它是如何工作的。现在我正在寻找正确的 k...我发现肘部标准是一种检测正确 k 的方法,但我不明白如何将它与 scikit learn 一起使用?!在 scikit 中学习我以这种方式对事物进行聚类

kmeans = KMeans(init='k-means++', n_clusters=n_clusters, n_init=10) 
kmeans.fit(data)

So should i do this several times for n_clusters = 1...n and watch at the Error rate to get the right k ? think this would be stupid and would take a lot of time?!

那么我应该为 n_clusters = 1...n 做几次并观察错误率以获得正确的 k 吗?认为这会很愚蠢并且会花费很多时间?!

采纳答案by Has QUIT--Anony-Mousse

The elbow criterion is a visual method. I have not yet seen a robust mathematical definition of it. But k-means is a pretty crude heuristic, too.

肘部准则是一种视觉方法。我还没有看到它的强大数学定义。但是 k-means 也是一种非常粗略的启发式方法。

So yes, you will need to run k-means with k=1...kmax, then plotthe resulting SSQ and decide upon an "optimal" k.

所以是的,您需要使用 运行 k 均值k=1...kmax,然后绘制生成的 SSQ 并决定“最佳”k。

There exist advanced versions of k-means such as X-means that will start with k=2and then increase it until a secondary criterion (AIC/BIC) no longer improves. Bisecting k-means is an approach that also starts with k=2 and then repeatedly splits clusters until k=kmax. You could probably extract the interim SSQs from it.

存在高级版本的 k 均值,例如 X 均值,它们将开始k=2并增加它,直到次要标准(AIC/BIC)不再改进。二分 k-means 是一种方法,也是从 k=2 开始,然后重复拆分集群直到 k=kmax。您可能可以从中提取临时 SSQ。

Either way, I have the impression that in any actual use casewhere k-mean is really good, you do actually know the k you need beforehand. In these cases, k-means is actually not so much a "clustering" algorithm, but a vector quantizationalgorithm. E.g. reducing the number of colors of an image to k. (where often you would choose k to be e.g. 32, because that is then 5 bits color depth and can be stored in a bit compressed way). Or e.g. in bag-of-visual-words approaches, where you would choose the vocabulary size manually. A popular value seems to be k=1000. You then don't really care much about the quality of the "clusters", but the main point is to be able to reduce an image to a 1000 dimensional sparse vector. The performance of a 900 dimensional or a 1100 dimensional representation will not be substantially different.

无论哪种方式,我的印象是,在 k-mean 非常好的任何实际用例中,您确实事先知道您需要的 k。在这些情况下,k-means 实际上与其说是一种“聚类”算法,不如说是一种向量量化算法。例如,将图像的颜色数量减少到 k。(通常您会选择 k 为例如 32,因为那是 5 位颜色深度并且可以以位压缩方式存储)。或者例如在视觉词袋方法中,您可以手动选择词汇量大小。一个流行的值似乎是 k=1000。然后您不太关心“集群”的质量,但重点是能够将图像减少到 1000 维稀疏向量。900 维或 1100 维表示的性能不会有显着差异。

For actual clustering tasks, i.e. when you want to analyze the resulting clusters manually, people usually use more advanced methods than k-means. K-means is more of a data simplification technique.

对于实际的聚类任务,即当你想手动分析结果聚类时,人们通常使用比 k-means 更高级的方法。K-means 更像是一种数据简化技术。

回答by Om Prakash

If the true label is not known in advance(as in your case), then K-Means clusteringcan be evaluated using either Elbow Criterion or Silhouette Coefficient.

如果事先不知道真实标签(如您的情况),则K-Means clustering可以使用肘部准则或轮廓系数进行评估。

Elbow Criterion Method:

肘部标准方法:

The idea behind elbow method is to run k-means clustering on a given dataset for a range of values of k (num_clusters, e.g k=1 to 10), and for each value of k, calculate sum of squared errors (SSE).

肘部方法背后的想法是在给定的数据集上对 k 值范围(num_clusters例如 k=1 到 10)运行 k 均值聚类,并且对于 k 的每个值,计算平方误差总和 (SSE)。

After that, plot a line graph of the SSE for each value of k. If the line graph looks like an arm - a red circle in below line graph (like angle), the "elbow" on the arm is the value of optimal k (number of cluster). Here, we want to minimize SSE. SSE tends to decrease toward 0 as we increase k (and SSE is 0 when k is equal to the number of data points in the dataset, because then each data point is its own cluster, and there is no error between it and the center of its cluster).

之后,为每个 k 值绘制 SSE 的折线图。如果折线图看起来像一个手臂 - 下方折线图中的红色圆圈(如角度),则手臂上的“肘部”是最佳 k 的值(聚类数)。在这里,我们希望最小化 SSE。随着 k 的增加,SSE 趋于向 0 减小(当 k 等于数据集中的数据点数时,SSE 为 0,因为这样每个数据点都是它自己的簇,并且它与中心点之间没有误差)它的集群)。

So the goal is to choose a small value of kthat still has a low SSE, and the elbow usually represents where we start to have diminishing returns by increasing k.

所以我们的目标是选择一个small value of k仍然具有低 SSE 的,肘部通常代表我们通过增加 k 开始收益递减的地方。

Let's consider iris datasets,

让我们考虑鸢尾花数据集,

import pandas as pd
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris['feature_names'])
#print(X)
data = X[['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)']]

sse = {}
for k in range(1, 10):
    kmeans = KMeans(n_clusters=k, max_iter=1000).fit(data)
    data["clusters"] = kmeans.labels_
    #print(data["clusters"])
    sse[k] = kmeans.inertia_ # Inertia: Sum of distances of samples to their closest cluster center
plt.figure()
plt.plot(list(sse.keys()), list(sse.values()))
plt.xlabel("Number of cluster")
plt.ylabel("SSE")
plt.show()

Plot for above code: enter image description here

绘制上述代码: 在此处输入图片说明

We can see in plot, 3 is the optimal number of clusters (encircled red) for iris dataset, which is indeed correct.

我们可以在图中看到,3 是 iris 数据集的最佳聚类数(红色圆圈),这确实是正确的。





Silhouette Coefficient Method:

轮廓系数法:

From sklearn documentation,

sklearn 文档

A higher Silhouette Coefficient score relates to a model with better-defined clusters. The Silhouette Coefficient is defined for each sample and is composed of two scores: `

较高的轮廓系数分数与具有更好定义集群的模型相关。轮廓系数是为每个样本定义的,由两个分数组成:`

a: The mean distance between a sample and all other points in the same class.

b: The mean distance between a sample and all other points in the next nearest cluster.

a:样本与同一类中所有其他点之间的平均距离。

b:样本与下一个最近簇中的所有其他点之间的平均距离。

The Silhouette Coefficient is for a single sample is then given as:

然后给出单个样本的轮廓系数:

s=b-a/max(a,b)

s=ba/max(a,b)

Now, to find the optimal value of kfor KMeans, loop through 1..n for n_clusters in KMeansand calculate Silhouette Coefficient for each sample.

现在,为了找到kfor的最佳值KMeans,循环 1..n for n_clusters inKMeans并计算每个样本的轮廓系数。

A higher Silhouette Coefficient indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters.

较高的轮廓系数表明对象与其自己的集群匹配良好,而与相邻集群的匹配不佳。

from sklearn.metrics import silhouette_score
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans

X = load_iris().data
y = load_iris().target

for n_cluster in range(2, 11):
    kmeans = KMeans(n_clusters=n_cluster).fit(X)
    label = kmeans.labels_
    sil_coeff = silhouette_score(X, label, metric='euclidean')
    print("For n_clusters={}, The Silhouette Coefficient is {}".format(n_cluster, sil_coeff))

Output -

输出 -

For n_clusters=2, The Silhouette Coefficient is 0.680813620271
For n_clusters=3, The Silhouette Coefficient is 0.552591944521
For n_clusters=4, The Silhouette Coefficient is 0.496992849949
For n_clusters=5, The Silhouette Coefficient is 0.488517550854
For n_clusters=6, The Silhouette Coefficient is 0.370380309351
For n_clusters=7, The Silhouette Coefficient is 0.356303270516
For n_clusters=8, The Silhouette Coefficient is 0.365164535737
For n_clusters=9, The Silhouette Coefficient is 0.346583642095
For n_clusters=10, The Silhouette Coefficient is 0.328266088778

对于 n_clusters=2,Silhouette Coefficient 为 0.680813620271
对于 n_clusters=3,Silhouette Coefficient 为 0.552591944521
对于 n_clusters=4,Silhouette Coefficient 为 0.496992849949
For n_clusters=5,Silhouette Coefficient=5,
Silhouette 系数为 0.552591944521。
对于 n_clusters=7,轮廓系数为 0.356303270516
对于 n_clusters=8,轮廓系数为 0.365164535737
对于 n_clusters=9,轮廓系数为 0.346583642095
对于 n_clusters=8,轮廓系数为 0.365164535737。

As we can see, n_clusters=2has highest Silhouette Coefficient. This means that 2 should be the optimal number of cluster, Right?

正如我们所见,n_clusters=2 的轮廓系数最高。这意味着 2 应该是最佳的簇数,对吗?

But here's the catch.

但问题就在这里。

Iris dataset has 3 species of flower, which contradicts the 2 as an optimal number of cluster. So despite n_clusters=2having highest Silhouette Coefficient, We would consider n_clusters=3as optimal number of cluster due to -

鸢尾花数据集有 3 种花,这与作为最佳簇数的 2 相矛盾。因此,尽管n_clusters=2具有最高的轮廓系数,我们仍将n_clusters=3视为最佳聚类数,因为 -

  1. Iris dataset has 3 species. (Most Important)
  2. n_clusters=2has a 2nd highest value of Silhouette Coefficient.
  1. 鸢尾花数据集有 3 种。(最重要的)
  2. n_clusters=2具有第二高的轮廓系数值。

So choosing n_clusters=3is the optimal no. of cluster for iris dataset.

所以选择n_clusters=3是最优的。虹膜数据集的集群。

Choosing optimal no. of the cluster will depend on the type of datasets and the problem we are trying to solve. But most of the cases, taking highest Silhouette Coefficient will yield an optimal number of cluster.

选择最佳编号 集群的大小将取决于数据集的类型和我们试图解决的问题。但在大多数情况下,采用最高的轮廓系数将产生最佳数量的集群。

Hope it helps!

希望能帮助到你!

回答by Karan Maheshwari

This answer is inspired by what OmPrakash has written. This contains code to plot both the SSE and Silhouette Score. What I've given is a general code snippet you can follow through in all cases of unsupervised learning where you don't have the labels and want to know what's the optimal number of cluster. There are 2 criterion. 1) Sum of Square errors (SSE) and Silhouette Score. You can follow OmPrakash's answer for the explanation. He's done a good job at that.

这个答案的灵感来自 OmPrakash 所写的内容。这包含用于绘制 SSE 和轮廓分数的代码。我给出的是一个通用代码片段,您可以在没有标签并想知道最佳聚类数是多少的无监督学习的所有情况下进行操作。有2个标准。1) 平方误差总和 (SSE) 和轮廓分数。您可以按照 OmPrakash 的回答进行解释。他在这方面做得很好。

Assume your dataset is a data frame df1. Here I have used a different dataset just to show how we can use both the criterion to help decide optimal number of cluster. Here I think 6 is the correct number of cluster. Then

假设您的数据集是数据框 df1。在这里,我使用了一个不同的数据集,只是为了展示我们如何使用这两个标准来帮助确定最佳集群数。这里我认为 6 是正确的簇数。然后

range_n_clusters = [2, 3, 4, 5, 6,7,8]
elbow = []
ss = []
for n_clusters in range_n_clusters:
   #iterating through cluster sizes
   clusterer = KMeans(n_clusters = n_clusters, random_state=42)
   cluster_labels = clusterer.fit_predict(df1)
   #Finding the average silhouette score
   silhouette_avg = silhouette_score(df1, cluster_labels)
   ss.append(silhouette_avg)
   print("For n_clusters =", n_clusters,"The average silhouette_score is :", silhouette_avg)`
   #Finding the average SSE"
   elbow.append(clusterer.inertia_) # Inertia: Sum of distances of samples to their closest cluster center
fig = plt.figure(figsize=(14,7))
fig.add_subplot(121)
plt.plot(range_n_clusters, elbow,'b-',label='Sum of squared error')
plt.xlabel("Number of cluster")
plt.ylabel("SSE")
plt.legend()
fig.add_subplot(122)
plt.plot(range_n_clusters, ss,'b-',label='Silhouette Score')
plt.xlabel("Number of cluster")
plt.ylabel("Silhouette Score")
plt.legend()
plt.show()

Graphs used to compare both the criterion in order to help us find optimal cluster values

用于比较两个标准的图形,以帮助我们找到最佳聚类值