Python 理解 scikit-learn KMeans 返回的“分数”
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/32370543/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Understanding "score" returned by scikit-learn KMeans
提问by Prateek Dewan
I applied clustering on a set of text documents (about 100). I converted them to Tfidf
vectors using TfIdfVectorizer
and supplied the vectors as input to scikitlearn.cluster.KMeans(n_clusters=2, init='k-means++', max_iter=100, n_init=10)
. Now when I
我在一组文本文档(大约 100 个)上应用了聚类。我Tfidf
使用它们将它们转换为向量TfIdfVectorizer
并将向量作为输入提供给scikitlearn.cluster.KMeans(n_clusters=2, init='k-means++', max_iter=100, n_init=10)
. 现在当我
model.fit()
print model.score()
on my vectors, I get a very small value if all the text documents are very similar, and I get a very large negative value if the documents are very different.
在我的向量上,如果所有文本文档都非常相似,我会得到一个非常小的值,如果文档非常不同,我会得到一个非常大的负值。
It serves my basic purpose of finding which set of documents are similar, but can someone help me understand what exactly does this model.score()
value signify for a fit? How can I use this value to justify my findings?
它满足了我查找哪些文档集相似的基本目的,但是有人可以帮助我理解这个model.score()
值究竟意味着什么适合吗?我如何使用这个值来证明我的发现?
回答by ypnos
In the documentation it says:
在文档中它说:
Returns:
score : float
Opposite of the value of X on the K-means objective.
To understand what that means you need to have a look at the k-means algorithm. What k-means essentially does is find cluster centers that minimize the sum of distances between data samples and their associated cluster centers.
要了解这意味着什么,您需要查看 k-means 算法。k-means 本质上所做的是找到使数据样本与其关联的聚类中心之间的距离总和最小的聚类中心。
It is a two-step process, where (a) each data sample is associated to its closest cluster center, (b) cluster centers are adjusted to lie at the center of all samples associated to them. These steps are repeated until a criterion (max iterations / min change between last two iterations) is met.
这是一个两步过程,其中(a)每个数据样本与其最近的聚类中心相关联,(b)聚类中心被调整为位于与其相关联的所有样本的中心。重复这些步骤,直到满足标准(最大迭代次数/最后两次迭代之间的最小变化)。
As you can see there remains a distance between the data samples and their associated cluster centers, and the objectiveof our minimization is that distance (sum of all distances).
如您所见,数据样本与其关联的聚类中心之间仍然存在距离,我们最小化的目标是该距离(所有距离的总和)。
You naturally get large distances if you have a big variety in data samples, if the number of data samples is significantly higher than the number of clusters, which in your case is only two. On the contrary, if all data samples were the same, you would always get a zero distance regardless of number of clusters.
如果您的数据样本种类繁多,并且数据样本的数量明显高于集群的数量(在您的情况下仅为2 ),那么您自然会获得很大的距离。相反,如果所有数据样本都相同,则无论集群数量如何,您都将始终获得零距离。
From the documentation I would expect that all values are negative, though. If you observe both negative and positive values, maybe there is more to the score than that.
不过,从文档中我希望所有值都是负数。如果您同时观察到负值和正值,那么分数可能不止于此。
I wonder how you got the idea of clustering into two clusters though.
我想知道您是如何想到将集群分成两个集群的。
回答by Mark Yang
ypnos is right, you can find some detail here: https://github.com/scikit-learn/scikit-learn/blob/51a765a/sklearn/cluster/k_means_.py#L893
ypnos 是对的,你可以在这里找到一些细节:https: //github.com/scikit-learn/scikit-learn/blob/51a765a/sklearn/cluster/k_means_.py#L893
inertia : float
Sum of distances of samples to their closest cluster center.
"""
回答by Tarun Kumar Yellapu
The word chosen by the documentation is a bit confusing. It says "Opposite of the value of X on the K-means objective." It means negative of the K-means objective.
文档选择的词有点令人困惑。它说“与 K 均值目标上的 X 值相反。”这意味着 K 均值目标的负值。
K-Means Objective
K-均值目标
The objective in the K-means is to reduce the sum of squares of the distances of points from their respective cluster centroids. It has other names like J-Squared error function, J-score or within-cluster sum of squares. This value tells how internally coherent the clusters are. (The less the better)
K-means 的目标是减少点与各自聚类质心的距离的平方和。它还有其他名称,如 J-Squared 误差函数、J-score 或簇内平方和。该值说明集群的内部一致性。(越少越好)
The objective function can be directly obtained from the following method.
目标函数可以通过以下方法直接获得。
model.inertia_
model.inertia_