pandas ValueError: Number of labels is 1. 有效值为 2 到 n_samples - 1 (inclusive) 当使用剪影_score

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/51382250/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 05:48:56  来源:igfitidea点击:

ValueError: Number of labels is 1. Valid values are 2 to n_samples - 1 (inclusive) when using silhouette_score

pythonpandasmachine-learningscikit-learnk-means

提问by Suhail Gupta

I am trying to calculate silhouette scoreas I find the optimal number of clusters to create, but get an error that says:

我正在尝试计算,silhouette score因为我找到了要创建的最佳集群数,但收到一条错误消息:

ValueError: Number of labels is 1. Valid values are 2 to n_samples - 1 (inclusive)

I am unable to understand the reason for this. Here is the code, that I am using to cluster and calculate silhouette score.

我无法理解这样做的原因。这是我用来聚类和计算的代码silhouette score

I read the csv that contains the text to be clustered and run K-Meanson the ncluster values. What could be the reason I am getting this error?

我阅读了包含要聚类的文本并K-Meansn聚类值上运行的 csv 。我收到此错误的原因可能是什么?

  #Create cluster using K-Means
#Only creates graph
import matplotlib
#matplotlib.use('Agg')
import re
import os
import nltk, math, codecs
import csv
from nltk.corpus import stopwords
from gensim.models import Doc2Vec
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.metrics import silhouette_score

model_name = checkpoint_save_path
loaded_model = Doc2Vec.load(model_name)

#Load the test csv file
data = pd.read_csv(test_filename)
overview = data['overview'].astype('str').tolist()
overview = filter(bool, overview)
vectors = []

def split_words(text):
  return ''.join([x if x.isalnum() or x.isspace() else " " for x in text ]).split()

def preprocess_document(text):
  sp_words = split_words(text)
  return sp_words

for i, t in enumerate(overview):
  vectors.append(loaded_model.infer_vector(preprocess_document(t)))

sse = {}
silhouette = {}


for k in range(1,15):
  km = KMeans(n_clusters=k, max_iter=1000, verbose = 0).fit(vectors)
  sse[k] = km.inertia_
  #FOLLOWING LINE CAUSES ERROR
  silhouette[k] = silhouette_score(vectors, km.labels_, metric='euclidean')

best_cluster_size = 1
min_error = float("inf")

for cluster_size in sse:
    if sse[cluster_size] < min_error:
        min_error = sse[cluster_size]
        best_cluster_size = cluster_size

print(sse)
print("====")
print(silhouette)

回答by seralouk

The erroris produced because you have a loop for different number of clusters n. During the first iteration, n_clustersis 1and this leads to all(km.labels_ == 0)to be True.

产生错误是因为您有不同数量的集群的循环n。在第一次迭代期间,n_clustersis1并且这导致all(km.labels_ == 0)True

In other words, you have only one cluster with label 0(thus, np.unique(km.labels_)prints array([0], dtype=int32)).

换句话说,您只有一个标签为 0 的集群(因此,np.unique(km.labels_)prints array([0], dtype=int32))。



silhouette_scorerequires more than 1 cluster labels. This causes the error. The error message is clear.

silhouette_score需要 1 个以上的簇标签。这会导致错误。错误信息很清楚。



Example:

例子:

from sklearn import datasets
from sklearn.cluster import KMeans
import numpy as np

iris = datasets.load_iris()
X = iris.data
y = iris.target

km = KMeans(n_clusters=3)
km.fit(X,y)

# check how many unique labels do you have
np.unique(km.labels_)
#array([0, 1, 2], dtype=int32)

We have 3 different clusters/cluster labels.

我们有 3 个不同的集群/集群标签。

silhouette_score(X, km.labels_, metric='euclidean')
0.38788915189699597

The function works fine.

该功能工作正常。



Now, let's cause the error:

现在,让我们引发错误:

km2 = KMeans(n_clusters=1)
km2.fit(X,y)

silhouette_score(X, km2.labels_, metric='euclidean')
ValueError: Number of labels is 1. Valid values are 2 to n_samples - 1 (inclusive)
ValueError: Number of labels is 1. Valid values are 2 to n_samples - 1 (inclusive)

回答by Yuan

From the documentation,

从文档中,

Note that Silhouette Coefficient is only defined if number of labels is 2 <= n_labels <= n_samples - 1

请注意,仅当标签数量为 2 <= n_labels <= n_samples - 1 时才定义轮廓系数

So one way to solve this problem is instead of using for k in range(1,15), try to start iteration from k = 2, which is for k in range(2,15). That works for me.

所以解决这个问题的一种方法是for k in range(1,15)尝试从 k = 2 开始迭代而不是使用,也就是for k in range(2,15)。这对我行得通。