pandas ValueError: Number of labels is 1. 有效值为 2 到 n_samples - 1 (inclusive) 当使用剪影_score

Question

提问by Suhail Gupta

I am trying to calculate silhouette scoreas I find the optimal number of clusters to create, but get an error that says:

我正在尝试计算，silhouette score因为我找到了要创建的最佳集群数，但收到一条错误消息：

ValueError: Number of labels is 1. Valid values are 2 to n_samples - 1 (inclusive)

I am unable to understand the reason for this. Here is the code, that I am using to cluster and calculate silhouette score.

我无法理解这样做的原因。这是我用来聚类和计算的代码silhouette score。

I read the csv that contains the text to be clustered and run K-Meanson the ncluster values. What could be the reason I am getting this error?

我阅读了包含要聚类的文本并K-Means在n聚类值上运行的 csv 。我收到此错误的原因可能是什么？

  #Create cluster using K-Means
#Only creates graph
import matplotlib
#matplotlib.use('Agg')
import re
import os
import nltk, math, codecs
import csv
from nltk.corpus import stopwords
from gensim.models import Doc2Vec
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.metrics import silhouette_score

model_name = checkpoint_save_path
loaded_model = Doc2Vec.load(model_name)

#Load the test csv file
data = pd.read_csv(test_filename)
overview = data['overview'].astype('str').tolist()
overview = filter(bool, overview)
vectors = []

def split_words(text):
  return ''.join([x if x.isalnum() or x.isspace() else " " for x in text ]).split()

def preprocess_document(text):
  sp_words = split_words(text)
  return sp_words

for i, t in enumerate(overview):
  vectors.append(loaded_model.infer_vector(preprocess_document(t)))

sse = {}
silhouette = {}


for k in range(1,15):
  km = KMeans(n_clusters=k, max_iter=1000, verbose = 0).fit(vectors)
  sse[k] = km.inertia_
  #FOLLOWING LINE CAUSES ERROR
  silhouette[k] = silhouette_score(vectors, km.labels_, metric='euclidean')

best_cluster_size = 1
min_error = float("inf")

for cluster_size in sse:
    if sse[cluster_size] < min_error:
        min_error = sse[cluster_size]
        best_cluster_size = cluster_size

print(sse)
print("====")
print(silhouette)

Answer 1

回答by seralouk

The erroris produced because you have a loop for different number of clusters n. During the first iteration, n_clustersis 1and this leads to all(km.labels_ == 0)to be True.

产生错误是因为您有不同数量的集群的循环n。在第一次迭代期间，n_clustersis1并且这导致all(km.labels_ == 0)是True。

In other words, you have only one cluster with label 0(thus, np.unique(km.labels_)prints array([0], dtype=int32)).

换句话说，您只有一个标签为 0 的集群（因此，np.unique(km.labels_)prints array([0], dtype=int32)）。

`silhouette_score`requires more than 1 cluster labels. This causes the error. The error message is clear.

`silhouette_score`需要 1 个以上的簇标签。这会导致错误。错误信息很清楚。

Example:

例子：

from sklearn import datasets
from sklearn.cluster import KMeans
import numpy as np

iris = datasets.load_iris()
X = iris.data
y = iris.target

km = KMeans(n_clusters=3)
km.fit(X,y)

# check how many unique labels do you have
np.unique(km.labels_)
#array([0, 1, 2], dtype=int32)

We have 3 different clusters/cluster labels.

我们有 3 个不同的集群/集群标签。

silhouette_score(X, km.labels_, metric='euclidean')
0.38788915189699597

The function works fine.

该功能工作正常。

Now, let's cause the error:

现在，让我们引发错误：

km2 = KMeans(n_clusters=1)
km2.fit(X,y)

silhouette_score(X, km2.labels_, metric='euclidean')

ValueError: Number of labels is 1. Valid values are 2 to n_samples - 1 (inclusive)

ValueError: Number of labels is 1. Valid values are 2 to n_samples - 1 (inclusive)

Answer 2

回答by Yuan

From the documentation,

从文档中，

Note that Silhouette Coefficient is only defined if number of labels is 2 <= n_labels <= n_samples - 1

请注意，仅当标签数量为 2 <= n_labels <= n_samples - 1 时才定义轮廓系数

So one way to solve this problem is instead of using for k in range(1,15), try to start iteration from k = 2, which is for k in range(2,15). That works for me.

所以解决这个问题的一种方法是for k in range(1,15)尝试从 k = 2 开始迭代而不是使用，也就是for k in range(2,15)。这对我行得通。

pandas ValueError: Number of labels is 1. 有效值为 2 到 n_samples - 1 (inclusive) 当使用剪影_score

提问by Suhail Gupta

回答by seralouk

`silhouette_score`requires more than 1 cluster labels. This causes the error. The error message is clear.

`silhouette_score`需要 1 个以上的簇标签。这会导致错误。错误信息很清楚。

回答by Yuan

相关推荐

最近更新

标签

pandas ValueError: Number of labels is 1. 有效值为 2 到 n_samples - 1 (inclusive) 当使用剪影_score

提问by Suhail Gupta

回答by seralouk

silhouette_scorerequires more than 1 cluster labels. This causes the error. The error message is clear.

silhouette_score需要 1 个以上的簇标签。这会导致错误。错误信息很清楚。

回答by Yuan

相关推荐

pandas 如何知道由 astype('category').cat.codes 分配的标签？

pandas 将熊猫列转换为日期时间

pandas Seaborn 条形图中 X 轴上的日期排序和格式

pandas 将对象类型的数据框列转换为浮动

相关推荐

最近更新

标签

`silhouette_score`requires more than 1 cluster labels. This causes the error. The error message is clear.

`silhouette_score`需要 1 个以上的簇标签。这会导致错误。错误信息很清楚。