Pandas + scikit-learn K-means 无法正常工作 - 将所有数据帧行视为一个大的多维示例

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/28114630/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:52:26  来源:igfitidea点击:

Pandas + scikit-learn K-means not working properly - treats all of dataframe rows as one big multi-dimensional example

pythonpandasscikit-learn

提问by Maksim Khaitovich

I am currently trying to do some k-means clustering using my data which is stored in my pandas.dataframe (actually in one of its columns). Odd thing is that instead of treating each row as a separate example it threats all rows as one example but in very high dimension. So for example:

我目前正在尝试使用存储在我的 pandas.dataframe (实际上在其列之一中)中的数据进行一些 k-means 聚类。奇怪的是,它没有将每一行视为一个单独的示例,而是将所有行视为一个示例,但在非常高的维度上进行威胁。例如:

df = pd.read_csv('D:\Apps\DataSciense\Kaggle Challenges\Titanic\Source Data\train.csv', header = 0)

median_ages = np.zeros((2,3))

for i in range(0,2):
    for j in range (0,3):
        median_ages[i, j] =df[(df.Gender == i) &(df.Pclass == j+1)].Age.dropna().median()

df['AgeFill'] = df['Age']

for i in range(0, 2):
    for j in range(0,3):
        df.loc[ (df.Age.isnull()) & (df.Gender == i) & (df.Pclass == j+1), 'AgeFill'] = median_ages[i, j]

then I just check that it looks fine:

然后我只是检查它看起来不错:

df.AgeFill

Name: AgeFill, Length: 891, dtype: float64

Looks ok, 891 float64 number. I do custering:

看起来不错,891 float64 数字。我做 custering:

k_means = cluster.KMeans(n_clusters=1, init='random')
k_means.fit(df.AgeFill)

And I check for cluster centers:

我检查集群中心:

k_means.cluster_centers_

It returns me one giant array.

它返回给我一个巨大的数组。

Furthermore:

此外:

k_means.labels_

Gives me:

给我:

array([0])

What am I doing wrong? Why it thinks I have a one example with 891 dimensions, instead of having 891 example?

我究竟做错了什么?为什么它认为我有一个有 891 个维度的示例,而不是有 891 个示例?

Just to illustrate it better, if I try 2 clusters:

只是为了更好地说明它,如果我尝试 2 个集群:

k_means = cluster.KMeans(n_clusters=2, init='random')
k_means.fit(df.AgeFill)

Traceback (most recent call last): File "", line 1, in k_means.fit(df.AgeFill) File "D:\Apps\Python\lib\site-packages\sklearn\cluster\k_means_.py", line 724, in fit X = self._check_fit_data(X) File "D:\Apps\Python\lib\site-packages\sklearn\cluster\k_means_.py", line 693, in _check_fit_data X.shape[0], self.n_clusters)) ValueError: n_samples=1 should be >= n_clusters=2

回溯(最近一次调用):文件“”,第 1 行,在 k_means.fit(df.AgeFill) 文件“D:\Apps\Python\lib\site-packages\sklearn\cluster\k_means_.py”,第 724 行, in fit X = self._check_fit_data(X) File "D:\Apps\Python\lib\site-packages\sklearn\cluster\k_means_.py", line 693, in _check_fit_data X.shape[0], self.n_clusters )) ValueError: n_samples=1 应该 >= n_clusters=2

So you could see that it REALLY thinks that it is just one giant sample.

所以你可以看到它真的认为它只是一个巨大的样本。

But:

但:

df.AgeFill.shape
(891,)

回答by elyase

You are passing a 1D array while scikit expects a 2D array with a samplesand a featuresaxis. This should do it:

您正在传递一个一维数组,而 scikit 需要一个带有样本特征轴的二维数组。这应该这样做:

k_means.fit(df.AgeFill.reshape(-1, 1))

Before:

前:

>>> df.AgeFill.shape
(891,)

After:

后:

>>> df.AgeFill.reshape(-1, 1).shape
(891, 1)