pandas 使用k-means,我得到了一个错误;具有 0 个特征的数组

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/31335602/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 23:36:28  来源:igfitidea点击:

Using k-means, I got a error ; array with 0 feature

pythonpandasmatplotlibscikit-learnk-means

提问by Suzuki Soma

I am trying to cluster my csv data, using matplotlib and k-means.

我正在尝试使用 matplotlib 和 k-means 对我的 csv 数据进行聚类。

My csv data is about energy consumption. https://github.com/camenergydatalab/EnergyDataSimulationChallenge/blob/master/challenge2/data/total_watt.csv

我的 csv 数据是关于能源消耗的。 https://github.com/camenergydatalab/EnergyDataSimulationChallenge/blob/master/challenge2/data/total_watt.csv

I want to cluster the values per day into 3 groups: low, medium, and high energy consumption.

我想将每天的值分为 3 组:低能耗、中能耗和高能耗。

This is my code.

这是我的代码。

import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
style.use('ggplot')
import pandas as pd
from sklearn.cluster import KMeans



MY_FILE='total_watt.csv'
date = []
consumption = []


df = pd.read_csv(MY_FILE, parse_dates=[0], index_col=[0])
df = df.resample('1D', how='sum')


for row in df:
    if len(row) ==2 :
        date.append(row[0])
        consumption.append(row[1])


import datetime
for x in range(len(date)):
    date[x]=datetime.datetime.strptime(date[x], '%Y-%m-%d %H:%M:%S')

X = np.array([date, consumption])
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)

centroids = kmeans.cluster_centers_
labels = kmeans.labels_

print(centroids)
print(labels)

colors = ["b.","g.","r."]

for i in range(len(X)):
    print("coordinate:",X[i], "label:", labels[i])
    plt.plot(X[i][0], X[i][1], colors[labels[i]], markersize = 10)

plt.scatter(centroids[:, 0],centroids[:, 1], marker = "x", s=150, linewidths = 5, zorder = 10)

plt.show()

But when I implemented this code, I got a following error;

但是当我实现这段代码时,出现了以下错误;

(DataVizProj)Soma-Suzuki:Soma Suzuki$ python 4.clusters.py
Traceback (most recent call last):
  File "4.clusters.py", line 31, in <module>
    kmeans.fit(X)
  File "/Users/Suzuki/Envs/DataVizProj/lib/python2.7/site-packages/sklearn/cluster/k_means_.py", line 785, in fit
    X = self._check_fit_data(X)
  File "/Users/Suzuki/Envs/DataVizProj/lib/python2.7/site-packages/sklearn/cluster/k_means_.py", line 755, in _check_fit_data
    X = check_array(X, accept_sparse='csr', dtype=np.float64)
  File "/Users/Suzuki/Envs/DataVizProj/lib/python2.7/site-packages/sklearn/utils/validation.py", line 367, in check_array
    % (n_features, shape_repr, ensure_min_features))
ValueError: Found array with 0 feature(s) (shape=(2, 0)) while a minimum of 1 is required.

How can I cluster my csv data properly.??

如何正确聚类我的 csv 数据。?

EDIT-----------------------------------------------------

编辑 - - - - - - - - - - - - - - - - - - - - - - - - - ----

This is my new code. Thank you!

这是我的新代码。谢谢!

import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
style.use('ggplot')
import pandas as pd
from sklearn.cluster import KMeans



MY_FILE='total_watt.csv'
date = []
consumption = []


df = pd.read_csv(MY_FILE, parse_dates=[0], index_col=[0])
df = df.resample('1D', how='sum')
df = df.dropna()

date = df.index.tolist()
consumption = df[df.columns[0]].values



X = np.array([date, consumption])
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)

centroids = kmeans.cluster_centers_
labels = kmeans.labels_

print(centroids)
print(labels)

colors = ["b.","g.","r."]

for i in range(len(X)):
    print("coordinate:",X[i], "label:", labels[i])
    plt.plot(X[i][0], X[i][1], colors[labels[i]], markersize = 10)

plt.scatter(centroids[:, 0],centroids[:, 1], marker = "x", s=150, linewidths = 5, zorder = 10)

plt.show()

and new error...

和新的错误...

(DataVizProj)Soma-Suzuki:Soma Suzuki$ python 4.clusters.py
Traceback (most recent call last):
  File "4.clusters.py", line 26, in <module>
    kmeans.fit(X)
  File "/Users/Suzuki/Envs/DataVizProj/lib/python2.7/site-packages/sklearn/cluster/k_means_.py", line 785, in fit
    X = self._check_fit_data(X)
  File "/Users/Suzuki/Envs/DataVizProj/lib/python2.7/site-packages/sklearn/cluster/k_means_.py", line 755, in _check_fit_data
    X = check_array(X, accept_sparse='csr', dtype=np.float64)
  File "/Users/Suzuki/Envs/DataVizProj/lib/python2.7/site-packages/sklearn/utils/validation.py", line 344, in check_array
    array = np.array(array, dtype=dtype, order=order, copy=copy)
TypeError: float() argument must be a string or a number

EDITED2-----------------------------------------

EDITED2-----------------------------------------

Thank you Jianxun!!

谢谢建勋!!

I finally succeeded o cluster my csv data!! Thank you so much!!

我终于成功地对我的 csv 数据进行了聚类!!非常感谢!!

import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
style.use('ggplot')
import pandas as pd
from sklearn.cluster import KMeans



MY_FILE='total_watt.csv'
date = []
consumption = []


df = pd.read_csv(MY_FILE, parse_dates=[0], index_col=[0])
df = df.resample('1D', how='sum')
df = df.dropna()

date = df.index.tolist()
date = [x.strftime('%Y-%m-%d') for x in date]
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
date_numeric = encoder.fit_transform(date)
consumption = df[df.columns[0]].values

X = np.array([date_numeric, consumption]).T




kmeans = KMeans(n_clusters=3)
kmeans.fit(X)

centroids = kmeans.cluster_centers_
labels = kmeans.labels_

print(centroids)
print(labels)

colors = ["b.","r.","g."]

for i in range(len(X)):
    print("coordinate:",X[i], "label:", labels[i])
    plt.plot(X[i][0], X[i][1], colors[labels[i]], markersize = 10)

plt.scatter(centroids[:, 0],centroids[:, 1], marker = "x", s=150, linewidths = 5, zorder = 10)

plt.show()

enter image description hereBut as you can see, the x-axis does not reflect time, although we set it properly....

在此处输入图片说明但是正如您所看到的,尽管我们正确设置了 x 轴并没有反映时间....

回答by Jianxun Li

First problem:

第一个问题:

for row in df:
    if len(row) ==2 :
        date.append(row[0])
        consumption.append(row[1])

This will give you unexpected empty list dateand consumptionbecause for row in dfactually loops over columns instead of row, and that's precisely why you've seen the error message saying that it has no features.

这会给你一个意想不到的空列表dateconsumption因为for row in df实际上循环的是列而不是行,这正是你看到错误消息说它没有功能的原因。

Also, I've seen there are two NaNin consumption, so you need to df = df.dropna()(or impute these missing values) because sklearnis not NaNtolerant.

另外,我已经看到NaN消费有两个,所以你需要df = df.dropna()(或估算这些缺失值)因为sklearnNaN宽容。

To get data from your dataframe, you can write something like this

要从您的数据框中获取数据,您可以编写如下内容

date = df.index.tolist()
consumption = df[df.columns[0]].values

Next, you've already parsed the date in pd.read_csv, so the following part of your code will not work at all.

接下来,您已经解析了 中的日期pd.read_csv,因此您的代码的以下部分将根本不起作用。

import datetime
for x in range(len(date)):
    date[x]=datetime.datetime.strptime(date[x], '%Y-%m-%d %H:%M:%S')

Finally, just feeding the raw datewith consumptioninto KMeanswon't produce too much useful results. You should consider convert dateinto numeric data, for example, dummies for dayofweek.

最后,只是将原料进dateconsumptionKMeans不会产生太多有用的结果。您应该考虑转换date为数字数据,例如,dayofweek 的虚拟数据。

To use LabelEncoder:

使用LabelEncoder

date = df.index.tolist()

date = [x.strftime('%Y-%m-%d') for x in date]

from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
date_numeric = encoder.fit_transform(date)

# feed date_numeric with consumption into your KMeans
# must use .T to transpose your X, sklearn think each column is a feature
X = np.array([date_numeric, consumption]).T

for your plotting issue:

对于您的绘图问题:

fig, ax = plt.subplots(figsize=(10,8))

colors = ["b.","r.","g."]

for i in range(len(X)):
    print("coordinate:",encoder.inverse_transform(X[i,0].astype(int)), X[i,1], "label:", labels[i])
    ax.plot(X[i][0], X[i][1], colors[labels[i]], markersize = 10)

ax.scatter(centroids[:, 0],centroids[:, 1], marker = "x", s=150, linewidths = 5, zorder = 10)
a = np.arange(0, len(X), 5)
ax.set_xticks(a)
ax.set_xticklabels(encoder.inverse_transform(a.astype(int)))

enter image description here

在此处输入图片说明