pandas 使用k-means,我得到了一个错误;具有 0 个特征的数组
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/31335602/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Using k-means, I got a error ; array with 0 feature
提问by Suzuki Soma
I am trying to cluster my csv data, using matplotlib and k-means.
我正在尝试使用 matplotlib 和 k-means 对我的 csv 数据进行聚类。
My csv data is about energy consumption. https://github.com/camenergydatalab/EnergyDataSimulationChallenge/blob/master/challenge2/data/total_watt.csv
我的 csv 数据是关于能源消耗的。 https://github.com/camenergydatalab/EnergyDataSimulationChallenge/blob/master/challenge2/data/total_watt.csv
I want to cluster the values per day into 3 groups: low, medium, and high energy consumption.
我想将每天的值分为 3 组:低能耗、中能耗和高能耗。
This is my code.
这是我的代码。
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
style.use('ggplot')
import pandas as pd
from sklearn.cluster import KMeans
MY_FILE='total_watt.csv'
date = []
consumption = []
df = pd.read_csv(MY_FILE, parse_dates=[0], index_col=[0])
df = df.resample('1D', how='sum')
for row in df:
if len(row) ==2 :
date.append(row[0])
consumption.append(row[1])
import datetime
for x in range(len(date)):
date[x]=datetime.datetime.strptime(date[x], '%Y-%m-%d %H:%M:%S')
X = np.array([date, consumption])
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)
centroids = kmeans.cluster_centers_
labels = kmeans.labels_
print(centroids)
print(labels)
colors = ["b.","g.","r."]
for i in range(len(X)):
print("coordinate:",X[i], "label:", labels[i])
plt.plot(X[i][0], X[i][1], colors[labels[i]], markersize = 10)
plt.scatter(centroids[:, 0],centroids[:, 1], marker = "x", s=150, linewidths = 5, zorder = 10)
plt.show()
But when I implemented this code, I got a following error;
但是当我实现这段代码时,出现了以下错误;
(DataVizProj)Soma-Suzuki:Soma Suzuki$ python 4.clusters.py
Traceback (most recent call last):
File "4.clusters.py", line 31, in <module>
kmeans.fit(X)
File "/Users/Suzuki/Envs/DataVizProj/lib/python2.7/site-packages/sklearn/cluster/k_means_.py", line 785, in fit
X = self._check_fit_data(X)
File "/Users/Suzuki/Envs/DataVizProj/lib/python2.7/site-packages/sklearn/cluster/k_means_.py", line 755, in _check_fit_data
X = check_array(X, accept_sparse='csr', dtype=np.float64)
File "/Users/Suzuki/Envs/DataVizProj/lib/python2.7/site-packages/sklearn/utils/validation.py", line 367, in check_array
% (n_features, shape_repr, ensure_min_features))
ValueError: Found array with 0 feature(s) (shape=(2, 0)) while a minimum of 1 is required.
How can I cluster my csv data properly.??
如何正确聚类我的 csv 数据。?
EDIT-----------------------------------------------------
编辑 - - - - - - - - - - - - - - - - - - - - - - - - - ----
This is my new code. Thank you!
这是我的新代码。谢谢!
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
style.use('ggplot')
import pandas as pd
from sklearn.cluster import KMeans
MY_FILE='total_watt.csv'
date = []
consumption = []
df = pd.read_csv(MY_FILE, parse_dates=[0], index_col=[0])
df = df.resample('1D', how='sum')
df = df.dropna()
date = df.index.tolist()
consumption = df[df.columns[0]].values
X = np.array([date, consumption])
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)
centroids = kmeans.cluster_centers_
labels = kmeans.labels_
print(centroids)
print(labels)
colors = ["b.","g.","r."]
for i in range(len(X)):
print("coordinate:",X[i], "label:", labels[i])
plt.plot(X[i][0], X[i][1], colors[labels[i]], markersize = 10)
plt.scatter(centroids[:, 0],centroids[:, 1], marker = "x", s=150, linewidths = 5, zorder = 10)
plt.show()
and new error...
和新的错误...
(DataVizProj)Soma-Suzuki:Soma Suzuki$ python 4.clusters.py
Traceback (most recent call last):
File "4.clusters.py", line 26, in <module>
kmeans.fit(X)
File "/Users/Suzuki/Envs/DataVizProj/lib/python2.7/site-packages/sklearn/cluster/k_means_.py", line 785, in fit
X = self._check_fit_data(X)
File "/Users/Suzuki/Envs/DataVizProj/lib/python2.7/site-packages/sklearn/cluster/k_means_.py", line 755, in _check_fit_data
X = check_array(X, accept_sparse='csr', dtype=np.float64)
File "/Users/Suzuki/Envs/DataVizProj/lib/python2.7/site-packages/sklearn/utils/validation.py", line 344, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
TypeError: float() argument must be a string or a number
EDITED2-----------------------------------------
EDITED2-----------------------------------------
Thank you Jianxun!!
谢谢建勋!!
I finally succeeded o cluster my csv data!! Thank you so much!!
我终于成功地对我的 csv 数据进行了聚类!!非常感谢!!
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
style.use('ggplot')
import pandas as pd
from sklearn.cluster import KMeans
MY_FILE='total_watt.csv'
date = []
consumption = []
df = pd.read_csv(MY_FILE, parse_dates=[0], index_col=[0])
df = df.resample('1D', how='sum')
df = df.dropna()
date = df.index.tolist()
date = [x.strftime('%Y-%m-%d') for x in date]
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
date_numeric = encoder.fit_transform(date)
consumption = df[df.columns[0]].values
X = np.array([date_numeric, consumption]).T
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)
centroids = kmeans.cluster_centers_
labels = kmeans.labels_
print(centroids)
print(labels)
colors = ["b.","r.","g."]
for i in range(len(X)):
print("coordinate:",X[i], "label:", labels[i])
plt.plot(X[i][0], X[i][1], colors[labels[i]], markersize = 10)
plt.scatter(centroids[:, 0],centroids[:, 1], marker = "x", s=150, linewidths = 5, zorder = 10)
plt.show()
But as you can see, the x-axis does not reflect time, although we set it properly....
但是正如您所看到的,尽管我们正确设置了 x 轴并没有反映时间....
回答by Jianxun Li
First problem:
第一个问题:
for row in df:
if len(row) ==2 :
date.append(row[0])
consumption.append(row[1])
This will give you unexpected empty list dateand consumptionbecause for row in dfactually loops over columns instead of row, and that's precisely why you've seen the error message saying that it has no features.
这会给你一个意想不到的空列表date,consumption因为for row in df实际上循环的是列而不是行,这正是你看到错误消息说它没有功能的原因。
Also, I've seen there are two NaNin consumption, so you need to df = df.dropna()(or impute these missing values) because sklearnis not NaNtolerant.
另外,我已经看到NaN消费有两个,所以你需要df = df.dropna()(或估算这些缺失值)因为sklearn不NaN宽容。
To get data from your dataframe, you can write something like this
要从您的数据框中获取数据,您可以编写如下内容
date = df.index.tolist()
consumption = df[df.columns[0]].values
Next, you've already parsed the date in pd.read_csv, so the following part of your code will not work at all.
接下来,您已经解析了 中的日期pd.read_csv,因此您的代码的以下部分将根本不起作用。
import datetime
for x in range(len(date)):
date[x]=datetime.datetime.strptime(date[x], '%Y-%m-%d %H:%M:%S')
Finally, just feeding the raw datewith consumptioninto KMeanswon't produce too much useful results. You should consider convert dateinto numeric data, for example, dummies for dayofweek.
最后,只是将原料进date同consumption成KMeans不会产生太多有用的结果。您应该考虑转换date为数字数据,例如,dayofweek 的虚拟数据。
To use LabelEncoder:
使用LabelEncoder:
date = df.index.tolist()
date = [x.strftime('%Y-%m-%d') for x in date]
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
date_numeric = encoder.fit_transform(date)
# feed date_numeric with consumption into your KMeans
# must use .T to transpose your X, sklearn think each column is a feature
X = np.array([date_numeric, consumption]).T
for your plotting issue:
对于您的绘图问题:
fig, ax = plt.subplots(figsize=(10,8))
colors = ["b.","r.","g."]
for i in range(len(X)):
print("coordinate:",encoder.inverse_transform(X[i,0].astype(int)), X[i,1], "label:", labels[i])
ax.plot(X[i][0], X[i][1], colors[labels[i]], markersize = 10)
ax.scatter(centroids[:, 0],centroids[:, 1], marker = "x", s=150, linewidths = 5, zorder = 10)
a = np.arange(0, len(X), 5)
ax.set_xticks(a)
ax.set_xticklabels(encoder.inverse_transform(a.astype(int)))



