pandas 使用k-means，我得到了一个错误；具有 0 个特征的数组

Question

提问by Suzuki Soma

I am trying to cluster my csv data, using matplotlib and k-means.

我正在尝试使用 matplotlib 和 k-means 对我的 csv 数据进行聚类。

My csv data is about energy consumption. https://github.com/camenergydatalab/EnergyDataSimulationChallenge/blob/master/challenge2/data/total_watt.csv

我的 csv 数据是关于能源消耗的。 https://github.com/camenergydatalab/EnergyDataSimulationChallenge/blob/master/challenge2/data/total_watt.csv

I want to cluster the values per day into 3 groups: low, medium, and high energy consumption.

我想将每天的值分为 3 组：低能耗、中能耗和高能耗。

This is my code.

这是我的代码。

import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
style.use('ggplot')
import pandas as pd
from sklearn.cluster import KMeans



MY_FILE='total_watt.csv'
date = []
consumption = []


df = pd.read_csv(MY_FILE, parse_dates=[0], index_col=[0])
df = df.resample('1D', how='sum')


for row in df:
    if len(row) ==2 :
        date.append(row[0])
        consumption.append(row[1])


import datetime
for x in range(len(date)):
    date[x]=datetime.datetime.strptime(date[x], '%Y-%m-%d %H:%M:%S')

X = np.array([date, consumption])
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)

centroids = kmeans.cluster_centers_
labels = kmeans.labels_

print(centroids)
print(labels)

colors = ["b.","g.","r."]

for i in range(len(X)):
    print("coordinate:",X[i], "label:", labels[i])
    plt.plot(X[i][0], X[i][1], colors[labels[i]], markersize = 10)

plt.scatter(centroids[:, 0],centroids[:, 1], marker = "x", s=150, linewidths = 5, zorder = 10)

plt.show()

But when I implemented this code, I got a following error;

但是当我实现这段代码时，出现了以下错误；

(DataVizProj)Soma-Suzuki:Soma Suzuki$ python 4.clusters.py
Traceback (most recent call last):
  File "4.clusters.py", line 31, in <module>
    kmeans.fit(X)
  File "/Users/Suzuki/Envs/DataVizProj/lib/python2.7/site-packages/sklearn/cluster/k_means_.py", line 785, in fit
    X = self._check_fit_data(X)
  File "/Users/Suzuki/Envs/DataVizProj/lib/python2.7/site-packages/sklearn/cluster/k_means_.py", line 755, in _check_fit_data
    X = check_array(X, accept_sparse='csr', dtype=np.float64)
  File "/Users/Suzuki/Envs/DataVizProj/lib/python2.7/site-packages/sklearn/utils/validation.py", line 367, in check_array
    % (n_features, shape_repr, ensure_min_features))
ValueError: Found array with 0 feature(s) (shape=(2, 0)) while a minimum of 1 is required.

How can I cluster my csv data properly.??

如何正确聚类我的 csv 数据。？

EDIT-----------------------------------------------------

编辑 - - - - - - - - - - - - - - - - - - - - - - - - - ----

This is my new code. Thank you!

这是我的新代码。谢谢！

import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
style.use('ggplot')
import pandas as pd
from sklearn.cluster import KMeans



MY_FILE='total_watt.csv'
date = []
consumption = []


df = pd.read_csv(MY_FILE, parse_dates=[0], index_col=[0])
df = df.resample('1D', how='sum')
df = df.dropna()

date = df.index.tolist()
consumption = df[df.columns[0]].values



X = np.array([date, consumption])
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)

centroids = kmeans.cluster_centers_
labels = kmeans.labels_

print(centroids)
print(labels)

colors = ["b.","g.","r."]

for i in range(len(X)):
    print("coordinate:",X[i], "label:", labels[i])
    plt.plot(X[i][0], X[i][1], colors[labels[i]], markersize = 10)

plt.scatter(centroids[:, 0],centroids[:, 1], marker = "x", s=150, linewidths = 5, zorder = 10)

plt.show()

and new error...

和新的错误...

(DataVizProj)Soma-Suzuki:Soma Suzuki$ python 4.clusters.py
Traceback (most recent call last):
  File "4.clusters.py", line 26, in <module>
    kmeans.fit(X)
  File "/Users/Suzuki/Envs/DataVizProj/lib/python2.7/site-packages/sklearn/cluster/k_means_.py", line 785, in fit
    X = self._check_fit_data(X)
  File "/Users/Suzuki/Envs/DataVizProj/lib/python2.7/site-packages/sklearn/cluster/k_means_.py", line 755, in _check_fit_data
    X = check_array(X, accept_sparse='csr', dtype=np.float64)
  File "/Users/Suzuki/Envs/DataVizProj/lib/python2.7/site-packages/sklearn/utils/validation.py", line 344, in check_array
    array = np.array(array, dtype=dtype, order=order, copy=copy)
TypeError: float() argument must be a string or a number

EDITED2-----------------------------------------

Thank you Jianxun!!

谢谢建勋！！

I finally succeeded o cluster my csv data!! Thank you so much!!

我终于成功地对我的 csv 数据进行了聚类！！非常感谢！！

import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
style.use('ggplot')
import pandas as pd
from sklearn.cluster import KMeans



MY_FILE='total_watt.csv'
date = []
consumption = []


df = pd.read_csv(MY_FILE, parse_dates=[0], index_col=[0])
df = df.resample('1D', how='sum')
df = df.dropna()

date = df.index.tolist()
date = [x.strftime('%Y-%m-%d') for x in date]
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
date_numeric = encoder.fit_transform(date)
consumption = df[df.columns[0]].values

X = np.array([date_numeric, consumption]).T




kmeans = KMeans(n_clusters=3)
kmeans.fit(X)

centroids = kmeans.cluster_centers_
labels = kmeans.labels_

print(centroids)
print(labels)

colors = ["b.","r.","g."]

for i in range(len(X)):
    print("coordinate:",X[i], "label:", labels[i])
    plt.plot(X[i][0], X[i][1], colors[labels[i]], markersize = 10)

plt.scatter(centroids[:, 0],centroids[:, 1], marker = "x", s=150, linewidths = 5, zorder = 10)

plt.show()

enter image description here But as you can see, the x-axis does not reflect time, although we set it properly....

在此处输入图片说明但是正如您所看到的，尽管我们正确设置了 x 轴并没有反映时间....

Answer 1

回答by Jianxun Li

First problem:

第一个问题：

for row in df:
    if len(row) ==2 :
        date.append(row[0])
        consumption.append(row[1])

This will give you unexpected empty list dateand consumptionbecause for row in dfactually loops over columns instead of row, and that's precisely why you've seen the error message saying that it has no features.

这会给你一个意想不到的空列表date，consumption因为for row in df实际上循环的是列而不是行，这正是你看到错误消息说它没有功能的原因。

Also, I've seen there are two NaNin consumption, so you need to df = df.dropna()(or impute these missing values) because sklearnis not NaNtolerant.

另外，我已经看到NaN消费有两个，所以你需要df = df.dropna()（或估算这些缺失值）因为sklearn不NaN宽容。

To get data from your dataframe, you can write something like this

要从您的数据框中获取数据，您可以编写如下内容

date = df.index.tolist()
consumption = df[df.columns[0]].values

Next, you've already parsed the date in pd.read_csv, so the following part of your code will not work at all.

接下来，您已经解析了中的日期pd.read_csv，因此您的代码的以下部分将根本不起作用。

import datetime
for x in range(len(date)):
    date[x]=datetime.datetime.strptime(date[x], '%Y-%m-%d %H:%M:%S')

Finally, just feeding the raw datewith consumptioninto KMeanswon't produce too much useful results. You should consider convert dateinto numeric data, for example, dummies for dayofweek.

最后，只是将原料进date同consumption成KMeans不会产生太多有用的结果。您应该考虑转换date为数字数据，例如，dayofweek 的虚拟数据。

To use LabelEncoder:

使用LabelEncoder：

date = df.index.tolist()

date = [x.strftime('%Y-%m-%d') for x in date]

from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
date_numeric = encoder.fit_transform(date)

# feed date_numeric with consumption into your KMeans
# must use .T to transpose your X, sklearn think each column is a feature
X = np.array([date_numeric, consumption]).T

for your plotting issue:

对于您的绘图问题：

fig, ax = plt.subplots(figsize=(10,8))

colors = ["b.","r.","g."]

for i in range(len(X)):
    print("coordinate:",encoder.inverse_transform(X[i,0].astype(int)), X[i,1], "label:", labels[i])
    ax.plot(X[i][0], X[i][1], colors[labels[i]], markersize = 10)

ax.scatter(centroids[:, 0],centroids[:, 1], marker = "x", s=150, linewidths = 5, zorder = 10)
a = np.arange(0, len(X), 5)
ax.set_xticks(a)
ax.set_xticklabels(encoder.inverse_transform(a.astype(int)))

enter image description here

在此处输入图片说明

pandas 使用k-means，我得到了一个错误；具有 0 个特征的数组

提问by Suzuki Soma

回答by Jianxun Li

相关推荐

最近更新

标签

pandas 使用k-means，我得到了一个错误；具有 0 个特征的数组

提问by Suzuki Soma

回答by Jianxun Li

相关推荐

如何根据 Pandas (python) 中的列表设置值

pandas python数据帧转换多种日期时间格式

Pandas 中数据帧的 for 循环中的 KeyError

Python pandas read_sql 返回生成器对象

相关推荐

最近更新

标签